NTICR 2005: WEB Navigational Retrieval Task

Guideline | test collection | tasks | result submission | assessment | evaluation | gameplan |dry run/training| home

 


Test Collection    top | back

Corpus

- Content

The NTCIR-5 WEB will use 'NW1000G-04' (1.36TB) as the document data, which was mainly crawled from *.jp domain in 2004 and is about 1 terabytes of total page data size. Its subset of 300 gigabytes Contents and formats of the data will be almost the same as that of 'NW100G-01' used in NTCIR-3/4 WEB. The organizers will prepare four versions of the document data as follows:

(1) RAW:

Web pages as they were crawled,

(2) EUC: (This is the one used)

Web pages with Japanese character code in RAW converted to EUC,

(3) COOKED:

Plain text data extracted from EUC by removing HTML tags and useless elements, and

(4) SEGMENTED:

Segmented text data processed from COOKED by a morphological analyzer.

RAW, EUC and COOKED are delivered to all the participants stored in hard disk drives as were in the NTCIR-4 WEB. SEGMENTED, a newly prepared version in the NTCIT-5 WEB, is estimated to be so large that some special treatments will be necessary according to the participants' demands. Therefore, its delivery will be somewhat later

- Location

test collection on elvis
- 2005: /u2/ntcir/2005/web/data/NW/

Link file (not going to use it this time):
  • /u2/ntcir/2005/web/data/DISK1/
  • /u2/ntcir/2005/web/data/DISK2/
  • /u2/ntcir/2005/web/data/DISK3/

Topics

- Content

The organizers will create the topics assuming both Type (i) and (ii). The requested object is supposed to be a certain product/sevice, shop, facility, organization, person, event, information source, document, etc. 
( (i) the case where the user requests the typical pages of a known object (e.g., a person, shop, or facility), and he/she carries out a search using the name of the object, and (ii) the case where the user knows the requested object but does not remember the name, so he/she carries out a search using the attribute information or the related information about the object. In both of the cases, the number of relevant documents tend to be just one or a few.)

- Structure

<TOPIC>

<NUM>Topic number</NUM>

<TYPE>Type code</TYPE><CATEGORY>Category code</CATEGORY>

<TITLE>Search terms</TITLE>

<DESC>Search description sentence</DESC>

<NARR>

<TERM>Explanation of terms (optional)</TERM>

<BACK>Explanation of back ground</BACK>

<RELE>Relevance criteria (optional)</RELE>

</NARR>

<USER SPECIALTY=”Knowledge level code”>Attributes of searcher</USER>

</TOPIC>

(1) TOPIC: One search topic

(2) NUM: Topic number

A four-digit decimal number with one-byte characters used as a topic identifier

(3) TYPE: Topic type

A code from 1 to 3 defined as follows:

1: A search term included in TITLE specifies the known item

2: A combination of search terms included in TITLE specifies the known item

3: A search term or a combination of search terms included in TITLE represents the known item but cannot specify it

(4) CATEGORY: Category of the known item

One or more codes from A to H and Z defined as follows3, separated by “, ” (a comma and a space):

A: Products / services (not including services provided on the Web)

B: Company / organization (including shops and administrative organs, but not including online shops)

C: Persons

D: Facilities (including public and private)

E: Sights and historic spots, and natural things (including parks, etc.)

F: Information resources (including information sites, data files, etc.)

G: Online shops and online services (not including those in F)

H: Events

Z: Others

(5) TITLE: Search terms

Search terms supposed to be entered to a search engine regarding the information needs; one to three terms in the order of importance, separated by “, ” (a comma and a space)

(6) DESC: Search description

One Japanese sentence briefly describing the information need (Although it should be conceptually consistent with TITLE, the search terms themselves in TITLE may not appear in DESC)

(7) NARR: Narrative of the information needs consisting of TERM, BACK, and RELE

Explanation of the information needs which are not fully represented with TITLE and DESC

(8) NARR/TERM: Explanation of terms (optional)

Japanese sentences describing definition of meanings and explaining related terms regarding terms in TITLE and DESC when they have ambiguity or they are not popular

(9) NARR/BACK: Explanation of back ground

Japanese sentences explaining back ground of the information needs and the motivation

(10) NARR/RELE: Relevance criteria (optional)

Japanese sentences explaining relevance criteria on the item and the pages when they are not clear just with TITLE and DESC

(11) USER: Searcher’s attributes

Searcher’s position, sex, and experience years of Web search

(12) USER/@SPECIALTY: Searcher’s knowledge level

Searcher’s knowledge level on the search target item; a code from A to D defined as follows:

A: Knows the item in detail.

B: Knows the outline of the item.

C: Knows the item to the extent the item can be identified among others.

D: Knows existence of the item but knows very little about itself.

- Location

training topics on elvis (will be moved to /u2 eventually)
- 2004: /u0/ntcir/2004/topics/ntcweb4b-formal-topics-euc.txt

test topics on elvis
- 2005: /u2/ntcir/2005/web/topics/ntcweb5-navi-frun-topics-1.euc.txt


Tasks    top | back

NTCIR-5 WEB Navi2 (Known item search)

The known item search is to find representative Web pages of a given item, but not a given Web page. A representative Web page may be a site top page, an entry page to a series of related pages, or a single fully informative page. Two types of users' situations are supposed as follows: (i) the case where the user requests the typical pages of a known object (e.g., a person, shop, or facility), and he/she carries out a search using the name of the object, and (ii) the case where the user knows the requested object but does not remember the name, so he/she carries out a search using the attribute information or the related information about the object. In both of the cases, the number of relevant documents tend to be just one or a few. Consequently, the subtask can be regarded as including home page finding and named page finding in TREC Web Track, but not restricted to them.

(doc: NTCIR-5 WEB Navigational Retrieval Task2 (Navi 2) Formal Run Manual(version 1.0) )

 

Query Term Expansion Task (see seperate page)

Query Term Expansion Task is a newly proposed pilot subtask. Its detailed task definition will be fixed based on discussions among the organizers and the participants. For more information, please visit the subtask's web page.

(doc: Dry Run Specification for Query Term Expansion Task in NTCIR-5 WEB)

 

Timeline

DATE

 

ACTION

2004-08-01

 

Call for Participation (preliminary)

2004-09-20

 

Registration Due
* Registrations after this date will be accepted as long as possible.

2004-10-01

 

Document Data Release
* Provided in a few divisions as they are prepared. The first one will be of about 300GB.

 

 

 

2004-12-01

 

Dry-Run Topics Release

2005-01-01

 

Dry-Run Results Submission

2005-03-01

 

Dry-Run Evaluation Results Release

2005-04-15

 

Formal-Run Topics Release

2005-06-10

 

Formal-Run Results Submission

2005-08-20

 

Formal-Run Evaluation Results Release

2005-10-01

 

Submission Due of Camera-ready Manuscript for the Working Notes
* Working Notes will be delivered at the Workshop Meeting.

2005-12-6--9

 

Workshop Meeting

2006-02-

 

Submission Due of Camera-ready Manuscript for the Proceedings
* The Proceedings will be published broadly.

 


Result Submission  top | back

 

1. Files to Be Submitted

On submission of search results, the search result document lists and the system descriptions should be submitted

2. Types of Runs

Participants can use the following combinations of topic elements for the search run execution. The other topic elements must not be used.

(1) TITLE only (mandatory)

(2) Any combination of TITLE, DESC, and NARR/BACK

(3) Any combination of TYPE and CATEGORY added to the above two

When submitting run results using (3), it is strongly recommended to also submit run results using (1) or (2), excluding TYPE and CATEGORY from it.

 

Both automatic and interactive processing modes are permitted. The run is regarded as interactive when a human interact in any way during search topic processing and/or search execution, otherwise it is regarded as automatic.

 

3. Numbers of Runs

There is no limitation for the number of run results that can be submitted. However, number of run results included in the pooling may be limited according to the total number of run results submitted by all participants. In such case, run results will be included from the higher priority indicated in the ‘Priority’ of the system description, therefore ‘Priority’ must be indicated.

 

4. Retrieval result format

On submission of run results, it is required to make run result document lists according to the following format.

<qid> <iter> <docid> <rank> <sim> <runid>

 

eg: 0001 0 0000001_0000001 0 4238 GroupID-LA40

 

<qid>: Search topic number (the content of <NUM> in each search topic). The list should have been sorted in the ascending order of <qid>.

<iter>: Fixed to ‘0’.

<docid>: Document ID. It must be included in the “doclist” file of the distribution data.

<rank>: Fixed to ‘0’.

<sim>: Similarity (real number). It should be larger for the documents to be retrieved first.

<runid>: Run ID defined by each participant, starting with Group-ID (Group's ID: IUSIS) which has already been specified in application form for participation, and with a hyphen.

 

The data should include only ASCII alphanumeric characters and hyphens. Each pair of fields in tuples is separated by a horizontal tab character (0x09) and each tuple is terminated by a new line character (0x0a).

 

The number of retrieval result documents should be no more than 100 for each topic for each run, and the documents should be sorted in the order of rank from higher to lower. On the system evaluation, the document order in the submitted retrieval result is treated as the ranks of the documents regardless of the values of <sim>.

 

The retrieval result document list of each run for all topics should be submitted in a plain text file with “<runid>.res” as its file name.

 

5. System Description

On submission of run results, it is required to make a concise description of each run according to the following format.

 

<SYSDESC>

<SUBTASK>Subtask</SUBTASK>

<RUNID>RunIDs</RUNID>

<TOPICPART>TopicPart</TOPICPART>

<QUERYMETHOD>QueryMethod</QUERYMETHOD>

<QUERYUNIT>QueryUnit</QUERYUNIT>

<QUERYEXPAN>QueryExpan</QUERYEXPAN>

<LINKINFO>LinkInfo</LINKINFO>

<URLINFO>URLInfo</URLINFO>

<TAGINFO>TagInfo</TAGINFO>

<ANCHOR>Anchor</ANCHOR>

<IRMODEL>IRModel</IRMODEL>

<RANKING>Ranking</RANKING>

<INDEXUNIT>IndexUnit</INDEXUNIT>

<INDEXTECH>IndexTech</INDEXTECH>

<INDEXSTRUC>IndexStruc</INDEXSTRUC>

<FILTERING>Filtering</FILTERING>

<RESOURCE>Resource</RESOURCE>

<PRIORITY>Priority</PRIORITY>

<RUNTIME>RunTime</RUNTIME>

<INDEXTIME>IndexTime</INDEXTIME>

<NOTE>Note</NOTE>

</SYSDESC>

Subtask: Fixed to ‘NTCIR-5 WEB Navi-2’.

RunIDs: <runid> as explained in Section 4. When designating more than one, they shouldbe separated by “, " (a comma and a space).

TopicPart: The part of the topic used; e.g,. TITLE, DESC, TITLE+DESC, TITLE+BACK, TITLE+BACK+TYPE, etc.

QueryMethod: automatic or interactive. automatic: runs without any human intervention during query processing and search. interactive: all runs other than “automatic”.

QueryUnit: Unit of query; e.g., character bi-gram, word+phrase, etc.

QueryExpan: Techniques used to expand queries; if not used, leave it blank; e.g., pseudo-relevance feedback, no query expansion, etc.

LinkInfo: How link information is used for searching and ranking; if not used leave it blank; e.g., PageRank, entry page detection, etc

URInfo: How URL is used for searching and ranking; if not used, leave it blank; e.g., re-ranking using directory depth, etc.

TagInfo: How HTML tags are used for searching and ranking; if not used, leave it blank; e.g., weighting title and headings, extracting page body, etc.

Anchor: How to use anchor text for ranking and searching; if not used, leave it blank; e.g., indexing the out-linked document, etc.

IRModel: IR model; e.g., Boolean model, vector space model, probabilistic model, etc.

Ranking: Ranking technique used for ranking; e.g., tf, tf-idf, mutual information, URL length, PageRank, etc.

IndexUnit: Unit of index, e.g., character, character bi-gram, word, phrase, etc.

IndexTech: Techniques used to process index terms; e.g., morphology, stemming, POS, etc.

IndexStruc: Index structure; e.g., PAT, inverted file, signature file, etc.

Filtering: Filtering method for extracting useful pages or for discarding unnecessary pages; e.g., Kleinburg and Chakrabarti’s topic distillation, Web pages selection using internet directory, SPAM filtering using SPAM-like words or patterns, etc.

Resource: External resources used for indexing, filtering, or searching, other than the data provided by the organizers; e.g., Yahoo directory, training data set, etc.

Priority: Priority rank of each of RunIDs (from first to the last); e.g., 1, 2, 3, …, or “RunID:1, RunID:2, …” when designating more than one RunIDs in a SYSDESC.

RunTime: (optional) Averaged seconds consumed for searching per topic.

IndexTime: (optional) Hours consumed for indexing.

Note: (optional) Any additional information.

 

It is desirable that all the items are described in detail and concretely, not limited to the examples indicated above.

 

Data should include only ASCII characters, but not using the characters of ‘<’ or ‘>’ in any item values of a system description. Text language should be English.

 

A system description may describe a single run or a group of runs. Each system description should be flanked by “<SYSDESC>” and “</SYSDESC>”.

 

All system descriptions should be stored in one plain text file with “GroupID.sys” as its file name. “GroupID” is the group ID entered on the registration form.

 

6. How to Submit the Search Results

On submission of search results, the search result document lists and the system descriptions should be submitted according to the following method. Upon receipt of search results, a receipt notification mail will be returned. If it is not delivered within 3 days from submission, contact at the following destination.

(1) Media: Attachment files of e-mail. Attachment file names should be explicitly described in the Subject field.

(2) Destination: oyama@nii.ac.jp

(3) Due date: Friday, 10th Jun, 2005.


Assessment top | back

Relevance of each document to the search topic was judged into one of the following levels by absolute evaluation:

  • Relevant A representative page appropriate for the searched item satisfying the retrieval needs
  • Partially Relevant A page partialy satisfying the retrieval needs; pages as follows fit to this:
    • A representative page of an item having an upper or lower concept of the searched item; an easy-to-find hyperlink to the relevant document should be provides in the page
    • A page that can be regarded as a substitute for the representative page of the searched item.
  • Non-relevant

Additional judgment

  • Undistinguishability:

3. The different item is more well-known than the searched item

2. the different item is as well-know as the searched item

1. the different item is less well-known than the searched item.

0. Not undistinguishable

  • Duplicate pages Notes: even if content the same, if they have different link target pages or images, they are not deemed to be duplicate pages.

Evaluation    top | back

The organizers are going to apply DCG (Discounted Cumulative Gain), WRR (Weighted Reciprocal Rank) [1,2, 3], and a evaluation measure taking costs of inputting search terms and browsing retrieval result documents into account. In case a group of duplicated pages, a group of frame set and frame pages, or a group of pages displayed by hyper-jumping automatically is included in the search result document list, redundancy will be taken into account in calculating the evaluation value. Moreover, in case an undistinguishable document is included in the search result document list, we will investigate evaluation methods regarding the effects.

 

DCG: (Ga,Gb)=(3,0) for (RL-1) and (Ga, Gb)=(3,2) for (RL-2)m (more to read on page 10-11)


Gameplan    top | back

very BASIC run (green: possible strategies; red: things need to be done; purple: things to think)

1. Query expasion: expand KANJI to HRAGANA

  • Code: qindexweb1.pl , qindexweb2.pl
  • Steps:
    • Title:
      • split terms by ','
      • keep the order of importance (add differnt weight in format of '~'?)
      • add "" around phrase such as "natural beauty basic"
      • different strategies for type 3 topic?
    • Description / Narrative:
      • tokenize (/usr/local/bin/kakasi -w < inputfile)
      • need to remove punctuation mark
      • translitterate Kanji to Hirakana (/usr/local/bin/kakasi -JH -ieuc -oieuc < tokenizedfile or /usr/local/bin/kakasi -JH -s <inputfile
      • about kakasi (from Gavin)
        • kakasi -ieuc -oeuc -JH / kakasi -ieuc -oeuc -JH -s/ kakasi -ieuc -oeuc -JH -fs

          -i = input encoding (for our purposes, this is always set to 'euc') - !!! MAKE sure it is ALWAYS euc !!!
          -o = output encoding (for our purposes, this is always set to 'euc') - !!! If not, LUCENE WILL NOT WORK !!!
          -JH = convert Kanji to Hiragana
          -s = insert spaces between converted entities
          -fs = produce file with Kanji characters with their hiragana readings in brackets with a space after the Kanji
      • Need to record the user knowledge level + IR experience (e.g.B4, A3 where the letter stands for knowledge level on the target item while number stands for the years of IR experience)? potential usage: to decide whether use long query or not)
  • results
  • under $results/query/[train|test|dry]
    • t file (t$qn$tn: keep the original information)
    • q file (q$qn$tn: tokenized and expanded topics)
    • baseline folder: q file (q$qn$tn[s|m|l]: tokenized query file with title only, title+desc/back, all fields combination)
    • expansion folder:qh file ( (qh$qn$tn[s|m|l]: tokenized query file with title only, title+desc/back, all fields combination with expended hirakana format)
    • advanced folder: o,b,t, q filess. (o=overlapping, t=term boosting, b=boolean, p=proximity) see agenda for detail

2. Phrase detection from topic 3 using lucent

3. tuning if possible

Detail see the agenda…


Dry run (didn’t turned in results this year)  & Training data top | back

Corpus

- Content

  • dry run: NW230G-04DR ( In the case of "raw" data, the dataset contains about 21,300,000 files and 230GB of data.)
  • training: NW100G-01

- Location

  • dry run(????)
  • training collection on elvis (will move to /u2 eventually)
    - 2004: /u0/ntcir/2004/euc/

Topics

- Location

  • dry run:  /u0/ntcir/2004/topics/ntcweb5-navi-dryrun-topics/ntcweb5-navi-dryrun-topics-v1.0.txt
  • training: topics of NTCIR4-WEB Navi-1
    • - /u0/ntcir/2004/topics/ntcweb4b-formal-topics-euc.txt: topics

Relevant Judgement

  • dry run: didn't get the relevant judgements
  • training: Relevance judgements of NTCIR4-WEB Navi-1
    • - qrel-fml.doc-a.lst : relevance judgment for rigid relevance level
    • - qrel-fml.doc-ab.lst : relevance judgment for relaxed relevance level