Guideline | test collection | tasks | result submission | assessment | evaluation | gameplan |dry run/training| home
- Content
The NTCIR-5 WEB will use 'NW1000G-04' (1.36TB) as the document data, which was mainly crawled from *.jp domain in 2004 and is about 1 terabytes of total page data size. Its subset of 300 gigabytes Contents and formats of the data will be almost the same as that of 'NW100G-01' used in NTCIR-3/4 WEB. The organizers will prepare four versions of the document data as follows:
(1) RAW:
Web pages as they were crawled,
(2) EUC: (This is the one
used)
Web pages with Japanese character code in RAW converted to EUC,
(3) COOKED:
Plain text data extracted from EUC by removing HTML tags and useless elements, and
(4) SEGMENTED:
Segmented text data processed from COOKED by a morphological analyzer.
RAW, EUC and COOKED are delivered to all the participants stored in hard disk drives as were in the NTCIR-4 WEB. SEGMENTED, a newly prepared version in the NTCIT-5 WEB, is estimated to be so large that some special treatments will be necessary according to the participants' demands. Therefore, its delivery will be somewhat later
- Location
test collection on elvis
- 2005: /u2/ntcir/2005/web/data/NW/
- Content
The organizers will create the topics assuming both Type (i) and (ii). The
requested object is supposed to be a certain product/sevice, shop, facility, organization,
person, event, information source, document, etc.
( (i) the case where the user requests the typical pages of a known object
(e.g., a person, shop, or facility), and he/she carries out a search using the
name of the object, and (ii) the case where the user knows the requested object
but does not remember the name, so he/she carries out a search using the
attribute information or the related information about the object. In both of
the cases, the number of relevant documents tend to be just one or a few.)
- Structure
<TOPIC>
<NUM>Topic number</NUM>
<TYPE>Type
code</TYPE><CATEGORY>Category code</CATEGORY>
<TITLE>Search terms</TITLE>
<DESC>Search description
sentence</DESC>
<NARR>
<TERM>Explanation of terms
(optional)</TERM>
<BACK>Explanation of back
ground</BACK>
<RELE>Relevance criteria
(optional)</RELE>
</NARR>
<USER SPECIALTY=”Knowledge level
code”>Attributes of searcher</USER>
</TOPIC>
(1) TOPIC: One search topic
(2) NUM: Topic number
A four-digit decimal number with one-byte characters used as a topic identifier
(3) TYPE: Topic type
A code from 1 to 3 defined as follows:
1: A search term included in TITLE specifies the
known item
2: A combination of search terms included in TITLE
specifies the known item
3: A search term or a combination of search terms
included in TITLE represents the known item but cannot specify it
(4) CATEGORY: Category of the
known item
One or more codes from A to H and Z defined as follows3, separated by “, ”
(a comma and a space):
A: Products / services (not including services
provided on the Web)
B: Company / organization (including shops and
administrative organs, but not including online shops)
C: Persons
D: Facilities (including public and private)
E: Sights and historic spots, and natural things
(including parks, etc.)
F: Information resources (including information
sites, data files, etc.)
G: Online shops and online services (not including
those in F)
H: Events
Z: Others
(5) TITLE: Search terms
Search terms supposed to be entered to a search engine regarding the
information needs; one to three terms in the order of
importance, separated by “, ” (a comma
and a space)
(6) DESC: Search description
One Japanese sentence briefly describing the information need (Although it
should be conceptually consistent with TITLE, the search terms themselves in
TITLE may not appear in DESC)
(7) NARR: Narrative of the
information needs consisting of TERM, BACK, and RELE
Explanation of the information needs which are not fully represented with
TITLE and DESC
(8) NARR/TERM: Explanation of
terms (optional)
Japanese sentences describing definition of meanings and explaining related
terms regarding terms in TITLE and DESC when they have ambiguity or they are
not popular
(9) NARR/BACK: Explanation of
back ground
Japanese sentences explaining back ground of the information needs and the
motivation
(10) NARR/RELE: Relevance
criteria (optional)
Japanese sentences explaining relevance criteria on the item and the pages
when they are not clear just with TITLE and DESC
(11) USER: Searcher’s attributes
Searcher’s position, sex, and experience years of Web search
(12) USER/@SPECIALTY: Searcher’s
knowledge level
Searcher’s knowledge level on the search target item; a code from A to D
defined as follows:
A: Knows the item in detail.
B: Knows the outline of the item.
C: Knows the item to the extent the item can be
identified among others.
D: Knows existence of the item but knows very little about itself.
- Location
training topics on elvis (will be moved to /u2 eventually)
- 2004: /u0/ntcir/2004/topics/ntcweb4b-formal-topics-euc.txt
test topics on elvis
- 2005: /u2/ntcir/2005/web/topics/ntcweb5-navi-frun-topics-1.euc.txt
The known item search is to find representative Web pages of a given item, but not a given Web page. A representative Web page may be a site top page, an entry page to a series of related pages, or a single fully informative page. Two types of users' situations are supposed as follows: (i) the case where the user requests the typical pages of a known object (e.g., a person, shop, or facility), and he/she carries out a search using the name of the object, and (ii) the case where the user knows the requested object but does not remember the name, so he/she carries out a search using the attribute information or the related information about the object. In both of the cases, the number of relevant documents tend to be just one or a few. Consequently, the subtask can be regarded as including home page finding and named page finding in TREC Web Track, but not restricted to them.
(doc: NTCIR-5 WEB Navigational Retrieval Task2 (Navi 2) Formal Run Manual(version 1.0) )
Query Term Expansion Task is a newly proposed pilot subtask. Its detailed task definition will be fixed based on discussions among the organizers and the participants. For more information, please visit the subtask's web page.
|
DATE |
|
ACTION |
|
2004-08-01 |
|
Call for Participation (preliminary) |
|
2004-09-20 |
|
Registration Due |
|
2004-10-01 |
|
Document Data Release |
|
|
|
|
|
2004-12-01 |
|
Dry-Run Topics Release |
|
2005-01-01 |
|
Dry-Run Results Submission |
|
2005-03-01 |
|
Dry-Run Evaluation Results Release |
|
2005-04-15 |
|
Formal-Run Topics Release |
|
2005-06-10 |
|
Formal-Run Results Submission |
|
2005-08-20 |
|
Formal-Run Evaluation Results Release |
|
2005-10-01 |
|
Submission Due of Camera-ready Manuscript for the Working
Notes |
|
2005-12-6--9 |
|
Workshop Meeting |
|
2006-02- |
|
Submission Due of Camera-ready Manuscript for the
Proceedings |
1. Files to Be Submitted
On submission of search results, the search result document lists and the system descriptions should be submitted
2. Types of Runs
Participants can use the following combinations of topic
elements for the search run execution. The other topic elements must not be used.
(1) TITLE only (mandatory)
(2) Any combination of TITLE, DESC,
and NARR/BACK
(3) Any combination of TYPE and
CATEGORY added to the above two
When submitting run results using (3), it is strongly recommended to also submit run results using (1) or (2), excluding TYPE and CATEGORY from it.
Both automatic and interactive processing modes are permitted. The run is regarded as interactive when a human interact in any way during search topic processing and/or search execution, otherwise it is regarded as automatic.
3. Numbers of Runs
There is no limitation for the number of run results that can be submitted. However, number of run results included in the pooling may be limited according to the total number of run results submitted by all participants. In such case, run results will be included from the higher priority indicated in the ‘Priority’ of the system description, therefore ‘Priority’ must be indicated.
4. Retrieval result
format
On submission of run results, it is required to make run
result document lists according to the following format.
<qid> <iter> <docid> <rank> <sim>
<runid>
eg: 0001 0 0000001_0000001 0 4238 GroupID-LA40
<qid>: Search topic number (the content of <NUM>
in each search topic). The list should have been sorted in the ascending order
of <qid>.
<iter>: Fixed to ‘0’.
<docid>: Document ID. It must be included in the
“doclist” file of the distribution data.
<rank>: Fixed to ‘0’.
<sim>: Similarity (real number). It should be larger
for the documents to be retrieved first.
<runid>: Run ID defined by each participant, starting
with Group-ID (Group's
ID: IUSIS) which has already been specified in application form for
participation, and with a hyphen.
The data should include only ASCII alphanumeric characters
and hyphens. Each pair of fields in tuples is separated by a horizontal tab
character (0x09) and each tuple is terminated by a new line character (0x0a).
The number of retrieval result documents should be no more than 100 for each topic for each run, and the documents should be sorted in the order of rank from higher to lower. On the system evaluation, the document order in the submitted retrieval result is treated as the ranks of the documents regardless of the values of <sim>.
The retrieval result document list of each run for all topics should be submitted in a plain text file with “<runid>.res” as its file name.
5. System Description
On submission of run results, it is required to make a concise description of each run according to the following format.
<SYSDESC>
<SUBTASK>Subtask</SUBTASK>
<RUNID>RunIDs</RUNID>
<TOPICPART>TopicPart</TOPICPART>
<QUERYMETHOD>QueryMethod</QUERYMETHOD>
<QUERYUNIT>QueryUnit</QUERYUNIT>
<QUERYEXPAN>QueryExpan</QUERYEXPAN>
<LINKINFO>LinkInfo</LINKINFO>
<URLINFO>URLInfo</URLINFO>
<TAGINFO>TagInfo</TAGINFO>
<ANCHOR>Anchor</ANCHOR>
<IRMODEL>IRModel</IRMODEL>
<RANKING>Ranking</RANKING>
<INDEXUNIT>IndexUnit</INDEXUNIT>
<INDEXTECH>IndexTech</INDEXTECH>
<INDEXSTRUC>IndexStruc</INDEXSTRUC>
<FILTERING>Filtering</FILTERING>
<RESOURCE>Resource</RESOURCE>
<PRIORITY>Priority</PRIORITY>
<RUNTIME>RunTime</RUNTIME>
<INDEXTIME>IndexTime</INDEXTIME>
<NOTE>Note</NOTE>
</SYSDESC>
Subtask: Fixed to
‘NTCIR-5 WEB Navi-2’.
RunIDs:
<runid> as explained in Section 4. When designating more than one, they
shouldbe separated by “, " (a comma and a space).
TopicPart: The
part of the topic used; e.g,. TITLE, DESC, TITLE+DESC, TITLE+BACK, TITLE+BACK+TYPE,
etc.
QueryMethod:
automatic or interactive. automatic: runs without any human intervention during
query processing and search. interactive: all runs other than “automatic”.
QueryUnit: Unit
of query; e.g., character bi-gram, word+phrase, etc.
QueryExpan:
Techniques used to expand queries; if not used, leave it blank; e.g.,
pseudo-relevance feedback, no query expansion, etc.
LinkInfo: How
link information is used for searching and ranking; if not used leave it blank;
e.g., PageRank, entry page detection, etc
URInfo: How URL
is used for searching and ranking; if not used, leave it blank; e.g.,
re-ranking using directory depth, etc.
TagInfo: How HTML
tags are used for searching and ranking; if not used, leave it blank; e.g.,
weighting title and headings, extracting page body, etc.
Anchor: How to
use anchor text for ranking and searching; if not used, leave it blank; e.g.,
indexing the out-linked document, etc.
IRModel: IR
model; e.g., Boolean model, vector space model, probabilistic model, etc.
Ranking: Ranking
technique used for ranking; e.g., tf, tf-idf, mutual information, URL length,
PageRank, etc.
IndexUnit: Unit
of index, e.g., character, character bi-gram, word, phrase, etc.
IndexTech:
Techniques used to process index terms; e.g., morphology, stemming, POS, etc.
IndexStruc: Index
structure; e.g., PAT, inverted file, signature file, etc.
Filtering:
Filtering method for extracting useful pages or for discarding unnecessary
pages; e.g., Kleinburg and Chakrabarti’s topic distillation, Web pages selection
using internet directory, SPAM filtering using SPAM-like words or patterns,
etc.
Resource:
External resources used for indexing, filtering, or searching, other than the
data provided by the organizers; e.g., Yahoo directory, training data set, etc.
Priority:
Priority rank of each of RunIDs (from first to the last); e.g., 1, 2, 3, …, or
“RunID:1, RunID:2, …” when designating more than one RunIDs in a SYSDESC.
RunTime:
(optional) Averaged seconds consumed for searching per topic.
IndexTime:
(optional) Hours consumed for indexing.
Note: (optional) Any additional information.
It is desirable that all the items are described in detail and concretely, not limited to the examples indicated above.
Data should include only ASCII characters, but not using the characters of ‘<’ or ‘>’ in any item values of a system description. Text language should be English.
A system description may describe a single run or a group of runs. Each system description should be flanked by “<SYSDESC>” and “</SYSDESC>”.
All system descriptions should be stored in one plain text file with “GroupID.sys” as its file name. “GroupID” is the group ID entered on the registration form.
6. How to Submit the
Search Results
On submission of search results, the search result document lists and the
system
descriptions should be submitted according to the following method.
Upon receipt of search results, a receipt notification mail will be returned.
If it is not delivered within 3 days from submission, contact at the following
destination.
(1) Media: Attachment files of e-mail. Attachment file names
should be explicitly described in the Subject field.
(2) Destination: oyama@nii.ac.jp
(3) Due date: Friday, 10th Jun, 2005.
Relevance of each document to the search topic was judged into one of the following levels by absolute evaluation:
Additional judgment
3. The different item is more well-known than the searched item
2. the different item is as well-know as the searched item
1. the different item is less well-known than the searched item.
0. Not undistinguishable
The organizers are going to apply DCG (Discounted Cumulative Gain), WRR (Weighted Reciprocal Rank) [1,2, 3], and a evaluation measure taking costs of inputting search terms and browsing retrieval result documents into account. In case a group of duplicated pages, a group of frame set and frame pages, or a group of pages displayed by hyper-jumping automatically is included in the search result document list, redundancy will be taken into account in calculating the evaluation value. Moreover, in case an undistinguishable document is included in the search result document list, we will investigate evaluation methods regarding the effects.
DCG: (Ga,Gb)=(3,0) for (RL-1) and (Ga, Gb)=(3,2) for (RL-2)m (more to read on page 10-11)
very BASIC run (green: possible strategies; red: things need to be done; purple: things to think)
1. Query expasion: expand KANJI to HRAGANA
2. Phrase detection from topic 3 using lucent
3. tuning if possible
Detail see the agenda…
Dry run (didn’t turned in results this year) & Training data top | back
- Content
- Location
- Location