|
1
|
- Kiduk Yang
- SLIS, Indiana University
- December 2002
|
|
2
|
- TREC Overview
- WIDIT in TREC
- Future Research
|
|
3
|
- Annual Information Retrieval conference
- Sponsored by
- National Institute of Science & Technology (NIST)
- Defense Advanced Research Project Agency (DARPA)
- Other U.S. agencies (e.g. DOD-ARDA)
- Attended by
- International researchers from academic, commercial, and government
institutions
- 93 groups from 22 countries in 2003
- Goals
- Increase IR research based on large-scale data
- Refine IR evaluation methodologies
- Create test collections for various aspects of IR
- Stimulate exchange of ideas & communication among academia,
industry, and government
|
|
4
|
- Static Text
- Streamed Text
- Human-in-the-loop
- Beyond English
- Spanish, Chinese, X-Language
- Beyond Text
- Web Searching
- Answers, not Documents
- Task-specific
- User-specific
- Beyond English
- Beyond Text
- Web Searching
- Web, Interactive, Terabyte
- Answers, not Documents
- Domain-specific
|
|
5
|
|
|
6
|
- Motivation
- Web search has become a daily information access mechanism
- 79% of Web users access the Internet daily.
85% of Web users use search engines to find information.
- GVU center at Georgia Tech (Kehoe et al., 1999)
- 670M Web searches per day (250M by Google)
- Search Engine Watch (2/2003)
- New Challenges
- Data: massive, dynamic, diverse
- Users: diverse, “transitory”
- New Opportunities
- Multiple sources of evidence
- content, hyperlinks, document structure, user data, taxonomies
- Data abundance/redundancy
- Review
- Yang (2005). Information Retrieval on the Web, ARIST Vol. 39
- http://ella.slis.indiana.edu./~kiyang/dissertation/litrev_webir.pdf
|
|
7
|
- Research Questions
- Which “traditional” IR approaches work for Web IR?
- What “new” approaches are effective in Web IR?
- IR Approaches tried
- Traditional
- Term Weighting
- Query Expansion
- Pseudo-Relevance Feedback
- Passage Retrieval
- Web-specific
- Link Analysis
- Indegree, PageRank, Google, anchor-text
- Document Structure
- HTML tags (meta, title, font, etc.)
- URL Information
|
|
8
|
|
|
9
|
- Data
- 18 GB, 1.25 million .gov Web pages (2002 breadth-first crawl)
- Tasks
- Topic Distillation
- Find relevant home pages given a broad query.
- pages devoted to the topic
- provides credible information on the topic
- is not a part of a larger site also devoted to the topic
- 50 broad topics
- e.g. “cotton industry”, http://cpru.usda.gov
- Home/Named Page Finding
- Find a particular page named in a query
- 150 HP & 150 NP topics
- e.g. (HP) “Internal Revenue Service”,
http://www.irs.gov
- e.g. (NP) “passport application form”, http://travel.state.gov/dsp11.pdf
|
|
10
|
- Web Information Discovery Integrated Tool
- (http://widit.slis.indiana.edu)
- Descendent of IRIS (http://ils.unc.edu/iris/)
- (elvis) 2.4GHz, dual Xeon processor, 4GB RAM, 1.3 TB RAID disk, FreeBSD
- Objectives
- Function as a modular system toolkit for IR research
- Provide an environment for
- IR system prototyping
- IT application development
- Foster collaborative research
|
|
11
|
- Indexing Modules
- HTML parser
- Stemmers
- Noun phrase Constructor
- SMART term weights
- Retrieval Modules
- Vector Space Model
- Subcollection retrieval
- Taxonomy Search
- Feedback Modules
- Adaptive Linear
- Probabilistic
- Link Analysis Modules
- Classification Modules
- Classification Dictionary
- Centroid Classifier
- Fusion Modules
- Similarity Merge
- Weighted Sum
- Web Crawler
- Taxonomy Crawler
- Sitemap Display
- Site Search
- Indexing Modules
- LSI
- Thesaurus Constructor
- OKAPI term weights
- Retrieval Modules
- Probabilistic Model
- Meta-Search
- Passage Retrieval
- Query Expansion Modules
- WordNet
- Local Context Analysis
- Link Analysis Modules
- Classification Modules
- Facetted Classifier
- Probabilistic Classifier
- Support Vector Machine
- Fusion Modules
- Web Crawler
- Link Summarizer
- Site Checker
- Personalized Search
|
|
12
|
- Basic Approach
- Index multiple sources of evidence (MSE)
- Document Content
- Document Structure
- title, meta keywords & description, emphasized words
- Link Structure
- anchor text, in/outdegree
- Parallel search
- Multiple Index
- body text (title, body)
- anchor text (title, inlink anchor text)
- Header text (title, meta kw & desc, first heading, emphasized
words)
- Multiple Query format
- simple query
- phrase, acronyms, abbreviations
- Combine results (i.e. fusion)
- Weighted Sum formula
- Overlap Weighted Sum formula
- Weighted Overlap Weighted Sum formula
- Rerank based on MSE
- Site Compression
- Phrase, Proximity & Acronym match
- In/Outdegree
|
|
13
|
|
|
14
|
- Shared Components
- Indexing & Searching
- Fusion
- Different training data ® Different
fusion formula
- Topic Distillation
- Reranking
- Site Compression
- Cluster results by site
- Rank sites
- Boost the scores of best pages from each site
- Content + Link Evidence
- inter-/intra indegree & outdegree
- phrase/proximity match counts in body/header/anchor texts
- query term match in URL
- Home/Named Page Finding
- Reranking
- Query Classification
- TD evidence + Acronym matching
|
|
15
|
- Training Data
- 2002 TD topics & relevance judgments
- Problem: 2002 TD <> 2003 TD
- bias for HP methods in 2003
- broader queries in 2003
- different evaluation metrics (p@10 vs. R-precision)
- 2002 NP topics & relevance judgments
- Problem: did not train for HP topics
- System Tuning Process
- Fusion
- 8 sys_wt combinations for each fusion formula (24 systems per task)
- Reranking
- select best single and best fusion systems
- adjust ranking heuristic by dynamic tuning
|
|
16
|
- Topic Distillation
- Baseline
- best single system
- P@10 = 0.168
- 8% improvement by fusion
- P@10 = 0.182 (f1sb90sh1sa9)
- 17% improvement by reranking
- P@10 = 0.196 (f1sb90sh1sa9_r2)
- Named Page Finding
- Baseline
- best single system
- MRR = 0.4714
- 10% improvement by fusion
- MRR = 0.5195 (f3sb80sh19sa1)
- No improvement by reranking
- MRR = 0.4714 (f3sb80sh19sa1_r2)
- Topic Distillation
- Baseline
- best single system
- P@10 = 0.076
- 21% improvement by fusion
- P@10 = 0.092 (f1sb90sh1sa9)
- 29% improvement by reranking
- P@10 = 0.098 (f1sb90sh1sa9_r1)
- P@10 = 0.088 (f1sb90sh1sa9_r2)
- Named Page Finding
- Baseline
- best single system
- MRR = 0.362
- 10% improvement by fusion
- MRR = 0.400 (f3sb80sh19sa1)
- No reranking run submitted
|
|
17
|
- Topic Distillation
- Average Precision (AP)
- Average of precision value after each relevant document is retrieved
- Performance over all relevant documents
- Rewards system that retrieve relevant documents quickly
- P@10
- Precision at rank 10
- High-precision measure (Web-specific)
- R-Precision (RP)
- Precision at rank R, where R is the number of relevant documents for
the topic
- De-emphasize the exact ranking of retrieved relevant documents
- Named Page Finding
- Reciprocal Rank (RR)
- 1/rank, where first correct document is retrieved
- S@10
- Success Rate at rank 10
- Percentage of topics where the correct answer is found in top 10 ranks
|
|
18
|
- Topic Distillation
- by Mean Average Precision (MAP)
- Best WIDIT (Baseline) ranked 38 of 107 systems, 11 of 23 groups
- Best: MRP = 0.1636, MAP = 0.1543, avgP@10 = 0.1240
- WIDIT: MRP = 0.0736, MAP = 0.1016, avgP@10 = 0.0760
- Median: MRP = 0.0699, MAP = 0.0896,
avgP@10 = 0.0700
- Worst2: MRP =
0.0230, MAP = 0.0222, avgP@10
= 0.0200
- by avgP@10
- Best WIDIT (f1sb90sh1sa9_r1) ranked 28 of 107 systems, 12 of 23 groups
- Best: MRP = 0.1485, MAP = 0.1387, avgP@10 = 0.1280
- WIDIT: MRP = 0.0626, MAP = 0.0787, avgP@10 = 0.0980
- Median: MRP = 0.0871, MAP =
0.1057, avgP@10 = 0.0807
- Worst2: MRP =
0.0181, MAP = 0.0250, avgP@10
= 0.0160
- by Mean R-Precision (MRP)
- Best WIDIT (Baseline) ranked 71 of 107 systems, 18 of 23 groups
- Best: MRP = 0.1636, MAP = 0.1543, avgP@10 = 0.1240
- Median: MRP = 0.0699, MAP =
0.0896, avgP@10 = 0.0700
- WIDIT: MRP = 0.0736, MAP = 0.1016, avgP@10 = 0.0760
- Worst2: MRP = 0.0181, MAP = 0.0250, avgP@10 = 0.0160
|
|
19
|
- Home/Named Page Finding
- by Mean Reciprocal Rank (MRR)
- Best WIDIT (f3sb80sh19sa1) ranked 48 of 75 systems, 13 of 19 groups
- Best: MRR = 0.727, S@10 = 89.3
- Median: MRR = 0.496, S@10 =
64.3
- WIDIT: MRR = 0.400, S@10 = 66.3
- Worst: MRR = 0.065, S@10 = 8.7
- by avgS@10
- Best WIDIT (f1sb90sh1sa9_r1) ranked 41 of 75 systems, 13 of 19 groups
- Best: MRR = 0.727, S@10 = 89.3
- Median: MRR = 0.465, S@10 = 68.3
- WIDIT: MRR = 0.400, S@10 = 66.3
- Worst: MRR = 0.065, S@10 = 8.7
|
|
20
|
- Dynamic Tuning
- Rank Boosting heuristic
- Keep top 5 rank static
- Boost potential home pages (root, subroot, path)
- Boost file type page with 2+ query terms in URL
- Stop if top 20 ranks are filled
- Home Page Identification
- URL Type (Tomlinson, 2003; Kraaij et al., 2002)
- HP_end: URL ends w/ index.htm,
default.htm, /, etc
- root: slash_cnt=0 or
(HP_end & slash_cnt=1)
- subroot: HP_end &
slash_cnt=2
- path: HP_end &
slash_cnt>=3
- file: rest
- Improved Query Classification heuristic
- e.g. HP if query ends in all caps
- 227 out of 300 correct classification
|
|
21
|
- Topic Distillation
- by Mean Average Precision (MAP)
- Best WIDIT (f1sb80sh10sa10_r4) ranked 10 of 107 systems, 7 of 23
groups
Best WIDIT
(Baseline) ranked 38 of 107 systems, 11 of 23 groups
- Best: MRP = 0.1636, MAP = 0.1543, avgP@10 = 0.1240
- WIDIT: MRP = 0.1139, MAP = 0.1281, avgP@10 = 0.0980
- WIDIT: MRP = 0.0736, MAP = 0.1016, avgP@10 = 0.0760
- Median: MRP = 0.0699, MAP = 0.0896,
avgP@10 = 0.0700
- Worst2: MRP =
0.0230, MAP = 0.0222, avgP@10
= 0.0200
- by avgP@10
- Best WIDIT (f1sb90sh1sa9_r2) ranked 25 of 107 systems, 11 of 23
groups
Best WIDIT
(f1sb90sh1sa9_r1) ranked 28 of 107 systems, 12 of 23 groups
- Best: MRP = 0.1485, MAP = 0.1387, avgP@10 = 0.1280
- WIDIT: MRP = 0.0626, MAP = 0.1216, avgP@10 = 0.0980
- WIDIT: MRP = 0.0626, MAP = 0.0787, avgP@10 = 0.0980
- Median: MRP = 0.0871, MAP =
0.1057, avgP@10 = 0.0807
- Worst2: MRP =
0.0181, MAP = 0.0250, avgP@10
= 0.0160
- by Mean R-Precision (MRP)
- Best WIDIT (f1sb80sh10sa10_r4) ranked 23 of 107 systems, 11 of 23
groups
Best WIDIT
(Baseline) ranked 71 of 107 systems, 18 of 23 groups
- Best: MRP = 0.1636, MAP = 0.1543, avgP@10 = 0.1240
- WIDIT: MRP = 0.1139, MAP = 0.1281, avgP@10 = 0.0980
- Median: MRP = 0.0699, MAP =
0.0896, avgP@10 = 0.0700
- WIDIT: MRP = 0.0736, MAP = 0.1016, avgP@10 = 0.0760
- Worst2: MRP = 0.0181, MAP = 0.0250, avgP@10 = 0.0160
|
|
22
|
- Home/Named Page Finding
- by Mean Reciprocal Rank (MRR)
- Best WIDIT (f3sb80sh10sa10_r4) ranked 45 of 75 systems, 13 of 19
groups
Best WIDIT
(f3sb80sh19sa1) ranked 48 of 75 systems, 13 of 19 groups
- Best: MRR = 0.727, S@10 = 89.3
- Median: MRR = 0.496, S@10 =
64.3
- WIDIT: MRR = 0.411, S@10 = 68.3
- WIDIT: MRR = 0.400, S@10 = 66.3
- Worst: MRR = 0.065, S@10 = 8.7
- by avgS@10
- Best WIDIT (f3sb80sh10sa10_r4) ranked 39 of 75 systems, 13 of 19
groups
Best WIDIT
(f3sb80sh19sa1) ranked 41 of 75 systems, 13 of 19 groups
- Best: MRR = 0.727, S@10 = 89.3
- Median: MRR = 0.465, S@10 = 68.3
- WIDIT: MRR = 0.411, S@10 = 68.3
- WIDIT: MRR = 0.400, S@10 = 66.3
- Worst: MRR = 0.065, S@10 = 8.7
|
|
23
|
- Topic Distillation
- Average 10.32 relevant documents per topic
- 5 topics w/ 1 relevant documents
- 3 topics w/ 2 relevant documents
- 20 topics w/ 5 or fewer relevant documents
- 33 topics w/ 10 or fewer relevant documents
- WIDIT: 17 w/ P@10=0, 23 w/ RP=0,
4 w/
relevant documents at top 20 rank
- Questions
- Is MRP an appropriate evaluation measure?
- compute performance statistics by topic groups listed above
- statistical analysis
- Why did WIDIT do so badly with topics w/ few relevant documents?
- did not retrieve reldocs at top 20 ranks in 29 of 33 topics (#reldoc
<= 10)
- was it WIDIT-specific or system-wide?
- Why doesn’t Link Analysis help?
|
|
24
|
- Home/Named Page Finding
- Best WIDIT (f3sb80sh10sa10_r4)
- Best: MRR = 0.727, S@10 = 89.3
- WIDIT: MRR = 0.411, S@10 = 68.3
- WIDIT (HP): MRR = 0.320, S@10 = 62.0
- 24 topics w/ first correct answer beyond rank 100
- WIDIT (NP): MRR = 0.503, S@10 = 74.7
- 6 topics w/ first correct answer beyond rank 100
- Questions
- How much effect did HP bias have on system performance?
- compare WIDIT performance levels by query type with those of the top
systems
- tune WIDIT-hp with training data and compare results
- Why did WIDIT rank correct document at very low ranks for some HP
topics?
- was it WIDIT-specific or system-wide?
- failure analysis
|
|
25
|
- Fusion helps
- What worked
- Combining multiple sources of evidence
- System tuning with training data
- Questions
- Why f1 for TD but f3 for NP?
- Other fusion methods?
- Home page identification is important for this year’s TD
- What worked
- HP identification based URL information
- Post-retrieval reranking to boost potential home pages
- Questions
- Why didn’t it help for HP/NP?
- Is dynamic tuning an optimal training approach?
- Different tasks require different approaches
- What worked
- System tuning >> Retrieval techniques
- Need both task typing and task-specific system tuning
- Good task typing but no system tuning for HP capped HP/NP task
performance
|
|
26
|
- Term Weighting
- Query Expansion/Modification
- Thesaurus (e.g. WordNet)
- SVM (Flake et al., 2002)
- Fusion
- Link Analysis
- Modified BHITS (Li, Shang & Zhang, 2002)
- addresses “small-in-large-out” problem of skewed link topology
- add more weights to inlinks if “small-in-large-out” hub page exists
- Implicit Link Analysis (Xue et al, 2003)
- addresses “small Web search” problem caused by truncated link
structure
- construct “implicit” links with association weight based on usage
data
- modify the link matrix of PageRank formula
- Personalized PageRank (Jeh & Widom, 2003; Haveliwala, 2002)
- computes PageRank scores customized for a given topic/category
- match topic to Yahoo! category
- bias PageRank with matched category pages
|
|
27
|
- Time Table
- December 03 - register for participation
- Spring 04 – L641 (http://ella.slis.indiana.edu./~kiyang/teaching/s04/l641/index.shtml)
- Summer - run TREC Experiments
- August - submit Results
- October - submit Notebook Paper
- November - attend Conference
- January 05 - submit Proceedings Paper
- Targeted Tracks
- WIDIT: Web, Robust, Q&A, Genomics
- LAIR: Video, Genomics
|
|
28
|
- Flake, G. W., Glover, E. J., Lawrence, S., Giles, C. L. (2002). Extracting
query modifications from nonlinear SVMs. Proceedings of the 11th
International WWW Conference, 317-324.
- Haveliwala, T. (2002). Topic-sensitive PageRank. Proceedings of the 11th
WWW Conference, 517-526.
- Jeh, G. & Widom, J. (2003). Scaling personalized web search. Proceedings
of the 12th International WWW Conference, 271 – 279.
- Kehoe, C., Pitkow, J., Sutton, K., Aggarwal, G., & Rogers, J. D.
(1999). Results of GVU's Tenth WWW User Survey. Retrieved November 11,
2003, from http://www.gvu.gatech.edu/user_surveys/survey-1998-10/tenthreport.html
- Kraaij, W., Westerveld, T., & Hiemstra, D. (2002). The Importance of Prior Probabilities
for Entry Page Search. Proceedings
of the 25th ACM SIGIR Conference on Research and Development in
Information Retrieval, 27-34.
- Li, L., Shang, Y., & Zhang, W. (2002). Improvement of HITS-based
Algorithms on Web Documents. Proceedings of the 11th International WWW
Conference, 527-535.
- Tomlinson, S. (2003). Robust, Web and Genomic Retrieval with Hummingbird
SearchServer at TREC 2003. The 12th Text REtrieval Conference (TREC
2003) Notebook, 372-385.
- Xue, G., Zeng, H., Chen, Z., Ma, W., Zhang, H, & Lu, C. (2003). Implicit
link analysis for small web search. Proceedings of the 26th ACM SIGIR
Conference on Research and Development in Information Retrieval, 56-63.
|
|
29
|
|
|
30
|
- Strip HTML tags
- extract title, meta keywords & description, emphasized words
- parse out hyperlinks (URL & anchor texts)
- Create pseudo-documents
- anchor texts
- header texts
- Create subcollection Sequential Indexes
- stop & stem (simple plural conflation)
- count term frequencies
- Create subcollection Inverted Indexes
- count subcollection document frequencies
- compute length-normalized term weights
- Compute whole collection term statistics
|
|
31
|
- Rank documents in each subcollection of each index
- Query-Document Similarity
- Parallel Searching
- Multiple Index
- body text (title, body)
- anchor text (title, inlink anchor text)
- Header text (title, meta kw & desc, first heading, emphasized
words)
- Multiple Query format
- simple query
- phrase, acronyms, abbreviations
- Multiple Subcollections
- for search speed and scalability
- use whole collection term statistics
- 2 qry_type * 3 document_index * 46 subcollections = max. 276 per
query
- Merge search results of subcollections
- merge & sort by document score
|
|
32
|
- Fusion Formulas
- Weighted Sum (WS)
- Normalized_Score * sys_wt
- sys_wt = the relative contribution of systems
- Overlap Weighted Sum
- WS * Overlap
- Overlap = # of systems retrieving a given document
- Weighted Overlap Weighted Sum
- WS * weighted_Overlap
- weighted_Overlap = Overlap * sys_wt
- Select best system weights (sys_wt)
- based on training data
- e.g. 0.9 for body text, 0.09
for header text, 0.01 for anchor text
|
|
33
|
- Reranking Methods
- Site Compression
- Cluster results by site
- Rank sites
- sort site by highest document score of each site
- Float of top n documents from
top m site to the top
- Content + Link Evidence Heuristic
- inter-/intra indegree & outdegree
- e.g. boost TD score if large outdegree
- e.g. boost HP score if large indegree
- phrase/proximity match counts in body/header/anchor texts
- e.g. boost score if phrase match in title or anchor text
- query term match in URL
- heuristic variations for HP and NP
- query classification by term occurrence
- System Training by Dynamic Tuning
|
|
34
|
- SMART lnu weight for document terms
- SMART ltc weight for query terms
- where:
- fik = number of
times term k appears in document i
- idfk = inverse
document frequency of term k
- t = number of terms in
document/query
|
|
35
|
- Document Score
- inner product of document and query vectors
- where:
- qk = weight of term k in the query
- dik = weight of term k in document i
- t = number of terms common to query & document
|
|
36
|
- Weighted Sum
-
FSws = S(wi*NSi)
- Overlap Weighted Sum
- FSows = S(wi*NSi) * olp
- Weighted Overlap Weighted Sum
- FSwows = S(wi*NSi) * (wi*olp)
- where:
- wi = weight of system i
NSi =
normalized score of a document by system i
- = (Si – Smin)
/ (Smax – Smin)
- olp = number of systems that
retrieved a given document
|