Notes
Slide Show
Outline
1
WIDIT in TREC-2003 Web Track
  • Kiduk Yang
  • SLIS, Indiana University
  • December 2002
2
Outline
  • TREC Overview
  • WIDIT in TREC


  • Future Research
3
What is TextREtrievalConference?
  • Annual Information Retrieval conference
    • Sponsored by
      • National Institute of Science & Technology (NIST)
      • Defense Advanced Research Project Agency (DARPA)
      • Other U.S. agencies (e.g. DOD-ARDA)
    • Attended by
      • International researchers from academic, commercial, and government institutions
      • 93 groups from 22 countries in 2003
  • Goals
    • Increase IR research based on large-scale data
    • Refine IR evaluation methodologies
    • Create test collections for various aspects of IR
    • Stimulate exchange of ideas & communication among academia, industry, and government
4
TREC Tasks: Tracks
  • Static Text
    • Ad-hoc
  • Streamed Text
    • Routing, Filtering
  • Human-in-the-loop
    • Interactive
  • Beyond English
    • Spanish, Chinese, X-Language
  • Beyond Text
    • Speech
  • Web Searching
    • Web, VLC
  • Answers, not Documents
    • Q&A
  • Task-specific
    • Robust
    • Novelty
  • User-specific
    • HARD
  • Beyond English
    • moved to CLEF, NTCIR
  • Beyond Text
    • Video
  • Web Searching
    • Web, Interactive, Terabyte
  • Answers, not Documents
    • Q&A
  • Domain-specific
    • Genomics
5
TREC Timeline
6
Web IR Research
  • Motivation
    • Web search has become a daily information access mechanism
      • 79% of Web users access the Internet daily.
        85% of Web users use search engines to find information.
        • GVU center at Georgia Tech (Kehoe et al., 1999)
      • 670M Web searches per day (250M by Google)
        • Search Engine Watch (2/2003)
    • New Challenges
      • Data:  massive, dynamic, diverse
      • Users: diverse, “transitory”
    • New Opportunities
      • Multiple sources of evidence
        • content, hyperlinks, document structure, user data, taxonomies
      • Data abundance/redundancy
  • Review
    • Yang (2005). Information Retrieval on the Web, ARIST Vol. 39
    • http://ella.slis.indiana.edu./~kiyang/dissertation/litrev_webir.pdf
7
TREC Web Track: Overview
  • Research Questions
    • Which “traditional” IR approaches work for Web IR?
    • What “new” approaches are effective in Web IR?
  • IR Approaches tried
    • Traditional
      • Term Weighting
      • Query Expansion
        • Pseudo-Relevance Feedback
      • Passage Retrieval
    • Web-specific
      • Link Analysis
        • Indegree, PageRank, Google, anchor-text
      • Document Structure
        • HTML tags (meta, title, font, etc.)
      • URL Information
8
Web Track Timeline
9
Web Track 2003
  • Data
    • 18 GB, 1.25 million .gov Web pages (2002 breadth-first crawl)
  • Tasks
    • Topic Distillation
      • Find relevant home pages given a broad query.
        • pages devoted to the topic
        • provides credible information on the topic
        • is not a part of a larger site also devoted to the topic
      • 50 broad topics
        • e.g. “cotton industry”,  http://cpru.usda.gov
    • Home/Named Page Finding
      • Find a particular page named in a query
      • 150 HP & 150 NP topics
        • e.g. (HP) “Internal Revenue Service”,  http://www.irs.gov
        • e.g. (NP) “passport application form”,  http://travel.state.gov/dsp11.pdf
10
WIDIT: Overview
  • Web Information Discovery Integrated Tool
    • (http://widit.slis.indiana.edu)
    • Descendent of IRIS (http://ils.unc.edu/iris/)
    • (elvis) 2.4GHz, dual Xeon processor, 4GB RAM, 1.3 TB RAID disk, FreeBSD
  • Objectives
    • Function as a modular system toolkit for IR research
      • e.g. TREC
    • Provide an environment for
      • IR system prototyping
      • IT application development
    • Foster collaborative research
      • e.g. ICDL, ISKO



11
WIDIT: Modules
  • Indexing Modules
    • HTML parser
    • Stemmers
    • Noun phrase Constructor
    • SMART term weights
  • Retrieval Modules
    • Vector Space Model
    • Subcollection retrieval
    • Taxonomy Search
  • Feedback Modules
    • Adaptive Linear
    • Probabilistic
  • Link Analysis Modules
    • Modified HITS Algorithm
  • Classification Modules
    • Classification Dictionary
    • Centroid Classifier
  • Fusion Modules
    • Similarity Merge
    • Weighted Sum
  • Web Crawler
    • Taxonomy Crawler
    • Sitemap Display
    • Site Search
  • Indexing Modules
    • LSI
    • Thesaurus Constructor
    • OKAPI term weights
  • Retrieval Modules
    • Probabilistic Model
    • Meta-Search
    • Passage Retrieval
  • Query Expansion Modules
    • WordNet
    • Local Context Analysis
  • Link Analysis Modules
    • Personalized PageRank
  • Classification Modules
    • Facetted Classifier
    • Probabilistic Classifier
    • Support Vector Machine
  • Fusion Modules
    • Dynamic Fusion
  • Web Crawler
    • Link Summarizer
    • Site Checker
    • Personalized Search
12
WIDIT in TREC-2003
  • Basic Approach
    • Index multiple sources of evidence (MSE)
      • Document Content
      • Document Structure
        • title, meta keywords & description, emphasized words
      • Link Structure
        • anchor text, in/outdegree
    • Parallel search
      • Multiple Index
        • body text (title, body)
        • anchor text (title, inlink anchor text)
        • Header text (title, meta kw & desc, first heading, emphasized words)
      • Multiple Query format
        • simple query
        • phrase, acronyms, abbreviations
    • Combine results (i.e. fusion)
      • Weighted Sum formula
      • Overlap Weighted Sum formula
      • Weighted Overlap Weighted Sum formula
    • Rerank based on MSE
      • Site Compression
      • Phrase, Proximity & Acronym match
      • In/Outdegree
13
WIDIT: System Architecture
14
WIDIT: TD vs. HP/NP
  • Shared Components
    • Indexing & Searching
      • Same processes
    • Fusion
      • Different training data ® Different fusion formula
  • Topic Distillation
    • Reranking
      • Site Compression
        • Cluster results by site
        • Rank sites
        • Boost the scores of best pages from each site
      • Content + Link Evidence
        • inter-/intra indegree & outdegree
        • phrase/proximity match counts in body/header/anchor texts
        • query term match in URL
  • Home/Named Page Finding
    • Reranking
      • Query Classification
      • TD evidence + Acronym matching
15
WIDIT: System Tuning
  • Training Data
    • 2002 TD topics & relevance judgments
      • Problem: 2002 TD <> 2003 TD
        • bias for HP methods in 2003
        • broader queries in 2003
        • different evaluation metrics (p@10 vs. R-precision)
    • 2002 NP topics & relevance judgments
      • Problem: did not train for HP topics
        • WT10g not indexed
  • System Tuning Process
    • Fusion
      • 8 sys_wt combinations for each fusion formula (24 systems per task)
    • Reranking
      • select best single and best fusion systems
      • adjust ranking heuristic by dynamic tuning
16
WIDIT: Results
  • Topic Distillation
    • Baseline
      • best single system
        • simple query, body text
      • P@10 = 0.168
    • 8% improvement by fusion
      • P@10 = 0.182 (f1sb90sh1sa9)
    • 17% improvement by reranking
      • P@10 = 0.196 (f1sb90sh1sa9_r2)
  • Named Page Finding
    • Baseline
      • best single system
        • simple query, body text
      • MRR = 0.4714
    • 10% improvement by fusion
      • MRR = 0.5195 (f3sb80sh19sa1)
    • No improvement by reranking
      • MRR = 0.4714 (f3sb80sh19sa1_r2)
  • Topic Distillation
    • Baseline
      • best single system
        • simple query, body text
      • P@10 = 0.076
    • 21% improvement by fusion
      • P@10 = 0.092 (f1sb90sh1sa9)
    • 29% improvement by reranking
      • P@10 = 0.098 (f1sb90sh1sa9_r1)
      • P@10 = 0.088 (f1sb90sh1sa9_r2)
  • Named Page Finding
    • Baseline
      • best single system
        • simple query, body text
      • MRR = 0.362
    • 10% improvement by fusion
      • MRR = 0.400 (f3sb80sh19sa1)
    • No reranking run submitted
17
WIDIT: Evaluation Measures
  • Topic Distillation
    • Average Precision (AP)
      • Average of precision value after each relevant document is retrieved
      • Performance over all relevant documents
      • Rewards system that retrieve relevant documents quickly
    • P@10
      • Precision at rank 10
      • High-precision measure (Web-specific)
    • R-Precision (RP)
      • Precision at rank R, where R is the number of relevant documents for the topic
      • De-emphasize the exact ranking of retrieved relevant documents

  • Named Page Finding
    • Reciprocal Rank (RR)
      • 1/rank, where first correct document is retrieved
    • S@10
      • Success Rate at rank 10
      • Percentage of topics where the correct answer is found in top 10 ranks
18
WIDIT: System Rankings
  • Topic Distillation
    • by Mean Average Precision (MAP)
      • Best WIDIT (Baseline) ranked 38 of 107 systems, 11 of 23 groups
        • Best:     MRP = 0.1636,  MAP = 0.1543, avgP@10 = 0.1240
        • WIDIT:  MRP = 0.0736,  MAP = 0.1016, avgP@10 = 0.0760
        • Median: MRP = 0.0699,  MAP = 0.0896, avgP@10 = 0.0700
        • Worst2:  MRP = 0.0230,  MAP = 0.0222, avgP@10 = 0.0200
    • by avgP@10
      • Best WIDIT (f1sb90sh1sa9_r1) ranked 28 of 107 systems, 12 of 23 groups
        • Best:     MRP = 0.1485,  MAP = 0.1387, avgP@10 = 0.1280
        • WIDIT:  MRP = 0.0626,  MAP = 0.0787, avgP@10 = 0.0980
        • Median: MRP = 0.0871,  MAP = 0.1057, avgP@10 = 0.0807
        • Worst2:  MRP = 0.0181,  MAP = 0.0250, avgP@10 = 0.0160
    • by Mean R-Precision (MRP)
      • Best WIDIT (Baseline) ranked 71 of 107 systems, 18 of 23 groups
        • Best:     MRP = 0.1636,  MAP = 0.1543, avgP@10 = 0.1240
        • Median: MRP = 0.0699,  MAP = 0.0896, avgP@10 = 0.0700
        • WIDIT:  MRP = 0.0736,  MAP = 0.1016, avgP@10 = 0.0760
        • Worst2:  MRP = 0.0181,  MAP = 0.0250, avgP@10 = 0.0160
19
WIDIT: System Rankings
  • Home/Named Page Finding
    • by Mean Reciprocal Rank (MRR)
      • Best WIDIT (f3sb80sh19sa1) ranked 48 of 75 systems, 13 of 19 groups
        • Best:     MRR = 0.727,  S@10 = 89.3
        • Median: MRR = 0.496,  S@10 = 64.3
        • WIDIT:  MRR = 0.400,  S@10 = 66.3
        • Worst:   MRR = 0.065,  S@10 =   8.7
    • by avgS@10
      • Best WIDIT (f1sb90sh1sa9_r1) ranked 41 of 75 systems, 13 of 19 groups
        • Best:     MRR = 0.727,  S@10 = 89.3
        • Median: MRR = 0.465,  S@10 = 68.3
        • WIDIT:   MRR = 0.400,  S@10 = 66.3
        • Worst:    MRR = 0.065,  S@10 =   8.7
20
WIDIT: Post-Submission Runs
  • Dynamic Tuning
    • Rank Boosting heuristic
      • Keep top 5 rank static
      • Boost potential home pages (root, subroot, path)
      • Boost file type page with 2+ query terms in URL
      • Stop if top 20 ranks are filled
    • Home Page Identification
      • URL Type (Tomlinson, 2003; Kraaij et al., 2002)
        • HP_end:  URL ends w/ index.htm, default.htm, /, etc
          • root:        slash_cnt=0 or (HP_end & slash_cnt=1)
          • subroot:  HP_end & slash_cnt=2
          • path:       HP_end & slash_cnt>=3
          • file:          rest
    • Improved Query Classification heuristic
      • e.g. HP if query ends in all caps
      • 227 out of 300 correct classification
21
WIDIT: System Rankings2
  • Topic Distillation
    • by Mean Average Precision (MAP)
      • Best WIDIT (f1sb80sh10sa10_r4) ranked 10 of 107 systems, 7 of 23 groups
            Best WIDIT (Baseline) ranked 38 of 107 systems, 11 of 23 groups
        • Best:     MRP = 0.1636,  MAP = 0.1543, avgP@10 = 0.1240
        • WIDIT:  MRP = 0.1139,  MAP = 0.1281, avgP@10 = 0.0980
        • WIDIT:  MRP = 0.0736,  MAP = 0.1016, avgP@10 = 0.0760
        • Median: MRP = 0.0699,  MAP = 0.0896, avgP@10 = 0.0700
        • Worst2:  MRP = 0.0230,  MAP = 0.0222, avgP@10 = 0.0200
    • by avgP@10
      • Best WIDIT (f1sb90sh1sa9_r2) ranked 25 of 107 systems, 11 of 23 groups
            Best WIDIT (f1sb90sh1sa9_r1) ranked 28 of 107 systems, 12 of 23 groups
        • Best:     MRP = 0.1485,  MAP = 0.1387, avgP@10 = 0.1280
        • WIDIT:  MRP = 0.0626,  MAP = 0.1216, avgP@10 = 0.0980
        • WIDIT:  MRP = 0.0626,  MAP = 0.0787, avgP@10 = 0.0980
        • Median: MRP = 0.0871,  MAP = 0.1057, avgP@10 = 0.0807
        • Worst2:  MRP = 0.0181,  MAP = 0.0250, avgP@10 = 0.0160
    • by Mean R-Precision (MRP)
      • Best WIDIT (f1sb80sh10sa10_r4) ranked 23 of 107 systems, 11 of 23 groups
             Best WIDIT (Baseline) ranked 71 of 107 systems, 18 of 23 groups
        • Best:     MRP = 0.1636,  MAP = 0.1543, avgP@10 = 0.1240
        • WIDIT:  MRP = 0.1139,  MAP = 0.1281, avgP@10 = 0.0980
        • Median: MRP = 0.0699,  MAP = 0.0896, avgP@10 = 0.0700
        • WIDIT:  MRP = 0.0736,  MAP = 0.1016, avgP@10 = 0.0760
        • Worst2:  MRP = 0.0181,  MAP = 0.0250, avgP@10 = 0.0160
22
WIDIT: System Rankings2
  • Home/Named Page Finding
    • by Mean Reciprocal Rank (MRR)
      • Best WIDIT (f3sb80sh10sa10_r4) ranked 45 of 75 systems, 13 of 19 groups
         Best WIDIT (f3sb80sh19sa1) ranked 48 of 75 systems, 13 of 19 groups
        • Best:     MRR = 0.727,  S@10 = 89.3
        • Median: MRR = 0.496,  S@10 = 64.3
        • WIDIT:   MRR = 0.411,  S@10 = 68.3
        • WIDIT:  MRR = 0.400,  S@10 = 66.3
        • Worst:   MRR = 0.065,  S@10 =   8.7
    • by avgS@10
      • Best WIDIT (f3sb80sh10sa10_r4) ranked 39 of 75 systems, 13 of 19 groups
             Best WIDIT (f3sb80sh19sa1) ranked 41 of 75 systems, 13 of 19 groups
        • Best:     MRR = 0.727,  S@10 = 89.3
        • Median: MRR = 0.465,  S@10 = 68.3
        • WIDIT:   MRR = 0.411,  S@10 = 68.3
        • WIDIT:   MRR = 0.400,  S@10 = 66.3
        • Worst:    MRR = 0.065,  S@10 =   8.7
23
WIDIT: Post-submission Analysis
  • Topic Distillation
    • Average 10.32 relevant documents per topic
      • 5 topics w/ 1 relevant documents
      • 3 topics w/ 2 relevant documents
      • 20 topics w/ 5 or fewer relevant documents
      • 33 topics w/ 10 or fewer relevant documents
        • WIDIT: 17 w/ P@10=0, 23 w/ RP=0,
                        4 w/ relevant documents at top 20 rank

  • Questions
    • Is MRP an appropriate evaluation measure?
      • compute performance statistics by topic groups listed above
      • statistical analysis
    • Why did WIDIT do so badly with topics w/ few relevant documents?
      • did not retrieve reldocs at top 20 ranks in 29 of 33 topics (#reldoc <= 10)
      • was it WIDIT-specific or system-wide?
    • Why doesn’t Link Analysis help?
24
WIDIT: Post-submission Analysis
  • Home/Named Page Finding
    • Best WIDIT (f3sb80sh10sa10_r4)
      • Best:              MRR = 0.727,  S@10 = 89.3
      • WIDIT:           MRR = 0.411,  S@10 = 68.3
      • WIDIT (HP):   MRR = 0.320,  S@10 = 62.0
        • 24 topics w/ first correct answer beyond rank 100
      • WIDIT (NP):   MRR = 0.503,  S@10 = 74.7
        • 6 topics w/ first correct answer beyond rank 100
    • Questions
      • How much effect did HP bias have on system performance?
        • compare WIDIT performance levels by query type with those of the top systems
        • tune WIDIT-hp with training data and compare results
      • Why did WIDIT rank correct document at very low ranks for some HP topics?
        • was it WIDIT-specific or system-wide?
        • failure analysis
          • term weighting?
          • bug
25
Conclusions
  • Fusion helps
    • What worked
      • Combining multiple sources of evidence
      • System tuning with training data
    • Questions
      • Why f1 for TD but f3 for NP?
      • Other fusion methods?
  • Home page identification is important for this year’s TD
    • What worked
      • HP identification based URL information
      • Post-retrieval reranking to boost potential home pages
    • Questions
      • Why didn’t it help for HP/NP?
      • Is dynamic tuning an optimal training approach?
  • Different tasks require different approaches
    • What worked
      • System tuning >> Retrieval techniques
    • Need both task typing and task-specific system tuning
      • Good task typing but no system tuning for HP capped HP/NP task performance
26
WIDIT: Future Research
  • Term Weighting
    • Okapi BM25
  • Query Expansion/Modification
    • Thesaurus (e.g. WordNet)
    • SVM (Flake et al., 2002)
  • Fusion
    • Rank-based fusion
  • Link Analysis
    • Modified BHITS (Li, Shang & Zhang, 2002)
      • addresses “small-in-large-out” problem of skewed link topology
        • add more weights to inlinks if “small-in-large-out” hub page exists
    • Implicit Link Analysis (Xue et al, 2003)
      • addresses “small Web search” problem caused by truncated link structure
        • construct “implicit” links with association weight based on usage data
        • modify the link matrix of PageRank formula
    • Personalized PageRank (Jeh & Widom, 2003; Haveliwala, 2002)
      • computes PageRank scores customized for a given topic/category
        • match topic to Yahoo! category
        • bias PageRank with matched category pages
27
TREC-2004
    • Time Table
      • December 03 - register for participation
      • Spring 04 – L641 (http://ella.slis.indiana.edu./~kiyang/teaching/s04/l641/index.shtml)
      • Summer - run TREC Experiments
      • August - submit Results
      • October - submit Notebook Paper
      • November - attend Conference
      • January 05 - submit Proceedings Paper
    • Targeted Tracks
      • WIDIT: Web, Robust, Q&A, Genomics
      • LAIR: Video, Genomics
28
References
  • Flake, G. W., Glover, E. J., Lawrence, S., Giles, C. L. (2002). Extracting query modifications from nonlinear SVMs. Proceedings of the 11th International WWW Conference, 317-324.


  • Haveliwala, T. (2002). Topic-sensitive PageRank. Proceedings of the 11th WWW Conference, 517-526.


  • Jeh, G. & Widom, J. (2003). Scaling personalized web search. Proceedings of the 12th International WWW Conference, 271 – 279.


  • Kehoe, C., Pitkow, J., Sutton, K., Aggarwal, G., & Rogers, J. D. (1999). Results of GVU's Tenth WWW User Survey. Retrieved November 11, 2003, from http://www.gvu.gatech.edu/user_surveys/survey-1998-10/tenthreport.html


  • Kraaij, W., Westerveld, T., & Hiemstra, D. (2002).  The Importance of Prior Probabilities for Entry Page Search.  Proceedings of the 25th ACM SIGIR Conference on Research and Development in Information Retrieval, 27-34.


  • Li, L., Shang, Y., & Zhang, W. (2002). Improvement of HITS-based Algorithms on Web Documents. Proceedings of the 11th International WWW Conference, 527-535.


  • Tomlinson, S. (2003). Robust, Web and Genomic Retrieval with Hummingbird SearchServer at TREC 2003. The 12th Text REtrieval Conference (TREC 2003) Notebook, 372-385.


  • Xue, G., Zeng, H., Chen, Z., Ma, W., Zhang, H, & Lu, C. (2003). Implicit link analysis for small web search. Proceedings of the 26th ACM SIGIR Conference on Research and Development in Information Retrieval, 56-63.
29
The End
30
WIDIT: Indexing Module

    • Strip HTML tags
      • extract title, meta keywords & description, emphasized words
      • parse out hyperlinks (URL & anchor texts)
    • Create pseudo-documents
      • anchor texts
      • header texts
    • Create subcollection Sequential Indexes
      • stop & stem (simple plural conflation)
      • count term frequencies
    • Create subcollection Inverted Indexes
      • count subcollection document frequencies
      • compute length-normalized term weights
    • Compute whole collection term statistics
31
WIDIT: Retrieval Module

    • Rank documents in each subcollection of each index
      • Query-Document Similarity
      • Parallel Searching
        • Multiple Index
          • body text (title, body)
          • anchor text (title, inlink anchor text)
          • Header text (title, meta kw & desc, first heading, emphasized words)
        • Multiple Query format
          • simple query
          • phrase, acronyms, abbreviations
        • Multiple Subcollections
          • for search speed and scalability
          • use whole collection term statistics
        • 2 qry_type * 3 document_index * 46 subcollections = max. 276 per query
    • Merge search results of subcollections
      • merge & sort by document score
32
WIDIT: Fusion Module

    • Fusion Formulas
      • Weighted Sum (WS)
        • Normalized_Score * sys_wt
          • sys_wt = the relative contribution of systems
      • Overlap Weighted Sum
        • WS * Overlap
          • Overlap = # of systems retrieving a given document
      • Weighted Overlap Weighted Sum
        • WS * weighted_Overlap
          • weighted_Overlap = Overlap * sys_wt
    • Select best system weights (sys_wt)
      • based on training data
        • e.g.  0.9 for body text, 0.09 for header text, 0.01 for anchor text
33
WIDIT: Reranking Module
    • Reranking Methods
      • Site Compression
        • Cluster results by site
          • URL pattern
        • Rank sites
          • sort site by highest document score of each site
        •  Float of top n documents from top m site to the top
      • Content + Link Evidence Heuristic
        • inter-/intra indegree & outdegree
          • e.g. boost TD score if large outdegree
          • e.g. boost HP score if large indegree
        • phrase/proximity match counts in body/header/anchor texts
          • e.g. boost score if phrase match in title or anchor text
        • query term match in URL
        • heuristic variations for HP and NP
          • query classification by term occurrence
    • System Training by Dynamic Tuning
34
Length-Normalized Term Weights

      • SMART lnu weight for document terms
      • SMART ltc weight for query terms




      • where:
      • fik     = number of times term k appears in document i
      • idfk  = inverse document frequency of term k
      • t     = number of terms in document/query
35
Document Ranking
    • Document Score
      • inner product of document and query vectors




      • where:
        • qk = weight of term k in the query
        • dik = weight of term k in document i
        • t = number of terms common to query & document
36
Fusion Formulas
  • Weighted Sum

  • FSws = S(wi*NSi)


  • Overlap Weighted Sum
  • FSows = S(wi*NSi) * olp
  • Weighted Overlap Weighted Sum
  • FSwows = S(wi*NSi) * (wi*olp)
  • where:
  •   wi  = weight of system i
      NSi = normalized score of a document by system i
  •            = (Si – Smin) / (Smax – Smin)
  •   olp = number of systems that retrieved a given document