People Search Project - 2007
About This Project
The
People Search Project 2007 is a continuation of an earlier project created by Ketan Mane in
2004. The original project goal was to acquire the possible addresses of the Ph.D. students
who graduated from School of Library and Information Science (SLIS), Indiana University, Bloomington.
This new version of the project takes up the same goal but with a slight change in the project
specifications. First, we began with the assumption that completely automated searching does not produce
the best results (at least with current methodologies); therefore, a mechanism to facilitate human
interaction with the search results to allow further evaluation needed to be added.
Second, additional, more advanced search algorithms needed to be developed, with the
hopes a greater number of results with a higher degree of relevance would be obtained.
Methodology
The methodology for this project was multi-phased in that each of the various stages of this
project had, for the most part, its own specific methodology. Each of the component parts
will be discussed below.
- Retrieve Data from the Web
- Relying chiefly on the Perl libraries
WWW::Mechanize and LWP, our search engine
(View Source Code) crawls
various databases for relevant documents
- Pages are then scraped with relevant data retrieved from the appropriate sections, and the resulting material
is parsed into a useable form.
- Data is then stored in temporary result files for later processing
- At present, an exhaustive search of the current source databases only takes approximately 5 hours. In future versions of this program,
however, this time might be further reduced. That is, as the most prominent factor in the amount of time required to crawl the
databases is server lag time, the incorporation of some sort of multi-threading process, whereby multiple databases could be searched
simultaneously, could significantly reduce search times.
- Process the raw data into our database
- After raw data files have been written, our data processing engine dataplex3.cgi (
View Source Code), then formats the data and enters it into our permanent database.
- One other important function that takes place at this step is that the new data
is compared with the existing data before it is entered into the database. This allows new crawling to acquire new records
without duplicating existing records.
- Using a TF-IDF algorithm, the records in our database are searched for the most relevant documents
- Existing records are then scored for relevance using a TF-IDF algorithm
(View Source Code).
- At present, each document is scored according to a static set of search terms. However,
the ability to dynamically generate TF-IDF scores from user input could be easily implemented.
- Phone numbers are extracted from the most relevant web pages retrieved for each individual. Data retrieved from
White Pages for the same individual is then searched to see if any of the phone numbers match (indicating
that the web page and address probably refer to the same individual).
View Source Code.
- An interface was created to allow the viewing, manipulating, and importing of records into our database.
- Finally, the
Main Interface for this project allows users to view and interact with the processed data records.
- Site interaction and server responses are significantly increased through the use
of AJAX interaction
Input
Output
Databases Used
Suggestions for Further Research
There are a number of directions in which this project might be taken in the future.
Some key suggestions might be:
- A greater number of databases should be searched and more exhaustive searches
should be conducted. The project, as it has been developed, is aimed at acquiring and processing
large amounts of data and attempting to find correlations between the data sets. Thus,
expanding the amount of resources crawled could provide more interesting results.
- It would be interesting to allow for dynamic TF-IDF ratings of the documents in our database.
The ability to expand or refine the criteria used in rating the documents found
might provide more insight.
- Other modes of comparison might be added as well. For example, our project sought to correlate
documents based upon phone numbers, expanding this to correlate addresses, or perhaps even
titles of published documents, might also prove revealing.
Authors:
Daniel B. Bicknell, MIS Student
Jenny Jackson, MIS Student
Last Updated: May 3rd, 2007