Exploiting Multiple Features with MEMMs for Focused Web Crawling

Download
  1. (PDF, 329 KB)
AuthorSearch for: ; Search for: ; Search for:
TypeArticle
ConferenceProceedings of the 13th International Conference on Applications of Natural Language to Information Systems (NLDB 2008), June 24-27, 2008., London, United Kingdom
Subjectfocused crawling; Web search; feature selection; MEMMs
AbstractFocused web crawling traverses the Web to collect documents on a specific topic. This is not an easy task, since focused crawlers need to identify the next most promising link to follow based on the topic and the content and links of previously crawled pages. In this paper, we present a framework based on Maximum Entropy Markov Models (MEMMs) for an enhanced focused web crawler to take advantage of richer representations of multiple features extracted from Web pages, such as anchor text and the keywords embedded in the link URL, to represent useful context. The key idea of our approach is to treat the focused web crawling problem as a sequential task and use a combination of content analysis and link structure to capture sequential patterns leading to targets. The experimental results showed that focused crawling using MEMMs is a very competitive crawler in general over Best-First crawling on Web Data in terms of two metrics: Precision and Maximum Average Similarity.
Publication date
LanguageEnglish
AffiliationNRC Institute for Information Technology; National Research Council Canada
Peer reviewedNo
NRC number50373
NPARC number5765089
Export citationExport as RIS
Report a correctionReport a correction
Record identifier32528c1e-e4f6-40ce-ba06-414d5bd7f94c
Record created2009-03-29
Record modified2016-05-09
Bookmark and share
  • Share this page with Facebook (Opens in a new window)
  • Share this page with Twitter (Opens in a new window)
  • Share this page with Google+ (Opens in a new window)
  • Share this page with Delicious (Opens in a new window)