Focused Crawler - Strategies

Strategies

A focused crawler ideally would like to download only web pages that are relevant to a particular topic and avoid downloading all others.

Therefore a focused crawler may predict the probability that a link to a particular page is relevant before actually downloading the page. A possible predictor is the anchor text of links; this was the approach taken by Pinkerton in a crawler developed in the early days of the Web. In a review of topical crawling algorithms, Menczer et al. show that such simple strategies are very effective for short crawls, while more sophisticated techniques such as reinforcement learning and evolutionary adaptation can give the best performance over longer crawls. Diligenti et al. propose to use the complete content of the pages already visited to infer the similarity between the driving query and the pages that have not been visited yet.

In another approach, the relevance of a page is determined after downloading its content. Relevant pages are sent to content indexing and their contained URLs are added to the crawl frontier; pages that fall below a relevance threshold are discarded.

The performance of a focused crawler depends mostly on the richness of links in the specific topic being searched, and focused crawling usually relies on a general web search engine for providing starting points.

Seeds selection can be important for focused crawlers and significantly influence the crawling efficiency . A whitelist strategy is to start the focus crawl from a list of high quality seed URLs and limit the crawling scope to the domains of these URLs. These high quality seeds should be selected based on a list of URL candidates which are accumulated over a sufficient long period of general web crawling. The whitelist should be updated periodically after it is created.

Read more about this topic:  Focused Crawler

Famous quotes containing the word strategies:

    By intervening in the Vietnamese struggle the United States was attempting to fit its global strategies into a world of hillocks and hamlets, to reduce its majestic concerns for the containment of communism and the security of the Free World to a dimension where governments rose and fell as a result of arguments between two colonels’ wives.
    Frances Fitzgerald (b. 1940)