Web Archiving - Difficulties and Limitations - Crawlers

Crawlers

Web archives which rely on web crawling as their primary means of collecting the Web are influenced by the difficulties of web crawling:

  • The robots exclusion protocol may request crawlers not access portions of a website. Some web archivists may ignore the request and crawl those portions anyway.
  • Large portions of a web site may be hidden in the deep Web. For example, the results page behind a web form lies in the deep Web because most crawlers cannot follow a link to the results page.
  • Crawler traps (e.g., calendars) may cause a crawler to download an infinite number of pages, so crawlers are usually configured to limit the number of dynamic pages they crawl.

However, it is important to note that a native format web archive, i.e., a fully browsable web archive, with working links, media, etc., is only really possible using crawler technology.

The Web is so large that crawling a significant portion of it takes a large amount of technical resources. The Web is changing so fast that portions of a website may change before a crawler has even finished crawling it.

Read more about this topic:  Web Archiving, Difficulties and Limitations