Crawling The Deep Web
A vast amount of Web pages lie in the deep or invisible Web. These pages are typically only accessible by submitting queries to a database, and regular crawlers are unable to find these pages if there are no links that point to them. Google's Sitemaps protocol and mod oai are intended to allow discovery of these deep-Web resources.
Deep Web crawling also multiplies the number of Web links to be crawled. Some crawlers only take some of the -shaped URLs. In some cases, such as the Googlebot, Web crawling is done on all text contained inside the hypertext content, tags, or text.
Strategic approaches may be taken to target deep-Web content. With a technique called screen scraping, specialized software may be customized to automatically and repeatedly query a given Web form with the intention of aggregating the resulting data. Such software can be used to span multiple Web forms across multiple Websites. Data extracted from the results of one Web form submission can be taken and applied as input to another Web form thus establishing continuity across the Deep Web in a way not possible with traditional web crawlers.
Read more about this topic: Web Crawler
Famous quotes containing the words crawling, deep and/or web:
“In the cold of Europe, under prudish northern fogs, except when slaughter is afoot, you only glimpse the crawling cruelty of your fellow men. But their rottenness rises to the surface as soon as they are tickled by the hideous fevers of the tropics.”
—Louis-Ferdinand Céline (18941961)
“Indeed, I believe that in the future, when we shall have seized again, as we will seize if we are true to ourselves, our own fair part of commerce upon the sea, and when we shall have again our appropriate share of South American trade, that these railroads from St. Louis, touching deep harbors on the gulf, and communicating there with lines of steamships, shall touch the ports of South America and bring their tribute to you.”
—Benjamin Harrison (18331901)
“Any newspaper, from the first line to the last, is nothing but a web of horrors.... I cannot understand how an innocent hand can touch a newspaper without convulsing in disgust.”
—Charles Baudelaire (18211867)