Deep Web - Classifying Resources

Classifying Resources

Automatically determining if a Web resource is a member of the surface Web or the deep Web is difficult. If a resource is indexed by a search engine, it is not necessarily a member of the surface Web, because the resource could have been found using another method (e.g., the Sitemap Protocol, mod oai, OAIster) instead of traditional crawling. If a search engine provides a backlink for a resource, one may assume that the resource is in the surface Web. Unfortunately, search engines do not always provide all backlinks to resources. Even if a backlink does exist, there is no way to determine if the resource providing the link is itself in the surface Web without crawling all of the Web. Furthermore, a resource may reside in the surface Web, but it has not yet been found by a search engine. Therefore, if we have an arbitrary resource, we cannot know for sure if the resource resides in the surface Web or deep Web without a complete crawl of the Web.

Most of the work of classifying search results has been in categorizing the surface Web by topic. For classification of deep Web resources, Ipeirotis et al. presented an algorithm that classifies a deep Web site into the category that generates the largest number of hits for some carefully selected, topically-focused queries. Deep Web directories under development include OAIster at the University of Michigan, Intute at the University of Manchester, Infomine at the University of California at Riverside, and DirectSearch (by Gary Price). This classification poses a challenge while searching the deep Web whereby two levels of categorization are required. The first level is to categorize sites into vertical topics (e.g., health, travel, automobiles) and sub-topics according to the nature of the content underlying their databases.

The more difficult challenge is to categorize and map the information extracted from multiple deep Web sources according to end-user needs. Deep Web search reports cannot display URLs like traditional search reports. End users expect their search tools to not only find what they are looking for quickly, but to be intuitive and user-friendly. In order to be meaningful, the search reports have to offer some depth to the nature of content that underlie the sources or else the end-user will be lost in the sea of URLs that do not indicate what content lies beneath them. The format in which search results are to be presented varies widely by the particular topic of the search and the type of content being exposed. The challenge is to find and map similar data elements from multiple disparate sources so that search results may be exposed in a unified format on the search report irrespective of their source.

Read more about this topic:  Deep Web

Famous quotes containing the word resources:

    How many inner resources one needs to tolerate a life of leisure without fatigue
    Natalie Clifford Barney (1876–1972)