Structure-Based Crawling in the Hidden Web

dc.creatorVidal,Marcio
dc.creatorSilva,Altigran S. da
dc.creatorDe Moura,Edleno
dc.creatorCavalcanti,João
dc.date2008
dc.date.accessioned2024-02-06T12:56:39Z
dc.date.available2024-02-06T12:56:39Z
dc.descriptionThe number of applications that need to crawl the Web to gather data is growing at an ever increasing pace. In some cases, the criterion to determine what pages must be included in a collection is based on theirs contents; in others, it would be wiser to use a structure-based criterion. In this article, we present a proposal to build structure-based crawlers that just requires a few examples of the pages to be crawled and an entry point to the target web site. Our crawlers can deal with form-based web sites. Contrarily to other proposals, ours does not require a sample database to fill in the forms, and does not require the user to interact heavily. Our experiments prove that our precision is 100% in seventeen real-world web sites, with both static and dynamic content, and that our recall is 95% in the eleven static web sites examined.
dc.formattext/html
dc.identifierhttps://doi.org/10.3217/jucs-014-11-1857
dc.identifierhttps://lib.jucs.org/article/29098/
dc.identifier.urihttps://openrepository.mephi.ru/handle/123456789/9768
dc.languageen
dc.publisherJournal of Universal Computer Science
dc.relationinfo:eu-repo/semantics/altIdentifier/eissn/0948-6968
dc.relationinfo:eu-repo/semantics/altIdentifier/pissn/0948-695X
dc.rightsinfo:eu-repo/semantics/openAccess
dc.rightsJ.UCS License
dc.sourceJUCS - Journal of Universal Computer Science 14(11): 1857-1876
dc.subjectWeb crawling
dc.subjecthidden web
dc.subjecttree-edit distance
dc.subjectweb wrappers
dc.titleStructure-Based Crawling in the Hidden Web
dc.typeResearch Article
Файлы