Structure-Based Crawling in the Hidden Web

Vidal,Marcio; Silva,Altigran S. da; De Moura,Edleno; Cavalcanti,João

Structure-Based Crawling in the Hidden Web

dc.creator	Vidal,Marcio
dc.creator	Silva,Altigran S. da
dc.creator	De Moura,Edleno
dc.creator	Cavalcanti,João
dc.date	2008
dc.date.accessioned	2024-02-06T12:56:39Z
dc.date.available	2024-02-06T12:56:39Z
dc.description	The number of applications that need to crawl the Web to gather data is growing at an ever increasing pace. In some cases, the criterion to determine what pages must be included in a collection is based on theirs contents; in others, it would be wiser to use a structure-based criterion. In this article, we present a proposal to build structure-based crawlers that just requires a few examples of the pages to be crawled and an entry point to the target web site. Our crawlers can deal with form-based web sites. Contrarily to other proposals, ours does not require a sample database to fill in the forms, and does not require the user to interact heavily. Our experiments prove that our precision is 100% in seventeen real-world web sites, with both static and dynamic content, and that our recall is 95% in the eleven static web sites examined.
dc.format	text/html
dc.identifier	https://doi.org/10.3217/jucs-014-11-1857
dc.identifier	https://lib.jucs.org/article/29098/
dc.identifier.uri	https://openrepository.mephi.ru/handle/123456789/9768
dc.language	en
dc.publisher	Journal of Universal Computer Science
dc.relation	info:eu-repo/semantics/altIdentifier/eissn/0948-6968
dc.relation	info:eu-repo/semantics/altIdentifier/pissn/0948-695X
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	J.UCS License
dc.source	JUCS - Journal of Universal Computer Science 14(11): 1857-1876
dc.subject	Web crawling
dc.subject	hidden web
dc.subject	tree-edit distance
dc.subject	web wrappers
dc.title	Structure-Based Crawling in the Hidden Web
dc.type	Research Article

Коллекции

Публикации в журналах НИЯУ МИФИ

Structure-Based Crawling in the Hidden Web

Файлы

Коллекции