CN101520798A - Webpage classification technology based on vertical search and focused crawler - Google Patents

Webpage classification technology based on vertical search and focused crawler Download PDF

Info

Publication number
CN101520798A
CN101520798A CN200910025724A CN200910025724A CN101520798A CN 101520798 A CN101520798 A CN 101520798A CN 200910025724 A CN200910025724 A CN 200910025724A CN 200910025724 A CN200910025724 A CN 200910025724A CN 101520798 A CN101520798 A CN 101520798A
Authority
CN
China
Prior art keywords
url
webpage
website
classification
web page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN200910025724A
Other languages
Chinese (zh)
Inventor
王攀
张顺颐
宫婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
RUNTREND TECHNOLOGY Inc
Original Assignee
RUNTREND TECHNOLOGY Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by RUNTREND TECHNOLOGY Inc filed Critical RUNTREND TECHNOLOGY Inc
Priority to CN200910025724A priority Critical patent/CN101520798A/en
Publication of CN101520798A publication Critical patent/CN101520798A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a method for identifying webpage classification based on vertical search and focused crawler. The method comprises two parts, namely a webpage source code acquisition method and a webpage content analysis method, wherein the webpage content analysis method is a key method, and comprises two main parts, namely extraction of structured information of the webpage and crawling strategy of the focused crawler. First, a URL is selected from a navigation site URL list to acquire a source file of the URL; and then, all classified URL of the navigation URL sites can be identified and acquired by the webpage content analysis method. The key method in the method is the webpage content analysis method, which is to first extract the webpage structured information, then carry out URL snatch by a directional breadth-first search strategy based on webpage content feature, and finally store the snatched URL and corresponding website classification in a list Category.

Description

Webpage classification technology based on vertical search and focused crawler
Technical field
The present invention be directed to the research of the Web page classifying recognition methods in the fixing navigation type list of websites vertical search engine, how main research effectively obtain the classified information of webpage based on vertical search and focused crawler technology, and designed the model of cognition and the algorithm of Web page classifying, it is multi-field to relate to vertical search, focused crawler, web data extraction, machine learning, data mining and natural language etc.
Background technology
Along with the continuous expansion of information, people more and more be unable to do without search engine.Though universal search engine such as Baidu, Google provides a lot of facilities to people, but along with people's demand the variation and more and more higher to the requirement of Search Results quality, universal search engine can not satisfy people's requirement in some specialized fields, so vertical search is just arisen at the historic moment, it is the precise search technology of serving local professional domain, professional more, the result who returns has more specific aim, domain knowledge by the specific industry theme, inquiry according to semantic information can be provided, thereby can satisfy user's special search need.
Along with the increased popularity of vertical search engine, also seem more and more important as the gordian technique-focused crawler of vertical search engine.Focused crawler is a program of downloading webpage automatically, and it visits webpage and relevant linking on the WWW selectively according to set extracting target, obtains needed information.
Web page classifying identification at vertical search and focused crawler technology possesses certain degree of difficulty, and following reason is arranged:
The first, how focused crawler be difficult to judges that choosing the webpage that most probable comprises subject correlation message from the URL formation of waiting to creep creeps.
The second, now many crawler systems of increasing income do not possess the directed function of extracting Web page structural information from the webpage that grasps.
Three, the content and structure of same webpage often changes, and the heavily visit strategy of focused crawler is difficult to adapt to this variation.
Therefore the focused crawler technology that adopts tradition to increase income is difficult to the different classes of webpage of accurate recognition.Therefore, must look for another way.
By to vertical search and focused crawler Study on Technology, we can solve following problem:
(1) utilize vertical search to obtain the corresponding network address of different classification from navigation website with focused crawler.
(2) can return precise search result targetedly to the special search of specific industry theme according to the user.
(3) obtain on the classifieds website Web page classifying under the unknown URL by vertical search and focused crawler.
Summary of the invention
Technical matters: the objective of the invention is to set up a kind of Web page classifying recognition methods based on vertical search and focused crawler technology at the navigation type website, and design its model of cognition and algorithm, by identification to the navigation type website, obtain the URL of different classification in the navigation type website, be convenient to the precise search of user, can provide the affiliated Web page classifying of unknown URL simultaneously the website.
Technical scheme: the present invention proposes the technological frame of the URL that respectively classifies in a kind of effective navigation by recognition class website, and detailed design recognizer.System is divided into three aspects, is successively from the bottom up: data collection layer, Context resolution layer and application presentation layer.
The key method of this paper is based on the Web page classifying recognition methods of vertical search and focused crawler technology, and this method comprises two parts: the webpage source code obtains and the web page contents analytical method.Wherein the web page contents analytical method is a core, and it comprises two main portions: extract the structured message of webpage and the strategy of creeping of focused crawler.By we find to the research of navigation website webpage source code, the navigation type website is made up of-the sub-directory page of the master catalogue page and each classification two kinds of pages basically, the master catalogue page comprises the link of each classification subpage frame of a large amount of sensings, and the sub-directory page of each classification then comprises the link of the website that belongs to this classification.The sub-directory page of each classification on the same navigation website also has very strong similarity, that is to say similar structure is all arranged in these pages, can summarize the structured message of the page with (or a several) regular expression by pattern learning, so as long as find the regular expression of representing these page structure information just can well instruct focused crawler to grasp and the relevant webpage of classifying as far as possible.With Www.hao123.comBe example, we want to search all URL of " amusement and recreation " classification, can write regular expression href s*=s* (?: " (?<1〉[^ "] *) " | (?<1〉S+)), be used for shape such as href=" in the matched character string ... the link of ", just can obtain all URL of " amusement and recreation " classification.In order to adapt to the irregular renewal of navigation website, better extract the Web page structural information of catalog page, we have increased the timing learner of URL regular expression, can adapt to the continuous variation of the website that navigates.The present invention has simultaneously proposed the directed BFS (Breadth First Search) strategy based on the web page contents feature with reference to three kinds of search strategies to URL.The basic thought of this search strategy is: in the process that webpage grasps, according to the directed structured message that extracts webpage of the content characteristic of webpage, grasp webpage with breadth-first strategy from structured message then earlier.This method can reduce the quantity of gathering the page effectively, has also saved the network bandwidth simultaneously, improves the efficient of information search.
Below introduce each aspect of this design in detail and based on the Web page classifying recognition methods and the web page contents analytical method of vertical search and focused crawler.
1. data collection layer
Function: the main effect of this aspect is to finish collection to web data by various Web agreements, gives the last layer face with the page that collects then and does further processing.
Interface: this aspect is the interface of focused crawler and the Internet, with the interface of last layer face be webpage source code string data, provide the input data to the upper strata.
2. Context resolution layer
Function: this aspect is the core aspect of whole framework, is that the page that data collection layer collects carries out the HTML parsing according to lower floor mainly, extracts hyperlink wherein, joins in the URL formation.The URL that provides in the page link generally is multiple form, may be complete, comprise agreement, website and path, also may omit partial content, or a relative path, therefore need structured message with web page contents analytical method extraction webpage, from structured message, grasp webpage URL with breadth-first strategy, obtain the mapping table Category of websites collection and URL, use the search of presentation layer Web page classifying to satisfy the last layer face.
Interface: the Web page classifying identification of this aspect should be a mapping table, i.e. websites collection and URL corresponding tables with the interface of application.
The main method of this layer is the web page contents analytical method, and it comprises two main portions: extract the structured message of webpage and the strategy of creeping of focused crawler.At first extract Web page structural information, use directed BFS (Breadth First Search) strategy to carry out the extracting of URL then based on the web page contents feature.
◆ the web page contents analytic method.Method processing procedure such as accompanying drawing 2.
(1) utilize focused crawler to grasp the source file of webpage;
(2) judge that whether this webpage satisfies the structure of web page feature that the pattern learning of regular expression timing learner obtains, if do not satisfy, changes step (6);
(3) utilize regular expression to extract the structured message of webpage, this structured message is the content piece of network address classified information;
(4) from the structured message piece, extract satisfactory new url according to regular expression;
(5) new url is added in the URL formation;
(6) judge that whether the URL formation is empty, if empty, then changes step (8);
(7) take out a URL, judge whether this URL satisfies search strategy,, then this URL is joined among the network address classification table Category, and turn to step (1) simultaneously if satisfy; Otherwise, turn to step (6);
(8) finish.
3. the application presentation layer of Web page classifying identification
Function: user's the input and the feedback of Search Results are provided.The user can be by the network address of input key word precise search to specific area; For the URL an of the unknown, the user also can inquire the websites collection under it.
■ is based on the Web page classifying recognition methods of vertical search and focused crawler technology.Method processing procedure such as accompanying drawing 1.
(1) reads the url list of presetting the Web side navigation website, judge that whether url list is empty, if empty, then changes step (8);
(2) take out a website URL, put it in the url list (UV_URL tabulation) of not visit.
(3) if the UV_URL tabulation for empty, is then changeed step (1);
(4) from UV_URL tabulation, take out a URL, judge whether accessed mistake of this URL according to Table V _ URL, if then change step (3);
(5) URL that obtains is carried out the webpage source code and obtain, utilize vertical search technology and focused crawler technology that web page contents is resolved, obtain website information corresponding in webpage classification information under this website and each classification;
(6) website information corresponding in webpage classification information and each classification is joined in the Category tabulation;
(7) from table UV_URL, delete URL, and it is added among the V_URL, turn to (1);
(8) finish.
The ■ beneficial effect
Identification for Web page classifying has very wide significance and using value.Mainly can be applied in:
◆ the specific crowd of professional domain is to the vertical search of customizing messages;
◆ the search of Deep Web and excavation;
◆ the website structure elucidation;
◆ the analysis of central issue of Internet user interest;
◆ improve the search efficiency of topic search engine;
◆ construction of digital library;
Description of drawings
Fig. 1 is based on the Web page classifying recognition methods overall flow figure of vertical search and focused crawler technology.Provided each processing procedure of identification Web page classifying among the figure.
Fig. 2 is the process flow diagram of web page contents analytic method.Provided each processing procedure of web page contents analytic method among the figure.
Embodiment
What navigation website warehouse-in engine of developing according to this method and broadband networks user behavior analysis system adopted is the B/S framework, and development platform is vs2005+oracle 9i, and the user can be as required, is linked into easily in the existing system that needs websites collection.Only need revise configuration file during deployment, can on a PC, move, also can operation simultaneously on multiple pc.This system has obtained concrete checking in our development ﹠ construction.Utilize URL coverage rate in Chinese website ALEXATOP100 of this system grabs to reach 98%, the coverage rate among the Global Site ALEXA TOP 500 reaches 87%, and the URL coverage rate of local characteristic website reaches 56%.By actual motion and test in our the development ﹠ construction process, well embodied implementation result based on the recognition methods of the Web page classifying of vertical search and focused crawler, verified the accuracy of the method.

Claims (2)

1. Web page classifying recognition methods based on vertical search and focused crawler technology is characterized in that steps of the method are:
(1) reads the url list of presetting the Web side navigation website, judge that whether url list is empty, if empty, then changes step (8);
(2) take out a website URL, put it in the url list (UV_URL tabulation) of not visit.
(3) if the UV_URL tabulation for empty, is then changeed step (1);
(4) from UV_URL tabulation, take out a URL, judge whether accessed mistake of this URL according to Table V _ URL, if then change step (3);
(5) URL that obtains is carried out the webpage source code and obtain, utilize vertical search technology and focused crawler technology that web page contents is resolved, obtain website information corresponding in webpage classification information under this website and each classification;
(6) website information corresponding in webpage classification information and each classification is joined in the Category tabulation;
(7) from table UV_URL, delete URL, and it is added among the V_URL, turn to (1);
(8) finish.
2. web page contents analytical method, it is based on the core methed in the Web page classifying recognition methods of vertical search and focused crawler technology.It is characterized in that coming the network address classification of accurate navigation by recognition website and the website information under the corresponding classification by vertical search and focused crawler technology, its method step is:
(1) utilize focused crawler to grasp the source file of webpage;
(2) judge that whether this webpage satisfies the structure of web page feature that the pattern learning of regular expression timing learner obtains, if do not satisfy, changes step (6);
(3) utilize regular expression to extract the structured message of webpage, this structured message is the content piece of network address classified information;
(4) from the structured message piece, extract satisfactory new url according to regular expression;
(5) new url is added in the URL formation;
(6) judge that whether the URL formation is empty, if empty, then changes step (8);
(7) take out a URL, judge whether this URL satisfies search strategy,, then this URL is joined among the network address classification table Category, and turn to step (1) simultaneously if satisfy; Otherwise, turn to step (6);
(8) finish.
CN200910025724A 2009-03-06 2009-03-06 Webpage classification technology based on vertical search and focused crawler Pending CN101520798A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910025724A CN101520798A (en) 2009-03-06 2009-03-06 Webpage classification technology based on vertical search and focused crawler

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910025724A CN101520798A (en) 2009-03-06 2009-03-06 Webpage classification technology based on vertical search and focused crawler

Publications (1)

Publication Number Publication Date
CN101520798A true CN101520798A (en) 2009-09-02

Family

ID=41081387

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910025724A Pending CN101520798A (en) 2009-03-06 2009-03-06 Webpage classification technology based on vertical search and focused crawler

Country Status (1)

Country Link
CN (1) CN101520798A (en)

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101908071A (en) * 2010-08-10 2010-12-08 厦门市美亚柏科信息股份有限公司 Method and device thereof for improving search efficiency of search engine
CN102073678A (en) * 2010-12-03 2011-05-25 厦门市美亚柏科信息股份有限公司 System and method for analyzing information of websites
CN102117320A (en) * 2011-01-11 2011-07-06 百度在线网络技术(北京)有限公司 Structured data searching method and device
CN102118400A (en) * 2009-12-31 2011-07-06 北京四维图新科技股份有限公司 Data acquisition method and system
CN102591992A (en) * 2012-02-15 2012-07-18 苏州亚新丰信息技术有限公司 Webpage classification identifying system and method based on vertical search and focused crawler technology
CN102654861A (en) * 2011-03-01 2012-09-05 腾讯科技(深圳)有限公司 Method and system for calculating webpage extraction accuracy
CN102968495A (en) * 2012-11-29 2013-03-13 河海大学 Vertical search engine and method for searching contrast association shopping information
CN102982161A (en) * 2012-12-05 2013-03-20 北京奇虎科技有限公司 Method and device for acquiring webpage information
CN103246675A (en) * 2012-02-10 2013-08-14 百度在线网络技术(北京)有限公司 Method and equipment for capturing data of website
CN103455492A (en) * 2012-05-29 2013-12-18 腾讯科技(深圳)有限公司 Method and device for searching web pages
US20140046938A1 (en) * 2011-11-01 2014-02-13 Tencent Technology (Shen Zhen) Company Limited History records sorting method and apparatus
CN103778238A (en) * 2014-01-27 2014-05-07 西安交通大学 Method for automatically building classification tree from semi-structured data of Wikipedia
CN103885957A (en) * 2012-12-20 2014-06-25 百度在线网络技术(北京)有限公司 Webpage information extraction method and device
CN103984749A (en) * 2014-05-27 2014-08-13 电子科技大学 Focused crawler method based on link analysis
CN104750804A (en) * 2015-03-24 2015-07-01 南京途牛科技有限公司 Plug-in type configurable vertical network spider implementation method
CN104765823A (en) * 2015-04-08 2015-07-08 天脉聚源(北京)传媒科技有限公司 Method and device for collecting website data
US9286408B2 (en) 2013-01-30 2016-03-15 Hewlett-Packard Development Company, L.P. Analyzing uniform resource locators
CN106202467A (en) * 2016-07-18 2016-12-07 浪潮集团有限公司 A kind of definable towards peer-to-peer network searches for the web crawlers method of emphasis
CN106446068A (en) * 2016-09-06 2017-02-22 北京邮电大学 Directory database generation and query methods and apparatuses
CN106649823A (en) * 2016-12-29 2017-05-10 淮海工学院 Webpage classification recognition method based on comprehensive subject term vertical search and focused crawler
CN106776636A (en) * 2015-11-24 2017-05-31 北京国双科技有限公司 Data processing method and device
CN106933944A (en) * 2017-01-20 2017-07-07 深圳前海勇艺达机器人有限公司 Method and its robot device with reciting news can automatically be captured
CN107291916A (en) * 2017-06-28 2017-10-24 上海尚工机器人技术有限公司 Internet Information Integration engine
CN107862039A (en) * 2017-11-06 2018-03-30 工业和信息化部电子第五研究所 Web data acquisition methods, system and Data Matching method for pushing
CN110633446A (en) * 2019-11-25 2019-12-31 湖南蚁坊软件股份有限公司 Webpage column recognition model training method, using method, device and storage medium
CN110704711A (en) * 2019-09-11 2020-01-17 中国海洋大学 Object automatic identification system for lifetime learning

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102118400A (en) * 2009-12-31 2011-07-06 北京四维图新科技股份有限公司 Data acquisition method and system
CN102118400B (en) * 2009-12-31 2013-07-17 北京四维图新科技股份有限公司 Data acquisition method and system
CN101908071A (en) * 2010-08-10 2010-12-08 厦门市美亚柏科信息股份有限公司 Method and device thereof for improving search efficiency of search engine
CN101908071B (en) * 2010-08-10 2012-09-05 厦门市美亚柏科信息股份有限公司 Method and device thereof for improving search efficiency of search engine
CN102073678B (en) * 2010-12-03 2013-02-27 厦门市美亚柏科信息股份有限公司 System and method for analyzing information of websites
CN102073678A (en) * 2010-12-03 2011-05-25 厦门市美亚柏科信息股份有限公司 System and method for analyzing information of websites
CN102117320A (en) * 2011-01-11 2011-07-06 百度在线网络技术(北京)有限公司 Structured data searching method and device
CN102654861A (en) * 2011-03-01 2012-09-05 腾讯科技(深圳)有限公司 Method and system for calculating webpage extraction accuracy
CN102654861B (en) * 2011-03-01 2017-12-08 深圳市世纪光速信息技术有限公司 Webpage extraction accuracy computational methods and system
US20140046938A1 (en) * 2011-11-01 2014-02-13 Tencent Technology (Shen Zhen) Company Limited History records sorting method and apparatus
CN103246675A (en) * 2012-02-10 2013-08-14 百度在线网络技术(北京)有限公司 Method and equipment for capturing data of website
CN103246675B (en) * 2012-02-10 2018-01-12 百度在线网络技术(北京)有限公司 A kind of method and apparatus for being used to capture website data
CN102591992A (en) * 2012-02-15 2012-07-18 苏州亚新丰信息技术有限公司 Webpage classification identifying system and method based on vertical search and focused crawler technology
CN103455492A (en) * 2012-05-29 2013-12-18 腾讯科技(深圳)有限公司 Method and device for searching web pages
CN103455492B (en) * 2012-05-29 2018-10-30 腾讯科技(深圳)有限公司 A kind of method and apparatus of search and webpage
CN102968495B (en) * 2012-11-29 2015-11-18 河海大学 The vertical search engine of search contrast association shopping information and method
CN102968495A (en) * 2012-11-29 2013-03-13 河海大学 Vertical search engine and method for searching contrast association shopping information
CN102982161A (en) * 2012-12-05 2013-03-20 北京奇虎科技有限公司 Method and device for acquiring webpage information
CN103885957A (en) * 2012-12-20 2014-06-25 百度在线网络技术(北京)有限公司 Webpage information extraction method and device
US9286408B2 (en) 2013-01-30 2016-03-15 Hewlett-Packard Development Company, L.P. Analyzing uniform resource locators
CN103778238A (en) * 2014-01-27 2014-05-07 西安交通大学 Method for automatically building classification tree from semi-structured data of Wikipedia
CN103984749A (en) * 2014-05-27 2014-08-13 电子科技大学 Focused crawler method based on link analysis
CN103984749B (en) * 2014-05-27 2017-10-20 电子科技大学 A kind of focused crawler method based on link analysis
CN104750804A (en) * 2015-03-24 2015-07-01 南京途牛科技有限公司 Plug-in type configurable vertical network spider implementation method
CN104765823A (en) * 2015-04-08 2015-07-08 天脉聚源(北京)传媒科技有限公司 Method and device for collecting website data
CN106776636A (en) * 2015-11-24 2017-05-31 北京国双科技有限公司 Data processing method and device
CN106202467A (en) * 2016-07-18 2016-12-07 浪潮集团有限公司 A kind of definable towards peer-to-peer network searches for the web crawlers method of emphasis
CN106446068A (en) * 2016-09-06 2017-02-22 北京邮电大学 Directory database generation and query methods and apparatuses
CN106446068B (en) * 2016-09-06 2020-02-07 北京邮电大学 Directory database generation and query method and device
CN106649823A (en) * 2016-12-29 2017-05-10 淮海工学院 Webpage classification recognition method based on comprehensive subject term vertical search and focused crawler
CN106933944A (en) * 2017-01-20 2017-07-07 深圳前海勇艺达机器人有限公司 Method and its robot device with reciting news can automatically be captured
CN107291916A (en) * 2017-06-28 2017-10-24 上海尚工机器人技术有限公司 Internet Information Integration engine
CN107862039A (en) * 2017-11-06 2018-03-30 工业和信息化部电子第五研究所 Web data acquisition methods, system and Data Matching method for pushing
CN110704711A (en) * 2019-09-11 2020-01-17 中国海洋大学 Object automatic identification system for lifetime learning
CN110633446A (en) * 2019-11-25 2019-12-31 湖南蚁坊软件股份有限公司 Webpage column recognition model training method, using method, device and storage medium
CN110633446B (en) * 2019-11-25 2020-03-13 湖南蚁坊软件股份有限公司 Webpage column recognition model training method, using method, device and storage medium

Similar Documents

Publication Publication Date Title
CN101520798A (en) Webpage classification technology based on vertical search and focused crawler
CN102591992A (en) Webpage classification identifying system and method based on vertical search and focused crawler technology
CN101630330A (en) Method for webpage classification
CN103491205B (en) The method for pushing of a kind of correlated resources address based on video search and device
CN101154231B (en) Method and system for applying web page semantics
CN101452453B (en) A kind of method of input method Web side navigation and a kind of input method system
CN102622445B (en) User interest perception based webpage push system and webpage push method
CN102831121B (en) Method and system for extracting webpage information
Zheng et al. Template-independent news extraction based on visual consistency
CN100514323C (en) System and method for automatically extracting by-line information
CN103294781B (en) A kind of method and apparatus for processing page data
CN102156737B (en) Method for extracting subject content of Chinese webpage
US8380693B1 (en) System and method for automatically identifying classified websites
CN103246732B (en) A kind of abstracting method of online Web news content and system
CN101344881A (en) Index generation method and device and search system for mass file type data
CN104182412A (en) Webpage crawling method and webpage crawling system
CN101655862A (en) Method and device for searching information object
CN102306201B (en) Method and system for analyzing webpage title
CN103823824A (en) Method and system for automatically constructing text classification corpus by aid of internet
CN105718585B (en) Document and label word justice correlating method and its device
CN102054028A (en) Web crawler system with page-rendering function and implementation method thereof
CN101393565A (en) Facing virtual museum searching method based on noumenon
CN103020123A (en) Method for searching bad video website
CN103984749A (en) Focused crawler method based on link analysis
CN102779135A (en) Method and device for obtaining cross-linguistic search resources and corresponding search method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
DD01 Delivery of document by public notice

Addressee: RunTrend Technology Inc. Xu Jingjing

Document name: Notification that Application Deemed to be Withdrawn

C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20090902