CN102591992A - Webpage classification identifying system and method based on vertical search and focused crawler technology - Google Patents

Webpage classification identifying system and method based on vertical search and focused crawler technology Download PDF

Info

Publication number
CN102591992A
CN102591992A CN2012100341952A CN201210034195A CN102591992A CN 102591992 A CN102591992 A CN 102591992A CN 2012100341952 A CN2012100341952 A CN 2012100341952A CN 201210034195 A CN201210034195 A CN 201210034195A CN 102591992 A CN102591992 A CN 102591992A
Authority
CN
China
Prior art keywords
url
module
webpage
website
data acquisition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012100341952A
Other languages
Chinese (zh)
Inventor
曹武龙
王国圃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SUZHOU YAXINFENG INFORMATION TECHNOLOGY Co Ltd
Original Assignee
SUZHOU YAXINFENG INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SUZHOU YAXINFENG INFORMATION TECHNOLOGY Co Ltd filed Critical SUZHOU YAXINFENG INFORMATION TECHNOLOGY Co Ltd
Priority to CN2012100341952A priority Critical patent/CN102591992A/en
Publication of CN102591992A publication Critical patent/CN102591992A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a webpage classification identifying system based on vertical search and focused crawler technology. The system is characterized by comprising an application expressing module, a data acquisition module and a content analyzing module, wherein the data acquisition module finishes acquisition of webpage data through a Web protocol, and transfers the acquired page data to the content analyzing module; the content analyzing module performs HTML (hyper text mark-up language) analysis on the page data acquired by the data acquisition module, extracts hyperlink in a page and adds the hyperlink into a URL (uniform resource locator) queue to obtain a correspondence table between the website type and URL; and the application expressing module receives the keyword input by a user for search, and feeds the searched website of a specific field and/or the website type thereof back to the user. Through actual operation and test in the development and construction process, the implementation effect of the webpage classification identification method based on vertical search and focused crawler is perfectly reflected, and the accuracy of the method is verified.

Description

Web page classifying recognition system and method based on vertical search and focused crawler technology
Technical field
The invention belongs to the web page search engine technical field, be specifically related to a kind of Web page classifying recognition system and method based on vertical search and focused crawler technology.
Background technology
Along with the continuous expansion of information, People more and more be unable to do without search engine.Though universal search engine such as Baidu, Google provides a lot of facilities to people; But the variation of Along with people's demand and to the Search Results quality require increasingly highly, universal search engine can not satisfy people's requirement in some specialized fields, so vertical search is just arisen at the historic moment; It is the precise search technology of serving local professional domain; Professional more, the result who returns has more specific aim, through the domain knowledge of specific industry theme; Inquiry according to semantic information can be provided, thereby can satisfy user's special search need.
Along with the increased popularity of vertical search engine, also seem more and more important as gordian technique one focused crawler of vertical search engine.Focused crawler is a program of downloading webpage automatically, and it visits webpage and relevant linking on the WWW selectively according to set extracting target, obtains needed information.
Web page classifying identification to vertical search and focused crawler technology possesses certain degree of difficulty, and following reason is arranged: the first, how focused crawler is difficult to judge that from the URL formation of waiting to creep, choosing the webpage that most probable comprises subject correlation message creeps.The second, now many crawler systems of increasing income do not possess the directed function of extracting Web page structural information from the webpage that grasps.Three, the content and structure of same webpage often changes, and the heavily visit strategy of focused crawler is difficult to adapt to this variation.Therefore the focused crawler technology that adopts tradition to increase income is difficult to the different classes of webpage of accurate recognition.Therefore, must look for another way.The present invention therefore.
Summary of the invention
The object of the invention is to provide a kind of Web page classifying recognition system based on vertical search and focused crawler technology; Set up a kind of Web page classifying recognition methods to the navigation type website based on vertical search and focused crawler technology; And design its model of cognition and algorithm, through identification, obtain the URL of different classification in the navigation type website to the navigation type website; Be convenient to the precise search of user, can provide the affiliated Web page classifying of unknown URL simultaneously the website.
In order to solve these problems of the prior art, technical scheme provided by the invention is:
A kind of Web page classifying recognition system based on vertical search and focused crawler technology; It is characterized in that said system comprises application representation module, data acquisition module and Context resolution module; Said data acquisition module is accomplished the collection to web data through the Web agreement, gives the Context resolution module with the page data that collects then; Said Context resolution module is carried out the HTML parsing according to the page data of data collecting module collected, extracts the hyperlink in the page, and hyperlink is joined in the URL formation, obtains the mapping table of network address classification and URL; Said application representation module is accepted user entered keyword search, and the network address of the specific area that searches and/or affiliated network address classification result are fed back to the user.
Preferably, said system is arranged between focused crawler process and the Internet network, and said focused crawler process grasps the guidance station dot information of Internet network automatically according to rule.
Another object of the present invention is to provide the said system of a kind of employing to carry out the Web page classifying recognition methods, it is characterized in that said method comprising the steps of:
(1) create the focused crawler process, the focused crawler process reads the url list of preset Web side navigation website;
(2) the data acquisition module website URL that need carry out data acquisition from the url list taking-up carries out the webpage source code to the URL that obtains and obtains; The Context resolution module utilizes vertical search technology and focused crawler technology that web page contents is resolved; Obtain website information corresponding in webpage classification information and each webpage classification under this website, and website information corresponding in webpage classification information and each webpage classification is joined in the Category tabulation; Circulate successively and all travel through until url list; The URL that said Category list storage has been discerned and its network address classification that is subordinate to.
Preferably,, then travel through directly and finish when url list is empty in said method step (2).
Preferably; Said method step (2) is after data acquisition module need carry out the website URL of data acquisition from the url list taking-up; The website URL that data acquisition module earlier carries out data acquisition with needs puts into and does not visit url list, when not visiting the url list non-NULL, never visits url list and takes out a URL; And the URL that obtains is carried out the webpage source code obtain; And through the Context resolution module webpage source code is resolved, and this URL joined visit in the url list, and never visit this URL of deletion in the url list.
Preferably, when not visiting url list for sky, notice focused crawler process reads the url list of preset Web side navigation website in the said method step (2).
Preferably, the URL that ought never visit the url list taking-up in the said method visited, and then continued not visit next URL of url list.
Preferably, said method step (2) the Context resolution module step of carrying out Context resolution comprises:
A1) the webpage source file of the URL that obtains of focused crawler process extracting, then according to regular expression regularly the structure of web page characteristic use regular expression that obtains of learner pattern learning extract the structured message of webpage;
A2) use regular expression from the structured message of webpage, to extract the new url that meets the network address classified information; And new url added in the URL formation;
A3) from the URL formation, take out URL, whether this URL of cycle criterion satisfies the search strategy of using representation module, if satisfy search strategy, then this URL is joined among the network address classification table Category with corresponding network address classification.
Preferably, said method step A2) said regular expression grasps new url according to the strategy of breadth First from source file.
Preferably, said method step A1) when regular expression timing learner can not be discerned the structure of web page characteristic, directly carry out whether satisfying in the URL formation judgement of the search strategy of using representation module.
Preferably, said method step A3) if when not satisfying search strategy, continues then to judge whether next URL in the URL formation satisfies the step of search strategy in.
The present invention can solve following problem: 1) utilize vertical search to obtain the corresponding network address of different classification with focused crawler from navigation website through to vertical search and focused crawler Study on Technology.2) can return precise search result targetedly to the special search of specific industry theme according to the user.3) obtain Web page classifying that unknown URL is affiliated on the classifieds website through vertical search and focused crawler.
The present invention is based on the Web page classifying recognition methods of vertical search and focused crawler technology, the technological frame of the URL that respectively classifies in a kind of effective navigation by recognition class website be provided, and detailed design recognizer.System is divided into three modules, is successively from the bottom up: data acquisition module, Context resolution layer and application presentation layer.
The present invention is based on the Web page classifying recognition methods of vertical search and focused crawler technology, committed step is two parts: the webpage source code obtains and the web page contents analytical method.Wherein the web page contents analytical method is a core, and it comprises two main portions: extract the structured message of webpage and the strategy of creeping of focused crawler.Through discovering navigation website webpage source code: the navigation type website is made up of the sub-directory page of a master catalogue page and each classification basically two kinds of pages; The master catalogue page comprises the link of each classification subpage frame of a large amount of sensings, and the sub-directory page of each classification then comprises the link of the website that belongs to this classification.The sub-directory page of each classification on the same navigation website also has very strong similarity; That is to say similar structure is all arranged in these pages; Can summarize the structured message of the page with (or a several) regular expression through pattern learning, so as long as find the regular expression of representing these page structure information just can well instruct focused crawler to grasp and the relevant webpage of classifying as far as possible.With www.hao123.com is example; As to search all URL of " amusement and recreation " classification; Can write regular expression href (?: " (? < 1>[^ "] *) " | (? < 1 >)); be used for the link of shape such as href=" ... " in the matched character string, just can obtain all URL of " amusement and recreation " classification.
In order to adapt to the irregular renewal of navigation website, better extract the Web page structural information of catalog page, the invention provides the timing learner of URL regular expression, can adapt to the continuous variation of the website that navigates.The present invention has simultaneously proposed the directed BFS strategy based on the web page contents characteristic with reference to three kinds of search strategies of URL.The basic thought of this search strategy is: in the process that webpage grasps, according to the directed structured message that extracts webpage of the content characteristic of webpage, from structured message, grasp webpage with breadth-first strategy then earlier.This method can reduce the quantity of gathering the page effectively, has also practiced thrift the network bandwidth simultaneously, improves the efficient of information search.
With respect to scheme of the prior art, advantage of the present invention is:
Utilize URL coverage rate in Chinese website ALEXA TOP100 of this system grabs to reach 98%, the coverage rate among the Global Site ALEXA TOP 500 reaches 87%, and the URL coverage rate of local characteristic website reaches 56%.Through actual motion and test in the development & construction process, well embodied implementation result based on the recognition methods of the Web page classifying of vertical search and focused crawler, verified the accuracy of the method.The present invention has very wide significance and using value for the identification of Web page classifying.Mainly can be applied in the specific crowd of professional domain to aspects such as the search of the vertical search of customizing messages, Deep Web and excavation, website structure elucidation, the analysis of central issue of Internet user interest, the search efficiency that improves topic search engine, construction of digital library.
Description of drawings
Below in conjunction with accompanying drawing and embodiment the present invention is further described:
Fig. 1 is based on the Web page classifying recognition methods overall flow figure of vertical search and focused crawler technology; Wherein provided each processing procedure of identification Web page classifying.
Fig. 2 is the process flow diagram of web page contents analytic method; Wherein provided each processing procedure of web page contents analytic method.
Embodiment
Below in conjunction with specific embodiment such scheme is further specified.Should be understood that these embodiment are used to the present invention is described and are not limited to limit scope of the present invention.The implementation condition that adopts among the embodiment can be done further adjustment according to the condition of concrete producer, and not marked implementation condition is generally the condition in the normal experiment.
Embodiment
Navigation website that present embodiment is developed warehouse-in engine and broadband networks user behavior analysis system employing be the B/S framework, development platform is vs2005+oracle 9i, the user can be as required, is linked into easily in the existing system that needs the network address classification.Only need revise configuration file during deployment, can on a PC, move, also can operation simultaneously on multiple pc.
Below introduce each module of this design in detail and based on the Web page classifying recognition methods of vertical search and focused crawler.The concrete processing procedure of method such as the accompanying drawing 1 of Web page classifying identification, carry out according to following steps:
(1) read the url list of presetting the Web side navigation website, judge whether url list is empty,
If empty, then change step (8);
(2) take out a website URL, put it in the url list (UV_URL tabulation) of not visit.
(3) if the UV_URL tabulation for empty, is then changeed step (1);
(4) from the UV_URL tabulation, take out a URL, judge according to Table V _ URL whether this URL was visited, if then change step (3);
(5) URL that obtains is carried out the webpage source code and obtain, utilize vertical search technology and focused crawler technology that web page contents is resolved, obtain the website information of correspondence in webpage classification information and each classification under this website;
(6) website information corresponding in webpage classification information and each classification is joined in the Category tabulation;
(7) from table UV_URL, delete URL, and it is added among the V_URL, turn to (1);
(8) finish.
Web page classifying recognition methods based on vertical search and focused crawler needs following module: data acquisition module, Context resolution module and application representation module.
The function of data acquisition module: the main effect of this module is to accomplish the collection to web data through various Web agreements, gives a last module with the page that collects then and does further processing.
The interface of data acquisition module: this module is the interface of focused crawler and the Internet, with the interface of a last module be webpage source code string data, to the upper strata input data are provided.
The function of Context resolution module: this module is the nucleus module of whole framework, is that the page that data collecting module collected is got off carries out the HTML parsing according to next module mainly, extracts hyperlink wherein, joins in the URL formation.The URL that provides in the page link generally is multiple form, possibly be complete, comprises agreement, website and path; Also possibly omit partial content; Or a relative path, therefore need structured message with web page contents analytical method extraction webpage, from structured message, grasp webpage URL with breadth-first strategy; Obtain the mapping table Category of network address classification and URL, to satisfy of the search of a last module application representation module to Web page classifying.
The interface of Context resolution module: the Web page classifying identification of this module should be a mapping table with the interface of application module, i.e. network address classification and URL correspondence table.
The main method of Context resolution module is the web page contents analytical method, and it comprises two main portions: extract the structured message of webpage and the strategy of creeping of focused crawler.At first extract Web page structural information, use directed BFS strategy to carry out the extracting of URL then based on the web page contents characteristic.
Concrete web page contents analytic method processing procedure such as accompanying drawing 2, carry out according to following steps:
(1) utilize focused crawler to grasp the source file of webpage;
(2) judge that whether this webpage satisfies the structure of web page characteristic that the pattern learning of regular expression timing learner obtains, if do not satisfy, changes step (6);
(3) utilize regular expression to extract the structured message of webpage, this structured message is the content blocks of network address classified information;
(4) from the structured message piece, extract satisfactory new url according to regular expression;
(5) new url is added in the URL formation;
(6) judge that whether the URL formation is empty, if empty, then changes step (8);
(7) take out a URL, judge whether this URL satisfies search strategy,, then this URL is joined among the network address classification table Category, and turn to step (1) simultaneously if satisfy; Otherwise, turn to step (6);
(8) finish.
Wherein: UV_URL is used to deposit the not URL of visit; V_URL is used to deposit the URL that has visited; Category is used to deposit URL that has discerned and the network address classification that is subordinate to.
Use the function of representation module: user's the input and the feedback of Search Results are provided.The user can be through the network address of input key word precise search to specific area; For the URL an of the unknown, the user also can inquire the network address classification under it.
Above-mentioned instance only is explanation technical conceive of the present invention and characteristics, and its purpose is to let the people who is familiar with this technology can understand content of the present invention and enforcement according to this, can not limit protection scope of the present invention with this.All equivalent transformations that spirit is done according to the present invention or modification all should be encompassed within protection scope of the present invention.

Claims (10)

1. Web page classifying recognition system based on vertical search and focused crawler technology; It is characterized in that said system comprises application representation module, data acquisition module and Context resolution module; Said data acquisition module is accomplished the collection to web data through the Web agreement, gives the Context resolution module with the page data that collects then; Said Context resolution module is carried out the HTML parsing according to the page data of data collecting module collected, extracts the hyperlink in the page, and hyperlink is joined in the URL formation, obtains the mapping table of network address classification and URL; Said application representation module is accepted user entered keyword search, and the network address of the specific area that searches and/or affiliated network address classification result are fed back to the user.
2. one kind is adopted the system of claim 1 to carry out the Web page classifying recognition methods, it is characterized in that said method comprising the steps of:
(1) create the focused crawler process, the focused crawler process reads the url list of preset Web side navigation website;
(2) the data acquisition module website URL that need carry out data acquisition from the url list taking-up carries out the webpage source code to the URL that obtains and obtains; The Context resolution module utilizes vertical search technology and focused crawler technology that web page contents is resolved; Obtain website information corresponding in webpage classification information and each webpage classification under this website, and website information corresponding in webpage classification information and each webpage classification is joined in the Category tabulation; Circulate successively and all travel through until url list; The URL that said Category list storage has been discerned and its network address classification that is subordinate to.
3. method according to claim 2 is characterized in that said method step (2) when url list is sky, then travels through directly and finishes.
4. method according to claim 2; It is characterized in that said method step (2) is after data acquisition module need carry out the website URL of data acquisition from the url list taking-up; The website URL that data acquisition module earlier carries out data acquisition with needs puts into and does not visit url list, when not visiting the url list non-NULL, never visits url list and takes out a URL; And the URL that obtains is carried out the webpage source code obtain; And through the Context resolution module webpage source code is resolved, and this URL joined visit in the url list, and never visit this URL of deletion in the url list.
5. method according to claim 4 is characterized in that in the said method step (2) that notice focused crawler process reads the url list of preset Web side navigation website when not visiting url list when empty.
6. method according to claim 4 is characterized in that the URL that ought never visit the url list taking-up in the said method visited, and then continued not visit next URL of url list.
7. method according to claim 2 is characterized in that said method step (2) Context resolution module carries out the step of Context resolution and comprise:
A1) the webpage source file of the URL that obtains of focused crawler process extracting, then according to regular expression regularly the structure of web page characteristic use regular expression that obtains of learner pattern learning extract the structured message of webpage;
A2) use regular expression from the structured message of webpage, to extract the new url that meets the network address classified information; And new url added in the URL formation;
A3) from the URL formation, take out URL, whether this URL of cycle criterion satisfies the search strategy of using representation module, if satisfy search strategy, then this URL is joined among the network address classification table Category with corresponding network address classification.
8. method according to claim 7 is characterized in that said method step A2) said regular expression grasps new url according to the strategy of breadth First from source file.
9. method according to claim 7 is characterized in that said method step A1) when regular expression timing learner can not be discerned the structure of web page characteristic, directly carry out whether satisfying in the URL formation judgement of the search strategy of using representation module.
10. method according to claim 7 is characterized in that said method step A3) in if when not satisfying search strategy, continue then to judge whether next URL in the URL formation satisfies the step of search strategy.
CN2012100341952A 2012-02-15 2012-02-15 Webpage classification identifying system and method based on vertical search and focused crawler technology Pending CN102591992A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012100341952A CN102591992A (en) 2012-02-15 2012-02-15 Webpage classification identifying system and method based on vertical search and focused crawler technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012100341952A CN102591992A (en) 2012-02-15 2012-02-15 Webpage classification identifying system and method based on vertical search and focused crawler technology

Publications (1)

Publication Number Publication Date
CN102591992A true CN102591992A (en) 2012-07-18

Family

ID=46480627

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012100341952A Pending CN102591992A (en) 2012-02-15 2012-02-15 Webpage classification identifying system and method based on vertical search and focused crawler technology

Country Status (1)

Country Link
CN (1) CN102591992A (en)

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102819591A (en) * 2012-08-07 2012-12-12 北京网康科技有限公司 Content-based web page classification method and system
CN102902703A (en) * 2012-07-19 2013-01-30 中国人民解放军国防科学技术大学 Network sensitive information-oriented screenshot discovery and locking callback method
CN103324761A (en) * 2013-07-11 2013-09-25 广州市尊网商通资讯科技有限公司 Product database forming method based on Internet data and system
CN103744945A (en) * 2013-12-31 2014-04-23 上海伯释信息科技有限公司 Method for rapidly and accurately searching for target book by web crawler technology
CN103778238A (en) * 2014-01-27 2014-05-07 西安交通大学 Method for automatically building classification tree from semi-structured data of Wikipedia
CN103870567A (en) * 2014-03-11 2014-06-18 浪潮集团有限公司 Automatic identifying method for webpage collecting template of vertical search engine in cloud computing
CN104050037A (en) * 2014-06-13 2014-09-17 淮阴工学院 Implementation method for directional crawler based on assigned e-commerce website
CN105656707A (en) * 2014-11-18 2016-06-08 阿里巴巴集团控股有限公司 Method and system for testing web crawler
CN103324761B (en) * 2013-07-11 2016-11-30 广州市尊网商通资讯科技有限公司 A kind of based on internet data formation product database method and system
CN106649823A (en) * 2016-12-29 2017-05-10 淮海工学院 Webpage classification recognition method based on comprehensive subject term vertical search and focused crawler
CN106776636A (en) * 2015-11-24 2017-05-31 北京国双科技有限公司 Data processing method and device
CN106874282A (en) * 2015-12-11 2017-06-20 北京奇虎科技有限公司 The generation method and device of candidate page set
CN107045507A (en) * 2016-02-05 2017-08-15 北京国双科技有限公司 Web page crawl method and device
CN107301253A (en) * 2017-08-23 2017-10-27 杭州安恒信息技术有限公司 A kind of method and device for improving multi-site search key accuracy
CN108376071A (en) * 2016-11-11 2018-08-07 中移(杭州)信息技术有限公司 A kind of APP recognition methods and system
CN109446396A (en) * 2018-10-17 2019-03-08 珠海市智图数研信息技术有限公司 A kind of intelligent crawler frame system of line business information
CN109597928A (en) * 2018-12-05 2019-04-09 云南电网有限责任公司信息中心 Support the non-structured text acquisition methods based on Web network of subscriber policy configuration
CN110309408A (en) * 2018-03-09 2019-10-08 陈包容 A method of automation search
CN110637316A (en) * 2016-12-22 2019-12-31 奥恩全球运营有限公司,新加坡分公司 System and method for intelligent prospective object recognition using online resources and neural network processing to classify tissue based on published material
CN110781366A (en) * 2019-09-09 2020-02-11 深圳壹账通智能科技有限公司 Webpage data processing method and device, computer equipment and storage medium
US10769159B2 (en) 2016-12-22 2020-09-08 Aon Global Operations Plc, Singapore Branch Systems and methods for data mining of historic electronic communication exchanges to identify relationships, patterns, and correlations to deal outcomes
CN111753162A (en) * 2020-06-29 2020-10-09 平安国际智慧城市科技股份有限公司 Data crawling method, device, server and storage medium
CN111881336A (en) * 2020-07-28 2020-11-03 上海应用技术大学 Topic web crawler method and system
CN111931113A (en) * 2020-09-16 2020-11-13 深圳壹账通智能科技有限公司 Data cleaning method and related equipment
US10951695B2 (en) 2019-02-14 2021-03-16 Aon Global Operations Se Singapore Branch System and methods for identification of peer entities

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101520798A (en) * 2009-03-06 2009-09-02 苏州锐创通信有限责任公司 Webpage classification technology based on vertical search and focused crawler
CN101630330A (en) * 2009-08-14 2010-01-20 苏州锐创通信有限责任公司 Method for webpage classification

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101520798A (en) * 2009-03-06 2009-09-02 苏州锐创通信有限责任公司 Webpage classification technology based on vertical search and focused crawler
CN101630330A (en) * 2009-08-14 2010-01-20 苏州锐创通信有限责任公司 Method for webpage classification

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102902703A (en) * 2012-07-19 2013-01-30 中国人民解放军国防科学技术大学 Network sensitive information-oriented screenshot discovery and locking callback method
CN102819591A (en) * 2012-08-07 2012-12-12 北京网康科技有限公司 Content-based web page classification method and system
CN102819591B (en) * 2012-08-07 2016-04-06 北京网康科技有限公司 A kind of content-based Web page classification method and system
CN103324761A (en) * 2013-07-11 2013-09-25 广州市尊网商通资讯科技有限公司 Product database forming method based on Internet data and system
CN103324761B (en) * 2013-07-11 2016-11-30 广州市尊网商通资讯科技有限公司 A kind of based on internet data formation product database method and system
CN103744945A (en) * 2013-12-31 2014-04-23 上海伯释信息科技有限公司 Method for rapidly and accurately searching for target book by web crawler technology
CN103778238A (en) * 2014-01-27 2014-05-07 西安交通大学 Method for automatically building classification tree from semi-structured data of Wikipedia
CN103870567A (en) * 2014-03-11 2014-06-18 浪潮集团有限公司 Automatic identifying method for webpage collecting template of vertical search engine in cloud computing
CN104050037A (en) * 2014-06-13 2014-09-17 淮阴工学院 Implementation method for directional crawler based on assigned e-commerce website
CN105656707A (en) * 2014-11-18 2016-06-08 阿里巴巴集团控股有限公司 Method and system for testing web crawler
CN105656707B (en) * 2014-11-18 2019-03-26 阿里巴巴集团控股有限公司 A kind of method and system of test network crawler
CN106776636A (en) * 2015-11-24 2017-05-31 北京国双科技有限公司 Data processing method and device
CN106874282A (en) * 2015-12-11 2017-06-20 北京奇虎科技有限公司 The generation method and device of candidate page set
CN107045507A (en) * 2016-02-05 2017-08-15 北京国双科技有限公司 Web page crawl method and device
CN107045507B (en) * 2016-02-05 2020-08-21 北京国双科技有限公司 Webpage crawling method and device
CN108376071A (en) * 2016-11-11 2018-08-07 中移(杭州)信息技术有限公司 A kind of APP recognition methods and system
CN108376071B (en) * 2016-11-11 2021-08-24 中移(杭州)信息技术有限公司 APP identification method and system
CN110637316B (en) * 2016-12-22 2021-04-13 奥恩全球运营有限公司,新加坡分公司 System and method for prospective object identification
CN110637316A (en) * 2016-12-22 2019-12-31 奥恩全球运营有限公司,新加坡分公司 System and method for intelligent prospective object recognition using online resources and neural network processing to classify tissue based on published material
US11455313B2 (en) 2016-12-22 2022-09-27 Aon Global Operations Se, Singapore Branch Systems and methods for intelligent prospect identification using online resources and neural network processing to classify organizations based on published materials
US10769159B2 (en) 2016-12-22 2020-09-08 Aon Global Operations Plc, Singapore Branch Systems and methods for data mining of historic electronic communication exchanges to identify relationships, patterns, and correlations to deal outcomes
CN106649823A (en) * 2016-12-29 2017-05-10 淮海工学院 Webpage classification recognition method based on comprehensive subject term vertical search and focused crawler
CN107301253A (en) * 2017-08-23 2017-10-27 杭州安恒信息技术有限公司 A kind of method and device for improving multi-site search key accuracy
CN110309408B (en) * 2018-03-09 2023-07-14 陈包容 Automatic searching method
CN110309408A (en) * 2018-03-09 2019-10-08 陈包容 A method of automation search
CN109446396A (en) * 2018-10-17 2019-03-08 珠海市智图数研信息技术有限公司 A kind of intelligent crawler frame system of line business information
CN109597928B (en) * 2018-12-05 2022-12-16 云南电网有限责任公司信息中心 Unstructured text acquisition method supporting user policy configuration and based on Web network
CN109597928A (en) * 2018-12-05 2019-04-09 云南电网有限责任公司信息中心 Support the non-structured text acquisition methods based on Web network of subscriber policy configuration
US10951695B2 (en) 2019-02-14 2021-03-16 Aon Global Operations Se Singapore Branch System and methods for identification of peer entities
CN110781366A (en) * 2019-09-09 2020-02-11 深圳壹账通智能科技有限公司 Webpage data processing method and device, computer equipment and storage medium
CN111753162A (en) * 2020-06-29 2020-10-09 平安国际智慧城市科技股份有限公司 Data crawling method, device, server and storage medium
CN111881336A (en) * 2020-07-28 2020-11-03 上海应用技术大学 Topic web crawler method and system
CN111931113A (en) * 2020-09-16 2020-11-13 深圳壹账通智能科技有限公司 Data cleaning method and related equipment

Similar Documents

Publication Publication Date Title
CN102591992A (en) Webpage classification identifying system and method based on vertical search and focused crawler technology
CN101520798A (en) Webpage classification technology based on vertical search and focused crawler
CN102622445B (en) User interest perception based webpage push system and webpage push method
CN101630330A (en) Method for webpage classification
CN100405371C (en) Method and system for abstracting new word
CN101452453B (en) A kind of method of input method Web side navigation and a kind of input method system
CN103365924B (en) A kind of method of internet information search, device and terminal
CN101908071B (en) Method and device thereof for improving search efficiency of search engine
CN102073725B (en) Method for searching structured data and search engine system for implementing same
CN102073726B (en) Structured data import method and device for search engine system
CN102306201B (en) Method and system for analyzing webpage title
CN106570171A (en) Semantics-based sci-tech information processing method and system
CN103246732B (en) A kind of abstracting method of online Web news content and system
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
CN103294781A (en) Method and equipment used for processing page data
CN104182412A (en) Webpage crawling method and webpage crawling system
CN101231661A (en) Method and system for digging object grade knowledge
CN102054028A (en) Web crawler system with page-rendering function and implementation method thereof
CN102270331A (en) Network shopping navigating method based on visual search
CN101599089A (en) The automatic search of update information on content of video service website and extraction system and method
CN101551800A (en) Marked information generation device, inquiry unit and sharing system
CN104572934B (en) A kind of webpage key content abstracting method based on DOM
CN102722499A (en) Search engine and implementation method thereof
CN103530429A (en) Webpage content extracting method
CN101241506A (en) Many dimensions search method and device and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120718