CN102591992A - Webpage classification identifying system and method based on vertical search and focused crawler technology - Google Patents
Webpage classification identifying system and method based on vertical search and focused crawler technology Download PDFInfo
- Publication number
- CN102591992A CN102591992A CN2012100341952A CN201210034195A CN102591992A CN 102591992 A CN102591992 A CN 102591992A CN 2012100341952 A CN2012100341952 A CN 2012100341952A CN 201210034195 A CN201210034195 A CN 201210034195A CN 102591992 A CN102591992 A CN 102591992A
- Authority
- CN
- China
- Prior art keywords
- url
- module
- webpage
- website
- data acquisition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a webpage classification identifying system based on vertical search and focused crawler technology. The system is characterized by comprising an application expressing module, a data acquisition module and a content analyzing module, wherein the data acquisition module finishes acquisition of webpage data through a Web protocol, and transfers the acquired page data to the content analyzing module; the content analyzing module performs HTML (hyper text mark-up language) analysis on the page data acquired by the data acquisition module, extracts hyperlink in a page and adds the hyperlink into a URL (uniform resource locator) queue to obtain a correspondence table between the website type and URL; and the application expressing module receives the keyword input by a user for search, and feeds the searched website of a specific field and/or the website type thereof back to the user. Through actual operation and test in the development and construction process, the implementation effect of the webpage classification identification method based on vertical search and focused crawler is perfectly reflected, and the accuracy of the method is verified.
Description
Technical field
The invention belongs to the web page search engine technical field, be specifically related to a kind of Web page classifying recognition system and method based on vertical search and focused crawler technology.
Background technology
Along with the continuous expansion of information, People more and more be unable to do without search engine.Though universal search engine such as Baidu, Google provides a lot of facilities to people; But the variation of Along with people's demand and to the Search Results quality require increasingly highly, universal search engine can not satisfy people's requirement in some specialized fields, so vertical search is just arisen at the historic moment; It is the precise search technology of serving local professional domain; Professional more, the result who returns has more specific aim, through the domain knowledge of specific industry theme; Inquiry according to semantic information can be provided, thereby can satisfy user's special search need.
Along with the increased popularity of vertical search engine, also seem more and more important as gordian technique one focused crawler of vertical search engine.Focused crawler is a program of downloading webpage automatically, and it visits webpage and relevant linking on the WWW selectively according to set extracting target, obtains needed information.
Web page classifying identification to vertical search and focused crawler technology possesses certain degree of difficulty, and following reason is arranged: the first, how focused crawler is difficult to judge that from the URL formation of waiting to creep, choosing the webpage that most probable comprises subject correlation message creeps.The second, now many crawler systems of increasing income do not possess the directed function of extracting Web page structural information from the webpage that grasps.Three, the content and structure of same webpage often changes, and the heavily visit strategy of focused crawler is difficult to adapt to this variation.Therefore the focused crawler technology that adopts tradition to increase income is difficult to the different classes of webpage of accurate recognition.Therefore, must look for another way.The present invention therefore.
Summary of the invention
The object of the invention is to provide a kind of Web page classifying recognition system based on vertical search and focused crawler technology; Set up a kind of Web page classifying recognition methods to the navigation type website based on vertical search and focused crawler technology; And design its model of cognition and algorithm, through identification, obtain the URL of different classification in the navigation type website to the navigation type website; Be convenient to the precise search of user, can provide the affiliated Web page classifying of unknown URL simultaneously the website.
In order to solve these problems of the prior art, technical scheme provided by the invention is:
A kind of Web page classifying recognition system based on vertical search and focused crawler technology; It is characterized in that said system comprises application representation module, data acquisition module and Context resolution module; Said data acquisition module is accomplished the collection to web data through the Web agreement, gives the Context resolution module with the page data that collects then; Said Context resolution module is carried out the HTML parsing according to the page data of data collecting module collected, extracts the hyperlink in the page, and hyperlink is joined in the URL formation, obtains the mapping table of network address classification and URL; Said application representation module is accepted user entered keyword search, and the network address of the specific area that searches and/or affiliated network address classification result are fed back to the user.
Preferably, said system is arranged between focused crawler process and the Internet network, and said focused crawler process grasps the guidance station dot information of Internet network automatically according to rule.
Another object of the present invention is to provide the said system of a kind of employing to carry out the Web page classifying recognition methods, it is characterized in that said method comprising the steps of:
(1) create the focused crawler process, the focused crawler process reads the url list of preset Web side navigation website;
(2) the data acquisition module website URL that need carry out data acquisition from the url list taking-up carries out the webpage source code to the URL that obtains and obtains; The Context resolution module utilizes vertical search technology and focused crawler technology that web page contents is resolved; Obtain website information corresponding in webpage classification information and each webpage classification under this website, and website information corresponding in webpage classification information and each webpage classification is joined in the Category tabulation; Circulate successively and all travel through until url list; The URL that said Category list storage has been discerned and its network address classification that is subordinate to.
Preferably,, then travel through directly and finish when url list is empty in said method step (2).
Preferably; Said method step (2) is after data acquisition module need carry out the website URL of data acquisition from the url list taking-up; The website URL that data acquisition module earlier carries out data acquisition with needs puts into and does not visit url list, when not visiting the url list non-NULL, never visits url list and takes out a URL; And the URL that obtains is carried out the webpage source code obtain; And through the Context resolution module webpage source code is resolved, and this URL joined visit in the url list, and never visit this URL of deletion in the url list.
Preferably, when not visiting url list for sky, notice focused crawler process reads the url list of preset Web side navigation website in the said method step (2).
Preferably, the URL that ought never visit the url list taking-up in the said method visited, and then continued not visit next URL of url list.
Preferably, said method step (2) the Context resolution module step of carrying out Context resolution comprises:
A1) the webpage source file of the URL that obtains of focused crawler process extracting, then according to regular expression regularly the structure of web page characteristic use regular expression that obtains of learner pattern learning extract the structured message of webpage;
A2) use regular expression from the structured message of webpage, to extract the new url that meets the network address classified information; And new url added in the URL formation;
A3) from the URL formation, take out URL, whether this URL of cycle criterion satisfies the search strategy of using representation module, if satisfy search strategy, then this URL is joined among the network address classification table Category with corresponding network address classification.
Preferably, said method step A2) said regular expression grasps new url according to the strategy of breadth First from source file.
Preferably, said method step A1) when regular expression timing learner can not be discerned the structure of web page characteristic, directly carry out whether satisfying in the URL formation judgement of the search strategy of using representation module.
Preferably, said method step A3) if when not satisfying search strategy, continues then to judge whether next URL in the URL formation satisfies the step of search strategy in.
The present invention can solve following problem: 1) utilize vertical search to obtain the corresponding network address of different classification with focused crawler from navigation website through to vertical search and focused crawler Study on Technology.2) can return precise search result targetedly to the special search of specific industry theme according to the user.3) obtain Web page classifying that unknown URL is affiliated on the classifieds website through vertical search and focused crawler.
The present invention is based on the Web page classifying recognition methods of vertical search and focused crawler technology, the technological frame of the URL that respectively classifies in a kind of effective navigation by recognition class website be provided, and detailed design recognizer.System is divided into three modules, is successively from the bottom up: data acquisition module, Context resolution layer and application presentation layer.
The present invention is based on the Web page classifying recognition methods of vertical search and focused crawler technology, committed step is two parts: the webpage source code obtains and the web page contents analytical method.Wherein the web page contents analytical method is a core, and it comprises two main portions: extract the structured message of webpage and the strategy of creeping of focused crawler.Through discovering navigation website webpage source code: the navigation type website is made up of the sub-directory page of a master catalogue page and each classification basically two kinds of pages; The master catalogue page comprises the link of each classification subpage frame of a large amount of sensings, and the sub-directory page of each classification then comprises the link of the website that belongs to this classification.The sub-directory page of each classification on the same navigation website also has very strong similarity; That is to say similar structure is all arranged in these pages; Can summarize the structured message of the page with (or a several) regular expression through pattern learning, so as long as find the regular expression of representing these page structure information just can well instruct focused crawler to grasp and the relevant webpage of classifying as far as possible.With www.hao123.com is example; As to search all URL of " amusement and recreation " classification; Can write regular expression href (?: " (? < 1>[^ "] *) " | (? < 1 >)); be used for the link of shape such as href=" ... " in the matched character string, just can obtain all URL of " amusement and recreation " classification.
In order to adapt to the irregular renewal of navigation website, better extract the Web page structural information of catalog page, the invention provides the timing learner of URL regular expression, can adapt to the continuous variation of the website that navigates.The present invention has simultaneously proposed the directed BFS strategy based on the web page contents characteristic with reference to three kinds of search strategies of URL.The basic thought of this search strategy is: in the process that webpage grasps, according to the directed structured message that extracts webpage of the content characteristic of webpage, from structured message, grasp webpage with breadth-first strategy then earlier.This method can reduce the quantity of gathering the page effectively, has also practiced thrift the network bandwidth simultaneously, improves the efficient of information search.
With respect to scheme of the prior art, advantage of the present invention is:
Utilize URL coverage rate in Chinese website ALEXA TOP100 of this system grabs to reach 98%, the coverage rate among the Global Site ALEXA TOP 500 reaches 87%, and the URL coverage rate of local characteristic website reaches 56%.Through actual motion and test in the development & construction process, well embodied implementation result based on the recognition methods of the Web page classifying of vertical search and focused crawler, verified the accuracy of the method.The present invention has very wide significance and using value for the identification of Web page classifying.Mainly can be applied in the specific crowd of professional domain to aspects such as the search of the vertical search of customizing messages, Deep Web and excavation, website structure elucidation, the analysis of central issue of Internet user interest, the search efficiency that improves topic search engine, construction of digital library.
Description of drawings
Below in conjunction with accompanying drawing and embodiment the present invention is further described:
Fig. 1 is based on the Web page classifying recognition methods overall flow figure of vertical search and focused crawler technology; Wherein provided each processing procedure of identification Web page classifying.
Fig. 2 is the process flow diagram of web page contents analytic method; Wherein provided each processing procedure of web page contents analytic method.
Embodiment
Below in conjunction with specific embodiment such scheme is further specified.Should be understood that these embodiment are used to the present invention is described and are not limited to limit scope of the present invention.The implementation condition that adopts among the embodiment can be done further adjustment according to the condition of concrete producer, and not marked implementation condition is generally the condition in the normal experiment.
Embodiment
Navigation website that present embodiment is developed warehouse-in engine and broadband networks user behavior analysis system employing be the B/S framework, development platform is vs2005+oracle 9i, the user can be as required, is linked into easily in the existing system that needs the network address classification.Only need revise configuration file during deployment, can on a PC, move, also can operation simultaneously on multiple pc.
Below introduce each module of this design in detail and based on the Web page classifying recognition methods of vertical search and focused crawler.The concrete processing procedure of method such as the accompanying drawing 1 of Web page classifying identification, carry out according to following steps:
(1) read the url list of presetting the Web side navigation website, judge whether url list is empty,
If empty, then change step (8);
(2) take out a website URL, put it in the url list (UV_URL tabulation) of not visit.
(3) if the UV_URL tabulation for empty, is then changeed step (1);
(4) from the UV_URL tabulation, take out a URL, judge according to Table V _ URL whether this URL was visited, if then change step (3);
(5) URL that obtains is carried out the webpage source code and obtain, utilize vertical search technology and focused crawler technology that web page contents is resolved, obtain the website information of correspondence in webpage classification information and each classification under this website;
(6) website information corresponding in webpage classification information and each classification is joined in the Category tabulation;
(7) from table UV_URL, delete URL, and it is added among the V_URL, turn to (1);
(8) finish.
Web page classifying recognition methods based on vertical search and focused crawler needs following module: data acquisition module, Context resolution module and application representation module.
The function of data acquisition module: the main effect of this module is to accomplish the collection to web data through various Web agreements, gives a last module with the page that collects then and does further processing.
The interface of data acquisition module: this module is the interface of focused crawler and the Internet, with the interface of a last module be webpage source code string data, to the upper strata input data are provided.
The function of Context resolution module: this module is the nucleus module of whole framework, is that the page that data collecting module collected is got off carries out the HTML parsing according to next module mainly, extracts hyperlink wherein, joins in the URL formation.The URL that provides in the page link generally is multiple form, possibly be complete, comprises agreement, website and path; Also possibly omit partial content; Or a relative path, therefore need structured message with web page contents analytical method extraction webpage, from structured message, grasp webpage URL with breadth-first strategy; Obtain the mapping table Category of network address classification and URL, to satisfy of the search of a last module application representation module to Web page classifying.
The interface of Context resolution module: the Web page classifying identification of this module should be a mapping table with the interface of application module, i.e. network address classification and URL correspondence table.
The main method of Context resolution module is the web page contents analytical method, and it comprises two main portions: extract the structured message of webpage and the strategy of creeping of focused crawler.At first extract Web page structural information, use directed BFS strategy to carry out the extracting of URL then based on the web page contents characteristic.
Concrete web page contents analytic method processing procedure such as accompanying drawing 2, carry out according to following steps:
(1) utilize focused crawler to grasp the source file of webpage;
(2) judge that whether this webpage satisfies the structure of web page characteristic that the pattern learning of regular expression timing learner obtains, if do not satisfy, changes step (6);
(3) utilize regular expression to extract the structured message of webpage, this structured message is the content blocks of network address classified information;
(4) from the structured message piece, extract satisfactory new url according to regular expression;
(5) new url is added in the URL formation;
(6) judge that whether the URL formation is empty, if empty, then changes step (8);
(7) take out a URL, judge whether this URL satisfies search strategy,, then this URL is joined among the network address classification table Category, and turn to step (1) simultaneously if satisfy; Otherwise, turn to step (6);
(8) finish.
Wherein: UV_URL is used to deposit the not URL of visit; V_URL is used to deposit the URL that has visited; Category is used to deposit URL that has discerned and the network address classification that is subordinate to.
Use the function of representation module: user's the input and the feedback of Search Results are provided.The user can be through the network address of input key word precise search to specific area; For the URL an of the unknown, the user also can inquire the network address classification under it.
Above-mentioned instance only is explanation technical conceive of the present invention and characteristics, and its purpose is to let the people who is familiar with this technology can understand content of the present invention and enforcement according to this, can not limit protection scope of the present invention with this.All equivalent transformations that spirit is done according to the present invention or modification all should be encompassed within protection scope of the present invention.
Claims (10)
1. Web page classifying recognition system based on vertical search and focused crawler technology; It is characterized in that said system comprises application representation module, data acquisition module and Context resolution module; Said data acquisition module is accomplished the collection to web data through the Web agreement, gives the Context resolution module with the page data that collects then; Said Context resolution module is carried out the HTML parsing according to the page data of data collecting module collected, extracts the hyperlink in the page, and hyperlink is joined in the URL formation, obtains the mapping table of network address classification and URL; Said application representation module is accepted user entered keyword search, and the network address of the specific area that searches and/or affiliated network address classification result are fed back to the user.
2. one kind is adopted the system of claim 1 to carry out the Web page classifying recognition methods, it is characterized in that said method comprising the steps of:
(1) create the focused crawler process, the focused crawler process reads the url list of preset Web side navigation website;
(2) the data acquisition module website URL that need carry out data acquisition from the url list taking-up carries out the webpage source code to the URL that obtains and obtains; The Context resolution module utilizes vertical search technology and focused crawler technology that web page contents is resolved; Obtain website information corresponding in webpage classification information and each webpage classification under this website, and website information corresponding in webpage classification information and each webpage classification is joined in the Category tabulation; Circulate successively and all travel through until url list; The URL that said Category list storage has been discerned and its network address classification that is subordinate to.
3. method according to claim 2 is characterized in that said method step (2) when url list is sky, then travels through directly and finishes.
4. method according to claim 2; It is characterized in that said method step (2) is after data acquisition module need carry out the website URL of data acquisition from the url list taking-up; The website URL that data acquisition module earlier carries out data acquisition with needs puts into and does not visit url list, when not visiting the url list non-NULL, never visits url list and takes out a URL; And the URL that obtains is carried out the webpage source code obtain; And through the Context resolution module webpage source code is resolved, and this URL joined visit in the url list, and never visit this URL of deletion in the url list.
5. method according to claim 4 is characterized in that in the said method step (2) that notice focused crawler process reads the url list of preset Web side navigation website when not visiting url list when empty.
6. method according to claim 4 is characterized in that the URL that ought never visit the url list taking-up in the said method visited, and then continued not visit next URL of url list.
7. method according to claim 2 is characterized in that said method step (2) Context resolution module carries out the step of Context resolution and comprise:
A1) the webpage source file of the URL that obtains of focused crawler process extracting, then according to regular expression regularly the structure of web page characteristic use regular expression that obtains of learner pattern learning extract the structured message of webpage;
A2) use regular expression from the structured message of webpage, to extract the new url that meets the network address classified information; And new url added in the URL formation;
A3) from the URL formation, take out URL, whether this URL of cycle criterion satisfies the search strategy of using representation module, if satisfy search strategy, then this URL is joined among the network address classification table Category with corresponding network address classification.
8. method according to claim 7 is characterized in that said method step A2) said regular expression grasps new url according to the strategy of breadth First from source file.
9. method according to claim 7 is characterized in that said method step A1) when regular expression timing learner can not be discerned the structure of web page characteristic, directly carry out whether satisfying in the URL formation judgement of the search strategy of using representation module.
10. method according to claim 7 is characterized in that said method step A3) in if when not satisfying search strategy, continue then to judge whether next URL in the URL formation satisfies the step of search strategy.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2012100341952A CN102591992A (en) | 2012-02-15 | 2012-02-15 | Webpage classification identifying system and method based on vertical search and focused crawler technology |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2012100341952A CN102591992A (en) | 2012-02-15 | 2012-02-15 | Webpage classification identifying system and method based on vertical search and focused crawler technology |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102591992A true CN102591992A (en) | 2012-07-18 |
Family
ID=46480627
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2012100341952A Pending CN102591992A (en) | 2012-02-15 | 2012-02-15 | Webpage classification identifying system and method based on vertical search and focused crawler technology |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102591992A (en) |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102819591A (en) * | 2012-08-07 | 2012-12-12 | 北京网康科技有限公司 | Content-based web page classification method and system |
CN102902703A (en) * | 2012-07-19 | 2013-01-30 | 中国人民解放军国防科学技术大学 | Network sensitive information-oriented screenshot discovery and locking callback method |
CN103324761A (en) * | 2013-07-11 | 2013-09-25 | 广州市尊网商通资讯科技有限公司 | Product database forming method based on Internet data and system |
CN103744945A (en) * | 2013-12-31 | 2014-04-23 | 上海伯释信息科技有限公司 | Method for rapidly and accurately searching for target book by web crawler technology |
CN103778238A (en) * | 2014-01-27 | 2014-05-07 | 西安交通大学 | Method for automatically building classification tree from semi-structured data of Wikipedia |
CN103870567A (en) * | 2014-03-11 | 2014-06-18 | 浪潮集团有限公司 | Automatic identifying method for webpage collecting template of vertical search engine in cloud computing |
CN104050037A (en) * | 2014-06-13 | 2014-09-17 | 淮阴工学院 | Implementation method for directional crawler based on assigned e-commerce website |
CN105656707A (en) * | 2014-11-18 | 2016-06-08 | 阿里巴巴集团控股有限公司 | Method and system for testing web crawler |
CN103324761B (en) * | 2013-07-11 | 2016-11-30 | 广州市尊网商通资讯科技有限公司 | A kind of based on internet data formation product database method and system |
CN106649823A (en) * | 2016-12-29 | 2017-05-10 | 淮海工学院 | Webpage classification recognition method based on comprehensive subject term vertical search and focused crawler |
CN106776636A (en) * | 2015-11-24 | 2017-05-31 | 北京国双科技有限公司 | Data processing method and device |
CN106874282A (en) * | 2015-12-11 | 2017-06-20 | 北京奇虎科技有限公司 | The generation method and device of candidate page set |
CN107045507A (en) * | 2016-02-05 | 2017-08-15 | 北京国双科技有限公司 | Web page crawl method and device |
CN107301253A (en) * | 2017-08-23 | 2017-10-27 | 杭州安恒信息技术有限公司 | A kind of method and device for improving multi-site search key accuracy |
CN108376071A (en) * | 2016-11-11 | 2018-08-07 | 中移(杭州)信息技术有限公司 | A kind of APP recognition methods and system |
CN109446396A (en) * | 2018-10-17 | 2019-03-08 | 珠海市智图数研信息技术有限公司 | A kind of intelligent crawler frame system of line business information |
CN109597928A (en) * | 2018-12-05 | 2019-04-09 | 云南电网有限责任公司信息中心 | Support the non-structured text acquisition methods based on Web network of subscriber policy configuration |
CN110309408A (en) * | 2018-03-09 | 2019-10-08 | 陈包容 | A method of automation search |
CN110637316A (en) * | 2016-12-22 | 2019-12-31 | 奥恩全球运营有限公司,新加坡分公司 | System and method for intelligent prospective object recognition using online resources and neural network processing to classify tissue based on published material |
CN110781366A (en) * | 2019-09-09 | 2020-02-11 | 深圳壹账通智能科技有限公司 | Webpage data processing method and device, computer equipment and storage medium |
US10769159B2 (en) | 2016-12-22 | 2020-09-08 | Aon Global Operations Plc, Singapore Branch | Systems and methods for data mining of historic electronic communication exchanges to identify relationships, patterns, and correlations to deal outcomes |
CN111753162A (en) * | 2020-06-29 | 2020-10-09 | 平安国际智慧城市科技股份有限公司 | Data crawling method, device, server and storage medium |
CN111881336A (en) * | 2020-07-28 | 2020-11-03 | 上海应用技术大学 | Topic web crawler method and system |
CN111931113A (en) * | 2020-09-16 | 2020-11-13 | 深圳壹账通智能科技有限公司 | Data cleaning method and related equipment |
US10951695B2 (en) | 2019-02-14 | 2021-03-16 | Aon Global Operations Se Singapore Branch | System and methods for identification of peer entities |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101520798A (en) * | 2009-03-06 | 2009-09-02 | 苏州锐创通信有限责任公司 | Webpage classification technology based on vertical search and focused crawler |
CN101630330A (en) * | 2009-08-14 | 2010-01-20 | 苏州锐创通信有限责任公司 | Method for webpage classification |
-
2012
- 2012-02-15 CN CN2012100341952A patent/CN102591992A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101520798A (en) * | 2009-03-06 | 2009-09-02 | 苏州锐创通信有限责任公司 | Webpage classification technology based on vertical search and focused crawler |
CN101630330A (en) * | 2009-08-14 | 2010-01-20 | 苏州锐创通信有限责任公司 | Method for webpage classification |
Cited By (33)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102902703A (en) * | 2012-07-19 | 2013-01-30 | 中国人民解放军国防科学技术大学 | Network sensitive information-oriented screenshot discovery and locking callback method |
CN102819591A (en) * | 2012-08-07 | 2012-12-12 | 北京网康科技有限公司 | Content-based web page classification method and system |
CN102819591B (en) * | 2012-08-07 | 2016-04-06 | 北京网康科技有限公司 | A kind of content-based Web page classification method and system |
CN103324761A (en) * | 2013-07-11 | 2013-09-25 | 广州市尊网商通资讯科技有限公司 | Product database forming method based on Internet data and system |
CN103324761B (en) * | 2013-07-11 | 2016-11-30 | 广州市尊网商通资讯科技有限公司 | A kind of based on internet data formation product database method and system |
CN103744945A (en) * | 2013-12-31 | 2014-04-23 | 上海伯释信息科技有限公司 | Method for rapidly and accurately searching for target book by web crawler technology |
CN103778238A (en) * | 2014-01-27 | 2014-05-07 | 西安交通大学 | Method for automatically building classification tree from semi-structured data of Wikipedia |
CN103870567A (en) * | 2014-03-11 | 2014-06-18 | 浪潮集团有限公司 | Automatic identifying method for webpage collecting template of vertical search engine in cloud computing |
CN104050037A (en) * | 2014-06-13 | 2014-09-17 | 淮阴工学院 | Implementation method for directional crawler based on assigned e-commerce website |
CN105656707A (en) * | 2014-11-18 | 2016-06-08 | 阿里巴巴集团控股有限公司 | Method and system for testing web crawler |
CN105656707B (en) * | 2014-11-18 | 2019-03-26 | 阿里巴巴集团控股有限公司 | A kind of method and system of test network crawler |
CN106776636A (en) * | 2015-11-24 | 2017-05-31 | 北京国双科技有限公司 | Data processing method and device |
CN106874282A (en) * | 2015-12-11 | 2017-06-20 | 北京奇虎科技有限公司 | The generation method and device of candidate page set |
CN107045507A (en) * | 2016-02-05 | 2017-08-15 | 北京国双科技有限公司 | Web page crawl method and device |
CN107045507B (en) * | 2016-02-05 | 2020-08-21 | 北京国双科技有限公司 | Webpage crawling method and device |
CN108376071A (en) * | 2016-11-11 | 2018-08-07 | 中移(杭州)信息技术有限公司 | A kind of APP recognition methods and system |
CN108376071B (en) * | 2016-11-11 | 2021-08-24 | 中移(杭州)信息技术有限公司 | APP identification method and system |
CN110637316B (en) * | 2016-12-22 | 2021-04-13 | 奥恩全球运营有限公司,新加坡分公司 | System and method for prospective object identification |
CN110637316A (en) * | 2016-12-22 | 2019-12-31 | 奥恩全球运营有限公司,新加坡分公司 | System and method for intelligent prospective object recognition using online resources and neural network processing to classify tissue based on published material |
US11455313B2 (en) | 2016-12-22 | 2022-09-27 | Aon Global Operations Se, Singapore Branch | Systems and methods for intelligent prospect identification using online resources and neural network processing to classify organizations based on published materials |
US10769159B2 (en) | 2016-12-22 | 2020-09-08 | Aon Global Operations Plc, Singapore Branch | Systems and methods for data mining of historic electronic communication exchanges to identify relationships, patterns, and correlations to deal outcomes |
CN106649823A (en) * | 2016-12-29 | 2017-05-10 | 淮海工学院 | Webpage classification recognition method based on comprehensive subject term vertical search and focused crawler |
CN107301253A (en) * | 2017-08-23 | 2017-10-27 | 杭州安恒信息技术有限公司 | A kind of method and device for improving multi-site search key accuracy |
CN110309408B (en) * | 2018-03-09 | 2023-07-14 | 陈包容 | Automatic searching method |
CN110309408A (en) * | 2018-03-09 | 2019-10-08 | 陈包容 | A method of automation search |
CN109446396A (en) * | 2018-10-17 | 2019-03-08 | 珠海市智图数研信息技术有限公司 | A kind of intelligent crawler frame system of line business information |
CN109597928B (en) * | 2018-12-05 | 2022-12-16 | 云南电网有限责任公司信息中心 | Unstructured text acquisition method supporting user policy configuration and based on Web network |
CN109597928A (en) * | 2018-12-05 | 2019-04-09 | 云南电网有限责任公司信息中心 | Support the non-structured text acquisition methods based on Web network of subscriber policy configuration |
US10951695B2 (en) | 2019-02-14 | 2021-03-16 | Aon Global Operations Se Singapore Branch | System and methods for identification of peer entities |
CN110781366A (en) * | 2019-09-09 | 2020-02-11 | 深圳壹账通智能科技有限公司 | Webpage data processing method and device, computer equipment and storage medium |
CN111753162A (en) * | 2020-06-29 | 2020-10-09 | 平安国际智慧城市科技股份有限公司 | Data crawling method, device, server and storage medium |
CN111881336A (en) * | 2020-07-28 | 2020-11-03 | 上海应用技术大学 | Topic web crawler method and system |
CN111931113A (en) * | 2020-09-16 | 2020-11-13 | 深圳壹账通智能科技有限公司 | Data cleaning method and related equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102591992A (en) | Webpage classification identifying system and method based on vertical search and focused crawler technology | |
CN101520798A (en) | Webpage classification technology based on vertical search and focused crawler | |
CN102622445B (en) | User interest perception based webpage push system and webpage push method | |
CN101630330A (en) | Method for webpage classification | |
CN100405371C (en) | Method and system for abstracting new word | |
CN101452453B (en) | A kind of method of input method Web side navigation and a kind of input method system | |
CN103365924B (en) | A kind of method of internet information search, device and terminal | |
CN101908071B (en) | Method and device thereof for improving search efficiency of search engine | |
CN102073725B (en) | Method for searching structured data and search engine system for implementing same | |
CN102073726B (en) | Structured data import method and device for search engine system | |
CN102306201B (en) | Method and system for analyzing webpage title | |
CN106570171A (en) | Semantics-based sci-tech information processing method and system | |
CN103246732B (en) | A kind of abstracting method of online Web news content and system | |
CN103544255A (en) | Text semantic relativity based network public opinion information analysis method | |
CN103294781A (en) | Method and equipment used for processing page data | |
CN104182412A (en) | Webpage crawling method and webpage crawling system | |
CN101231661A (en) | Method and system for digging object grade knowledge | |
CN102054028A (en) | Web crawler system with page-rendering function and implementation method thereof | |
CN102270331A (en) | Network shopping navigating method based on visual search | |
CN101599089A (en) | The automatic search of update information on content of video service website and extraction system and method | |
CN101551800A (en) | Marked information generation device, inquiry unit and sharing system | |
CN104572934B (en) | A kind of webpage key content abstracting method based on DOM | |
CN102722499A (en) | Search engine and implementation method thereof | |
CN103530429A (en) | Webpage content extracting method | |
CN101241506A (en) | Many dimensions search method and device and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20120718 |