CN106649823A - Webpage classification recognition method based on comprehensive subject term vertical search and focused crawler - Google Patents

Webpage classification recognition method based on comprehensive subject term vertical search and focused crawler Download PDF

Info

Publication number
CN106649823A
CN106649823A CN201611247621.5A CN201611247621A CN106649823A CN 106649823 A CN106649823 A CN 106649823A CN 201611247621 A CN201611247621 A CN 201611247621A CN 106649823 A CN106649823 A CN 106649823A
Authority
CN
China
Prior art keywords
page
webpage
descriptor
url
comprehensive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611247621.5A
Other languages
Chinese (zh)
Inventor
掌明
卢艳宏
杨瑞
樊纪山
王经卓
宋永献
孙巧榆
张金学
洪露
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huaihai Institute of Techology
Original Assignee
Huaihai Institute of Techology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huaihai Institute of Techology filed Critical Huaihai Institute of Techology
Priority to CN201611247621.5A priority Critical patent/CN106649823A/en
Publication of CN106649823A publication Critical patent/CN106649823A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a webpage classification recognition method based on comprehensive subject term vertical search and focused crawler, and belongs to the technical field of webpage search engines. According to the method, research is performed aiming at a webpage classification recognition method in a subject term vertical search engine which is dynamically changed in a webpage, and the judgment of a fact that whether a dynamically changed webpage is related to a subject term is mainly searched; by computing the subject term correlation degree in the webpage, a URL highly related to a comprehensive subject term is screened out and enters a queue for crawl; classified information of the webpage is obtained through vertical search and focused crawler technologies; a webpage classification recognition model and algorithm are designed; different classifications of URLs are obtained through the recognition of the dynamically changed webpage; accurate webpage search is provided for users, and the webpage classification of an unknown URL can be further provided. The method has very wide significance and a high application value for the classification recognition of the dynamic webpage.

Description

Based on comprehensive descriptor vertical search and the Web page classifying recognition methods of focused crawler
Technical field
The present invention relates to web page search engine technical field, be specifically related to it is a kind of based on comprehensive descriptor vertical search and The Web page classifying recognition methods of focused crawler.
Background technology
With the increased popularity of vertical search engine, also seem as the key technology-focused crawler of vertical search engine It is more and more important.Focused crawler is a program for downloading webpage automatically, and it is selectively accessed according to set crawl target Webpage on WWW is linked to related, the information required for obtaining;The topmost process object of reptile is exactly URL, its root File content required for obtaining according to URL addresses, is then further processed to it.
With the rapid growth of internet, also in volatile presentation, people pay special attention to such as the information content on network Where effective information is obtained in the information of magnanimity, universal search engine gives people to provide many facilities, but cannot meet Personalized, variation and the demand of precision, so the appearance of vertical search receives common concern, it is specific that it searches for some The information of industry or theme, specific aim and purpose it is higher;Semantic information inquiry is provided by descriptor, can be met specific The specific demand of user;It is more professional, and the result of return also more targetedly, can be covered using little server resource Cover the data of a certain specific industry, theme.And focused crawler is visited as the core component of vertical search according to specified descriptor Ask webpage related on internet and link, capture the information for needing.
Basic vertical search and the Web page classifying recognition methods of focused crawler comprises the following steps:
(1) it is input into comprehensive descriptor to be checked;
(2) reptile is created;
(3) url list of default Web side navigation website is read;
(4) judge whether url list is empty, if it is empty, then go to step (8);
(5) a website URL is taken out, in putting it into the url list (UVURL lists) not accessed;
(6) judge whether UVURL lists are empty, if it is empty, then go to step (3);
(7) URL is taken out from UVURL lists, judges whether this URL is accessed according to Table V URL, if so, then turned Step (6);
(8) URL to obtaining carries out webpage source code acquisition, using vertical search technology and focused crawler technology in webpage Hold parsing, obtain corresponding website information in the webpage classification information and each classification under this website;
(9) corresponding website information in webpage classification information and each classification is added in Category lists;
(10) URL is deleted from table UVURL, and is added in VURL, gone to step (6);
(11) terminate.
There is certain difficulty in the method, there is following reason:Focused crawler is difficult from URL queues to be creeped to select and master The close queue of creeping of topic information relationship;Web crawlers in URL extraction process, using search strategies such as depth, width, easily " dimension calamity " problem of generation;It is existing much increase income crawler system from crawl webpage in obtain structured message function compared with It is weak;Existing focused crawler strategy is difficult the dynamic change of the content and structure for adapting to webpage.In sum, traditional focusing is climbed The different classes of webpage discrimination of worm technology is relatively low, it is necessary to look for another way.
The content of the invention
1. the technical problem to be solved
The technical problem to be solved in the present invention is to provide a kind of based on comprehensive descriptor vertical search and focused crawler Web page classifying recognition methods, by the vertical search and focused crawler technical research based on comprehensive descriptor, we can be compared with Good solution following point:
(1) URL queues to be creeped are built using hyperlink value and comprehensive theme word correlation value.
(2) can obtain that there is targetedly precisely search knot according to the special search of the specific synthesis descriptor of user Really.
(3) the webpage classification belonging to unknown URL is obtained by comprehensive descriptor vertical search and focused crawler.
2. technical scheme
To solve the above problems, the present invention is adopted the following technical scheme that:
By finding following rule to website observation and analysis:Website is substantially made up of catalog page and content page, Catalog page includes many links for pointing to the various different content pages, and content page then includes belonging to the net of the content of pages Stand link.Belonging to has very strong similitude between the same category of page, that is, have similar structure, can pass through regular expressions Formula is obtaining the structured message of the page.In order to adapt to the irregular change of web page contents, the net of page feature is preferably extracted Page structure information, introduces URL regular expressions learner to adapt to the dynamic change of webpage and solve descriptor isolated island ask Topic, needs the canonical table for obtaining the catalog page related to the URL regular expressions of descriptor related pages and descriptor simultaneously Up to formula, the URL with this two classes matching regular expressions is only captured.At the same time the present invention proposes determining based on comprehensive descriptor To depth-first search strategy.
It is a kind of based on comprehensive descriptor vertical search and the Web page classifying recognition methods of focused crawler, comprise the steps:
(1) it is input into comprehensive descriptor to be checked;
(2) reptile is created;
(3) invoking page content analysis algorithms;
(4) address searching table Search is read;
(5) judge whether address searching table Search is empty, then go to step if it is empty (15);
(6) first URL in Search tables is taken out, in putting it into UVURL lists;
(7) first URL in Search tables is deleted;
(8) judge whether UVURL lists are empty, then go to step if it is empty (4);
(9) if UVURL lists are not sky, a URL is taken out from UVURL lists;
(10) judge whether this URL is accessed according to Table V URL, if so, then go to step (8);
(11) if above-mentioned URL is not accessed, the corresponding webpage source codes of the URL are obtained;
(12) web page contents are parsed using distributed vertical search and focused crawler technology, obtains the web page class of the URL Other information and corresponding website information;
(13) webpage classification information and corresponding website information are added in Category lists;
(14) URL is deleted from table UVURL, and is added in VURL, gone to step (8);
(15) terminate.
Further, content of pages parser is described in step (3):By the calculating of the descriptor degree of association, obtain The N number of page maximum with the comprehensive descriptor degree of association, accurately identified by vertical search and focused crawler the page classification and Corresponding website information, comprises the following steps that:
1) using the source file of focused crawler technical limit spacing webpage;
2) judge whether the webpage matches the related to comprehensive descriptor of URL regular expression timings learner acquisition simultaneously The regular expression of the catalog page of the regular expression of the content page of the page and comprehensive descriptor related pages, if not Match somebody with somebody, then go to step 9);
3) structured message of webpage is extracted using regular expression;
4) comprehensive descriptor calculation of relationship degree method is called, the comprehensive descriptor association angle value of the page is obtained;
5) comprehensive descriptor degree of association R of the page is read, and judges whether the threshold values α more than setting, if it is not, then abandoning 1) page, go to step;
If 6) comprehensive descriptor degree of association R of the page is more than the threshold values α for setting, the comprehensive descriptor pass of the page Connection degree R values are inserted in contingency table Relevance;
7) new url is extracted from the structured message of the page using regular expression;
8) this is filled up to new chain in corresponding Relevance tables, and the descending mode according to Relevance values is arranged Sequence;
9) judge whether Relevance tables are empty, if it is empty, then go to step 13);
10) first URL of Relevance tables is taken out, judges whether this URL meets search strategy, if being unsatisfactory for, turned To step 9);
11) URL for meeting search strategy is added in address searching table Search, while deleting in Relevance tables First URL;
12) step 1 is turned to);
13) terminate.
Further, step 4) described in comprehensive descriptor calculation of relationship degree method be:By comprehensive descriptor Different weighted values embody the tight ness rating of the descriptor of the page to be searched, and according to word frequency page feature Xiang Ku is built, and according to every Diverse location of the individual characteristic item in the page arranges different weights to obtain the degree of association of the page and comprehensive descriptor, concrete step It is rapid as follows:
1. the comprehensive weight vector q=(q of M descriptor are built1,q2,...,qM), whereinqiRepresent i-th Weights of the individual descriptor in query expression;
2. the characteristic item page to be extracted is obtained;
3. word stem is extracted in the page:Extract text participle do filtration treatment -- filter out it is abstract or to retrieve nothing The word of pass, and remove unrelated prefix and suffix;
4. the word frequency of the word for extracting is calculated;
5. filter out characteristic item of the word frequency less than setting threshold values T, choose n characteristic item composition page feature Xiang Ku (if Characteristic item number of the word frequency more than T is more than n in the page, then n characteristic item is chosen from big to small by word frequency, if word in the page Characteristic item number of the frequency more than T is less than n, then not enough word frequency characteristic item is all 0), is set to p=(p1,p2,…,pn);
If 6. the characteristic item in feature database is located at<title>In label, if r=5.0, if characteristic item exists<meta>In, if r =3.0, if characteristic item exists<a>In, if r=2.0, other situations divide into r=1.0.Constitutive characteristic item weight vectors set r= (r1, r2..., rn);
7. its corresponding p is searched successively in page feature Xiang Ku to M descriptoriIf not finding in characteristic item storehouse, 0 is then designated as, the vector of composition is p '=(p1′,p2′,…,pn′);
8. comprehensive descriptor degree of association R in the page is calculated, its formula is as follows:
9. terminate.
3. beneficial effect
The present invention builds with the degree of association size of comprehensive descriptor according to webpage during web page characteristics are captured, first and searches Rope table, orientation extracts the structured message of webpage, is then captured from structured message with depth-first strategy and is closed with descriptor It is close webpage.Finally obtain URL and classification information with the big webpage of the descriptor degree of association to be put into table Category.Should Method can effectively reduce the quantity of the collection page, while saving the network bandwidth and improving the efficiency of information search.
Present invention is primarily intended to the webpage for being directed to dynamic change set up it is a kind of based on comprehensive descriptor vertical search and The Web page classifying recognition methods of focused crawler technology, provides identification model and related algorithm, is known by the webpage to dynamic change Not, the URL of different classifications is obtained, is accurate search of the user to offer webpage, can also provide the affiliated webpages point of unknown URL Class.
The present invention has very general sense and higher using value for the Classification and Identification of dynamic web page.Mainly may be used To be applied to:Vertical search of the professional to customizing messages in professional domain;Deep search and excavation;Effectively retrieve hidden net Network resource and utilization;WEB page is analyzed;Improve the efficiency of multiple descriptor search;Set up digital library.
Description of the drawings
Fig. 1 is vertical search and the focused crawler Web page classifying recognition methods flow chart for being based on comprehensive descriptor, wherein, The URL that the storage of UVURL tables is not accessed, the URL that the storage of VURL tables has been accessed, Category storage identified URL;
Fig. 2 is the flow chart of web page contents analytic method;
Fig. 3 is the page and descriptor calculation of relationship degree method flow diagram.
Specific embodiment
With reference to the accompanying drawings and examples the present invention is further detailed explanation.
Embodiment
The present invention propose it is a kind of can effectively recognize the Technical Architecture of all kinds of URL in dynamic web page, and give detailed Algorithm.System is divided into three layers, top-down to be followed successively by:Acquisition layer, analytic sheaf and expression layer.
1. collecting webpage data layer
Function:The major function of this layer is to realize collection to dynamic web page data, and gives last layer face and do content solution Analysis is processed.
Interface:This layer is the interface of focused crawler and network, is responsible for upper layer and provides webpage source code character string input number According to
2. web page contents analytic sheaf
Function:This layer is the core layer of whole design, mainly carried out according to the page that collecting webpage data layer is collected in Hold parsing, effective hyperlink is obtained according to descriptor associated weight, build URL queue sequences table to be creeped.Descriptor is related The diversity of the URL format in page link needs the structured message that webpage is obtained using web page contents analytical algorithm, structure The theme dictionary of correlation is built, with distributed vertical searching method the URL of webpage to be creeped is obtained, obtain comprehensive theme dictionary association The mapping table Category of degree and URL, for meeting search of the last layer to Web page classifying.
Interface:The comprehensive descriptor degree of association webpage identification of this layer and the interface of last layer are a mapping tables, i.e., Comprehensive descriptor degree of association table corresponding with URL.
The main method of this layer:Web page contents analytical algorithm, it mainly has three parts:Obtain the knot with regard to dynamic web page Structure information, the plan of specifically creeping for calculating the page and the descriptor degree of association, building URL relation tables to be creeped and focused crawler Slightly.
The page and comprehensive descriptor calculation of relationship degree method.Idiographic flow is as shown in Figure 3:
1. the comprehensive weight vector q=(q of M descriptor are built1,q2,…,qM), whereinqiRepresent i-th Weights of the descriptor in query expression;
2. the characteristic item page to be extracted is obtained;
3. word stem is extracted in the page:Extract text participle do filtration treatment -- filter out it is abstract or to retrieve nothing The word of pass, and remove unrelated prefix and suffix;
4. the word frequency of the word for extracting is calculated;
5. filter out characteristic item of the word frequency less than setting threshold values T, choose n characteristic item composition page feature Xiang Ku (if Characteristic item number of the word frequency more than T is more than n in the page, then n characteristic item is chosen from big to small by word frequency, if word in the page Characteristic item number of the frequency more than T is less than n, then not enough word frequency characteristic item is all 0), is set to p=(p1,p2,…,pn);
If 6. the characteristic item in feature database is located at<title>In label, if r=5.0, if characteristic item exists<meta>In, if r =3.0, if characteristic item exists<a>In, if r=2.0, other situations divide into r=1.0.Constitutive characteristic item weight vectors set r= (r1, r2..., rn);
7. its corresponding p is searched successively in page feature Xiang Ku to M descriptoriIf not finding in characteristic item storehouse, 0 is then designated as, the vector of composition is p '=(p1′,p2′,…,pn′);
8. comprehensive descriptor degree of association R in the page is calculated, its formula is as follows:
9. terminate.
Web page contents analytical algorithm.Specific algorithm flow process is as shown in Figure 2:
1) using the source file of focused crawler technical limit spacing webpage;
2) judge whether the webpage matches the related to comprehensive descriptor of URL regular expression timings learner acquisition simultaneously The regular expression of the catalog page of the regular expression of the content page of the page and comprehensive descriptor related pages, if not Match somebody with somebody, then go to step 9);
3) structured message of webpage is extracted using regular expression;
4) comprehensive descriptor calculation of relationship degree method is called, the comprehensive descriptor association angle value of the page is obtained;
5) comprehensive descriptor degree of association R of the page is read, and judges whether the threshold values α more than setting, if it is not, then abandoning 1) page, go to step;
If 6) comprehensive descriptor degree of association R of the page is more than the threshold values α for setting, the comprehensive descriptor pass of the page Connection degree R values are inserted in contingency table Relevance;
7) new url is extracted from the structured message of the page using regular expression;
8) this is filled up to new chain in corresponding Relevance tables, and the descending mode according to Relevance values is arranged Sequence;
9) judge whether Relevance tables are empty, if it is empty, then go to step 13);
10) first URL of Relevance tables is taken out, judges whether this URL meets search strategy, if being unsatisfactory for, turned To step 9);
11) URL for meeting search strategy is added in address searching table Search, while deleting in Relevance tables First URL;
12) step 1 is turned to);
13) terminate.
3. the application expression layer that Web page classifying is recognized
Function:Provide the user the feedback of descriptor input and Search Results.User can be with by the multiple descriptor of input Accurately search the network address in particular range;Websites collection belonging to unknown URL can also be supplied to user.
Vertical search and focused crawler technology Web page classifying recognition methods based on comprehensive descriptor.Method flow diagram is as schemed Shown in 1:
(1) it is input into comprehensive descriptor to be checked;
(2) reptile is created;
(3) invoking page content analysis algorithms;
(4) address searching table Search is read;
(5) judge whether address searching table Search is empty, then go to step if it is empty (15);
(6) first URL in Search tables is taken out, in putting it into UVURL lists;
(7) first URL in Search tables is deleted;
(8) judge whether UVURL lists are empty, then go to step if it is empty (4);
(9) if UVURL lists are not sky, a URL is taken out from UVURL lists;
(10) judge whether this URL is accessed according to Table V URL, if so, then go to step (8);
(11) if above-mentioned URL is not accessed, the corresponding webpage source codes of the URL are obtained;
(12) web page contents are parsed using distributed vertical search and focused crawler technology, obtains the web page class of the URL Other information and corresponding website information;
(13) webpage classification information and corresponding website information are added in Category lists;
(14) URL is deleted from table UVURL, and is added in VURL, gone to step (8);
(15) terminate.
The present invention be directed to the web page identification method in webpage in the descriptor distributed vertical search engine of dynamic change How research, main research judges whether the webpage of a dynamic change is related to descriptor, by the descriptor for calculating the page The degree of association, sifts out the URL larger with the comprehensive descriptor degree of association and enters queue to be creeped, using vertical search and focused crawler skill Art obtains the classification information of webpage, devises Web page classifying identification model and algorithm.
Specifically, the present invention is first big with the degree of association of comprehensive descriptor according to webpage during web page characteristics are captured Little to build search table, orientation extracts the structured message of webpage;Then with depth-first strategy capture from structured message with Descriptor webpage in close relations;Finally obtain URL and classification information with the big webpage of the descriptor degree of association and be put into table In Category.The method can effectively reduce the quantity of the collection page, while saving the network bandwidth and improving the effect of information search Rate.
Search and Network Users'Behaviors Analysis system based on comprehensive descriptor adopts B/S frameworks, uses vs2005+oracle 9i can conveniently be linked into the existing system for needing and carrying out websites collection as development environment, user.Only need to modification configuration text Part just can run on one or more PC.The system is verified in the sharp wound communication Co., Ltd in Suzhou.This is System accurately obtains the success rate of the URL big with the comprehensive descriptor degree of association in Chinese website ALEXA TOP100 and reaches 97%, 87% coverage rate can be reached in Global Site ALEXA TOP 500, is obtained on some Featured Sites and is closed with descriptor The big URL ratios of connection degree reach 53%.The method is demonstrated by the operation and test in the sharp wound communication Co., Ltd in Suzhou Accuracy.
Those of ordinary skill in the art it should be appreciated that the embodiment of the above be intended merely to explanation the present invention, And be not used as limitation of the invention, as long as in the spirit of the present invention, the change to embodiment described above Change, modification all will fall in scope of the presently claimed invention.

Claims (3)

1. it is a kind of based on comprehensive descriptor vertical search and the Web page classifying recognition methods of focused crawler, it is characterised in that to create After reptile, address searching table Search is obtained by content of pages parser, comprised the following steps that:
(1) using the source file of focused crawler technical limit spacing webpage;
(2) judge whether the webpage matches the architectural feature of associated content pages and catalog page simultaneously, if mismatching, turn step Suddenly (9);
(3) structured message of webpage is extracted using regular expression;
(4) comprehensive descriptor calculation of relationship degree method is called, the comprehensive descriptor association angle value of the page is obtained, it is described comprehensive main Write inscription concretely comprising the following steps for calculation of relationship degree method:
1. the comprehensive weight vector q=(q of M descriptor are built1,q2,...,qM), whereinqiRepresent i-th theme Weights of the word in query expression;
2. the characteristic item page to be extracted is obtained;
3. word stem is extracted in the page:The participle for extracting text does filtration treatment, filters out abstract or unrelated to retrieving Word, and remove unrelated prefix and suffix;
4. the word frequency of the word for extracting is calculated;
5. characteristic item of the word frequency less than setting threshold values T is filtered out, n characteristic item is chosen and is constituted page feature Xiang Ku, be set to p= (p1,p2,…,pn);
If 6. the characteristic item in feature database is located at<title>In label, if r=5.0, if characteristic item exists<meta>In, if r= 3.0, if characteristic item exists<a>In, if r=2.0, other situations divide into r=1.0.Constitutive characteristic item weight vectors set r=(r1, r2..., rn);
7. its corresponding p is searched successively in page feature Xiang Ku to M descriptoriIf not finding in characteristic item storehouse, remember For 0, the vector of composition is p '=(p1′,p2′,…,pn′);
8. comprehensive descriptor degree of association R in the page is calculated, its formula is as follows:
R = &Sigma; i = 1 M P i * p &prime; i * r i
(5) comprehensive descriptor degree of association R of the page is read, and judges whether the threshold values α more than setting, if it is not, then abandoning this The page, goes to step (1);
(6) if comprehensive descriptor degree of association R of the page is more than the threshold values α for setting, the comprehensive theme word association of the page Degree R values are inserted in contingency table Relevance;
(7) new url is extracted from the structured message of the page using regular expression;
(8) this is filled up to new chain in corresponding Relevance tables, and the descending mode according to Relevance values sorts;
(9) judge whether Relevance tables are empty, if it is empty, then go to step (13);
(10) first URL in Relevance tables is taken out, judges whether this URL meets search strategy, if being unsatisfactory for, turned To step (9);
(11) URL for meeting search strategy is added in address searching table Search, while deleting the in Relevance tables One URL;
(12) step (1) is turned to;
(13) terminate;
After obtaining address searching table Search, address searching table Search is read, then carry out obtaining big with the descriptor degree of association Webpage URL and the work of classification information.
2. according to claim 1 a kind of based on comprehensive descriptor vertical search and the Web page classifying identification side of focused crawler Method, it is characterised in that introduce URL regular expressions learner in step (2) to obtain the URL with descriptor associated content pages Whether the regular expression of the regular expression catalog page related to descriptor, verify the webpage by regular expression Architectural feature with associated content pages and catalog page.
3. according to claim 1 a kind of based on comprehensive descriptor vertical search and the Web page classifying identification side of focused crawler Method, it is characterised in that step (4) is 5. middle when choosing n characteristic item and constituting page feature item storehouse, if word frequency is more than T in the page Characteristic item number be more than n, then choose n characteristic item from big to small by word frequency;If characteristic item of the word frequency more than T in the page Number is less than n, then not enough word frequency characteristic item all 0.
CN201611247621.5A 2016-12-29 2016-12-29 Webpage classification recognition method based on comprehensive subject term vertical search and focused crawler Pending CN106649823A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611247621.5A CN106649823A (en) 2016-12-29 2016-12-29 Webpage classification recognition method based on comprehensive subject term vertical search and focused crawler

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611247621.5A CN106649823A (en) 2016-12-29 2016-12-29 Webpage classification recognition method based on comprehensive subject term vertical search and focused crawler

Publications (1)

Publication Number Publication Date
CN106649823A true CN106649823A (en) 2017-05-10

Family

ID=58835878

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611247621.5A Pending CN106649823A (en) 2016-12-29 2016-12-29 Webpage classification recognition method based on comprehensive subject term vertical search and focused crawler

Country Status (1)

Country Link
CN (1) CN106649823A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111767482A (en) * 2020-05-21 2020-10-13 中国地质大学(武汉) Self-adaptive crawling method for focused web crawler
CN111914201A (en) * 2020-08-07 2020-11-10 腾讯科技(深圳)有限公司 Network page processing method and device
CN112541004A (en) * 2020-12-25 2021-03-23 华南理工大学 Automatic processing method and device for database
CN114443928A (en) * 2022-01-25 2022-05-06 西藏民族大学 Web text data crawler method and system
CN116975410A (en) * 2023-09-22 2023-10-31 北京中关村科金技术有限公司 Webpage data acquisition method and device, electronic equipment and readable storage medium
CN117874319A (en) * 2024-03-11 2024-04-12 江西顶易科技发展有限公司 Search engine-based information mining method and device and computer equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101520798A (en) * 2009-03-06 2009-09-02 苏州锐创通信有限责任公司 Webpage classification technology based on vertical search and focused crawler
CN101630330A (en) * 2009-08-14 2010-01-20 苏州锐创通信有限责任公司 Method for webpage classification
CN102591992A (en) * 2012-02-15 2012-07-18 苏州亚新丰信息技术有限公司 Webpage classification identifying system and method based on vertical search and focused crawler technology

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101520798A (en) * 2009-03-06 2009-09-02 苏州锐创通信有限责任公司 Webpage classification technology based on vertical search and focused crawler
CN101630330A (en) * 2009-08-14 2010-01-20 苏州锐创通信有限责任公司 Method for webpage classification
CN102591992A (en) * 2012-02-15 2012-07-18 苏州亚新丰信息技术有限公司 Webpage classification identifying system and method based on vertical search and focused crawler technology

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张文龙 等: "基于Nutch的垂直搜索引擎的研究", 《南开大学学报(自然科学版)》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111767482A (en) * 2020-05-21 2020-10-13 中国地质大学(武汉) Self-adaptive crawling method for focused web crawler
CN111767482B (en) * 2020-05-21 2023-06-06 中国地质大学(武汉) Self-adaptive crawling method for focused web crawlers
CN111914201A (en) * 2020-08-07 2020-11-10 腾讯科技(深圳)有限公司 Network page processing method and device
CN111914201B (en) * 2020-08-07 2023-11-07 腾讯科技(深圳)有限公司 Processing method and device of network page
CN112541004A (en) * 2020-12-25 2021-03-23 华南理工大学 Automatic processing method and device for database
CN114443928A (en) * 2022-01-25 2022-05-06 西藏民族大学 Web text data crawler method and system
CN116975410A (en) * 2023-09-22 2023-10-31 北京中关村科金技术有限公司 Webpage data acquisition method and device, electronic equipment and readable storage medium
CN116975410B (en) * 2023-09-22 2023-12-19 北京中关村科金技术有限公司 Webpage data acquisition method and device, electronic equipment and readable storage medium
CN117874319A (en) * 2024-03-11 2024-04-12 江西顶易科技发展有限公司 Search engine-based information mining method and device and computer equipment
CN117874319B (en) * 2024-03-11 2024-05-17 江西顶易科技发展有限公司 Search engine-based information mining method and device and computer equipment

Similar Documents

Publication Publication Date Title
CN111353030B (en) Knowledge question and answer retrieval method and device based on knowledge graph in travel field
CN108959270B (en) Entity linking method based on deep learning
CN106649823A (en) Webpage classification recognition method based on comprehensive subject term vertical search and focused crawler
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN105138558B (en) The real time individual information collecting method of content is accessed based on user
CN101630314B (en) Semantic query expansion method based on domain knowledge
TWI695277B (en) Automatic website data collection method
CN101908071B (en) Method and device thereof for improving search efficiency of search engine
CN103823824B (en) A kind of method and system that text classification corpus is built automatically by the Internet
CN111708740A (en) Mass search query log calculation analysis system based on cloud platform
CN105045875B (en) Personalized search and device
CN105843965B (en) A kind of Deep Web Crawler form filling method and apparatus based on URL subject classification
CN110888991B (en) Sectional type semantic annotation method under weak annotation environment
CN101609450A (en) Web page classification method based on training set
CN103049542A (en) Domain-oriented network information search method
CN103226578A (en) Method for identifying websites and finely classifying web pages in medical field
CN103064984B (en) The recognition methods of spam page and system
CN106484797A (en) Accident summary abstracting method based on sparse study
CN110555154B (en) Theme-oriented information retrieval method
CN103678422A (en) Web page classification method and device and training method and device of web page classifier
CN103246732A (en) Online Web news content extracting method and system
CN109635107A (en) The method and device of semantic intellectual analysis and the event scenarios reduction of multi-data source
CN105512333A (en) Product comment theme searching method based on emotional tendency
CN110532450A (en) A kind of Theme Crawler of Content method based on improvement shark search
CN109446399A (en) A kind of video display entity search method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170510