CN106649823A - Webpage classification recognition method based on comprehensive subject term vertical search and focused crawler - Google Patents
Webpage classification recognition method based on comprehensive subject term vertical search and focused crawler Download PDFInfo
- Publication number
- CN106649823A CN106649823A CN201611247621.5A CN201611247621A CN106649823A CN 106649823 A CN106649823 A CN 106649823A CN 201611247621 A CN201611247621 A CN 201611247621A CN 106649823 A CN106649823 A CN 106649823A
- Authority
- CN
- China
- Prior art keywords
- page
- webpage
- descriptor
- url
- comprehensive
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a webpage classification recognition method based on comprehensive subject term vertical search and focused crawler, and belongs to the technical field of webpage search engines. According to the method, research is performed aiming at a webpage classification recognition method in a subject term vertical search engine which is dynamically changed in a webpage, and the judgment of a fact that whether a dynamically changed webpage is related to a subject term is mainly searched; by computing the subject term correlation degree in the webpage, a URL highly related to a comprehensive subject term is screened out and enters a queue for crawl; classified information of the webpage is obtained through vertical search and focused crawler technologies; a webpage classification recognition model and algorithm are designed; different classifications of URLs are obtained through the recognition of the dynamically changed webpage; accurate webpage search is provided for users, and the webpage classification of an unknown URL can be further provided. The method has very wide significance and a high application value for the classification recognition of the dynamic webpage.
Description
Technical field
The present invention relates to web page search engine technical field, be specifically related to it is a kind of based on comprehensive descriptor vertical search and
The Web page classifying recognition methods of focused crawler.
Background technology
With the increased popularity of vertical search engine, also seem as the key technology-focused crawler of vertical search engine
It is more and more important.Focused crawler is a program for downloading webpage automatically, and it is selectively accessed according to set crawl target
Webpage on WWW is linked to related, the information required for obtaining;The topmost process object of reptile is exactly URL, its root
File content required for obtaining according to URL addresses, is then further processed to it.
With the rapid growth of internet, also in volatile presentation, people pay special attention to such as the information content on network
Where effective information is obtained in the information of magnanimity, universal search engine gives people to provide many facilities, but cannot meet
Personalized, variation and the demand of precision, so the appearance of vertical search receives common concern, it is specific that it searches for some
The information of industry or theme, specific aim and purpose it is higher;Semantic information inquiry is provided by descriptor, can be met specific
The specific demand of user;It is more professional, and the result of return also more targetedly, can be covered using little server resource
Cover the data of a certain specific industry, theme.And focused crawler is visited as the core component of vertical search according to specified descriptor
Ask webpage related on internet and link, capture the information for needing.
Basic vertical search and the Web page classifying recognition methods of focused crawler comprises the following steps:
(1) it is input into comprehensive descriptor to be checked;
(2) reptile is created;
(3) url list of default Web side navigation website is read;
(4) judge whether url list is empty, if it is empty, then go to step (8);
(5) a website URL is taken out, in putting it into the url list (UVURL lists) not accessed;
(6) judge whether UVURL lists are empty, if it is empty, then go to step (3);
(7) URL is taken out from UVURL lists, judges whether this URL is accessed according to Table V URL, if so, then turned
Step (6);
(8) URL to obtaining carries out webpage source code acquisition, using vertical search technology and focused crawler technology in webpage
Hold parsing, obtain corresponding website information in the webpage classification information and each classification under this website;
(9) corresponding website information in webpage classification information and each classification is added in Category lists;
(10) URL is deleted from table UVURL, and is added in VURL, gone to step (6);
(11) terminate.
There is certain difficulty in the method, there is following reason:Focused crawler is difficult from URL queues to be creeped to select and master
The close queue of creeping of topic information relationship;Web crawlers in URL extraction process, using search strategies such as depth, width, easily
" dimension calamity " problem of generation;It is existing much increase income crawler system from crawl webpage in obtain structured message function compared with
It is weak;Existing focused crawler strategy is difficult the dynamic change of the content and structure for adapting to webpage.In sum, traditional focusing is climbed
The different classes of webpage discrimination of worm technology is relatively low, it is necessary to look for another way.
The content of the invention
1. the technical problem to be solved
The technical problem to be solved in the present invention is to provide a kind of based on comprehensive descriptor vertical search and focused crawler
Web page classifying recognition methods, by the vertical search and focused crawler technical research based on comprehensive descriptor, we can be compared with
Good solution following point:
(1) URL queues to be creeped are built using hyperlink value and comprehensive theme word correlation value.
(2) can obtain that there is targetedly precisely search knot according to the special search of the specific synthesis descriptor of user
Really.
(3) the webpage classification belonging to unknown URL is obtained by comprehensive descriptor vertical search and focused crawler.
2. technical scheme
To solve the above problems, the present invention is adopted the following technical scheme that:
By finding following rule to website observation and analysis:Website is substantially made up of catalog page and content page,
Catalog page includes many links for pointing to the various different content pages, and content page then includes belonging to the net of the content of pages
Stand link.Belonging to has very strong similitude between the same category of page, that is, have similar structure, can pass through regular expressions
Formula is obtaining the structured message of the page.In order to adapt to the irregular change of web page contents, the net of page feature is preferably extracted
Page structure information, introduces URL regular expressions learner to adapt to the dynamic change of webpage and solve descriptor isolated island ask
Topic, needs the canonical table for obtaining the catalog page related to the URL regular expressions of descriptor related pages and descriptor simultaneously
Up to formula, the URL with this two classes matching regular expressions is only captured.At the same time the present invention proposes determining based on comprehensive descriptor
To depth-first search strategy.
It is a kind of based on comprehensive descriptor vertical search and the Web page classifying recognition methods of focused crawler, comprise the steps:
(1) it is input into comprehensive descriptor to be checked;
(2) reptile is created;
(3) invoking page content analysis algorithms;
(4) address searching table Search is read;
(5) judge whether address searching table Search is empty, then go to step if it is empty (15);
(6) first URL in Search tables is taken out, in putting it into UVURL lists;
(7) first URL in Search tables is deleted;
(8) judge whether UVURL lists are empty, then go to step if it is empty (4);
(9) if UVURL lists are not sky, a URL is taken out from UVURL lists;
(10) judge whether this URL is accessed according to Table V URL, if so, then go to step (8);
(11) if above-mentioned URL is not accessed, the corresponding webpage source codes of the URL are obtained;
(12) web page contents are parsed using distributed vertical search and focused crawler technology, obtains the web page class of the URL
Other information and corresponding website information;
(13) webpage classification information and corresponding website information are added in Category lists;
(14) URL is deleted from table UVURL, and is added in VURL, gone to step (8);
(15) terminate.
Further, content of pages parser is described in step (3):By the calculating of the descriptor degree of association, obtain
The N number of page maximum with the comprehensive descriptor degree of association, accurately identified by vertical search and focused crawler the page classification and
Corresponding website information, comprises the following steps that:
1) using the source file of focused crawler technical limit spacing webpage;
2) judge whether the webpage matches the related to comprehensive descriptor of URL regular expression timings learner acquisition simultaneously
The regular expression of the catalog page of the regular expression of the content page of the page and comprehensive descriptor related pages, if not
Match somebody with somebody, then go to step 9);
3) structured message of webpage is extracted using regular expression;
4) comprehensive descriptor calculation of relationship degree method is called, the comprehensive descriptor association angle value of the page is obtained;
5) comprehensive descriptor degree of association R of the page is read, and judges whether the threshold values α more than setting, if it is not, then abandoning
1) page, go to step;
If 6) comprehensive descriptor degree of association R of the page is more than the threshold values α for setting, the comprehensive descriptor pass of the page
Connection degree R values are inserted in contingency table Relevance;
7) new url is extracted from the structured message of the page using regular expression;
8) this is filled up to new chain in corresponding Relevance tables, and the descending mode according to Relevance values is arranged
Sequence;
9) judge whether Relevance tables are empty, if it is empty, then go to step 13);
10) first URL of Relevance tables is taken out, judges whether this URL meets search strategy, if being unsatisfactory for, turned
To step 9);
11) URL for meeting search strategy is added in address searching table Search, while deleting in Relevance tables
First URL;
12) step 1 is turned to);
13) terminate.
Further, step 4) described in comprehensive descriptor calculation of relationship degree method be:By comprehensive descriptor
Different weighted values embody the tight ness rating of the descriptor of the page to be searched, and according to word frequency page feature Xiang Ku is built, and according to every
Diverse location of the individual characteristic item in the page arranges different weights to obtain the degree of association of the page and comprehensive descriptor, concrete step
It is rapid as follows:
1. the comprehensive weight vector q=(q of M descriptor are built1,q2,...,qM), whereinqiRepresent i-th
Weights of the individual descriptor in query expression;
2. the characteristic item page to be extracted is obtained;
3. word stem is extracted in the page:Extract text participle do filtration treatment -- filter out it is abstract or to retrieve nothing
The word of pass, and remove unrelated prefix and suffix;
4. the word frequency of the word for extracting is calculated;
5. filter out characteristic item of the word frequency less than setting threshold values T, choose n characteristic item composition page feature Xiang Ku (if
Characteristic item number of the word frequency more than T is more than n in the page, then n characteristic item is chosen from big to small by word frequency, if word in the page
Characteristic item number of the frequency more than T is less than n, then not enough word frequency characteristic item is all 0), is set to p=(p1,p2,…,pn);
If 6. the characteristic item in feature database is located at<title>In label, if r=5.0, if characteristic item exists<meta>In, if r
=3.0, if characteristic item exists<a>In, if r=2.0, other situations divide into r=1.0.Constitutive characteristic item weight vectors set r=
(r1, r2..., rn);
7. its corresponding p is searched successively in page feature Xiang Ku to M descriptoriIf not finding in characteristic item storehouse,
0 is then designated as, the vector of composition is p '=(p1′,p2′,…,pn′);
8. comprehensive descriptor degree of association R in the page is calculated, its formula is as follows:
9. terminate.
3. beneficial effect
The present invention builds with the degree of association size of comprehensive descriptor according to webpage during web page characteristics are captured, first and searches
Rope table, orientation extracts the structured message of webpage, is then captured from structured message with depth-first strategy and is closed with descriptor
It is close webpage.Finally obtain URL and classification information with the big webpage of the descriptor degree of association to be put into table Category.Should
Method can effectively reduce the quantity of the collection page, while saving the network bandwidth and improving the efficiency of information search.
Present invention is primarily intended to the webpage for being directed to dynamic change set up it is a kind of based on comprehensive descriptor vertical search and
The Web page classifying recognition methods of focused crawler technology, provides identification model and related algorithm, is known by the webpage to dynamic change
Not, the URL of different classifications is obtained, is accurate search of the user to offer webpage, can also provide the affiliated webpages point of unknown URL
Class.
The present invention has very general sense and higher using value for the Classification and Identification of dynamic web page.Mainly may be used
To be applied to:Vertical search of the professional to customizing messages in professional domain;Deep search and excavation;Effectively retrieve hidden net
Network resource and utilization;WEB page is analyzed;Improve the efficiency of multiple descriptor search;Set up digital library.
Description of the drawings
Fig. 1 is vertical search and the focused crawler Web page classifying recognition methods flow chart for being based on comprehensive descriptor, wherein,
The URL that the storage of UVURL tables is not accessed, the URL that the storage of VURL tables has been accessed, Category storage identified URL;
Fig. 2 is the flow chart of web page contents analytic method;
Fig. 3 is the page and descriptor calculation of relationship degree method flow diagram.
Specific embodiment
With reference to the accompanying drawings and examples the present invention is further detailed explanation.
Embodiment
The present invention propose it is a kind of can effectively recognize the Technical Architecture of all kinds of URL in dynamic web page, and give detailed
Algorithm.System is divided into three layers, top-down to be followed successively by:Acquisition layer, analytic sheaf and expression layer.
1. collecting webpage data layer
Function:The major function of this layer is to realize collection to dynamic web page data, and gives last layer face and do content solution
Analysis is processed.
Interface:This layer is the interface of focused crawler and network, is responsible for upper layer and provides webpage source code character string input number
According to
2. web page contents analytic sheaf
Function:This layer is the core layer of whole design, mainly carried out according to the page that collecting webpage data layer is collected in
Hold parsing, effective hyperlink is obtained according to descriptor associated weight, build URL queue sequences table to be creeped.Descriptor is related
The diversity of the URL format in page link needs the structured message that webpage is obtained using web page contents analytical algorithm, structure
The theme dictionary of correlation is built, with distributed vertical searching method the URL of webpage to be creeped is obtained, obtain comprehensive theme dictionary association
The mapping table Category of degree and URL, for meeting search of the last layer to Web page classifying.
Interface:The comprehensive descriptor degree of association webpage identification of this layer and the interface of last layer are a mapping tables, i.e.,
Comprehensive descriptor degree of association table corresponding with URL.
The main method of this layer:Web page contents analytical algorithm, it mainly has three parts:Obtain the knot with regard to dynamic web page
Structure information, the plan of specifically creeping for calculating the page and the descriptor degree of association, building URL relation tables to be creeped and focused crawler
Slightly.
The page and comprehensive descriptor calculation of relationship degree method.Idiographic flow is as shown in Figure 3:
1. the comprehensive weight vector q=(q of M descriptor are built1,q2,…,qM), whereinqiRepresent i-th
Weights of the descriptor in query expression;
2. the characteristic item page to be extracted is obtained;
3. word stem is extracted in the page:Extract text participle do filtration treatment -- filter out it is abstract or to retrieve nothing
The word of pass, and remove unrelated prefix and suffix;
4. the word frequency of the word for extracting is calculated;
5. filter out characteristic item of the word frequency less than setting threshold values T, choose n characteristic item composition page feature Xiang Ku (if
Characteristic item number of the word frequency more than T is more than n in the page, then n characteristic item is chosen from big to small by word frequency, if word in the page
Characteristic item number of the frequency more than T is less than n, then not enough word frequency characteristic item is all 0), is set to p=(p1,p2,…,pn);
If 6. the characteristic item in feature database is located at<title>In label, if r=5.0, if characteristic item exists<meta>In, if r
=3.0, if characteristic item exists<a>In, if r=2.0, other situations divide into r=1.0.Constitutive characteristic item weight vectors set r=
(r1, r2..., rn);
7. its corresponding p is searched successively in page feature Xiang Ku to M descriptoriIf not finding in characteristic item storehouse,
0 is then designated as, the vector of composition is p '=(p1′,p2′,…,pn′);
8. comprehensive descriptor degree of association R in the page is calculated, its formula is as follows:
9. terminate.
Web page contents analytical algorithm.Specific algorithm flow process is as shown in Figure 2:
1) using the source file of focused crawler technical limit spacing webpage;
2) judge whether the webpage matches the related to comprehensive descriptor of URL regular expression timings learner acquisition simultaneously
The regular expression of the catalog page of the regular expression of the content page of the page and comprehensive descriptor related pages, if not
Match somebody with somebody, then go to step 9);
3) structured message of webpage is extracted using regular expression;
4) comprehensive descriptor calculation of relationship degree method is called, the comprehensive descriptor association angle value of the page is obtained;
5) comprehensive descriptor degree of association R of the page is read, and judges whether the threshold values α more than setting, if it is not, then abandoning
1) page, go to step;
If 6) comprehensive descriptor degree of association R of the page is more than the threshold values α for setting, the comprehensive descriptor pass of the page
Connection degree R values are inserted in contingency table Relevance;
7) new url is extracted from the structured message of the page using regular expression;
8) this is filled up to new chain in corresponding Relevance tables, and the descending mode according to Relevance values is arranged
Sequence;
9) judge whether Relevance tables are empty, if it is empty, then go to step 13);
10) first URL of Relevance tables is taken out, judges whether this URL meets search strategy, if being unsatisfactory for, turned
To step 9);
11) URL for meeting search strategy is added in address searching table Search, while deleting in Relevance tables
First URL;
12) step 1 is turned to);
13) terminate.
3. the application expression layer that Web page classifying is recognized
Function:Provide the user the feedback of descriptor input and Search Results.User can be with by the multiple descriptor of input
Accurately search the network address in particular range;Websites collection belonging to unknown URL can also be supplied to user.
Vertical search and focused crawler technology Web page classifying recognition methods based on comprehensive descriptor.Method flow diagram is as schemed
Shown in 1:
(1) it is input into comprehensive descriptor to be checked;
(2) reptile is created;
(3) invoking page content analysis algorithms;
(4) address searching table Search is read;
(5) judge whether address searching table Search is empty, then go to step if it is empty (15);
(6) first URL in Search tables is taken out, in putting it into UVURL lists;
(7) first URL in Search tables is deleted;
(8) judge whether UVURL lists are empty, then go to step if it is empty (4);
(9) if UVURL lists are not sky, a URL is taken out from UVURL lists;
(10) judge whether this URL is accessed according to Table V URL, if so, then go to step (8);
(11) if above-mentioned URL is not accessed, the corresponding webpage source codes of the URL are obtained;
(12) web page contents are parsed using distributed vertical search and focused crawler technology, obtains the web page class of the URL
Other information and corresponding website information;
(13) webpage classification information and corresponding website information are added in Category lists;
(14) URL is deleted from table UVURL, and is added in VURL, gone to step (8);
(15) terminate.
The present invention be directed to the web page identification method in webpage in the descriptor distributed vertical search engine of dynamic change
How research, main research judges whether the webpage of a dynamic change is related to descriptor, by the descriptor for calculating the page
The degree of association, sifts out the URL larger with the comprehensive descriptor degree of association and enters queue to be creeped, using vertical search and focused crawler skill
Art obtains the classification information of webpage, devises Web page classifying identification model and algorithm.
Specifically, the present invention is first big with the degree of association of comprehensive descriptor according to webpage during web page characteristics are captured
Little to build search table, orientation extracts the structured message of webpage;Then with depth-first strategy capture from structured message with
Descriptor webpage in close relations;Finally obtain URL and classification information with the big webpage of the descriptor degree of association and be put into table
In Category.The method can effectively reduce the quantity of the collection page, while saving the network bandwidth and improving the effect of information search
Rate.
Search and Network Users'Behaviors Analysis system based on comprehensive descriptor adopts B/S frameworks, uses vs2005+oracle
9i can conveniently be linked into the existing system for needing and carrying out websites collection as development environment, user.Only need to modification configuration text
Part just can run on one or more PC.The system is verified in the sharp wound communication Co., Ltd in Suzhou.This is
System accurately obtains the success rate of the URL big with the comprehensive descriptor degree of association in Chinese website ALEXA TOP100 and reaches 97%,
87% coverage rate can be reached in Global Site ALEXA TOP 500, is obtained on some Featured Sites and is closed with descriptor
The big URL ratios of connection degree reach 53%.The method is demonstrated by the operation and test in the sharp wound communication Co., Ltd in Suzhou
Accuracy.
Those of ordinary skill in the art it should be appreciated that the embodiment of the above be intended merely to explanation the present invention,
And be not used as limitation of the invention, as long as in the spirit of the present invention, the change to embodiment described above
Change, modification all will fall in scope of the presently claimed invention.
Claims (3)
1. it is a kind of based on comprehensive descriptor vertical search and the Web page classifying recognition methods of focused crawler, it is characterised in that to create
After reptile, address searching table Search is obtained by content of pages parser, comprised the following steps that:
(1) using the source file of focused crawler technical limit spacing webpage;
(2) judge whether the webpage matches the architectural feature of associated content pages and catalog page simultaneously, if mismatching, turn step
Suddenly (9);
(3) structured message of webpage is extracted using regular expression;
(4) comprehensive descriptor calculation of relationship degree method is called, the comprehensive descriptor association angle value of the page is obtained, it is described comprehensive main
Write inscription concretely comprising the following steps for calculation of relationship degree method:
1. the comprehensive weight vector q=(q of M descriptor are built1,q2,...,qM), whereinqiRepresent i-th theme
Weights of the word in query expression;
2. the characteristic item page to be extracted is obtained;
3. word stem is extracted in the page:The participle for extracting text does filtration treatment, filters out abstract or unrelated to retrieving
Word, and remove unrelated prefix and suffix;
4. the word frequency of the word for extracting is calculated;
5. characteristic item of the word frequency less than setting threshold values T is filtered out, n characteristic item is chosen and is constituted page feature Xiang Ku, be set to p=
(p1,p2,…,pn);
If 6. the characteristic item in feature database is located at<title>In label, if r=5.0, if characteristic item exists<meta>In, if r=
3.0, if characteristic item exists<a>In, if r=2.0, other situations divide into r=1.0.Constitutive characteristic item weight vectors set r=(r1,
r2..., rn);
7. its corresponding p is searched successively in page feature Xiang Ku to M descriptoriIf not finding in characteristic item storehouse, remember
For 0, the vector of composition is p '=(p1′,p2′,…,pn′);
8. comprehensive descriptor degree of association R in the page is calculated, its formula is as follows:
(5) comprehensive descriptor degree of association R of the page is read, and judges whether the threshold values α more than setting, if it is not, then abandoning this
The page, goes to step (1);
(6) if comprehensive descriptor degree of association R of the page is more than the threshold values α for setting, the comprehensive theme word association of the page
Degree R values are inserted in contingency table Relevance;
(7) new url is extracted from the structured message of the page using regular expression;
(8) this is filled up to new chain in corresponding Relevance tables, and the descending mode according to Relevance values sorts;
(9) judge whether Relevance tables are empty, if it is empty, then go to step (13);
(10) first URL in Relevance tables is taken out, judges whether this URL meets search strategy, if being unsatisfactory for, turned
To step (9);
(11) URL for meeting search strategy is added in address searching table Search, while deleting the in Relevance tables
One URL;
(12) step (1) is turned to;
(13) terminate;
After obtaining address searching table Search, address searching table Search is read, then carry out obtaining big with the descriptor degree of association
Webpage URL and the work of classification information.
2. according to claim 1 a kind of based on comprehensive descriptor vertical search and the Web page classifying identification side of focused crawler
Method, it is characterised in that introduce URL regular expressions learner in step (2) to obtain the URL with descriptor associated content pages
Whether the regular expression of the regular expression catalog page related to descriptor, verify the webpage by regular expression
Architectural feature with associated content pages and catalog page.
3. according to claim 1 a kind of based on comprehensive descriptor vertical search and the Web page classifying identification side of focused crawler
Method, it is characterised in that step (4) is 5. middle when choosing n characteristic item and constituting page feature item storehouse, if word frequency is more than T in the page
Characteristic item number be more than n, then choose n characteristic item from big to small by word frequency;If characteristic item of the word frequency more than T in the page
Number is less than n, then not enough word frequency characteristic item all 0.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611247621.5A CN106649823A (en) | 2016-12-29 | 2016-12-29 | Webpage classification recognition method based on comprehensive subject term vertical search and focused crawler |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611247621.5A CN106649823A (en) | 2016-12-29 | 2016-12-29 | Webpage classification recognition method based on comprehensive subject term vertical search and focused crawler |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106649823A true CN106649823A (en) | 2017-05-10 |
Family
ID=58835878
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611247621.5A Pending CN106649823A (en) | 2016-12-29 | 2016-12-29 | Webpage classification recognition method based on comprehensive subject term vertical search and focused crawler |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106649823A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111767482A (en) * | 2020-05-21 | 2020-10-13 | 中国地质大学(武汉) | Self-adaptive crawling method for focused web crawler |
CN111914201A (en) * | 2020-08-07 | 2020-11-10 | 腾讯科技(深圳)有限公司 | Network page processing method and device |
CN112541004A (en) * | 2020-12-25 | 2021-03-23 | 华南理工大学 | Automatic processing method and device for database |
CN114443928A (en) * | 2022-01-25 | 2022-05-06 | 西藏民族大学 | Web text data crawler method and system |
CN116975410A (en) * | 2023-09-22 | 2023-10-31 | 北京中关村科金技术有限公司 | Webpage data acquisition method and device, electronic equipment and readable storage medium |
CN117874319A (en) * | 2024-03-11 | 2024-04-12 | 江西顶易科技发展有限公司 | Search engine-based information mining method and device and computer equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101520798A (en) * | 2009-03-06 | 2009-09-02 | 苏州锐创通信有限责任公司 | Webpage classification technology based on vertical search and focused crawler |
CN101630330A (en) * | 2009-08-14 | 2010-01-20 | 苏州锐创通信有限责任公司 | Method for webpage classification |
CN102591992A (en) * | 2012-02-15 | 2012-07-18 | 苏州亚新丰信息技术有限公司 | Webpage classification identifying system and method based on vertical search and focused crawler technology |
-
2016
- 2016-12-29 CN CN201611247621.5A patent/CN106649823A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101520798A (en) * | 2009-03-06 | 2009-09-02 | 苏州锐创通信有限责任公司 | Webpage classification technology based on vertical search and focused crawler |
CN101630330A (en) * | 2009-08-14 | 2010-01-20 | 苏州锐创通信有限责任公司 | Method for webpage classification |
CN102591992A (en) * | 2012-02-15 | 2012-07-18 | 苏州亚新丰信息技术有限公司 | Webpage classification identifying system and method based on vertical search and focused crawler technology |
Non-Patent Citations (1)
Title |
---|
张文龙 等: "基于Nutch的垂直搜索引擎的研究", 《南开大学学报(自然科学版)》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111767482A (en) * | 2020-05-21 | 2020-10-13 | 中国地质大学(武汉) | Self-adaptive crawling method for focused web crawler |
CN111767482B (en) * | 2020-05-21 | 2023-06-06 | 中国地质大学(武汉) | Self-adaptive crawling method for focused web crawlers |
CN111914201A (en) * | 2020-08-07 | 2020-11-10 | 腾讯科技(深圳)有限公司 | Network page processing method and device |
CN111914201B (en) * | 2020-08-07 | 2023-11-07 | 腾讯科技(深圳)有限公司 | Processing method and device of network page |
CN112541004A (en) * | 2020-12-25 | 2021-03-23 | 华南理工大学 | Automatic processing method and device for database |
CN114443928A (en) * | 2022-01-25 | 2022-05-06 | 西藏民族大学 | Web text data crawler method and system |
CN116975410A (en) * | 2023-09-22 | 2023-10-31 | 北京中关村科金技术有限公司 | Webpage data acquisition method and device, electronic equipment and readable storage medium |
CN116975410B (en) * | 2023-09-22 | 2023-12-19 | 北京中关村科金技术有限公司 | Webpage data acquisition method and device, electronic equipment and readable storage medium |
CN117874319A (en) * | 2024-03-11 | 2024-04-12 | 江西顶易科技发展有限公司 | Search engine-based information mining method and device and computer equipment |
CN117874319B (en) * | 2024-03-11 | 2024-05-17 | 江西顶易科技发展有限公司 | Search engine-based information mining method and device and computer equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111353030B (en) | Knowledge question and answer retrieval method and device based on knowledge graph in travel field | |
CN108959270B (en) | Entity linking method based on deep learning | |
CN106649823A (en) | Webpage classification recognition method based on comprehensive subject term vertical search and focused crawler | |
CN103544255B (en) | Text semantic relativity based network public opinion information analysis method | |
CN105138558B (en) | The real time individual information collecting method of content is accessed based on user | |
CN101630314B (en) | Semantic query expansion method based on domain knowledge | |
TWI695277B (en) | Automatic website data collection method | |
CN101908071B (en) | Method and device thereof for improving search efficiency of search engine | |
CN103823824B (en) | A kind of method and system that text classification corpus is built automatically by the Internet | |
CN111708740A (en) | Mass search query log calculation analysis system based on cloud platform | |
CN105045875B (en) | Personalized search and device | |
CN105843965B (en) | A kind of Deep Web Crawler form filling method and apparatus based on URL subject classification | |
CN110888991B (en) | Sectional type semantic annotation method under weak annotation environment | |
CN101609450A (en) | Web page classification method based on training set | |
CN103049542A (en) | Domain-oriented network information search method | |
CN103226578A (en) | Method for identifying websites and finely classifying web pages in medical field | |
CN103064984B (en) | The recognition methods of spam page and system | |
CN106484797A (en) | Accident summary abstracting method based on sparse study | |
CN110555154B (en) | Theme-oriented information retrieval method | |
CN103678422A (en) | Web page classification method and device and training method and device of web page classifier | |
CN103246732A (en) | Online Web news content extracting method and system | |
CN109635107A (en) | The method and device of semantic intellectual analysis and the event scenarios reduction of multi-data source | |
CN105512333A (en) | Product comment theme searching method based on emotional tendency | |
CN110532450A (en) | A kind of Theme Crawler of Content method based on improvement shark search | |
CN109446399A (en) | A kind of video display entity search method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170510 |