CN108959413A - A kind of topical webpage clawing method and Theme Crawler of Content system - Google Patents

A kind of topical webpage clawing method and Theme Crawler of Content system Download PDF

Info

Publication number
CN108959413A
CN108959413A CN201810581858.XA CN201810581858A CN108959413A CN 108959413 A CN108959413 A CN 108959413A CN 201810581858 A CN201810581858 A CN 201810581858A CN 108959413 A CN108959413 A CN 108959413A
Authority
CN
China
Prior art keywords
correlation
degree
link
webpage
theme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810581858.XA
Other languages
Chinese (zh)
Other versions
CN108959413B (en
Inventor
彭涛
包铁
徐凯旋
张雪松
王上
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jilin University
Original Assignee
Jilin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jilin University filed Critical Jilin University
Priority to CN201810581858.XA priority Critical patent/CN108959413B/en
Publication of CN108959413A publication Critical patent/CN108959413A/en
Application granted granted Critical
Publication of CN108959413B publication Critical patent/CN108959413B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

This application provides a kind of topical webpage clawing method and Theme Crawler of Content system, method includes: to obtain the link not crawled from first including kind of sublink wait crawl in link set;Determine that corresponding first degree of correlation of the corresponding target webpage of link and second degree of correlation obtained, first degree of correlation and second degree of correlation are respectively the degree of correlation of target text content and Object linking and designated key in target webpage;The temperature value of target webpage is determined according to first degree of correlation and second degree of correlation, and stores the content to be presented of target webpage;If the temperature value of target webpage is greater than or equal to preset temperature value, Object linking is put into second wait crawl link set;If first wait crawl in link set there is no the link that had not been obtained, from second wait crawl obtained in link set with the designated key degree of correlation it is highest do not crawl link and continue to crawl.The application makes user that can obtain largely webpage relevant to designated key from network.

Description

A kind of topical webpage clawing method and Theme Crawler of Content system
Technical field
The present invention relates to web page crawl technical field more particularly to a kind of topical webpage clawing method and Theme Crawler of Content systems System.
Background technique
With the fast development of internet, people have welcome the epoch of an information explosion, and various information are full of Life every aspect.In order to facilitate the acquisition of information, there is search engine, people can soon be examined by search engine Rope to many webpages information, search engine improve people obtain information efficiency.Currently, the common search engine of people is such as Google, Baidu etc. are universal search engine, and this kind of search engine attempts to obtain whole resources on internet, however, people Demand it is varied, sometimes, user wishes to get the web page contents with designated key from network, and leads to This individual demand of user is unable to satisfy with search engine.
Summary of the invention
In view of this, the present invention provides a kind of topical webpage clawing method and Theme Crawler of Content systems, to make user's base Webpage relevant to designated key is got in this method and system, its technical solution is as follows:
A kind of topical webpage clawing method, comprising:
The link not crawled is obtained in link set from first wait crawl, described first wait crawl in link set including pre- The kind sublink first obtained;
Determine corresponding first degree of correlation of the corresponding target webpage of link and second degree of correlation obtained, described first is related Degree is the degree of correlation of the Object linking and the designated key in the target webpage, and second degree of correlation is the target network The degree of correlation of target text content and designated key in page;
The temperature value of the target webpage is determined according to first degree of correlation and second degree of correlation, and described in storage Target webpage it is content to be presented, wherein it is related to the designated key that the temperature value can characterize the target webpage Degree;
If the temperature value of the target webpage is greater than or equal to the first preset temperature value, the Object linking is put into the Two wait crawl in link set;
If described first is not present the link having not been obtained wait crawl in link set, from described second wait crawl link Obtained in set with the designated key degree of correlation it is highest do not crawl link, then execute the link pair of the determining acquisition Corresponding first degree of correlation of the target webpage answered and second degree of correlation.
Wherein, determining corresponding first degree of correlation of the corresponding target webpage of link and second degree of correlation obtained, packet It includes:
The target webpage is crawled from network according to the link of acquisition;
It is corresponding that the target text content, the Object linking and the Object linking are extracted from the target webpage Anchor Text;
Determine the degree of correlation of corresponding Object linking and the designated key as first phase based on the Anchor Text Guan Du, and the determining target text content and the degree of correlation of the designated key are as second degree of correlation.
Wherein, the degree of correlation of the determination target text content and designated key, comprising:
The target text content is segmented using two-way length memory conditional random field models, obtains multiple words;
The theme of the target text content is determined by the theme determination module that the multiple word and prediction are established;
The degree of correlation of the theme and the designated key that determine the target text content is as the target text content With the degree of correlation of the designated key.
Wherein, the degree of correlation that corresponding Object linking Yu the designated key are determined based on the Anchor Text, comprising:
Word in the Anchor Text is converted into word vector;
The word vector is inputted to the theme prediction model pre-established, obtains the prediction of the theme prediction model output As a result, wherein the prediction result is used to indicate the theme of the Anchor Text, and the theme prediction model is to be labeled with theme The corresponding word vector of training Anchor Text is that training sample is trained to obtain;
The theme of the Anchor Text and the degree of correlation of the designated key are determined, as the corresponding object chain of the Anchor Text Connect the degree of correlation with the designated key.
Wherein, the process of the theme prediction model is pre-established, comprising:
Multiple Anchor Texts for being labelled with theme are obtained, training Anchor Text set is formed;
Each word in training Anchor Text in the trained Anchor Text set is successively converted into word vector, is obtained and institute State the corresponding word vector set of trained Anchor Text, wherein the distance between different word vectors characterize between its corresponding text Relevance;
It is input by the corresponding word vector set cooperation of the training text, training bidirectional circulating neural network will be trained The bidirectional circulating neural network arrived is as the theme prediction model.
Preferably, the topical webpage clawing method, further includes:
By described second wait crawl in link set, it is less than the default degree of correlation, corresponding with the degree of correlation of the designated key Temperature value less than the second preset temperature value link delete.
A kind of Theme Crawler of Content system, comprising: link obtains module, degree of correlation determining module, temperature value determining module and chain Connect processing module;
The link obtains module, for obtaining the link not crawled in link set from first wait crawl, described first Wait crawl the kind sublink in link set including obtaining in advance;
The degree of correlation determining module, for determine obtain corresponding first degree of correlation of the corresponding target webpage of link and Second degree of correlation, first degree of correlation are the degree of correlation of the Object linking and the designated key in the target webpage, institute State the degree of correlation that second degree of correlation is the target text content and designated key in the target webpage;
The temperature value determining module, for determining the target according to first degree of correlation and second degree of correlation The temperature value of webpage, and store the content to be presented of the target webpage, wherein the temperature value can characterize the target network The degree of correlation of page and the designated key;
The link processing module is greater than or equal to the first preset temperature value for the temperature value when the target webpage When, the Object linking is put into second wait crawl in link set;
The link obtains module, is also used to that the link having not been obtained is not present wait crawl in link set when described first When, from described second wait crawl link set in obtain with the designated key degree of correlation it is highest do not crawl link, then It triggers degree of correlation determining module and determines corresponding first degree of correlation of the corresponding target webpage of link and second degree of correlation obtained.
Wherein, the degree of correlation determining module includes: web page crawl submodule, data extracting sub-module, first degree of correlation Determine that submodule and second degree of correlation determine submodule;
The web page crawl submodule crawls the target webpage for the link according to acquisition from network;
The data extracting sub-module, for extracting the target text content, the target from the target webpage Link Anchor Text corresponding with the Object linking;
First degree of correlation determines submodule, for determining corresponding Object linking and the finger based on the Anchor Text The degree of correlation of theme is determined as first degree of correlation;
Second degree of correlation determines submodule, for determining that the target text content is related to the designated key Degree is used as second degree of correlation.
Wherein, first degree of correlation determines submodule, comprising: transform subblock, prediction submodule and determining submodule;
The transform subblock, for the word in the Anchor Text to be converted to word vector;
The prediction submodule obtains the master for the word vector to be inputted the theme prediction model pre-established Inscribe the prediction result of prediction model output, wherein the prediction result is used to indicate the theme of the Anchor Text, and the theme is pre- Model is surveyed to be trained to obtain using the corresponding word vector of the training Anchor Text for being labeled with theme as training sample;
The determining submodule, for determining the theme of the Anchor Text and the degree of correlation of the designated key, as institute State the degree of correlation of Anchor Text corresponding Object linking and the designated key.
Preferably, the Theme Crawler of Content system, further includes: link removing module;
The link removing module is related to the designated key for linking described second in set wait crawl Degree is less than the link of the default degree of correlation, corresponding temperature value less than the second preset temperature value and deletes.
Above-mentioned technical proposal has the following beneficial effects:
The topical webpage clawing method and Theme Crawler of Content system that the present invention supplies are obtained from first wait crawl in link set first The link not crawled is taken, then determines target text content and designated key in the corresponding target webpage of link obtained respectively The degree of correlation and Object linking and designated key the degree of correlation, the temperature of target webpage is determined then according to the degree of correlation determined Angle value, and the content to be presented of marked webpage is stored, if the temperature value of target webpage is greater than or equal to the first preset temperature value, Object linking is put into second wait crawl in link set, there is no the links having not been obtained in link set wait crawl first When, then from second wait crawl obtained in link set with the designated key degree of correlation it is highest do not crawl link and continue to crawl.This The topical webpage clawing method provided and Theme Crawler of Content system are invented, so that user can obtain largely and designated key from network Relevant webpage, better user experience.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.
Fig. 1 is topical webpage clawing method provided in an embodiment of the present invention;
Fig. 2 is to determine the corresponding target network of link obtained in topical webpage clawing method provided in an embodiment of the present invention The flow diagram of the realization process of corresponding first degree of correlation of page and second degree of correlation;
Fig. 3 is to determine corresponding object chain based on Anchor Text in topical webpage clawing method provided in an embodiment of the present invention Connect the flow diagram with the realization process of the degree of correlation of designated key;
Fig. 4 is the schematic diagram of word vector provided in an embodiment of the present invention;
Fig. 5 is the schematic diagram of theme prediction model provided in an embodiment of the present invention;
Fig. 6 is provided in an embodiment of the present invention the pond matrix Z of nx2m to be obtained the schematic diagram for the vector P that dimension is 2m;
Fig. 7 is to remember conditional random field models using two-way length provided in an embodiment of the present invention to segment a sentence Schematic diagram;
Fig. 8 is the structural schematic diagram of Theme Crawler of Content system provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
The embodiment of the invention provides a kind of topical webpage clawing method, this method is relevant to designated key for crawling Webpage may include: referring to Fig. 1, showing the flow diagram of the web page crawl method
Step S101: the link not crawled is obtained from first wait crawl in link set.
Wherein, first wait crawl the kind sublink in link set including obtaining in advance.
Kind sublink is the initial position of web page crawl, and good kind sublink can be quickly found out net relevant to theme Page.
Step S102: determining corresponding first degree of correlation of the corresponding target webpage of link and second degree of correlation obtained, the One degree of correlation is the degree of correlation of the Object linking and designated key in target webpage, and second degree of correlation is the target in target webpage The degree of correlation of content of text and designated key.
Wherein, the target text content in target webpage refers to the content of text included in target webpage, such as one Piece article, news item etc., Object linking refers to URL present in target webpage in target webpage, can be jumped by URL Into the corresponding page.
Step S103: the temperature value of target webpage is determined according to first degree of correlation and second degree of correlation, and stores target network Page it is content to be presented.
Wherein, the temperature value of target webpage can characterize the degree of correlation of target webpage and designated key.The temperature of target webpage Angle value is higher, shows that the degree of correlation of target webpage and designated key is higher, conversely, the temperature value of target webpage is lower, shows mesh The degree of correlation for marking webpage and designated key is lower.
Wherein, the content to be presented of target webpage may include the title of target webpage, the content of text of target webpage, mesh Mark the link of webpage.
In view of including many html tags and other codes on target webpage, however these information do not have for showing Have in all senses, because user and being not concerned with these information, in addition, these nonsensical information for showing are stored, Memory space can be also wasted, therefore, when storing the information of target webpage, it is only necessary to store title, the target network of target webpage The link of the content of text, target webpage of page.
The link of the title, the content of text and target webpage of target webpage of a kind of possible target webpage can be by as follows Format stores hereof:
<title>the title of webpage</title>
<body>the content of text of webpage</body>
<url>the link of webpage</url>
When storing in the database, settable 3 fields: title, body, url.
Step S104: if the temperature value of target webpage is greater than or equal to the first preset temperature value, Object linking is put into Second wait crawl in link set.
It should be noted that the temperature value of target webpage be greater than or equal to the first preset temperature value, show target webpage with The degree of correlation of designated key is higher, correspondingly, the degree of correlation of Object linking and designated key included in target webpage may Also higher, therefore, Object linking included in target webpage is put into second wait crawl in link set.
Wherein, second wait crawl the non-seed link that is linked as in link set, but from based on kind of sublink acquisition The link extracted in webpage, it should be noted that the webpage obtained based on kind of sublink may include directly or indirectly through kind The webpage that sublink obtains.
It should be noted that if the temperature value of target webpage less than the first preset temperature value, then shows target webpage and refers to The degree of correlation for determining theme is smaller, and correspondingly, link included in target webpage is also smaller with the degree of correlation of designated key, therefore Stopping continues to crawl to the link in target webpage.
Step S104: if first is not present the link having not been obtained wait crawl in link set, from the second chain to be crawled Connect set in obtain with the designated key degree of correlation it is highest do not crawl link, then execution step S102.
In one possible implementation, it is obtained and designated key degree of correlation highest from second wait crawl in link set The process of the link not crawled may include: the degree of correlation based on the link and designated key that do not crawl to the second chain to be crawled The link connect in set is ranked up, and is determined and the designated key degree of correlation is highest links by sequence, from the second chain to be crawled Connect obtained in set should with the designated key degree of correlation is highest links.
It should be noted that if first has the link having not been obtained wait crawl link set, then preferentially first wait climb It takes in link set and obtains the link not crawled, execute step S102 and subsequent step, first wait crawl in link set After link has all crawled, then from second wait crawl acquisition and the highest chain not crawled of the designated key degree of correlation in link set Row is tapped into crawl.When second is not present the link having not been obtained in link set wait crawl, terminate the webpage based on designated key Crawl process.
Web page crawl method provided in an embodiment of the present invention towards Korean is obtained from first wait crawl in link set first The link not crawled is taken, then determines target text content and designated key in the corresponding target webpage of link obtained respectively The degree of correlation and Object linking and designated key the degree of correlation, the temperature of target webpage is determined then according to the degree of correlation determined Angle value, and the content to be presented of marked webpage is stored, if the temperature value of target webpage is greater than or equal to the first preset temperature value, Object linking is put into second wait crawl in link set, there is no the links having not been obtained in link set wait crawl first When, then from second wait crawl obtained in link set with the designated key degree of correlation it is highest do not crawl link and continue to crawl.This The web page crawl method that inventive embodiments provide, so that user can obtain largely webpage relevant to designated key from network, Better user experience.
It should be noted that topical webpage clawing method provided by the above embodiment is applicable to the subject web of multilingual Page crawls, for example, Chinese, Korean, Japanese etc..
Dominant language of the Korean as Korean nationality of China, while being also the official language of South Korea and Korea, have very big Research significance, embody the following aspects: firstly, China the Korean nationality people using Korean as mother tongue, they are searching Korean may be used when rope information, needs to provide the search engine towards specific content for them, them is facilitated to live; Secondly, South Korea and Korea just have with China and much contact all on the periphery in China, in international community's continually changing the present from ancient times It, people should also pay close attention to the development and situation of surrounding countries other than being concerned about the development of oneself country.In view of this, this reality When step of the example in the method provided the embodiments of the present invention being provided being introduced, it is illustrated by taking Korean as an example.It needs Illustrate, it, can the selection from the website NAVER (https: //www.naver.com/) in advance when the above method is towards Korean Multiple (such as 163) link as kind of a sublink.
Below to the step S102 in above-described embodiment: determining corresponding first phase of the corresponding target webpage of link obtained Guan Du and second degree of correlation are introduced.
Referring to Fig. 2, showing corresponding first degree of correlation of the corresponding target webpage of link and the second phase for determining acquisition The flow diagram of the realization process of Guan Du may include:
Step S201: corresponding target webpage is crawled from network according to the link of acquisition.
Step S202: target text content, Object linking and the corresponding Anchor Text of Object linking are extracted from target webpage.
Wherein, Anchor Text refers to the corresponding word content of the link URL in webpage, is normally contained in an a label In, by following form tissue:
<a href="URL">anchor Text</a>
It is contained in the corresponding Anchor Text of Object linking and target webpage is briefly summarized, it can be used to Object linking Theme predicted that the process predicted based on theme of the Anchor Text to corresponding Object linking can be found in subsequent explanation.
Step S203: determine that corresponding Object linking is related as first to the degree of correlation of designated key based on Anchor Text Degree, and determining target text content and the degree of correlation of designated key are as second degree of correlation.
The process for the degree of correlation that corresponding Object linking and designated key are determined based on Anchor Text is illustrated below.Please Refering to Fig. 3, the process that the realization process of the degree of correlation of corresponding Object linking and designated key is determined based on Anchor Text is shown Schematic diagram may include:
Step S301: the word in Anchor Text is converted into word vector.
It should be noted that the vectorization mode of word is existed in the prior art and is carried out using one-hot coding to word The mode of vectorization carries out vectorization with 0 and 1 pair of word, but the obtained vector of this vectorization mode is sparse and dimensional comparison Height, in view of this, word-embeding word embedding grammar can be used to carry out vectorization expression to word for the embodiment of the present invention, by Word2vec model expands to the method for term vector among the expression to word, and the word for obtaining certain dimension indicates vector.
Word2vec model gives birth to the expression of twin words by the relationship between word and context, in this way in obtained expression It may include certain semantic information, using the relationship between word and context, training obtains the expression to word in a model.? The Korean article information in wiki can be used when training, obtained vectorization indicates to be similar to Fig. 4.
Word vector: being inputted the theme prediction model pre-established by step S302, obtains the pre- of theme prediction model output Survey result.
Wherein, prediction result is used to indicate the theme of Anchor Text, and theme prediction model is literary with the training anchor for being labeled with theme This corresponding word vector is that training sample is trained to obtain.
The process for pre-establishing theme prediction model may include: to obtain multiple Anchor Texts for being labelled with theme, composition instruction Practice Anchor Text set;By training Anchor Text set in training Anchor Text in each word be successively converted to word vector, obtain with The corresponding word vector set of training text, wherein the distance between different word vectors characterize the association between its corresponding text Property;It is input by the corresponding word vector set cooperation of training text, training bidirectional circulating neural network two-way is followed what training obtained Ring neural network is as theme prediction model.
Referring to Fig. 5, showing the schematic diagram of theme prediction model, in the present embodiment, bidirectional circulating neural network can Two-way LSTM network is thought, in order to make full use of the output of network, in a model to the positive and negative both direction of each input character Output hfAnd hbSpliced to obtain hiIf hfDimension be hidden unit number m, then hiDimension be 2m:
hi=[hf, hb] (1)
The Anchor Text for being n for a length becomes the output obtained after two-way LSTM network query function and splicing The matrix Z of one nx2m:
Z={ h0,h1,h2,...,hn-1,hn-1} (2)
For obtained matrix Z, using the pond method in convolutional network, compression is extracted to the feature in Z, is obtained The vector P that one dimension is 2m, as shown in Figure 6.
Pond method is handled a part in matrix by specifically operating, and matrix is compressed and turned It changes, therefrom extracts useful feature for subsequent analysis.In one possible implementation, maximum pondization side can be used Method carries out pond to each row in matrix, it may be assumed that
kx=max (hij) 0≤j < n (3)
It is the fully-connected network containing h unit after pondization, by network query function, result is exported and is converted into pair Answer the probability value of theme.
Step S303: determining the theme of Anchor Text and the degree of correlation of designated key, as the corresponding Object linking of Anchor Text With the degree of correlation of designated key.
The present embodiment has used the full content in Anchor Text when the theme to Anchor Text is predicted, eliminates other The interference of information, and the representation method of word vector has been used, it may include more semantic informations, be more conducive to anchor text This theme is predicted.The Recognition with Recurrent Neural Network that theme prediction model uses preferably can carry out mould to character string information It is quasi-, also, the Recognition with Recurrent Neural Network due to using is bidirectional circulating neural network, it can be to positive and negative two sides of an Anchor Text To information obtained, more information are used for the judgement to theme.
The process for the degree of correlation for determining target text and designated key is illustrated below.
In one possible implementation, determine that the process of the degree of correlation of target text content and designated key can wrap It includes: target text content being segmented using two-way length memory conditional random field models, obtains multiple words;Pass through multiple words The theme of target text content is determined with the theme decision model that prediction is established;It determines the theme of target text content and specifies master The degree of correlation of the degree of correlation of topic as target text content and designated key.
After having obtained the above-mentioned representation method to word, word vector can be used for the participle to Korean.The present embodiment is right When target text content is segmented, regard participle process as a process by word word-building, is each word distribution in word The form of different labels: B indicates word in the starting position of a word;M indicates word in the inside of a word;S indicates that word is one A individual word;E indicates word in the end position of word.The present embodiment has used the length memory unit in Recognition with Recurrent Neural Network (LSTM) method combined with condition random field (CRF) constitutes a two-way length memory conditional random field models to complete To the participle of Korean, one Korean sentence is divided using two-way length memory conditional random field models referring to Fig. 7, showing The schematic diagram of word, from figure 7 it can be seen that use sentence or character string as input, by being searched in word vector table, Word vector is obtained, then word vector is input in two-way LSTM network, after the output for obtaining two-way LSTM network, is used Relationship between CRF is straight to the word in model and label is decoded to obtain final word segmentation result.
In order to determine the theme of target text content, other than being segmented, theme decision model is also constructed.In one kind In possible implementation, it support vector machines (support vector machine, SVM) can be used to construct a theme and determine Model obtains the feature of theme using feature extracting method, obtains web page characteristics vector to characteristic weighing using TF-IDF method, Training obtains a theme decision model, when determining the theme of target text content, directly utilizes the theme decision model pair The theme of target text content is determined.
In another embodiment of the present invention, to step S103 in previous embodiment: according to first degree of correlation and second The degree of correlation determines that the temperature value of target webpage is introduced.
It should be noted that might not be all connected directly between the related subject page on internet, but it is logical Cross some other webpage indirect links together, these links just constitute one " tunnel ".Tunnel Passing is exactly that theme is allowed to climb Worm attempts to cross these links, it is found that more subject web pages, Tunnel Passing are one kind for Theme Crawler of Content towards not Come the method returned.
The present embodiment is fixed using Newtonian Cooling during web page crawl by calculating different temperature values for each webpage Regular movements state adjusts temperature value, and Theme Crawler of Content is allowed to have certain Tunnel Passing ability.
In the present embodiment, the degree of correlation and webpage that the temperature value of webpage passes through content of text and designated key in webpage In link and the relatedness computation of designated key obtain, specifically, the temperature value T of webpagei(4) calculate as the following formula:
T in formula (4)i-1Indicate the temperature of father's webpage, δ indicates the attenuation rate of temperature, tiIndicate the temperature that webpage itself has Degree, the temperature value of webpage are described as the combination of the temperature that father's webpage transmits and current web page self-temperature.It should be noted that logical Formula (4) are crossed as can be seen that the calculating of the temperature value of webpage is an iterative process, father's webpage originally is that kind of sublink is corresponding Webpage, the temperature of the webpage can set a fixed temperature value.
The original temperature t of webpage itselfi(5) calculate as the following formula:
W (content) in above formula (5) indicates the content of text of webpage and the degree of correlation of designated key, w (lk) indicate net The degree of correlation of link and designated key in page.
The attenuation rate δ of temperature is calculated as the following formula (6):
δ=e-u*τ (6)
U in above formula (6) indicates cooling ratio, can be set to a definite value, and τ indicates time interval, also can be set For the current depth value of webpage.
When carrying out Tunnel Passing, since the temperature value of webpage is different, lead to different web pages crawls ability difference, temperature The degree of correlation of the corresponding link of high webpage and designated key may be higher, has more chances to acquire more webpage informations, Low then opposite of temperature.In order to which Theme Crawler of Content can be stopped continuing to crawl to link, when the temperature of webpage is pre- lower than first If when temperature value, just abandoning link corresponding to the webpage and continuing to crawl.
In order to improve web page crawl efficiency, avoid crawling link low with the designated key degree of correlation, the present invention is real The method for applying example offer can also include: by second wait crawl in link set, be less than default phase with the degree of correlation of designated key The link of Guan Du, corresponding temperature value less than the second preset temperature value is deleted.
It should be noted that linking the temperature value that corresponding temperature value refers to the webpage comprising the link, that is, when a certain When in webpage including multiple links, it is the temperature of the webpage comprising these links that it is identical that these, which link corresponding temperature value, Angle value.
It, wait crawl link set and second after crawling the all-links in link set and having crawled, is being opened up to first Show the content to be presented of each target webpage of storage, specifically, according to target in webpage target text content and designated key phase The sequence of Guan Du from high to low is shown the content to be presented of target webpage, that is, shows in the web page contents of user, arranges It is in front the content of the higher webpage of the degree of correlation of the content of text and designated key of webpage.
The embodiment of the invention also provides a kind of Theme Crawler of Content systems, referring to Fig. 8, showing the structural representation of the system Figure may include: that link obtains module 801, degree of correlation determining module 802, temperature value determining module 803 and link processing module 804。
Link obtains module 801, for obtaining the link not crawled from first wait crawl in link set, described first to Crawling in link set includes the kind sublink obtained in advance;
Degree of correlation determining module 802, for determine obtain corresponding first degree of correlation of the corresponding target webpage of link and Second degree of correlation.
Wherein, first degree of correlation be the target webpage in Object linking and the designated key the degree of correlation, second The degree of correlation is the degree of correlation of the target text content and designated key in the target webpage.
Temperature value determining module 803, for determining the target according to first degree of correlation and second degree of correlation The temperature value of webpage, and store the content to be presented of the target webpage.
Wherein, the temperature value can characterize the degree of correlation of the target webpage Yu the designated key.
Link processing module 804, for when the temperature value of the target webpage be greater than or equal to the first preset temperature value when, The Object linking is put into second wait crawl in link set.
Link obtains module 801, and being also used to work as described first, there is no the links having not been obtained in link set wait crawl When, from described second wait crawl link set in obtain with the designated key degree of correlation it is highest do not crawl link, then It triggers degree of correlation determining module 802 and determines that corresponding first degree of correlation of the corresponding target webpage of link obtained is related to second Degree.
Theme Crawler of Content system provided in an embodiment of the present invention does not crawl from first wait crawl to obtain in link set first Link, then determine respectively the degree of correlation of the target text content and designated key in the corresponding target webpage of link obtained with And the degree of correlation of Object linking and designated key, the temperature value of target webpage is determined then according to the degree of correlation determined, and deposit The content to be presented of marked webpage is stored up, if the temperature value of target webpage is greater than or equal to the first preset temperature value, by object chain Connect be put into second wait crawl link set in, first wait crawl link set in there is no had not been obtained link when, then from Second wait crawl obtained in link set with the designated key degree of correlation it is highest do not crawl link and continue to crawl.The present invention is implemented The Theme Crawler of Content system that example provides, so that user can obtain largely webpage relevant to designated key, user experience from network Preferably.
In Theme Crawler of Content system provided by the above embodiment, degree of correlation determining module 802 may include: web page crawl Module, data extracting sub-module, first degree of correlation determine that submodule and second degree of correlation determine submodule.
The web page crawl submodule crawls the target webpage for the link according to acquisition from network.
The data extracting sub-module, for extracting the target text content, the target from the target webpage Link Anchor Text corresponding with the Object linking.
First degree of correlation determines submodule, for determining corresponding Object linking and the finger based on the Anchor Text The degree of correlation of theme is determined as first degree of correlation.
Second degree of correlation determines submodule, for determining the degree of correlation of the target text content and the designated key As second degree of correlation.
Further, first degree of correlation determines that submodule may include: transform subblock, prediction submodule and determination Submodule.
The transform subblock, for the word in the Anchor Text to be converted to word vector.
The prediction submodule obtains the master for the word vector to be inputted the theme prediction model pre-established Inscribe the prediction result of prediction model output, wherein the prediction result is used to indicate the theme of the Anchor Text, and the theme is pre- Model is surveyed to be trained to obtain using the corresponding word vector of the training Anchor Text for being labeled with theme as training sample.
The determining submodule, for determining the theme of the Anchor Text and the degree of correlation of the designated key, as institute State the degree of correlation of Anchor Text corresponding Object linking and the designated key.
Theme Crawler of Content system provided by the above embodiment can also include: link removing module.
The link removing module is related to the designated key for linking described second in set wait crawl Degree is less than the link of the default degree of correlation, corresponding temperature value less than the second preset temperature value and deletes.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.
In several embodiments provided herein, it should be understood that disclosed method, apparatus and equipment, it can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or The mutual coupling, direct-coupling or communication connection discussed can be by some communication interfaces, between device or unit Coupling or communication connection are connect, can be electrical property, mechanical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.In addition, the functional units in various embodiments of the present invention may be integrated into one processing unit, it is also possible to each Unit physically exists alone, and can also be integrated in one unit with two or more units.
It, can be with if the function is realized in the form of SFU software functional unit and when sold or used as an independent product It is stored in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially in other words The part of the part that contributes to existing technology or the technical solution can be embodied in the form of software products, the meter Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be a People's computer, server or network equipment etc.) it performs all or part of the steps of the method described in the various embodiments of the present invention. And storage medium above-mentioned includes: that USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited The various media that can store program code such as reservoir (RAM, Random Access Memory), magnetic or disk.
The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the invention It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims (10)

1. a kind of topical webpage clawing method characterized by comprising
The link not crawled is obtained in link set from first wait crawl, described first wait crawl in link set including obtaining in advance The kind sublink taken;
It determines corresponding first degree of correlation of the corresponding target webpage of link obtained and second degree of correlation, first degree of correlation is The degree of correlation of Object linking and the designated key in the target webpage, second degree of correlation are in the target webpage Target text content and designated key the degree of correlation;
The temperature value of the target webpage is determined according to first degree of correlation and second degree of correlation, and stores the target Webpage it is content to be presented, wherein the temperature value can characterize the degree of correlation of the target webpage Yu the designated key;
If the temperature value of the target webpage be greater than or equal to the first preset temperature value, by the Object linking be put into second to It crawls in link set;
If described first is not present the link having not been obtained wait crawl in link set, from described second wait crawl link set Middle acquisition and the designated key degree of correlation be highest linking of not crawling, the link for then executing the determining acquisition is corresponding Corresponding first degree of correlation of target webpage and second degree of correlation.
2. topical webpage clawing method according to claim 1, which is characterized in that the determining link obtained is corresponding Corresponding first degree of correlation of target webpage and second degree of correlation, comprising:
The target webpage is crawled from network according to the link of acquisition;
The target text content, the Object linking and the corresponding anchor text of the Object linking are extracted from the target webpage This;
Based on the Anchor Text determine the degree of correlation of corresponding Object linking and the designated key as first degree of correlation, And the determining target text content and the degree of correlation of the designated key are as second degree of correlation.
3. topical webpage clawing method according to claim 2, which is characterized in that the determination target text content With the degree of correlation of designated key, comprising:
The target text content is segmented using two-way length memory conditional random field models, obtains multiple words;
The theme of the target text content is determined by the theme determination module that the multiple word and prediction are established;
The degree of correlation of the theme and the designated key that determine the target text content is as the target text content and institute State the degree of correlation of designated key.
4. topical webpage clawing method according to claim 2, which is characterized in that described based on determining pair of the Anchor Text The degree of correlation of the Object linking and the designated key answered, comprising:
Word in the Anchor Text is converted into word vector;
The word vector is inputted to the theme prediction model pre-established, obtains the prediction knot of the theme prediction model output Fruit, wherein the prediction result is used to indicate the theme of the Anchor Text, and the theme prediction model is to be labeled with the instruction of theme Practicing the corresponding word vector of Anchor Text is that training sample is trained to obtain;
Determine the theme of the Anchor Text and the degree of correlation of the designated key, as the corresponding Object linking of the Anchor Text with The degree of correlation of the designated key.
5. topical webpage clawing method according to claim 3, which is characterized in that pre-establish the theme prediction model Process, comprising:
Multiple Anchor Texts for being labelled with theme are obtained, training Anchor Text set is formed;
Each word in training Anchor Text in the trained Anchor Text set is successively converted into word vector, is obtained and the instruction Practice the corresponding word vector set of Anchor Text, wherein the distance between different word vectors characterize the association between its corresponding text Property;
It is input by the corresponding word vector set cooperation of the training text, training bidirectional circulating neural network obtains training Bidirectional circulating neural network is as the theme prediction model.
6. topical webpage clawing method according to claim 1, which is characterized in that further include:
By described second wait crawl in link set, it is less than the default degree of correlation, corresponding temperature with the degree of correlation of the designated key Link of the angle value less than the second preset temperature value is deleted.
7. a kind of Theme Crawler of Content system characterized by comprising link obtains module, degree of correlation determining module, temperature value and determines Module and link processing module;
The link obtains module, and for obtaining the link not crawled from first wait crawl in link set, described first wait climb Taking in link set includes the kind sublink obtained in advance;
The degree of correlation determining module, for determining corresponding first degree of correlation of the corresponding target webpage of link and second obtained The degree of correlation, first degree of correlation are the degree of correlation of the Object linking and the designated key in the target webpage, described the Two degrees of correlation are the degree of correlation of the target text content and designated key in the target webpage;
The temperature value determining module, for determining the target webpage according to first degree of correlation and second degree of correlation Temperature value, and store the content to be presented of the target webpage, wherein the temperature value can characterize the target webpage with The degree of correlation of the designated key;
The link processing module, for inciting somebody to action when the temperature value of the target webpage is greater than or equal to the first preset temperature value The Object linking is put into second wait crawl in link set;
The link obtains module, is also used to when described first is not present the link having not been obtained in link set wait crawl, From described second wait crawl link set in obtain with the designated key degree of correlation it is highest do not crawl link, then trigger Degree of correlation determining module determines corresponding first degree of correlation of the corresponding target webpage of link and second degree of correlation obtained.
8. Theme Crawler of Content system according to claim 7, which is characterized in that the degree of correlation determining module includes: webpage It crawls submodule, data extracting sub-module, first degree of correlation and determines that submodule and second degree of correlation determine submodule;
The web page crawl submodule crawls the target webpage for the link according to acquisition from network;
The data extracting sub-module, for extracting the target text content, the Object linking from the target webpage Anchor Text corresponding with the Object linking;
First degree of correlation determines submodule, for determining corresponding Object linking and the specified master based on the Anchor Text The degree of correlation of topic is as first degree of correlation;
Second degree of correlation determines submodule, for determining that the degree of correlation of the target text content and the designated key is made For second degree of correlation.
9. Theme Crawler of Content system according to claim 8, which is characterized in that first degree of correlation determines submodule, packet It includes: transform subblock, prediction submodule and determining submodule;
The transform subblock, for the word in the Anchor Text to be converted to word vector;
It is pre- to obtain the theme for the word vector to be inputted the theme prediction model pre-established for the prediction submodule Survey the prediction result of model output, wherein the prediction result is used to indicate the theme of the Anchor Text, and the theme predicts mould Type is trained to obtain using the corresponding word vector of the training Anchor Text for being labeled with theme as training sample;
The determining submodule, for determining the theme of the Anchor Text and the degree of correlation of the designated key, as the anchor The degree of correlation of text corresponding Object linking and the designated key.
10. Theme Crawler of Content system according to claim 7, which is characterized in that further include: link removing module;
The link removing module is small with the degree of correlation of the designated key for linking described second in set wait crawl It is deleted in the link of the default degree of correlation, corresponding temperature value less than the second preset temperature value.
CN201810581858.XA 2018-06-07 2018-06-07 Topic webpage crawling method and topic crawler system Active CN108959413B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810581858.XA CN108959413B (en) 2018-06-07 2018-06-07 Topic webpage crawling method and topic crawler system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810581858.XA CN108959413B (en) 2018-06-07 2018-06-07 Topic webpage crawling method and topic crawler system

Publications (2)

Publication Number Publication Date
CN108959413A true CN108959413A (en) 2018-12-07
CN108959413B CN108959413B (en) 2020-09-11

Family

ID=64494106

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810581858.XA Active CN108959413B (en) 2018-06-07 2018-06-07 Topic webpage crawling method and topic crawler system

Country Status (1)

Country Link
CN (1) CN108959413B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110069690A (en) * 2019-04-24 2019-07-30 成都市映潮科技股份有限公司 A kind of theme network crawler method, apparatus and medium
CN110532450A (en) * 2019-05-13 2019-12-03 南京大学 A kind of Theme Crawler of Content method based on improvement shark search
CN111143649A (en) * 2019-12-09 2020-05-12 杭州迪普科技股份有限公司 Webpage searching method and device
CN112579853A (en) * 2019-09-30 2021-03-30 顺丰科技有限公司 Method and device for sequencing crawling links and storage medium
CN112836111A (en) * 2021-02-09 2021-05-25 沈阳麟龙科技股份有限公司 URL crawling method, device, medium and electronic equipment of crawler system
CN113449168A (en) * 2021-07-14 2021-09-28 北京锐安科技有限公司 Method, device and equipment for capturing theme webpage data and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073730A (en) * 2011-01-14 2011-05-25 哈尔滨工程大学 Method for constructing topic web crawler system
CN102298622A (en) * 2011-08-11 2011-12-28 中国科学院自动化研究所 Search method for focused web crawler based on anchor text and system thereof
CN102662954A (en) * 2012-03-02 2012-09-12 杭州电子科技大学 Method for implementing topical crawler system based on learning URL string information
US9323861B2 (en) * 2010-11-18 2016-04-26 Daniel W. Shepherd Method and apparatus for enhanced web browsing
CN106776722A (en) * 2016-11-22 2017-05-31 新乡学院 theme prediction algorithm based on hyperlink

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9323861B2 (en) * 2010-11-18 2016-04-26 Daniel W. Shepherd Method and apparatus for enhanced web browsing
CN102073730A (en) * 2011-01-14 2011-05-25 哈尔滨工程大学 Method for constructing topic web crawler system
CN102298622A (en) * 2011-08-11 2011-12-28 中国科学院自动化研究所 Search method for focused web crawler based on anchor text and system thereof
CN102662954A (en) * 2012-03-02 2012-09-12 杭州电子科技大学 Method for implementing topical crawler system based on learning URL string information
CN106776722A (en) * 2016-11-22 2017-05-31 新乡学院 theme prediction algorithm based on hyperlink

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
费晨杰等: "基于LDA扩展主题词库的主题爬虫研究", 《计算机应用与软件》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110069690A (en) * 2019-04-24 2019-07-30 成都市映潮科技股份有限公司 A kind of theme network crawler method, apparatus and medium
CN110069690B (en) * 2019-04-24 2021-12-07 成都映潮科技股份有限公司 Method, device and medium for topic web crawler
CN110532450A (en) * 2019-05-13 2019-12-03 南京大学 A kind of Theme Crawler of Content method based on improvement shark search
CN112579853A (en) * 2019-09-30 2021-03-30 顺丰科技有限公司 Method and device for sequencing crawling links and storage medium
CN111143649A (en) * 2019-12-09 2020-05-12 杭州迪普科技股份有限公司 Webpage searching method and device
CN112836111A (en) * 2021-02-09 2021-05-25 沈阳麟龙科技股份有限公司 URL crawling method, device, medium and electronic equipment of crawler system
CN113449168A (en) * 2021-07-14 2021-09-28 北京锐安科技有限公司 Method, device and equipment for capturing theme webpage data and storage medium
WO2023284612A1 (en) * 2021-07-14 2023-01-19 北京锐安科技有限公司 Subject webpage data capturing method and apparatus, and device and storage medium
CN113449168B (en) * 2021-07-14 2024-02-20 北京锐安科技有限公司 Theme webpage data grabbing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN108959413B (en) 2020-09-11

Similar Documents

Publication Publication Date Title
CN108959413A (en) A kind of topical webpage clawing method and Theme Crawler of Content system
CN111581510B (en) Shared content processing method, device, computer equipment and storage medium
CN111125422B (en) Image classification method, device, electronic equipment and storage medium
US10459995B2 (en) Search engine for processing image search queries in multiple languages
US9846836B2 (en) Modeling interestingness with deep neural networks
CN110032632A (en) Intelligent customer service answering method, device and storage medium based on text similarity
CN103514299B (en) Information search method and device
CN113822067A (en) Key information extraction method and device, computer equipment and storage medium
CN109388743B (en) Language model determining method and device
Yang OntoCrawler: A focused crawler with ontology-supported website models for information agents
CN108875065B (en) Indonesia news webpage recommendation method based on content
CN109325146A (en) A kind of video recommendation method, device, storage medium and server
CN111625715B (en) Information extraction method and device, electronic equipment and storage medium
Nielsen Wembedder: Wikidata entity embedding web service
CN111753167A (en) Search processing method, search processing device, computer equipment and medium
CN113569118B (en) Self-media pushing method, device, computer equipment and storage medium
JP2023544925A (en) Data evaluation methods, training methods and devices, electronic equipment, storage media, computer programs
CN104778232B (en) Searching result optimizing method and device based on long query
US11514103B1 (en) Image search using intersected predicted queries
CN113128431A (en) Video clip retrieval method, device, medium and electronic equipment
CN112328734B (en) Method, device and computer equipment for generating text data
CN107798091A (en) The method and its relevant device that a kind of data crawl
CN114611023A (en) Search result display method, device, equipment, medium and program product
CN114282528A (en) Keyword extraction method, device, equipment and storage medium
CN111310069B (en) Evaluation method and device for timeliness search

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant