CN105528422A

CN105528422A - Focused crawler processing method and apparatus

Info

Publication number: CN105528422A
Application number: CN201510890437.1A
Authority: CN
Inventors: 张晨; 邵小亮; 谢隆飞; 王全礼
Original assignee: China Construction Bank Corp
Current assignee: China Construction Bank Corp
Priority date: 2015-12-07
Filing date: 2015-12-07
Publication date: 2016-04-27
Anticipated expiration: 2035-12-07
Also published as: CN105528422B

Abstract

The present invention provides a focused crawler processing method and apparatus. The method comprises the steps of at least extracting network tile feature information, keyword feature information in metainformation, description feature information in the metainformation and web content feature information from a web document after obtaining the web document; performing topic relativity analysis on the web document on basis of the feature information, and obtaining a classification result; and training a topic classifier on basis of web document increment conditions in a web document set when the web document is stored into the web document set on basis of the classification result. So during a crawling process based on a focused crawler, a topic classification module related to the focused crawler can also be trained, the topic classification module based on the focused crawler can be closer to the search topic, so that when the focused crawler is crawling on basis of the topic classification module, crawled contents are more related to the search topic, and therefore crawling precision ratio and recall rate are improved.

Description

A kind of Theme Crawler of Content disposal route and device

Technical field

The invention belongs to web crawlers technical field, in particular, particularly relate to a kind of Theme Crawler of Content disposal route and device.

Background technology

Web crawlers, it is the program of a kind of " robotization browse network ", a kind of network robot in other words conj.or perhaps, current web crawlers has been widely used in internet search engine or other similar websites, it can gather the content of pages that in all search engines or website, it can have access to automatically, make user can be retrieved the information of needs faster by web crawlers, and can be used for search engine or website to do further process by the content of pages that web crawlers collects, can train based on the content of pages collected to make search engine or website.

The basis of web crawlers develops out a kind of Theme Crawler of Content, and namely Theme Crawler of Content is as the one of web crawlers, and it is a kind of web crawlers with theme discrimination module, according to search for, can crawl the network information relevant to search on internet.Current Theme Crawler of Content mainly builds based on keyword or regular expression, and the content that this mode makes it crawl exists the problem of low recall rate.

Summary of the invention

In view of this, the object of the present invention is to provide a kind of Theme Crawler of Content disposal route, for improving recall rate.Technical scheme is as follows:

The invention provides a kind of Theme Crawler of Content disposal route, described method comprises:

Obtain the web document that in queue to be crawled, URL(uniform resource locator) is corresponding;

Characteristic information extraction from described web document, wherein said characteristic information at least comprises the keyword feature information in network title characteristic information, metamessage, the Expressive Features information in metamessage and Web page text characteristic information;

Based on described characteristic information, topic relativity classification is carried out to described web document, obtain classification results;

Based on described classification results, determine whether described web document to be stored in web document set;

When described web document being stored in web document set based on described classification results, based on web document increment situation in web document set, to the subject classification model training relevant to described Theme Crawler of Content.

Preferably, after obtaining the web document that in queue to be crawled, URL(uniform resource locator) is corresponding, described method also comprises: judge whether the page corresponding to described URL(uniform resource locator) is navigation page;

If so, then described navigation page is resolved, obtain the URL(uniform resource locator) in described navigation page, and in queue to be crawled described in the URL(uniform resource locator) got is write;

If not, then the step of characteristic information extraction from described web document is triggered.

Preferably, described from described web document characteristic information extraction, comprising:

Participle is carried out to the title of described web document, obtains first participle result, and based on described first participle result, obtain a tuple-set of described title;

Use fisrt feature function, judge the relation of a tuple-set of each word in described title and described title, obtain title feature vector, described title feature vector is used to indicate the relation of each word and a described tuple-set in described title;

Participle is carried out to the critical word information of metamessage in described web document, obtains the second word segmentation result, and based on described second word segmentation result, obtain a tuple-set of described critical word information;

Use second feature function, the relation of a tuple-set of each keyword and described critical word information in described critical word information is judged, obtain keyword feature vector, described keyword feature vector is used to indicate the relation of a tuple-set of each keyword and described critical word information in described critical word information;

Participle is carried out to the description metamessage of metamessage in described web document, obtains the 3rd word segmentation result, and based on described second word segmentation result, obtain a tuple-set of described description metamessage;

Use third feature function, the relation of a tuple-set of each webpage descriptor and described description metamessage in described description metamessage is judged, obtain Expressive Features vector, described Expressive Features vector is used to indicate the relation of a tuple-set of each webpage descriptor and described description metamessage in described description metamessage;

After the Web page text of described web document is processed, obtain a tuple-set of described Web page text and two tuple-sets of described Web page text;

Use fourth feature function, the relation of one tuple-set of each keyword in described Web page text and described Web page text is judged, obtain the first eigenvector of Web page text, the first eigenvector of described Web page text is used to indicate the relation of a tuple-set of each keyword and described Web page text in described Web page text;

Use fifth feature function, the relation of two tuple-sets of each keyword in described Web page text and described Web page text is judged, obtain the second feature vector of Web page text, the second feature vector of described Web page text is used to indicate the relation of two tuple-sets of each keyword and described Web page text in described Web page text.

Preferably, described based on described classification results, determine whether described web document to be stored in web document set, comprising:

When described classification results indicates described web document relevant to search for, judge whether the theme dependent probability of described web document is greater than theme dependent probability threshold value, wherein said search for is the theme that described Theme Crawler of Content crawls;

When judging that the theme dependent probability of described web document is greater than theme dependent probability threshold value, described web document is stored in described web document set;

When described classification results indicates described web document uncorrelated with described search for, judge in described web document set, whether theme relevant documentation quantity and the ratio of not a theme relevant documentation quantity are less than the relevant accounting threshold value of theme, wherein said theme relevant documentation quantity refers to the quantity of the web document relevant to described search for, and described not a theme relevant documentation quantity refers to the quantity with the incoherent web document of described search for;

When judging in described web document set that theme relevant documentation quantity and the ratio of not a theme relevant documentation quantity are less than the relevant accounting threshold value of theme, described web document is stored in described web document set.

Preferably, described when described web document being stored in web document set based on described classification results, based on web document increment situation in web document set, to the subject classification model training relevant to described Theme Crawler of Content, comprising:

When described web document is stored in described web document set, add a process to increment counter, the initial value of wherein said increment counter is 0, and often stores a web document in described web document set, and described increment counter adds one automatically;

Judge whether the value of described increment counter is greater than delta threshold, if so, re-training is carried out to described subject classification model, and the value of described increment counter is updated to initial value.

The present invention also provides a kind of Theme Crawler of Content treating apparatus, and described device comprises:

Acquiring unit, for obtaining the web document that in queue to be crawled, URL(uniform resource locator) is corresponding;

Extraction unit, for characteristic information extraction from described web document, wherein said characteristic information at least comprises the keyword feature information in network title characteristic information, metamessage, the Expressive Features information in metamessage and Web page text characteristic information;

Taxon, for carrying out topic relativity classification based on described characteristic information to described web document, obtains classification results;

Judging unit, for based on described classification results, determines whether described web document to be stored in web document set;

Training unit, for when described web document being stored in web document set based on described classification results, based on web document increment situation in web document set, to the subject classification model training relevant to described Theme Crawler of Content.

Preferably, described device also comprises: page judging unit, for judging whether the page corresponding to described URL(uniform resource locator) is navigation page, if it is trigger described acquiring unit to resolve described navigation page, obtain the URL(uniform resource locator) in described navigation page, and in queue to be crawled described in the URL(uniform resource locator) got is write; If otherwise trigger described extraction unit.

Preferably, described extraction unit comprises:

First participle subelement, for carrying out participle to the title of described web document, obtains first participle result, and based on described first participle result, obtains a tuple-set of described title;

Title feature vector obtains subelement, for using fisrt feature function, the relation of one tuple-set of each word in described title and described title is judged, obtain title feature vector, described title feature vector is used to indicate the relation of each word and a described tuple-set in described title;

Second participle subelement, for carrying out participle to the critical word information of metamessage in described web document, obtains the second word segmentation result, and based on described second word segmentation result, obtains a tuple-set of described critical word information;

Keyword feature vector obtains subelement, for using second feature function, the relation of a tuple-set of each keyword and described critical word information in described critical word information is judged, obtain keyword feature vector, described keyword feature vector is used to indicate the relation of a tuple-set of each keyword and described critical word information in described critical word information;

3rd participle subelement, for carrying out participle to the description metamessage of metamessage in described web document, obtains the 3rd word segmentation result, and based on described second word segmentation result, obtains a tuple-set of described description metamessage;

Expressive Features vector obtains subelement, for using third feature function, the relation of a tuple-set of each webpage descriptor and described description metamessage in described description metamessage is judged, obtain Expressive Features vector, described Expressive Features vector is used to indicate the relation of a tuple-set of each webpage descriptor and described description metamessage in described description metamessage;

4th participle subelement, after processing the Web page text of described web document, obtains a tuple-set of described Web page text and two tuple-sets of described Web page text;

First eigenvector obtains subelement, for using fourth feature function, the relation of one tuple-set of each keyword in described Web page text and described Web page text is judged, obtain the first eigenvector of Web page text, the first eigenvector of described Web page text is used to indicate the relation of a tuple-set of each keyword and described Web page text in described Web page text;

Second feature vector obtains subelement, for using fifth feature function, the relation of two tuple-sets of each keyword in described Web page text and described Web page text is judged, obtain the second feature vector of Web page text, the second feature vector of described Web page text is used to indicate the relation of two tuple-sets of each keyword and described Web page text in described Web page text.

Preferably, described judging unit comprises:

First judgment sub-unit, during for indicating described web document relevant to search for when described classification results, judge whether the theme dependent probability of described web document is greater than theme dependent probability threshold value, wherein said search for is the theme that described Theme Crawler of Content crawls;

First storing sub-units, for when judging that the theme dependent probability of described web document is greater than theme dependent probability threshold value, is stored to described web document in described web document set;

Second judgment sub-unit, during for indicating described web document uncorrelated with described search for when described classification results, judge in described web document set, whether theme relevant documentation quantity and the ratio of not a theme relevant documentation quantity are less than the relevant accounting threshold value of theme, wherein said theme relevant documentation quantity refers to the quantity of the web document relevant to described search for, and described not a theme relevant documentation quantity refers to the quantity with the incoherent web document of described search for;

Second storing sub-units, for when judging in described web document set that theme relevant documentation quantity and the ratio of not a theme relevant documentation quantity are less than the relevant accounting threshold value of theme, is stored to described web document in described web document set.

Preferably, described training unit comprises:

Counter, during for being stored in described web document set when described web document, add a process to increment counter, the initial value of wherein said increment counter is 0, and often storing a web document in described web document set, described increment counter adds one automatically;

Judgment sub-unit, for judging whether the value of described increment counter is greater than delta threshold;

Training subelement, for when the value of described increment counter is greater than delta threshold, carries out re-training to described subject classification model, and the value of described increment counter is updated to initial value.

Compared with prior art, technique scheme tool provided by the invention has the following advantages:

In technique scheme provided by the invention, after getting web document, at least from web document, extract network title characteristic information, keyword feature information in metamessage, Expressive Features information in metamessage and Web page text characteristic information, based on these characteristic informations, topic relativity analysis is carried out to web document, obtain classification results, and when web document being stored in web document set based on classification results, based on web document increment situation in web document set, subject classification device is trained, therefore in the process crawled based on Theme Crawler of Content, the subject classification model training can also be correlated with for Theme Crawler of Content, make Theme Crawler of Content based on subject classification model closer to search for, such Theme Crawler of Content is when crawling based on subject classification model, the content crawled is more relevant to search for, thus improve the accurate rate and recall rate that crawl.

And the embodiment of the present invention is when to the training of subject classification model, the characteristic information adopted is that Theme Crawler of Content is crawling the information automatically gathered in process, relative to the mode of artificial labeled data training subject classification model, reduce the workload of artificial labeled data.This is external when carrying out re-training to subject classification model, the web document be newly added in web document set is brought in subject classification model as training input variable and is trained by capital, training input variable is increased, therefore can obtain new subject classification model, and judge the theme that makes new advances based on new subject classification model.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

Fig. 1 is a kind of process flow diagram of the Theme Crawler of Content disposal route that the embodiment of the present invention provides;

Fig. 2 is a kind of sub-process figure of the Theme Crawler of Content disposal route that the embodiment of the present invention provides;

Fig. 3 is the another kind of sub-process figure of the Theme Crawler of Content disposal route that the embodiment of the present invention provides;

Fig. 4 is the another kind of process flow diagram of the Theme Crawler of Content disposal route that the embodiment of the present invention provides;

Fig. 5 is a kind of structural representation of the Theme Crawler of Content treating apparatus that the embodiment of the present invention provides;

Fig. 6 is the structural representation of extraction unit in the Theme Crawler of Content treating apparatus that provides of the embodiment of the present invention;

Fig. 7 is the structural representation of judging unit in the Theme Crawler of Content treating apparatus that provides of the embodiment of the present invention;

Fig. 8 is the another kind of structural representation of the Theme Crawler of Content device that the embodiment of the present invention provides.

Embodiment

For making the object of the embodiment of the present invention, technical scheme and advantage clearly, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

Refer to Fig. 1, it illustrates a kind of process flow diagram of the Theme Crawler of Content disposal route that the embodiment of the present invention provides, can comprise the following steps:

101: obtain the web document that in queue to be crawled, URL(uniform resource locator) (UniformResourceLocator, URL) is corresponding.In embodiments of the present invention, Theme Crawler of Content can carry out page resource request by prior art, and adopts prior art parse the URL in each request and be added in queue to be crawled.

Such as Theme Crawler of Content uses the HTML (Hypertext Markup Language) (HyperTextTransferProtocol increased income, HTTP) ApacheHttpClient in kit carries out page resource request, wherein ApacheHttpClient is the instrument of the page resource request that the carrying out using the primary multithreading encapsulating dress provided in the SDK (Software Development Kit) (JavaDevelopmentKit, JDK) of Java language to obtain walks abreast.And use the primary multithreading bag provided in JDK to resolve, the URL parsed is added into queue to be crawled.

102: characteristic information extraction from web document, wherein characteristic information at least comprises the keyword feature information in network title characteristic information, metamessage, the Expressive Features information in metamessage and Web page text characteristic information.

That is the embodiment of the present invention at least characteristic information extraction from the network title of web document, metamessage and Web page text, and this three part especially network title and metamessage can indicate theme corresponding to web document, therefore by from this three extracting section to characteristic information more fit in theme.

Wherein metamessage is HyperText Mark-up Language (HyperTextMark-upLanguage corresponding to web document, the summary information of the web document comprised in meta (unit) label of Html) file, as keyword, theme etc., namely the various characteristic informations in metamessage can be extracted according to the content in meta label, in embodiments of the present invention, adopt the critical word information in metamessage and the description metamessage in metamessage, from critical word information, extract keyword feature information and extract Expressive Features information from description metamessage.

Why from critical word information and describe characteristic information extraction metamessage be because: critical word information be Web page developer write there is recapitulative key word information, it at least comprises the key topic of web document, and form is as: <metaname=" keywords " content=" it is first middle Eastern Europe applicant country that Poland's application adds the sub-investment bank " >.The title that metamessage is then meta label is described, the content of its record and critical word info class are seemingly, also be some recapitulative information relevant to dry building body, such as: to add the sub-investment bank be Poland of first middle Eastern Europe applicant country to the application of " <metaname=" Description " content=" Poland is the maximum economy in middle Eastern Europe." > ", namely the correlativity of the theme of critical word information and description metamessage and web document is higher, extracts feature in the critical word information therefore in preferred metamessage and description metamessage.

In the embodiment of the present invention, the acquisition process of above-mentioned characteristic information as shown in Figure 2, can comprise the following steps:

201: participle is carried out to the title of web document, obtains first participle result, and based on first participle result, obtain a tuple-set of title.Be understandable that: title is the descriptive matter in which there that in web document, a section is played summary and acts on, and has good directive property to judging that whether web document is relevant to theme.

Existing participle technique can be adopted in embodiments of the present invention to the participle of title, as adopted Stamford participle technique, participle is carried out to title, obtain first participle result, wherein first participle result is after title participle, the set of letters of each word composition, then based on the tandem that each word in first participle result occurs in title, a tuple-set is obtained.

Such as: " it is first middle Eastern Europe applicant country that Poland's application adds the sub-investment bank " is after participle, the tuple-set obtained is: (Poland), (application), (adding), (the sub-investment bank), (first), (middle Eastern Europe), (applicant country).

Here it should be noted is that: after participle, according to the vertical order that word occurs in a document, each word segmentation result can regard a time series as, each word is wherein on a time t, and in a tuple-set, each tuple is exactly the word (w (t)) of current time t; By that analogy, in two tuple-sets, each two tuples are exactly the contamination (w (t-1), w (t)) of time t and time t-1, for " it is first middle Eastern Europe applicant country that Poland's application adds the sub-investment bank ", two tuple-sets are then (Poland, are ready), (be ready, add), (adds, the sub-investment bank), (the sub-investment bank, first), (first, middle Eastern Europe), (middle Eastern Europe, applicant country).

202: use fisrt feature function, judge the relation of a tuple-set of each word in title and title, obtain title feature vector, title feature vector is used to indicate the relation of each word and a tuple-set in title.In embodiments of the present invention, the form of fisrt feature function is as follows:

as can be seen from fisrt feature function, when belonging to a tuple in a tuple-set as the word w in title, eigenwert is 1, otherwise eigenwert is 0, by fisrt feature function, can obtain a title feature vector be made up of 0 and 1.

203: participle is carried out to the critical word information of metamessage in web document, obtains the second word segmentation result, and based on the second word segmentation result, obtain a tuple-set of critical word information.The form of critical word information is as <metaname=" keywords " content=" it is first middle Eastern Europe applicant country that Poland's application adds the sub-investment bank " >.When carrying out participle to it, first extracting the information in content attribute: " it is first middle Eastern Europe applicant country that Poland's application adds the sub-investment bank ", then carrying out participle and obtaining a tuple-set.Such as: " it is first middle Eastern Europe applicant country that Poland's application adds the sub-investment bank ", the tuple-set built is: (Poland), (application), (adding), (the sub-investment bank), (first), (middle Eastern Europe), (applicant country).

204: use second feature function, the relation of one tuple-set of each keyword in critical word information and critical word information is judged, obtain keyword feature vector, keyword feature vector is used to indicate the relation of a tuple-set of each keyword and critical word information in critical word information.In embodiments of the present invention, the form of second feature function is as follows:

by second feature function, a keyword feature vector be made up of 0 and 1 can be obtained.

205: participle is carried out to the description metamessage of metamessage in web document, obtain the 3rd word segmentation result, and based on the second word segmentation result, obtain the tuple-set describing metamessage.To add the sub-investment bank be Poland of first middle Eastern Europe applicant country is the maximum economy in middle Eastern Europe in the form describing metamessage such as: " <metaname=" Description " content=" Poland application.">”。When carrying out participle to it, first extracting the information in name attribute: " Poland's application to add the sub-investment bank be Poland of first middle Eastern Europe applicant country be the maximum economy in middle Eastern Europe ", then carrying out participle and obtaining a tuple-set.

206: use third feature function, the relation of the tuple-set describing each webpage descriptor and description metamessage in metamessage is judged, obtain Expressive Features vector, Expressive Features vector is used to indicate the relation of the tuple-set describing each webpage descriptor and description metamessage in metamessage.In embodiments of the present invention, the form of third feature function is as follows:

by third feature function, an Expressive Features vector be made up of 0 and 1 can be obtained.

207: after the Web page text of web document is processed, obtain a tuple-set of Web page text and two tuple-sets of Web page text.After obtaining a web document, need to extract Web page text from web document, such as, Open Source Code CXExtractor in algorithm of increasing income " the generic web pages text extracting based on the row block distribution function " method that Chinese Harbin Institute of Technology can be adopted to propose carries out the extraction of Web page text.

After extracting Web page text, a series of pre-service can be carried out to Web page text, as by regular expression, filter the special character replaced in Web page text, as shown in table 1:

Special character in table 1 Web page text

After removing special character, participle is carried out to web page text, and remove stop words in word segmentation result according to stopping vocabulary, be i.e. in the word segmentation result of final web page text, do not comprise special character and stop words.

According to the order that each word in the word segmentation result of web page text occurs in web page text, build a tuple, two tuples of a word.One tuple-set of the tuple composition Web page text constructed, two tuple-sets of the two tuple composition Web page texts constructed.

Such as, in Web page text, partial content is: Poland is ready to add the leading sub-investment bank of China with original member's identity, then one tuple-set is: (Poland), (being ready), (originating), (member state), (identity), (adding), (China), (dominating), (the sub-investment bank); Two tuple-sets are: (Poland, is ready), (are ready, original), (original, member state), (member state, identity), (identity, add), (adding, China), (China, leading), (leading, the sub-investment bank).

208: use fourth feature function, the relation of one tuple-set of each keyword in Web page text and Web page text is judged, obtain the first eigenvector of Web page text, the first eigenvector of Web page text is used to indicate the relation of a tuple-set of each keyword and Web page text in Web page text.In embodiments of the present invention, the form of fourth feature function is as follows:

by fourth feature function, a first eigenvector be made up of 0 and 1 can be obtained.

209: use fifth feature function, the relation of two tuple-sets of each keyword in Web page text and Web page text is judged, obtain the second feature vector of Web page text, the second feature vector of Web page text is used to indicate the relation of two tuple-sets of each keyword and Web page text in Web page text.In embodiments of the present invention, the form of fifth feature function is as follows:

by fifth feature function, a second feature vector be made up of 0 and 1 can be obtained.

103: feature based information carries out topic relativity classification to web document, obtains classification results.Wherein whether classification results to be used to indicate web document relevant to the search for of Theme Crawler of Content, and search for can be the theme that user inputs.Can adopt subject classification model to judge in embodiments of the present invention, detailed process is as follows:

By above-mentioned characteristic information, as title feature vector, keyword feature is vectorial, Expressive Features is vectorial, first eigenvector and second feature vector couple together, the total characteristic vector of composition a line multiple row, then total characteristic vector is inputed in subject classification model, the Output rusults of subject classification model is then classification results, whether relevantly to search for indicates web document.Concrete: when the Output rusults of subject classification model is 1, then represent that web document is relevant to the search for of Theme Crawler of Content; When the Output rusults of subject classification model is 0, then represent that the search for of web document and Theme Crawler of Content is uncorrelated.

104: based on classification results, determine whether described web document to be stored in web document set.In embodiments of the present invention, multiple web document is stored in web document set, these web document can be further used as the training data of subject classification model, to the training of subject classification model, that is the embodiment of the present invention provides applicable semi-supervised learning method to expand subject classification model, when Theme Crawler of Content adopts subject classification model to predict, can process web document based on classification results, expand the web document as training data in webpage collection of document.

Wherein web document storage mode is as shown in Figure 3, can comprise the following steps:

301: judge whether classification results indicates web document relevant to search for, if perform step 302, perform step 305 if not.

302: when classification results instruction web document is relevant to search for, judge whether the theme dependent probability of web document is greater than theme dependent probability threshold value, if perform step 303, perform step 304 if not.

303: when judging that the theme dependent probability of web document is greater than theme dependent probability threshold value, web document is stored in web document set.

304: when judging that the theme dependent probability of web document is less than or equal to theme dependent probability threshold value, abandon web document.

305: when classification results instruction web document is uncorrelated with search for, judge in web document set, whether theme relevant documentation quantity and the ratio of not a theme relevant documentation quantity are less than the relevant accounting threshold value of theme, if perform step 306, perform step 307 if not.Wherein theme relevant documentation quantity refers to the quantity of the web document relevant to search for, and not a theme relevant documentation quantity refers to the quantity of the incoherent web document with search for.

306: when judging in web document set that theme relevant documentation quantity and the ratio of not a theme relevant documentation quantity are less than the relevant accounting threshold value of theme, web document is stored in web document set.

307: when judging in web document set that theme relevant documentation quantity and the ratio of not a theme relevant documentation quantity are more than or equal to the relevant accounting threshold value of theme, abandon web document.

From above-mentioned storage mode, the embodiment of the present invention determines whether web document to be stored in web document set based on theme dependent probability threshold value accounting threshold value relevant with theme, especially the web document that theme dependent probability is greater than theme dependent probability threshold value can be chosen when storing the web document relevant to theme, make when training subject classification model, based on the web document relevant to theme and search for more press close to, to improve the degree of accuracy of subject classification model.

Wherein above-mentioned theme dependent probability threshold value accounting threshold value relevant with theme is artificial empirical data, can choose different value, do not limit this embodiment of the present invention to its concrete value under different application scene.

105: when web document being stored in web document set based on classification results, based on web document increment situation in web document set, to the subject classification model training relevant to Theme Crawler of Content.Its feasible pattern is: when web document is stored in web document set, add a process to increment counter, and wherein the initial value of increment counter is 0, and often stores a web document in web document set, and increment counter adds one automatically; Judge whether the value of described increment counter is greater than delta threshold, if so, re-training is carried out to subject classification model, and the value of increment counter is updated to initial value.

Namely detected the increment of web document in webpage collection of document by an increment counter, when the increment that increment counter indicates web document is greater than delta threshold, then need to carry out re-training to subject classification model.When carrying out re-training to subject classification model, the multiple web document based on storing in web document set, and each web document adopts mode shown in Fig. 2 to carry out automatic characteristic information extraction, therefore the embodiment of the present invention when re-training subject classification model based on web document automatically mark, reduce the workload of artificial labeled data.

Here it should be noted is that: during first time training subject classification model, user is needed to carry out manual mark to a small amount of web document, these manual network documentations marked are as the training data in initial web document set to train subject classification model, and the training of follow-up subject classification model is the web document based on storing in web document set.Wherein the source of the web document of manual mark can be the webpage that general networking reptile crawls from internet at random, and also can be the webpage artificially obtained from internet, also can be the web page library of increasing income.

From technique scheme, the Theme Crawler of Content disposal route that the embodiment of the present invention provides is after getting web document, at least from web document, extract network title characteristic information, keyword feature information in metamessage, Expressive Features information in metamessage and Web page text characteristic information, based on these characteristic informations, topic relativity analysis is carried out to web document, obtain classification results, and when web document being stored in web document set based on classification results, based on web document increment situation in web document set, subject classification device is trained, therefore in the process crawled based on Theme Crawler of Content, the subject classification model training can also be correlated with for Theme Crawler of Content, make Theme Crawler of Content based on subject classification model closer to search for, such Theme Crawler of Content is when crawling based on subject classification model, the content crawled is more relevant to search for, thus improve the accurate rate and recall rate that crawl.

And the embodiment of the present invention is when to the training of subject classification model, the characteristic information adopted is that Theme Crawler of Content is crawling the information automatically gathered in process, relative to the mode of artificial labeled data training subject classification model, reduce the workload of artificial labeled data.

Refer to Fig. 4, it illustrates the another kind of process flow diagram of the Theme Crawler of Content disposal route that the embodiment of the present invention provides, can comprise the following steps:

401: obtain the web document that in queue to be crawled, URL is corresponding.

402: judge whether the page that URL is corresponding is navigation page, if perform step 403; Perform step 404 if not.In embodiments of the present invention, prior art can be used, as Logic Regression Models judges whether the page is navigation page.

403: navigation page is resolved, obtain the URL in navigation page, and the URL got is write in queue to be crawled.

Be understandable that: type of webpage is divided into navigation page and content pages according to function.Wherein do not comprise the content of essence in navigation page, only comprise a series of Anchor Text as navigation; Content pages then comprises flesh and blood and less Anchor Text.Therefore after judging that the page that URL is corresponding is navigation page, need the URL getting the content pages comprising flesh and blood and less Anchor Text from navigation page, and these URL are added in queue to be crawled, classify with the web document corresponding to these URL.

404: characteristic information extraction from web document, wherein characteristic information at least comprises the keyword feature information in network title characteristic information, metamessage, the Expressive Features information in metamessage and Web page text characteristic information.

405: feature based information carries out topic relativity classification to web document, obtains classification results.

406: based on classification results, determine whether described web document to be stored in web document set.

407: when web document being stored in web document set based on classification results, based on web document increment situation in web document set, to the subject classification model training relevant to Theme Crawler of Content.

The wherein specific implementation process of above-mentioned steps 404 to step 407: identical with above-mentioned steps 102 to step 105, no longer sets forth this embodiment of the present invention.

As can be seen from technique scheme, whether the Theme Crawler of Content disposal route that the embodiment of the present invention provides can the page corresponding to URL be that navigation page judges, like this when judging to be navigation page, no longer can perform feature extraction and classification deterministic process to navigation page, reduce the data volume of process.

Based on the above-mentioned Theme Crawler of Content disposal route provided, its degree of accuracy and recall rate are verified, wherein initial web document set uses manually to win on internet 1000 sections of web document as Initial page collection of document, and manual mark is one by one carried out, as the training data of subject classification model to it; Theme dependent probability threshold value is set as 0.8; Theme accounting threshold value of being correlated with is set as 0.75; Delta threshold is 1000; When Logic Regression Models carries out parameter training, the step-length of Gradient Descent is 0.05.The applied environment of the method is as follows:

Central processing unit (CentralProcessingUnit, CPU): IntelE52620;

Random access memory (RandomAccessMemory, RAM): 64GB;

Operating system: Windows7UltimateSP1;

JAVA virtual machine environment: JDK1.6;

The network bandwidth: 100Mbps;

Under applied environment, web-page requests Thread Count is 10 threads; Page URL resolves thread 2; Text extracting thread 1;

Based on above-mentioned applied environment, the operation result of Theme Crawler of Content is as shown in table 2:

Table 2 operation result is added up

	Quantity
		Crawl webpage quantity	370,561

Carry out random sampling 100 to operation result to evaluate, its confusion matrix is as shown in table 3:

Table 3 confusion matrix

Can be obtained by confusion matrix: accurate rate is 87%; Recall rate is 82.1%.

For aforesaid each embodiment of the method, in order to simple description, therefore it is all expressed as a series of combination of actions, but those skilled in the art should know, the present invention is not by the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in instructions all belongs to preferred embodiment, and involved action and module might not be that the present invention is necessary.

Refer to Fig. 5, it illustrates a kind of structural representation of the Theme Crawler of Content treating apparatus that the embodiment of the present invention provides, can comprise: acquiring unit 11, extraction unit 12, taxon 13, judging unit 14 and training unit 15.

Acquiring unit 11, for obtaining the web document that in queue to be crawled, URL is corresponding.In embodiments of the present invention, acquiring unit 11 adopts prior art to carry out page resource request based on Theme Crawler of Content, and adopts prior art parse the URL in each request and be added in queue to be crawled.

Extraction unit 12, for characteristic information extraction from web document, wherein characteristic information at least comprises the keyword feature information in network title characteristic information, metamessage, the Expressive Features information in metamessage and Web page text characteristic information.That is the embodiment of the present invention at least characteristic information extraction from the network title of web document, metamessage and Web page text, and this three part especially network title and metamessage can indicate theme corresponding to web document, therefore by from this three extracting section to characteristic information more fit in theme.

Preferably can adopt the structure of extraction unit 12 shown in Fig. 6, comprise: first participle subelement 121, title feature vector acquisition subelement 122, second participle subelement 123, keyword feature vector acquisition subelement 124, the 3rd participle subelement 125, Expressive Features vector obtain subelement 126, the 4th participle subelement 127, first eigenvector acquisition subelement 128 and second feature vector acquisition subelement 129.

First participle subelement 121, for carrying out participle to the title of web document, obtains first participle result, and based on first participle result, obtains a tuple-set of title.

Title feature vector obtains subelement 122, for using fisrt feature function, judge the relation of a tuple-set of each word in title and title, obtain title feature vector, title feature vector is used to indicate the relation of each word and a tuple-set in title.In embodiments of the present invention, the form of fisrt feature function is as follows:

Second participle subelement 123, for carrying out participle to the critical word information of metamessage in web document, obtains the second word segmentation result, and based on the second word segmentation result, obtains a tuple-set of critical word information.

Keyword feature vector obtains subelement 124, for using second feature function, the relation of one tuple-set of each keyword in critical word information and critical word information is judged, obtain keyword feature vector, keyword feature vector is used to indicate the relation of a tuple-set of each keyword and critical word information in critical word information.In embodiments of the present invention, the form of second feature function is as follows:

3rd participle subelement 125, for carrying out participle to the description metamessage of metamessage in web document, obtains the 3rd word segmentation result, and based on the second word segmentation result, obtains the tuple-set describing metamessage.

Expressive Features vector obtains subelement 126, for using third feature function, the relation of the tuple-set describing each webpage descriptor and description metamessage in metamessage is judged, obtain Expressive Features vector, Expressive Features vector is used to indicate the relation of the tuple-set describing each webpage descriptor and description metamessage in metamessage.In embodiments of the present invention, the form of third feature function is as follows:

4th participle subelement 127, after processing the Web page text of web document, obtain a tuple-set of Web page text and two tuple-sets of Web page text, concrete processing procedure refers to the explanation of embodiment of the method part.

First eigenvector obtains subelement 128, for using fourth feature function, the relation of one tuple-set of each keyword in Web page text and Web page text is judged, obtain the first eigenvector of Web page text, the first eigenvector of Web page text is used to indicate the relation of a tuple-set of each keyword and Web page text in Web page text.In embodiments of the present invention, the form of fourth feature function is as follows:

Second feature vector obtains subelement 129, for using fifth feature function, the relation of two tuple-sets of each keyword in Web page text and Web page text is judged, obtain the second feature vector of Web page text, the second feature vector of Web page text is used to indicate the relation of two tuple-sets of each keyword and Web page text in Web page text.In embodiments of the present invention, the form of fifth feature function is as follows:

Taxon 13, carries out topic relativity classification for feature based information to web document, obtains classification results.Wherein whether classification results to be used to indicate web document relevant to the search for of Theme Crawler of Content, and search for can be the theme that user inputs.Can adopt subject classification model to judge in embodiments of the present invention, detailed process is as follows:

Judging unit 14, for based on classification results, determines whether web document to be stored in web document set.In embodiments of the present invention, multiple web document is stored in web document set, these web document can be further used as the training data of subject classification model, to the training of subject classification model, that is the embodiment of the present invention provides applicable semi-supervised learning method to expand subject classification model, when Theme Crawler of Content adopts subject classification model to predict, can process web document based on classification results, expand the web document as training data in webpage collection of document.

Wherein judging unit 14 can adopt structure shown in Fig. 7 to determine whether and store web document, specifically can comprise: the first judgment sub-unit 141, first storing sub-units 142, second judgment sub-unit 143 and the second storing sub-units 144.

First judgment sub-unit 141, for when classification results instruction web document is relevant to search for, judge whether the theme dependent probability of web document is greater than theme dependent probability threshold value, wherein search for is the theme the theme that reptile crawls.

First storing sub-units 142, for when judging that the theme dependent probability of web document is greater than theme dependent probability threshold value, is stored to web document in web document set.

Second judgment sub-unit 143, for when classification results instruction web document is uncorrelated with search for, judge in web document set, whether theme relevant documentation quantity and the ratio of not a theme relevant documentation quantity are less than the relevant accounting threshold value of theme, wherein theme relevant documentation quantity refers to the quantity of the web document relevant to search for, and not a theme relevant documentation quantity refers to the quantity of the incoherent web document with search for.

Second storing sub-units 144, for when judging in web document set that theme relevant documentation quantity and the ratio of not a theme relevant documentation quantity are less than the relevant accounting threshold value of theme, is stored to web document in web document set.

Training unit 15, for when web document being stored in web document set based on classification results, based on web document increment situation in web document set, to the subject classification model training relevant to Theme Crawler of Content.

In embodiments of the present invention, training unit 15 can comprise: counter, judgment sub-unit and training subelement.Its Counter, during for being stored in web document set when web document, adds a process to increment counter, and wherein the initial value of increment counter is 0, and often stores a web document in web document set, and increment counter adds one automatically.

Judgment sub-unit, for judging whether the value of increment counter is greater than delta threshold.

Training subelement, for when the value of increment counter is greater than delta threshold, carries out re-training to subject classification model, and the value of increment counter is updated to initial value.

From technique scheme, the Theme Crawler of Content treating apparatus that the embodiment of the present invention provides is after getting web document, at least from web document, extract network title characteristic information, keyword feature information in metamessage, Expressive Features information in metamessage and Web page text characteristic information, based on these characteristic informations, topic relativity analysis is carried out to web document, obtain classification results, and when web document being stored in web document set based on classification results, based on web document increment situation in web document set, subject classification device is trained, therefore in the process crawled based on Theme Crawler of Content, the subject classification model training can also be correlated with for Theme Crawler of Content, make Theme Crawler of Content based on subject classification model closer to search for, such Theme Crawler of Content is when crawling based on subject classification model, the content crawled is more relevant to search for, thus improve the accurate rate and recall rate that crawl.

Refer to Fig. 8, it illustrates the another kind of structural representation of the Theme Crawler of Content treating apparatus that the embodiment of the present invention provides, Fig. 5 basis can also comprise: page judging unit 16, for judging whether the page that URL is corresponding is navigation page, if it is trigger acquiring unit 11 pairs of navigation pages to resolve, obtain the URL in navigation page, and the URL got is write in queue to be crawled.If otherwise trigger extraction unit 12.

As can be seen from technique scheme, whether the Theme Crawler of Content treating apparatus that the embodiment of the present invention provides can the page corresponding to URL be that navigation page judges, like this when judging to be navigation page, no longer can perform feature extraction and classification deterministic process to navigation page, reduce the data volume of process.

It should be noted that, each embodiment in this instructions all adopts the mode of going forward one by one to describe, and what each embodiment stressed is the difference with other embodiments, between each embodiment identical similar part mutually see.For device class embodiment, due to itself and embodiment of the method basic simlarity, so description is fairly simple, relevant part illustrates see the part of embodiment of the method.

Finally, also it should be noted that, in this article, the such as relational terms of first and second grades and so on is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or equipment and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or equipment.When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment comprising described key element and also there is other identical element.

To the above-mentioned explanation of the disclosed embodiments, those skilled in the art are realized or uses the present invention.To be apparent for a person skilled in the art to the multiple amendment of these embodiments, General Principle as defined herein can without departing from the spirit or scope of the present invention, realize in other embodiments.Therefore, the present invention can not be restricted to these embodiments shown in this article, but will meet the widest scope consistent with principle disclosed herein and features of novelty.

The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. a Theme Crawler of Content disposal route, is characterized in that, described method comprises:

2. method according to claim 1, is characterized in that, after obtaining the web document that in queue to be crawled, URL(uniform resource locator) is corresponding, described method also comprises: judge whether the page corresponding to described URL(uniform resource locator) is navigation page;

3. method according to claim 1 and 2, is characterized in that, described from described web document characteristic information extraction, comprising:

4. method according to claim 3, is characterized in that, described based on described classification results, determines whether described web document to be stored in web document set, comprising:

5. method according to claim 4, it is characterized in that, described when described web document being stored in web document set based on described classification results, based on web document increment situation in web document set, to the subject classification model training relevant to described Theme Crawler of Content, comprising:

6. a Theme Crawler of Content treating apparatus, is characterized in that, described device comprises:

7. device according to claim 6, it is characterized in that, described device also comprises: page judging unit, for judging whether the page corresponding to described URL(uniform resource locator) is navigation page, if it is trigger described acquiring unit to resolve described navigation page, obtain the URL(uniform resource locator) in described navigation page, and in queue to be crawled described in the URL(uniform resource locator) got is write; If otherwise trigger described extraction unit.

8. the device according to claim 6 or 7, is characterized in that, described extraction unit comprises:

9. device according to claim 8, is characterized in that, described judging unit comprises:

10. device according to claim 9, is characterized in that, described training unit comprises: