CN102831131A

CN102831131A - Method and device for establishing labeling webpage linguistic corpus

Info

Publication number: CN102831131A
Application number: CN2011101720928A
Authority: CN
Inventors: 付雷; 夏迎炬; 孟遥; 于浩
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2011-06-16
Filing date: 2011-06-16
Publication date: 2012-12-19
Anticipated expiration: 2031-06-16
Also published as: CN102831131B

Abstract

The embodiment of the invention discloses a method and a device for establishing a labeling webpage linguistic corpus. The method comprises the following steps of: generating initial seed labeling webpage linguistic data aiming at a pre-selected initial seed webpage; acquiring a relevant seed webpage with a pre-set quantity from a searching engine according to a keyword of the initial seed labeling webpage linguistic data; labeling the relevant seed webpage according to the initial seed labeling webpage linguistic data, so as to obtain relevant seed labeling webpage linguistic data; judging whether the relevant seed labeling webpage linguistic data and the initial seed labeling webpage linguistic data satisfy a pre-set condition or not, if so, combining the relevant seed labeling webpage linguistic data with the initial seed labeling webpage linguistic data to be the labeling webpage linguistic corpus, and if not, using the relevant seed labeling webpage linguistic data as the initial seed labeling webpage linguistic data, and executing a step of acquiring the relevant seed webpage with the pre-set quantity from the searching engine. With the adoption of the embodiment of the invention, the large-scale standard labeling linguistic data is formed under the condition that small amount of the seed labeling linguistic data is given.

Description

Make up the method and the device of mark webpage corpus

Technical field

Relate generally to internet data processing technology field of the present invention, especially a kind of method and device that makes up mark webpage corpus.

Background technology

The data resource of internet is greatly abundant; For various data-intensive application provide potential Data Source; But the structure of web page on the internet is complicated, and the body matter of webpage often is submerged among the noise informations such as advertisement or navigation, will utilize this huge data source of internet to be the research service for this reason; Just need can the various information in the webpage be separated and sort out, just the content to webpage marks.

The webpage language material that has the markup information that becomes more meticulous all has fundamental influence for plurality of applications; For example web search, Web page classifying or web page contents extraction etc.; The said mark that becomes more meticulous is meant for appearing at the careful mark mode of it being divided into eight types of title, author, time, text, comment, advertisement, peer link and other etc. of text in the webpage; It is application services such as contents extraction or Cluster Classification that language material behind this mark both can be used as corpus; Also can be used as retrieval and wait the pretreatment stage of using, thereby improve retrieval precision.

The method of traditional structure mark webpage corpus, webpage label is directly carried out in general manual work, promptly is to be come the full content of certain webpage is checked by specific technician, thereby marks according to checking the each several part content of result to webpage.

But this mode that adopts manual work to carry out webpage label because the webpage quantity on the internet is unlimited, is carried out webpage label with regard to needing the technician to pay huge energy; Further, also there is similar situation sometimes in some partial content of different web pages, so this in the time of waste of manpower resource, also makes the language material scale be difficult to do greatly with regard to making the technician carry out the repeatability mark to identical web page contents.

Summary of the invention

In view of this, the embodiment of the invention provides a kind of method and device that makes up mark webpage corpus, can be under the situation of given small quantities of seed mark language material, and circulation constantly enlarges the scale of mark language material, forms large-scale standard mark language material.

An aspect according to the embodiment of the invention; A kind of method that makes up mark webpage corpus is provided; Comprise: generate initial seed mark webpage language material to the initial seed webpage of choosing in advance; Said initial seed webpage is the set that dissimilar webpages is formed, and said initial seed mark webpage language material is the kind sub-pages that marks out text and title; From search engine, obtain relevant kind of sub-pages of preset number according to the keyword of said initial seed mark webpage language material; According to said initial seed webpage label language material said relevant kind of sub-pages marked, obtain relevant seed mark webpage language material; And it is pre-conditioned to judge whether said relevant seed mark webpage language material and said initial seed mark webpage language material satisfy, if then said relevant seed mark webpage language material and said initial seed mark webpage language material are combined as mark webpage corpus; If not, then said relevant seed mark webpage language material is marked the webpage language material as initial seed, and carry out the said step of from search engine, obtaining relevant kind of sub-pages of preset number.

Another aspect according to the embodiment of the invention; A kind of device that makes up mark webpage corpus is provided; Comprise: generation module; Be used for generating initial seed mark webpage language material to the initial seed webpage of choosing in advance, said initial seed webpage is the set that dissimilar webpages is formed, and said initial seed mark webpage language material is the kind sub-pages that marks out text and title; Acquisition module is used for obtaining relevant kind of sub-pages presetting number from search engine according to the keyword of said initial seed mark webpage language material; Labeling module is used for according to said initial seed webpage label language material said relevant kind of sub-pages being marked, and obtains relevant seed mark webpage language material; Judge module is used to judge whether said relevant seed mark webpage language material and said initial seed mark webpage language material satisfy pre-conditioned; Composite module, be used for when the result of said judge module when being, said relevant seed mark webpage language material and said initial seed mark webpage language material are combined as mark webpage corpus; And trigger module, be used for when the result of said judge module for not the time, said relevant seed mark webpage language material is marked the webpage language material as initial seed, and triggers said acquisition module.

In addition, according to a further aspect in the invention, a kind of storage medium is provided also.Said storage medium comprises machine-readable program code, and when on messaging device, carrying out said program code, said program code makes said messaging device carry out the method according to above-mentioned structure mark webpage corpus of the present invention.

In addition, in accordance with a further aspect of the present invention, a kind of program product is provided also.Said program product comprises the executable instruction of machine, and when on messaging device, carrying out said instruction, said instruction makes said messaging device carry out the method according to above-mentioned structure mark webpage corpus of the present invention.

Above-mentioned a kind of method according to the embodiment of the invention; Can be under the situation of given small quantities of seed mark language material; Constantly circulating enlarges the scale of mark language material, forms large-scale standard mark language material, and the method for this structure mark webpage corpus need not manual work identical web page contents is carried out the repeatability mark; When having saved human resources and physics cost, also make mark webpage corpus can realize bigger scale.

Provide other aspects of the embodiment of the invention in the instructions part below, wherein, specify the preferred embodiment that is used for disclosing fully the embodiment of the invention, and it is not applied qualification.

Description of drawings

Below in conjunction with concrete embodiment, and, the above-mentioned of the embodiment of the invention and other purposes and advantage are done further description with reference to accompanying drawing.In the accompanying drawings, technical characterictic or parts identical or correspondence will adopt identical or corresponding Reference numeral to represent.

Fig. 1 is the process flow diagram that the method embodiment 1 that provides as the embodiment of the invention is shown;

Fig. 2 is the process flow diagram that illustrates as S102 among the inventive method embodiment 1;

Fig. 3 is the process flow diagram that illustrates as S103 among the inventive method embodiment 1;

Fig. 4 is the process flow diagram that illustrates as S301 among the inventive method embodiment 1;

Fig. 5 is the process flow diagram that illustrates as S302 among the inventive method embodiment 1;

Fig. 6 is the process flow diagram that the method embodiment 2 that provides as the embodiment of the invention is shown;

Fig. 7 is the process flow diagram that the method embodiment 3 that provides as the embodiment of the invention is shown;

Fig. 8 is the synoptic diagram that the device embodiment 1 that provides as the embodiment of the invention is shown;

Fig. 9 is the synoptic diagram that illustrates as acquisition module 802 among apparatus of the present invention embodiment 1;

Figure 10 is the synoptic diagram that illustrates as labeling module 803 among apparatus of the present invention embodiment 1;

Figure 11 is the synoptic diagram that illustrates as the first mark submodule 1001 among the device embodiment 1;

Figure 12 is the synoptic diagram that illustrates as the second mark submodule 1002 among the device embodiment 1;

Figure 13 is the synoptic diagram that the device embodiment 2 that provides as the embodiment of the invention is shown;

Figure 14 is the synoptic diagram that the device embodiment 3 that provides as the embodiment of the invention is shown;

Figure 15 is the block diagram that illustrates as the exemplary configurations of the personal computer of the messaging device that is adopted in the embodiments of the invention.

Embodiment

Embodiments of the invention are described with reference to the accompanying drawings.

The embodiment of the invention provides corresponding solution to the prior art problem.Concrete, referring to Fig. 1, the method embodiment 1 of the structure mark webpage corpus that the embodiment of the invention provides can comprise:

S101: generate initial seed mark webpage language material to the initial seed webpage of choosing in advance, said initial seed webpage is the set that dissimilar webpages is formed, and said initial seed mark webpage language material is the kind sub-pages that marks out text and title.

The embodiment of the invention is in practical application; Can choose some dissimilar webpages in advance; For example the webpage of types such as blog (BLOG), news and finance and economics is respectively chosen a spot of webpage sample, and the webpage quantity of each type is not limit, and for example every type is selected 100 webpages etc.The type here can change according to different situation to some extent, and for example, the classification type of Sina website and the classification type of Sohu.com just there are differences.But the mode classification difference of type does not influence the realization of the embodiment of the invention, and therefore, the embodiment of the invention does not limit the quantity and the type of the initial seed webpage of choosing in advance.

The webpage that selects needs the artificial mark that becomes more meticulous as the initial seed webpage, need mark title and text at least, and other parts are labeled as other; Further, other parts can meticulously mark out author, time, comment, advertisement and peer link part.The initial seed webpage that these have marked can form initial seed mark webpage language material.

S102: relevant kind of sub-pages from search engine, obtaining preset number according to the keyword of said initial seed mark webpage language material.

Because the initial seed webpage is just chosen dissimilar in representative a small amount of webpage, so also need expand relevant other kinds sub-pages according to the initial seed webpage.Because on the internet; For a certain piece of writing article of issuing on the initial seed webpage; Its possibility of being reprinted is very big; If in the initial seed mark webpage language material content of certain initial seed webpage by another not other webpages in said initial seed webpage language material reprint; The body matter of these two pages should be roughly the same in theory so, has very high similarity, therefore can so that the label in the later use Initial page go the related pages of reprinting is marked.

And this step is being searched the relevant reprinting page of initial seed webpage, specifically can use search engine as aid, utilizes the keyword of initial seed mark webpage language material to search for.

Wherein, with reference to shown in Figure 2, said S102 specifically can adopt following implementation in practical application:

S201: utilize the participle instrument that the title and the text of said initial seed webpage are carried out participle, to obtain the initial key speech of said initial seed webpage.

(for example: the title and the text that shareware ICTCLAS) the initial seed webpage have been marked out carry out participle, obtain the initial key speech of initial seed webpage can to utilize the participle instrument in embodiments of the present invention.

S202: according to the weighted value of part of speech, word frequency and each initial key speech of speech positional information calculation of said initial key speech.

Calculate the weight of each initial key speech in this step again according to part of speech, word frequency and the speech position of initial key speech.Concrete computing formula can be as follows:

Weight(W)＝Position(W)+Freq(W)+Pos(W)。This formula be appreciated that for: the weighted value of initial key speech W equals its position value, part of speech value and word frequency value sum.Wherein, Position (W) is the position value, and for example, initial key speech W appears in title and the text simultaneously, and then corresponding Position (W) can value be 3; If W appears at separately in the title, then corresponding Position (W) can value be 2; If W appears at separately in the text, then corresponding Position (W) can value be 1.Freq (W) is the word frequency value, promptly is the number of times sum that all speech occur in number of times/document of occurring of W.Pos (W) is the part of speech value, and for example, if W is noun or noun phrase, then Pos (W) can value be 1, otherwise is 0.

Certainly; Above-mentioned concrete numerical value is for making things convenient for those skilled in the art better to understand the object lesson shown in the embodiment of the invention, and the account form of the weighted value of the embodiment of the invention can be carried out accommodation according to actual conditions or user's request in practical application.

S203: the weight selection value is greater than the final keyword of several keywords conducts of predetermined threshold value.

After obtaining the weighted value of each attribute keywords, select the maximum top n of weight as sending into the final keyword that search engine is retrieved, wherein, the value of N is concerning the quantity of the result for retrieval that search engine returns, and generally can get 5～15.

S204: said final keyword is retrieved to obtain retrieval kind of sub-pages through search engine.

Search engine sent in the final keyword of gained retrieves; Search engine wherein can adopt Google, Baidu or search dog etc.; Search engine will return Search Results according to the final key word of input, and the Search Results here promptly is the reprinting page relevant with the initial seed webpage.Wherein, the selection of search engine does not influence the realization of the embodiment of the invention yet, and therefore, the present invention is the not concrete realization of limit search engine also.

S103: according to said initial seed webpage label language material said relevant kind of sub-pages marked, obtain relevant seed mark webpage language material.

After obtaining relevant kind of sub-pages, can mark relevant kind of a subpage frame with reference to initial seed webpage label language material.Concrete, can only mark text, title and other three classifications to relevant kind of a subpage frame, its order can be elder generation's mark text, marks title then, marks other at last.

Wherein, with reference to shown in Figure 3, said step S103 specifically can comprise in practical application:

S301:, the text of said relevant kind of sub-pages is marked according to the subclass of the text that marks out in the said initial seed mark webpage language material.

This step at first marks the text in relevant kind of the sub-pages, and with reference to shown in Figure 4, said step S301 can adopt following implementation in practical application:

S401: the subclass of from initial seed mark webpage language material, extracting text; Said subclass is any or the content of a plurality of parts of the text of said initial seed mark webpage.

Subclass described in this step is any a section or any of text of initial seed mark webpage, also can be the content of any a plurality of paragraph or a plurality of sentences.For example, subclass can directly equal the text of initial seed mark webpage, also can be first section and final stage of text, can also be for one section the longest in text content etc.

The text that with the subclass is initial seed mark webpage is an example, when concrete the realization, then need utilize the margin text of text to come the body part of relevant kind of subpage frame is positioned.In the concrete implementation procedure, can be noted as beginning text and the endtext of finding out text the part of " text " respectively from the initial seed page.The total length of supposing body part is L, the text of L/5 length text and endtext to start with before and after then can getting respectively.Certainly, the text that also can choose other length according to actual conditions is text and endtext to start with.

S402: the initial start-up portion according to said subclass is searched corresponding relevant start-up portion and relevant latter end with initial latter end from said relevant kind of sub-pages.

After the beginning text and endtext of the text that obtains initial seed mark webpage; Need the beginning text of the text in the initial seed webpage be mated in relevant seed page body with endtext, the relevant position in the relevant kind of sub-pages that matches is called relevant start-up portion and relevant latter end.

S403: according to the contents extraction instrument said relevant kind of sub-pages carried out contents extraction, obtain extracting start-up portion and extract latter end.

This step utilizes the contents extraction instrument from relevant kind of subpage frame, directly to extract body part again, and the initial text and the endtext that directly extract the text that obtains are called the extraction start-up portion respectively and extract latter end.This leaching process can adopt existing method for extracting content, does not limit its extracting mode in the embodiment of the invention.

S404: judge said relevant start-up portion and extract start-up portion, and whether said relevant latter end is identical with said extraction latter end, if then get into step S405; If, then do not get into step S406.

In this step; Judge whether the aforementioned relevant start-up portion that obtains is identical with the extraction start-up portion; And judge simultaneously whether relevant latter end is identical with the extraction latter end; If all identical, explain that the text of initial seed webpage conforms to the body matter of relevant kind of sub-pages fully, then can directly be labeled as " text " with the content of being correlated with between start-up portion and the relevant latter end in subsequent step S405.If not; The body matter of the body matter of relevant kind of subpage frame greater than the initial seed page is described; The content of the relevant kind of subpage frame that then in subsequent step S406, will directly extract promptly is that the content-label that extracts between start-up portion and the extraction latter end is " text ".

S405: the content between said relevant start-up portion and the relevant latter end is labeled as text.

S406: the content between said extraction start-up portion and the extraction latter end is labeled as text.

Need to prove; In process to the text mark; If can't in relevant kind of subpage frame, find the beginning text and the endtext of initial seed page body simultaneously; Then can relevant kind of a subpage frame not carried out any mark, promptly abandon this relevant kind of subpage frame, attempt next relevant kind of subpage frame.

S302:, the title of said relevant kind of sub-pages is marked according to the title that marks out in the said initial seed mark webpage language material.

When the title to relevant kind of sub-pages marks, adopt all principles of coupling.Wherein, with reference to shown in Figure 5, said step S302 specifically can adopt following implementation in practical application:

S501: whether the title of judging relevant kind of sub-pages is consistent with the title of said initial seed webpage, if then get into step S502; If, then do not get into step S503.

S502: the title to said relevant kind of sub-pages marks.

S503: finish title mark process to said relevant kind of sub-pages.

Can find out; Having adopted the mode of whole matching for the mark of the title of relevant kind of sub-pages, as can in relevant kind of subpage frame, all finding title text, promptly is that the title of title and said initial seed webpage of relevant kind of sub-pages is in full accord; Further; And this title text is not positioned among the body matter that has marked, then the title text that finds is labeled as " title ", otherwise finishes the title mark process to said relevant kind of sub-pages.

S303: the content that does not mark in the said relevant kind of sub-pages is labeled as other.

After text and title are marked; Since article often can not keep article when reprinting author with deliver information such as time; Usually can keep text and title; So when expansion initial seed language material, can only mark text and title division in relevant kind of the subpage frame, all remaining not parts of mark all are labeled as " other ".

S104: it is pre-conditioned to judge whether said relevant seed mark webpage language material and said initial seed mark webpage language material satisfy, if, then get into step S105, if not, execution in step S106 then.

Pre-conditioned in this step; Below in practical application, can adopting any one: the one, if having no kind of sub-pages effectively to expand; Promptly search for less than other webpages in marking the webpage corpus not, the 2nd, reach the scale of the corpus that the user sets; For example, reaching 100M just stops).

Therefore; This step judges specifically when carrying out whether said relevant seed mark webpage language material and said initial seed mark webpage language material have all carried out expansion and got final product; Perhaps, judge whether said relevant seed mark webpage language material and said initial seed mark webpage language material reach predefined scale.

S105: said relevant seed mark webpage language material and said initial seed mark webpage language material are combined as mark webpage corpus.

If satisfy pre-conditionedly, the relevant seed mark webpage language material that then will obtain is combined as with initial seed mark webpage language material and marks the webpage corpus.

S106: said relevant seed mark webpage language material is marked the webpage language material as initial seed, and return step S102.

If do not satisfy pre-conditioned; To mark then that good relevant seed mark page language material remakes is initial seed mark webpage language material; Flow process according to S102～S106 is expanded the relevant seed mark page language material that marks again; All mark page language materials of final feasible expansion satisfy pre-conditioned, thereby construct final mark webpage corpus.

More than the problem that exists in the prior art and corresponding solution have been carried out at length introducing.Adopt the method for the structure mark webpage corpus in the embodiment of the invention; Can be under the situation of given small quantities of seed mark language material; Constantly circulating enlarges the scale of mark language material, forms large-scale standard mark language material, and the method for this structure mark webpage corpus need not manual work identical web page contents is carried out the repeatability mark; When having saved human resources and physics cost, also make mark webpage corpus can realize bigger scale.Further, if adopt the method for existing this mark webpage language material to use, for example " retrieval " application also can improve retrieval precision, thereby avoid having influence on the performance of Internet Server.

Concrete, referring to Fig. 6, the embodiment of the invention provides the method embodiment 2 of another kind of structure mark webpage corpus, can comprise:

S601: generate initial seed mark webpage language material to the initial seed webpage of choosing in advance, said initial seed webpage is the set that dissimilar webpages is formed, and said initial seed mark webpage language material is the kind sub-pages that marks out text and title.

In the embodiment of the invention with the realization similarity of embodiment 1, can be mutually referring to, give unnecessary details no longer in detail in the present embodiment.

S602: relevant kind of sub-pages from search engine, obtaining preset number according to the keyword of said initial seed mark webpage language material.

S603: adopt vector space model to calculate the similarity of said initial seed webpage and said retrieval kind sub-pages.

What present embodiment was different with a last embodiment is; After the relevant kind of sub-pages that directly gets access to preset number from search engine; Can adopt vector space model (also can adopt other existing similarity calculating methods), calculate the similarity of the initial seed page and each retrieval kind of subpage frame.

S604: the value of said similarity is planted sub-pages as said relevant kind of sub-pages greater than several retrievals of predetermined threshold value.

Calculate after the similarity, similarity is planted a subpage frame greater than the retrieval kind sub-pages of certain threshold value as being correlated with.Why present embodiment does not directly plant a sub-pages with a preceding M page as being correlated with; And need increase to calculate the step of similarity; Mainly based on following factor: the ordering of the Search Results of search engine can not directly disclose the similarity relation between retrieval kind of subpage frame and the query word, and this mainly is by the concrete sort algorithm decision of search engine.

The return results sort algorithm of search engine relates to a large amount of factor (the for example bid ranking of Baidu) now, and similarity is one of them factor.So,, once more through calculating the similarity between the webpage,, promptly be to reprint webpage in the present embodiment to find relevant kind of sub-pages of initial seed webpage accurately to the Search Results that directly obtains.Concrete, for calculation of similarity degree, can utilize the participle instrument respectively related web page to be carried out participle and calculate them then and obtain, just vector space model with cosine (cos) value of planting the subpage frame vector.

S605:, the text of said relevant kind of sub-pages is marked according to the subclass of the text that marks out in the said initial seed mark webpage language material.

S606:, the title of said relevant kind of sub-pages is marked according to the title that marks out in the said initial seed mark webpage language material.

S607: the content that does not mark in the said relevant kind of sub-pages is labeled as other, obtains relevant seed mark webpage language material.

S608: it is pre-conditioned to judge whether said relevant seed mark webpage language material and said initial seed mark webpage language material satisfy, if then get into step S609; If, then do not get into step S610.

S609: said relevant seed mark webpage language material and said initial seed mark webpage language material are combined as mark webpage corpus.

S610 marks the webpage language material with said relevant seed mark webpage language material as initial seed, and carries out said step S602.

In a word; In the present embodiment; Not only can realize setting up among the embodiment 1 purpose of extensive mark webpage corpus, and because optimize for the result for retrieval of search engine in the present embodiment, so the mark webpage corpus that adopts present embodiment to set up is more accurate and effective; Further, also make follow-up application more effective and accurate.

Referring to Fig. 7, the embodiment of the invention provides the method embodiment 3 of another kind of structure mark webpage corpus, can comprise:

S701: generate initial seed mark webpage language material to the initial seed webpage of choosing in advance, said initial seed webpage is the set that dissimilar webpages is formed, and said initial seed mark webpage language material is the kind sub-pages that marks out text and title.

In the embodiment of the invention with the realization similarity of embodiment 1 and embodiment 2, can be mutually referring to, give unnecessary details no longer in detail in the present embodiment.

S702: relevant kind of sub-pages from search engine, obtaining preset number according to the keyword of said initial seed mark webpage language material.

S703: according to said initial seed webpage label language material said relevant kind of sub-pages marked, obtain relevant seed mark webpage language material.

S704: it is pre-conditioned to judge whether said relevant seed mark webpage language material and said initial seed mark webpage language material satisfy, if then get into step S705; If, then do not get into step S706.

S705: said relevant seed mark webpage language material and said initial seed mark webpage language material are combined as mark webpage corpus, get into step 707.

S706: said relevant seed mark webpage language material is marked the webpage language material as initial seed, and return step S702.

S707: according to extraction model that is used to extract web page contents of said mark webpage corpus training.

In the present embodiment, adopt aforementioned manner to make up after the mark webpage corpus, can also be the basis, train an extraction model that is used to extract web page contents with the mark webpage corpus that makes up.

S708: extract title and body matter in the target web according to said extraction model.

Can extract title in the target web and body matter according to the extraction model that trains, thereby can obtain title and the text in the target web accurately.Therefore precision and the accuracy of adopting the method for present embodiment can improve contents extraction.

Certainly, after having made up mark webpage corpus, can also use said mark webpage corpus and set up index, promptly be that search engine server can conveniently be set up index with reference to title and the body matter in the mark webpage corpus that makes up.Because subsequent applications has a lot, so enumerate no longer one by one in the embodiment of the invention, those skilled in the art can combine application of the prior art to implement.

Corresponding with first kind of method embodiment 1 that makes up mark webpage corpus that the embodiment of the invention provides, the embodiment of the invention also provides a kind of device embodiment 1 that makes up mark webpage corpus, and referring to Fig. 8, this device can comprise:

Generation module 801; Be used for generating initial seed mark webpage language material to the initial seed webpage of choosing in advance; Said initial seed webpage is the set that dissimilar webpages is formed, and said initial seed mark webpage language material is the kind sub-pages that marks out text and title.

Acquisition module 802 is used for obtaining relevant kind of sub-pages presetting number from search engine according to the keyword of said initial seed mark webpage language material.

Wherein, with reference to shown in Figure 9, said acquisition module 802 specifically can comprise:

Participle submodule 901 is used to utilize the participle instrument that the title and the text of said initial seed webpage are carried out participle, to obtain the initial key speech of said initial seed webpage.

First calculating sub module 902 is used for the weighted value of part of speech, word frequency and each initial key speech of speech positional information calculation according to said initial key speech.

First chooses submodule 903, is used for several keywords conducts final keyword of weight selection value greater than predetermined threshold value.

Retrieval submodule 904 is used for said final keyword is retrieved to obtain retrieval kind of sub-pages through search engine.

Labeling module 803 is used for according to said initial seed webpage label language material said relevant kind of sub-pages being marked, and obtains relevant seed mark webpage language material.

Wherein, with reference to shown in Figure 10, said labeling module 803 can comprise:

The first mark submodule 1001, the subclass of the text that is used for marking out according to said initial seed mark webpage language material marks the text of said relevant kind of sub-pages.

Wherein, with reference to shown in Figure 11, the said first mark submodule 1001 can comprise:

Subclass is extracted submodule 1101, is used for extracting from initial seed mark webpage language material the subclass of text; Said subclass is any or the content of a plurality of parts of the text of said initial seed mark webpage.

Search submodule 1102, be used for searching corresponding relevant start-up portion and relevant latter end with initial latter end from said relevant kind of sub-pages according to the initial start-up portion of said subclass.

Contents extraction submodule 1103 is used for according to the contents extraction instrument said relevant kind of sub-pages being carried out contents extraction, obtains extracting start-up portion and extracts latter end.

First judges submodule 1104, be used to judge said relevant start-up portion and extract start-up portion, and whether said relevant latter end is identical with said extraction latter end.

The 4th mark submodule 1105 is used for result when the said first judgement submodule when being, the content between said relevant start-up portion and the relevant latter end is labeled as text.

The 5th mark submodule 1106, be used for when said first judge submodule the result for not the time, the content between said extraction start-up portion and the extraction latter end is labeled as text.

The second mark submodule 1002 is used for the title that marks out according to said initial seed mark webpage language material, and the title of said relevant kind of sub-pages is marked.

Wherein, with reference to shown in Figure 12, the said second mark submodule 1002 specifically can comprise:

Second judges submodule 1201, is used to judge whether the title of relevant kind of sub-pages is consistent with the title of said initial seed webpage.

The 6th mark submodule 1202 is used for result when the said second judgement submodule when being, the title of said relevant kind of sub-pages is marked.

Finish submodule 1203, be used for when said second judge submodule the result for not the time, finish the title of said relevant kind of sub-pages is marked process.

The 3rd mark submodule 1003 is used for the content that said relevant kind of sub-pages do not mark is labeled as other.

Judge module 804 is used to judge whether said relevant seed mark webpage language material and said initial seed mark webpage language material satisfy pre-conditioned.

Said judge module 804 specifically can comprise: be used to judge whether said relevant seed mark webpage language material and said initial seed mark webpage language material all expand; Perhaps, be used to judge whether said relevant seed mark webpage language material and said initial seed mark webpage language material reach predefined scale.

Composite module 805, be used for when the result of said judge module when being, said relevant seed mark webpage language material and said initial seed mark webpage language material are combined as mark webpage corpus.

Trigger module 806, be used for when the result of said judge module for not the time, said relevant seed mark webpage language material is marked the webpage language material as initial seed, and triggers said acquisition module 802.

Adopt the device of the structure mark webpage corpus in the embodiment of the invention; Can be under the situation of given small quantities of seed mark language material; Constantly circulating enlarges the scale of mark language material, forms large-scale fiducial mark and annotates language material, and the method for this structure mark webpage corpus need not manual work identical web page contents is carried out the repeatability mark; When having saved human resources and physics cost, also make mark webpage corpus can realize bigger scale.

Corresponding with first kind of method embodiment 2 that makes up mark webpage corpus that the embodiment of the invention provides, the embodiment of the invention also provides a kind of device embodiment 2 that makes up mark webpage corpus, and with reference to shown in Figure 13, this device can comprise:

Acquisition module 802 is used for obtaining the retrieval kind sub-pages of presetting number from search engine according to the keyword of said initial seed mark webpage language material.

Second calculating sub module 1301 is used the similarity that adopts vector space model to calculate said initial seed webpage and said retrieval kind sub-pages.

Second chooses submodule 1302, the value that is used to choose said similarity greater than several retrieval kind of sub-pages of predetermined threshold value as said relevant kind of sub-pages.

In a word; Adopt the device of present embodiment to make up mark webpage corpus; Not only can realize setting up the purpose of extensive mark webpage corpus, and because optimize for the result for retrieval of search engine in the present embodiment, so the mark webpage corpus that adopts present embodiment to set up is more accurate and effective; Further, also make follow-up application more effective and accurate.

Corresponding with first kind of method embodiment 3 that makes up mark webpage corpus that the embodiment of the invention provides, the embodiment of the invention also provides a kind of device embodiment 3 that makes up mark webpage corpus, and with reference to Figure 14, this device can comprise:

Trigger module 806, be used for when the result of said judge module for not the time, said relevant seed mark webpage language material is marked the webpage language material as initial seed, and triggers said acquisition module 902.

Training module 1401 is used for according to extraction model that is used to extract web page contents of said mark webpage corpus training.

Extraction module 1402 is used for title and body matter according to said extraction model extraction target web.

In addition, should also be noted that above-mentioned series of processes and device also can be through software and/or firmware realizations.Under situation about realizing through software and/or firmware; From storage medium or network to computing machine with specialized hardware structure; General purpose personal computer 1500 for example shown in Figure 15 is installed the program that constitutes this software, and this computing machine can be carried out various functions or the like when various program is installed.

In Figure 15, CPU (CPU) 1501 carries out various processing according to program stored among ROM (read-only memory) (ROM) 1502 or from the program that storage area 1508 is loaded into random-access memory (ram) 1503.In RAM 1503, also store data required when CPU 1501 carries out various processing or the like as required.

CPU 1501, ROM 1502 and RAM 1503 are connected to each other via bus 1504.Input/output interface 1505 also is connected to bus 1504.

Following parts are connected to input/output interface 1505: importation 1506 comprises keyboard, mouse or the like; Output 1507 comprises display, such as cathode ray tube (CRT), LCD (LCD) or the like and loudspeaker or the like; Storage area 1508 comprises hard disk or the like; With communications portion 1509, comprise that NIC is such as LAN card, modulator-demodular unit or the like.Communications portion 1509 is handled such as the Internet executive communication via network.

As required, driver 1510 also is connected to input/output interface 1505.Detachable media 1511 is installed on the driver 1510 such as disk, CD, magneto-optic disk, semiconductor memory or the like as required, makes the computer program of therefrom reading be installed to as required in the storage area 1508.

Realizing through software under the situation of above-mentioned series of processes, such as detachable media 1511 program that constitutes software is being installed such as the Internet or storage medium from network.

It will be understood by those of skill in the art that this storage medium is not limited to shown in Figure 15 wherein having program stored therein, distribute so that the detachable media 1511 of program to be provided to the user with equipment with being separated.The example of detachable media 1511 comprises disk (comprising floppy disk (registered trademark)), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD) (registered trademark)) and semiconductor memory.Perhaps, storage medium can be hard disk that comprises in ROM 1502, the storage area 1508 or the like, computer program stored wherein, and be distributed to the user with the equipment that comprises them.

The step that also it is pointed out that the above-mentioned series of processes of execution can order following the instructions naturally be carried out in chronological order, but does not need necessarily to carry out according to time sequencing.Some step can walk abreast or carry out independently of one another.

Though specified the present invention and advantage thereof, be to be understood that and under not breaking away from, can carry out various changes, alternative and conversion the situation of the appended the spirit and scope of the present invention that claim limited.And; The term of the embodiment of the invention " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability; Thereby make and comprise that process, method, article or the equipment of a series of key elements not only comprise those key elements; But also comprise other key elements of clearly not listing, or also be included as this process, method, article or equipment intrinsic key element.Under the situation that do not having much more more restrictions, the key element that limits by statement " comprising ... ", and be not precluded within process, method, article or the equipment that comprises said key element and also have other identical element.

About comprising the embodiment of above embodiment, following remarks is also disclosed:

1. 1 kinds of methods that make up mark webpage corpus of remarks comprise:

Generate initial seed mark webpage language material to the initial seed webpage of choosing in advance, said initial seed webpage is the set that dissimilar webpages is formed, and said initial seed mark webpage language material is the kind sub-pages that marks out text and title;

From search engine, obtain relevant kind of sub-pages of preset number according to the keyword of said initial seed mark webpage language material;

According to said initial seed webpage label language material said relevant kind of sub-pages marked, obtain relevant seed mark webpage language material; And

It is pre-conditioned to judge whether said relevant seed mark webpage language material and said initial seed mark webpage language material satisfy, if then said relevant seed mark webpage language material and said initial seed mark webpage language material are combined as mark webpage corpus; If not, then said relevant seed mark webpage language material is marked the webpage language material as initial seed, and carry out the said step of from search engine, obtaining relevant kind of sub-pages of preset number.

2, according to remarks 1 described method, wherein, said according to said initial seed webpage label language material to the said relevant kind of step that sub-pages marks, comprising:

According to the subclass of the text that marks out in the said initial seed mark webpage language material, the text of said relevant kind of sub-pages is marked;

According to the title that marks out in the said initial seed mark webpage language material, the title of said relevant kind of sub-pages is marked; And

The content that does not mark in the said relevant kind of sub-pages is labeled as other.

3, according to remarks 2 described methods, wherein, the said step that the text of said relevant kind of sub-pages is marked comprises:

From initial seed mark webpage language material, extract the subclass of text; Said subclass is any or the content of a plurality of parts of the text of said initial seed mark webpage;

Initial start-up portion according to said subclass is searched corresponding relevant start-up portion and relevant latter end with initial latter end from said relevant kind of sub-pages;

According to the contents extraction instrument said relevant kind of sub-pages carried out contents extraction, obtain extracting start-up portion and extract latter end; And

Judge said relevant start-up portion and extract start-up portion; And whether said relevant latter end is identical with said extraction latter end; If; Then the content between said relevant start-up portion and the relevant latter end is labeled as text,, then the content between said extraction start-up portion and the extraction latter end is labeled as text if not.

4, according to remarks 2 described methods, wherein, the said step that the title of said relevant kind of sub-pages is marked comprises:

Whether the title of judging relevant kind of sub-pages is consistent with the title of said initial seed webpage, if then the title to said relevant kind of sub-pages marks; If, then do not finish title mark process to said relevant kind of sub-pages.

5, according to remarks 1 described method, wherein, said keyword according to said initial seed mark webpage language material obtains the step of relevant kind of sub-pages of preset number from search engine, comprising:

Utilize the participle instrument that the title and the text of said initial seed webpage are carried out participle, to obtain the initial key speech of said initial seed webpage;

Weighted value according to part of speech, word frequency and each initial key speech of speech positional information calculation of said initial key speech;

The weight selection value is greater than the final keyword of several keywords conducts of predetermined threshold value; And

Said final keyword is retrieved to obtain retrieval kind of sub-pages through search engine.

6, according to remarks 5 described methods, wherein, said said final keyword is retrieved obtaining through search engine also comprises after retrieval kind of the sub-pages:

Adopt vector space model to calculate the similarity of said initial seed webpage and said retrieval kind sub-pages; And

The value of said similarity is planted sub-pages as said relevant kind of sub-pages greater than several retrievals of predetermined threshold value.

7,, wherein, saidly judge that whether said relevant seed mark webpage language material and said initial seed mark webpage language material satisfy pre-conditioned step, comprising according to remarks 1 described method:

Judge whether said relevant seed mark webpage language material and said initial seed mark webpage language material all expand; Perhaps

Judge whether said relevant seed mark webpage language material and said initial seed mark webpage language material reach predefined scale.

8,, wherein, obtain also comprising after the said mark webpage corpus according to remarks 1 described method:

According to extraction model that is used to extract web page contents of said mark webpage corpus training; And

Extract title and body matter in the target web according to said extraction model.

9, a kind of device that makes up mark webpage corpus comprises:

Generation module; Be used for generating initial seed mark webpage language material to the initial seed webpage of choosing in advance; Said initial seed webpage is the set that dissimilar webpages is formed, and said initial seed mark webpage language material is the kind sub-pages that marks out text and title;

Acquisition module is used for obtaining relevant kind of sub-pages presetting number from search engine according to the keyword of said initial seed mark webpage language material;

Labeling module is used for according to said initial seed webpage label language material said relevant kind of sub-pages being marked, and obtains relevant seed mark webpage language material;

Judge module is used to judge whether said relevant seed mark webpage language material and said initial seed mark webpage language material satisfy pre-conditioned;

Composite module, be used for when the result of said judge module when being, said relevant seed mark webpage language material and said initial seed mark webpage language material are combined as mark webpage corpus; And

Trigger module, be used for when the result of said judge module for not the time, said relevant seed mark webpage language material is marked the webpage language material as initial seed, and triggers said acquisition module.

10, according to remarks 9 described devices, wherein, said labeling module comprises:

The first mark submodule, the subclass of the text that is used for marking out according to said initial seed mark webpage language material marks the text of said relevant kind of sub-pages;

The second mark submodule is used for the title that marks out according to said initial seed mark webpage language material, and the title of said relevant kind of sub-pages is marked; And

The 3rd mark submodule is used for the content that said relevant kind of sub-pages do not mark is labeled as other.

11, according to remarks 10 described devices, wherein, the said first mark submodule comprises:

Subclass is extracted submodule, is used for extracting from initial seed mark webpage language material the subclass of text; Said subclass is any or the content of a plurality of parts of the text of said initial seed mark webpage;

Search submodule, be used for searching corresponding relevant start-up portion and relevant latter end with initial latter end from said relevant kind of sub-pages according to the initial start-up portion of said subclass;

The contents extraction submodule is used for according to the contents extraction instrument said relevant kind of sub-pages being carried out contents extraction, obtains extracting start-up portion and extracts latter end;

First judges submodule, be used to judge said relevant start-up portion and extract start-up portion, and whether said relevant latter end is identical with said extraction latter end;

The 4th mark submodule is used for result when the said first judgement submodule when being, the content between said relevant start-up portion and the relevant latter end is labeled as text; And

The 5th mark submodule, be used for when said first judge submodule the result for not the time, the content between said extraction start-up portion and the extraction latter end is labeled as text.

12, according to remarks 10 described devices, the said second mark submodule comprises:

Second judges submodule, is used to judge whether the title of relevant kind of sub-pages is consistent with the title of said initial seed webpage;

The 6th mark submodule is used for result when the said second judgement submodule when being, the title of said relevant kind of sub-pages is marked;

Finish submodule, be used for when said second judge submodule the result for not the time, finish the title of said relevant kind of sub-pages is marked process.

13, according to remarks 9 described devices, said acquisition module comprises:

The participle submodule is used to utilize the participle instrument that the title and the text of said initial seed webpage are carried out participle, to obtain the initial key speech of said initial seed webpage;

First calculating sub module is used for the weighted value of part of speech, word frequency and each initial key speech of speech positional information calculation according to said initial key speech;

First chooses submodule, is used for several keywords conducts final keyword of weight selection value greater than predetermined threshold value;

The retrieval submodule is used for said final keyword is retrieved to obtain retrieval kind of sub-pages through search engine.

14, according to remarks 13 described devices, also comprise:

Second calculating sub module is used to adopt vector space model to calculate the similarity of said initial seed webpage and said retrieval kind sub-pages; And

Second chooses submodule, the value that is used to choose said similarity greater than several retrieval kind of sub-pages of predetermined threshold value as said relevant kind of sub-pages.

15, according to remarks 9 described devices, said judge module comprises:

Be used to judge whether said relevant seed mark webpage language material and said initial seed mark webpage language material all expand; Perhaps, be used to judge whether said relevant seed mark webpage language material and said initial seed mark webpage language material reach predefined scale.

16, according to remarks 9 described devices, also comprise:

Training module is used for according to extraction model that is used to extract web page contents of said mark webpage corpus training; And

Extraction module is used for title and body matter according to said extraction model extraction target web.

Claims

1. one kind makes up the method that marks the webpage corpus, comprising:

It is pre-conditioned to judge whether said relevant seed mark webpage language material and said initial seed mark webpage language material satisfy, if then said relevant seed mark webpage language material and said initial seed mark webpage language material are combined as mark webpage corpus; If not, then said relevant seed mark webpage language material is marked the webpage language material as initial seed, and carry out said keyword according to said initial seed mark webpage language material obtains relevant kind of sub-pages of preset number from search engine step.

2. method according to claim 1, wherein, said according to said initial seed webpage label language material to the said relevant kind of step that sub-pages marks, comprising:

3. method according to claim 2, wherein, the said step that the text of said relevant kind of sub-pages is marked comprises:

4. method according to claim 2, wherein, the said step that the title of said relevant kind of sub-pages is marked comprises:

5. method according to claim 1, wherein, saidly judge that whether said relevant seed mark webpage language material and said initial seed mark webpage language material satisfy pre-conditioned step, comprising:

6. one kind makes up the device that marks the webpage corpus, comprising:

7. device according to claim 6, wherein, said labeling module comprises:

8. device according to claim 7, wherein, the said first mark submodule comprises:

9. device according to claim 7, the said second mark submodule comprises:

The 6th mark submodule is used for result when the said second judgement submodule when being, the title of said relevant kind of sub-pages is marked; And

10. device according to claim 6, said judge module: be used to judge whether said relevant seed mark webpage language material and said initial seed mark webpage language material all expand; Perhaps, be used to judge whether said relevant seed mark webpage language material and said initial seed mark webpage language material reach predefined scale.