CN102831131B

CN102831131B - Method and device for establishing labeling webpage linguistic corpus

Info

Publication number: CN102831131B
Application number: CN201110172092.8A
Authority: CN
Inventors: 付雷; 夏迎炬; 孟遥; 于浩
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2011-06-16
Filing date: 2011-06-16
Publication date: 2015-02-11
Anticipated expiration: 2031-06-16
Also published as: CN102831131A

Abstract

The embodiment of the invention discloses a method and a device for establishing a labeling webpage linguistic corpus. The method comprises the following steps of: generating initial seed labeling webpage linguistic data aiming at a pre-selected initial seed webpage; acquiring a relevant seed webpage with a pre-set quantity from a searching engine according to a keyword of the initial seed labeling webpage linguistic data; labeling the relevant seed webpage according to the initial seed labeling webpage linguistic data, so as to obtain relevant seed labeling webpage linguistic data; judging whether the relevant seed labeling webpage linguistic data and the initial seed labeling webpage linguistic data satisfy a pre-set condition or not, if so, combining the relevant seed labeling webpage linguistic data with the initial seed labeling webpage linguistic data to be the labeling webpage linguistic corpus, and if not, using the relevant seed labeling webpage linguistic data as the initial seed labeling webpage linguistic data, and executing a step of acquiring the relevant seed webpage with the pre-set quantity from the searching engine. With the adoption of the embodiment of the invention, the large-scale standard labeling linguistic data is formed under the condition that small amount of the seed labeling linguistic data is given.

Description

Build method and the device of mark webpage corpus

Technical field

Relate generally to internet data processing technology field of the present invention, especially a kind of method and device building mark webpage corpus.

Background technology

The data resource extreme enrichment of internet, for various data-intensive application provides potential Data Source, but the structure of web page on internet is complicated, the body matter of webpage is often submerged among the noise information such as advertisement or navigation, this huge data source of internet will be utilized for research service for this reason, just need the various information in webpage to be separated and sort out, namely the content of webpage is marked.

Webpage language material with the markup information that becomes more meticulous has vital impact for a lot of application, such as web search, Web page classifying or web page contents extraction etc., the said mark that becomes more meticulous refer to for the text appeared in webpage careful divided into title, author, the time, text, comment, advertisement, peer link and other etc. the notation methods of eight classes, it is the application service such as contents extraction or Cluster Classification that language material after this mark both can be used as corpus, also can wait the pretreatment stage of application as retrieval, thus improve retrieval precision.

The method of traditional structure mark webpage corpus, generally manually directly carries out webpage label, is namely checked the full content of certain webpage by specific technician, thus marks according to checking each several part content of result to webpage.

But the mode of webpage label is manually carried out in this employing, because the webpage quantity on internet is unlimited, pay huge energy to carry out webpage label with regard to needing technician; Further, also there is similar situation in some partial content of different web pages sometimes, so this carries out repeatability mark with regard to making technician to identical web page contents, while waste of manpower resource, also makes language material scale be difficult to do greatly.

Summary of the invention

In view of this, embodiments provide a kind of method and the device that build mark webpage corpus, can when given a small amount of seed mark language material, constantly circulation expands the scale of mark language material, forms large-scale standard mark language material.

According to an aspect of the embodiment of the present invention, a kind of method building mark webpage corpus is provided, comprise: for the initial seed auto-building html files initial seed mark webpage language material chosen in advance, described initial seed webpage is the set of dissimilar webpage composition, and described initial seed mark webpage language material is the sub-pages marking out text and title; From search engine, the relevant sub-pages of predetermined number is obtained according to the keyword of described initial seed mark webpage language material; According to described initial seed webpage label language material, described relevant sub-pages is marked, obtain relevant seed mark webpage language material; And judge whether described relevant seed mark webpage language material and described initial seed mark webpage language material meet pre-conditioned, if so, then described relevant seed mark webpage language material and described initial seed mark webpage language material are combined as mark webpage corpus; If not, then described relevant seed mark webpage language material is marked webpage language material as initial seed, and from search engine, obtain the step of the relevant sub-pages of predetermined number described in performing.

According to another aspect of the embodiment of the present invention, a kind of device building mark webpage corpus is provided, comprise: generation module, for marking webpage language material for the initial seed auto-building html files initial seed chosen in advance, described initial seed webpage is the set of dissimilar webpage composition, and described initial seed mark webpage language material is the sub-pages marking out text and title; Acquisition module, for obtaining the relevant sub-pages of predetermined number from search engine according to the keyword of described initial seed mark webpage language material; Labeling module, for marking described relevant sub-pages according to described initial seed webpage label language material, obtains relevant seed mark webpage language material; Judge module, for judging whether described relevant seed mark webpage language material and described initial seed mark webpage language material meet pre-conditioned; Composite module, for when the result of described judge module is for being, is combined as mark webpage corpus by described relevant seed mark webpage language material and described initial seed mark webpage language material; And trigger module, for when the result of described judge module is no, using described relevant seed mark webpage language material as initial seed mark webpage language material, and trigger described acquisition module.

In addition, according to a further aspect in the invention, a kind of storage medium is additionally provided.Described storage medium comprises machine-readable program code, and when performing described program code on messaging device, described program code makes the execution of described messaging device according to the method for above-mentioned structure mark webpage corpus according to the present invention.

In addition, in accordance with a further aspect of the present invention, a kind of program product is additionally provided.Described program product comprises the executable instruction of machine, and when performing described instruction on messaging device, described instruction makes the execution of described messaging device according to the method for above-mentioned structure mark webpage corpus according to the present invention.

According to above-mentioned a kind of method of the embodiment of the present invention, can when given a small amount of seed mark language material, constantly circulation expands the scale of mark language material, form large-scale standard mark language material, this method building mark webpage corpus carries out repeatability mark without the need to artificial to identical web page contents, while saving human resources and physics cost, also make mark webpage corpus can realize larger scale.

Provide other aspects of the embodiment of the present invention in instructions part below, wherein, describe the preferred embodiment being used for the openly embodiment of the present invention fully in detail, and do not apply to limit to it.

Accompanying drawing explanation

Below in conjunction with specific embodiment, and with reference to accompanying drawing, the above and other object of the embodiment of the present invention and advantage are further described.In the accompanying drawings, the identical or corresponding Reference numeral of employing represents by the technical characteristic of identical or correspondence or parts.

Fig. 1 is the process flow diagram that the embodiment of the method 1 provided as the embodiment of the present invention is shown;

Fig. 2 illustrates the process flow diagram as S102 in the inventive method embodiment 1;

Fig. 3 illustrates the process flow diagram as S103 in the inventive method embodiment 1;

Fig. 4 illustrates the process flow diagram as S301 in the inventive method embodiment 1;

Fig. 5 illustrates the process flow diagram as S302 in the inventive method embodiment 1;

Fig. 6 is the process flow diagram that the embodiment of the method 2 provided as the embodiment of the present invention is shown;

Fig. 7 is the process flow diagram that the embodiment of the method 3 provided as the embodiment of the present invention is shown;

Fig. 8 is the schematic diagram that the device embodiment 1 provided as the embodiment of the present invention is shown;

Fig. 9 illustrates the schematic diagram as acquisition module 802 in apparatus of the present invention embodiment 1;

Figure 10 illustrates the schematic diagram as labeling module 803 in apparatus of the present invention embodiment 1;

Figure 11 illustrates the schematic diagram as the first mark submodule 1001 in device embodiment 1;

Figure 12 illustrates the schematic diagram as the second mark submodule 1002 in device embodiment 1;

Figure 13 is the schematic diagram that the device embodiment 2 provided as the embodiment of the present invention is shown;

Figure 14 is the schematic diagram that the device embodiment 3 provided as the embodiment of the present invention is shown;

Figure 15 is the block diagram of the example arrangement of the personal computer illustrated as the messaging device adopted in embodiments of the invention.

Embodiment

With reference to the accompanying drawings embodiments of the invention are described.

The embodiment of the present invention, for prior art problem, provides corresponding solution.Concrete, see Fig. 1, the embodiment of the method 1 of the structure mark webpage corpus that the embodiment of the present invention provides can comprise:

S101: for the initial seed auto-building html files initial seed mark webpage language material chosen in advance, described initial seed webpage is the set of dissimilar webpage composition, described initial seed mark webpage language material is the sub-pages marking out text and title.

The embodiment of the present invention in actual applications, some dissimilar webpages can be chosen in advance, the webpage of the types such as such as blog (BLOG), news and finance and economics respectively chooses a small amount of webpage sample, and the webpage quantity of each type is not limit, and such as every type selects 100 webpages etc.Here type can change to some extent according to different actual conditions, and such as, the classification type of Sina website and the classification type of Sohu.com just there are differences.But the mode classification of type difference does not affect the realization of the embodiment of the present invention, and therefore, the embodiment of the present invention does not limit quantity and the type of the initial seed webpage chosen in advance.

The webpage selected, as initial seed webpage, needs the mark manually carrying out becoming more meticulous, and at least needs to mark title and text, and other parts are labeled as other; Further, other parts meticulously can mark out author, time, comment, advertisement and Related Links section.The initial seed webpage that these have marked can form initial seed mark webpage language material.

S102: the relevant sub-pages obtaining predetermined number according to the keyword of described initial seed mark webpage language material from search engine.

Because initial seed webpage just choose dissimilar in representative a small amount of webpage, so also need to expand other relevant sub-pages according to initial seed webpage.Because on the internet, for a certain section article that initial seed webpage is issued, it is very large by the possibility of reprinting, if in initial seed mark webpage language material the content of certain initial seed webpage reprint by another other webpages not in described initial seed webpage language material, so the body matter of these two pages should be roughly the same in theory, there is very high similarity, therefore can so that the label in later use Initial page goes the related pages to reprinting to mark.

And this step is searching the relevant reprinting page of initial seed webpage, search engine specifically can be used as aid, the keyword utilizing initial seed to mark webpage language material is searched for.

Wherein, shown in figure 2, described S102 specifically can adopt following implementation in actual applications:

S201: utilize participle instrument to carry out participle to the title of described initial seed webpage and text, to obtain the initial key word of described initial seed webpage.

The title that participle instrument (such as: shareware ICTCLAS) can be utilized in embodiments of the present invention to have marked out initial seed webpage and text carry out participle, obtain the initial key word of initial seed webpage.

S202: according to the weighted value of the part of speech of described initial key word, word frequency and each initial key word of word positional information calculation.

Part of speech again according to initial key word in this step, word frequency and word position calculate the weight of each initial key word.Concrete computing formula can be as follows:

Weight(W)＝Position(W)+Freq(W)+Pos(W)。This formula can be understood as: the weighted value of initial key word W equals its position value, part of speech value and word frequency value sum.Wherein, Position (W) is position value, and such as, initial key word W appears in title and text simultaneously, then corresponding Position (W) can value be 3; If W appears at separately in title, then corresponding Position (W) can value be 2; If W occurs separately in the body of the email, then corresponding Position (W) can value be 1.Freq (W) is word frequency value, is namely the number of times sum that in the number of times/document of W appearance, all words occur.Pos (W) is part of speech value, and such as, if W is noun or noun phrase, then Pos (W) can value be 1, otherwise is 0.

Certainly, above-mentioned concrete numerical value is the object lesson better understood the embodiment of the present invention for convenience of those skilled in the art and illustrate, the account form of the weighted value of the embodiment of the present invention can carry out accommodation according to actual conditions or user's request in actual applications.

S203: weight selection value is greater than several keywords of predetermined threshold value as final keyword.

After the weighted value obtaining each attribute keywords, select the final keyword that the maximum top n of weight carries out as feeding search engine retrieving, wherein, the value of N is related to the quantity of the result for retrieval that search engine returns, and generally can get 5 ~ 15.

S204: described final keyword is undertaken retrieving to obtain retrieval sub-pages by search engine.

Search engine is sent in the final keyword of gained retrieve, search engine wherein can adopt Google, Baidu or search dog etc., search engine returns Search Results by according to the final key word of input, and namely Search Results is here the reprinting page relevant to initial seed webpage.Wherein, the selection of search engine does not affect the realization of the embodiment of the present invention yet, therefore, and the specific implementation of the present invention's also not limit search engine.

S103: mark described relevant sub-pages according to described initial seed webpage label language material, obtains relevant seed mark webpage language material.

After obtaining relevant sub-pages, can mark relevant kind of subpage frame with reference to initial seed webpage label language material.Concrete, only can mark text, title and other three classifications to relevant kind of subpage frame, its order can be first mark text, then marks title, finally marks other.

Wherein, shown in figure 3, described step S103 specifically can comprise in actual applications:

S301: according to the subset of the text marked out in described initial seed mark webpage language material, the text of described relevant sub-pages is marked.

First this step marks the text in relevant sub-pages, and shown in figure 4, described step S301 can adopt following implementation in actual applications:

S401: the subset extracting text from initial seed mark webpage language material; Described subset is any one of the text of described initial seed mark webpage or the content of multiple part.

Subset described in this step is any one section or any one of the text of initial seed mark webpage, also can be the content of any number of paragraph or multiple sentence.Such as, subset directly can equal the text of initial seed mark webpage, and also can be first paragraph and the final stage of text, can also be one section of content etc. the longest in text.

With subset be initial seed mark webpage text be example, when specific implementation, then need to utilize the body part of the margin text of text to relevant kind of subpage frame to position.In concrete implementation procedure, text and endtext finding out text respectively in the part of " text " can be noted as from the initial seed page.Suppose that the total length of body part is L, then can get the text of front and back L/5 length respectively as beginning text and endtext.Certainly, also the text of other length can be chosen as beginning text and endtext according to actual conditions.

S402: search corresponding relevant start-up portion and relevant latter end according to the initial start-up portion of described subset and initial latter end from described relevant sub-pages.

After the beginning text of text obtaining initial seed mark webpage and endtext, need the beginning text of the text in initial seed webpage and endtext to mate in relevant seed page body, the relevant position in the relevant sub-pages matched is called relevant start-up portion and relevant latter end.

S403: carry out contents extraction to described relevant sub-pages according to contents extraction instrument, obtains extracting start-up portion and extracting latter end.

This step recycling contents extraction instrument extracting directly from relevant kind of subpage frame goes out body part, and the initial text of the text that extracting directly obtains and endtext are called and extract start-up portion and extract latter end.This leaching process, can adopt existing method for extracting content, does not limit its extracting mode in the embodiment of the present invention.

S404: judge described relevant start-up portion and extract start-up portion, and whether described relevant latter end is identical with described extraction latter end, if so, then enters step S405; If not, then step S406 is entered.

In this step, judge that whether the aforementioned relevant start-up portion obtained is identical with extraction start-up portion, and judge that whether relevant latter end is identical with extraction latter end simultaneously, if all identical, illustrate that the text of initial seed webpage conforms to completely to the body matter of relevant sub-pages, then can directly the content between relevant start-up portion and relevant latter end be labeled as " text " in subsequent step S405.If not, illustrate that the body matter of relevant kind of subpage frame is greater than the body matter of the initial seed page, namely the content of the relevant kind of subpage frame then gone out by extracting directly in subsequent step S406 is the content-label extracted between start-up portion and extraction latter end is " text ".

S405: the content between described relevant start-up portion and relevant latter end is labeled as text.

S406: described extraction start-up portion and the content extracted between latter end are labeled as text.

It should be noted that, in the process that text is marked, if beginning text and the endtext of initial seed page body cannot be found in relevant kind of subpage frame simultaneously, then can not carry out any mark to relevant kind of subpage frame, namely abandon this relevant kind of subpage frame, attempt next relevant kind of subpage frame.

S302: according to the title marked out in described initial seed mark webpage language material, the title of described relevant sub-pages is marked.

When marking the title of relevant sub-pages, adopt the principle of all couplings.Wherein, shown in figure 5, described step S302 specifically can adopt following implementation in actual applications:

S501: judge that whether the title of relevant sub-pages is consistent with the title of described initial seed webpage, if so, then enter step S502; If not, then step S503 is entered.

S502: the title of described relevant sub-pages is marked.

S503: terminate the title annotation process to described relevant sub-pages.

Can find out, mark for the title of relevant sub-pages have employed the mode of whole matching, if all find title text in relevant kind of subpage frame, namely the title being the title of relevant sub-pages and described initial seed webpage is completely the same, further, and this title text is not positioned among the body matter that marked, be then labeled as " title " by the title text found, otherwise terminate the title annotation process to described relevant sub-pages.

S303: the content do not marked in described relevant sub-pages is labeled as other.

After text and title are marked; due to the author of article often can not be retained when article is reprinted and deliver the information such as time; usually text and title can be retained; so when expanding initial seed language material; can only mark the text in relevant kind of subpage frame and title division, all remaining parts do not marked all are labeled as " other ".

S104: judge whether described relevant seed mark webpage language material and described initial seed mark webpage language material meet pre-conditioned, if so, then enter step S105, if not, then perform step S106.

Pre-conditioned in this step, can adopt in actual applications following any one: one is if can effectively expand without any sub-pages, namely search for less than not marking other webpages in webpage corpus, two is the scales reaching the corpus that user sets; Such as, reach 100M just to stop).

Therefore, this step specifically judges when performing whether described relevant seed mark webpage language material and described initial seed mark webpage language material have all carried out expanding, or, judge whether described relevant seed mark webpage language material and described initial seed mark webpage language material reach the scale preset.

S105: described relevant seed mark webpage language material and described initial seed mark webpage language material are combined as mark webpage corpus.

If meet pre-conditioned, then the relevant seed mark webpage language material obtained and initial seed mark webpage language material are combined as mark webpage corpus.

S106: using described relevant seed mark webpage language material as initial seed mark webpage language material, and return step S102.

If do not meet pre-conditioned, then the relevant seed mark page language material marked is re-used as initial seed mark webpage language material, according to the flow process of S102 ~ S106, the relevant seed mark page language material marked is expanded again, finally make all mark page language materials expanded meet pre-conditioned, thus construct final mark webpage corpus.

Carry out introducing in detail to problems of the prior art and corresponding solution above.Adopt the method for the structure mark webpage corpus in the embodiment of the present invention, can when given a small amount of seed mark language material, constantly circulation expands the scale of mark language material, form large-scale standard mark language material, this method building mark webpage corpus carries out repeatability mark without the need to artificial to identical web page contents, while saving human resources and physics cost, also make mark webpage corpus can realize larger scale.Further, if adopt the method for existing this mark webpage language material to apply, such as " retrieval " application, also can improve retrieval precision, thus avoid the performance having influence on Internet Server.

Concrete, see Fig. 6, embodiments provide the another kind of embodiment of the method 2 building mark webpage corpus, can comprise:

S601: for the initial seed auto-building html files initial seed mark webpage language material chosen in advance, described initial seed webpage is the set of dissimilar webpage composition, described initial seed mark webpage language material is the sub-pages marking out text and title.

In the embodiment of the present invention with embodiment 1 realize similarity, can mutually see, be no longer described in detail in the present embodiment.

S602: the relevant sub-pages obtaining predetermined number according to the keyword of described initial seed mark webpage language material from search engine.

S603: adopt vector space model to calculate the similarity of described initial seed webpage and described retrieval sub-pages.

The present embodiment and a upper embodiment unlike, after the relevant sub-pages directly getting predetermined number from search engine, vector space model (also can adopt other existing similarity calculating methods) can be adopted, calculate the similarity of the initial seed page and each retrieval kind of subpage frame.

S604: the value of described similarity is greater than several retrieval sub-pages of predetermined threshold value as described relevant sub-pages.

After calculating similarity, similarity is greater than the retrieval sub-pages of certain threshold value as relevant kind of subpage frame.The present embodiment why not directly using a front M results page as relevant sub-pages, and need to increase the step calculating similarity, mainly based on following factor: the sequence of the Search Results of search engine directly can not disclose the similarity relation between retrieval kind of subpage frame and query word, and this is mainly determined by the concrete sort algorithm of search engine.

The sort algorithm that returns results of present search engine relates to a large amount of factors (bid ranking of such as Baidu), and similarity is one of them factor.So, to the Search Results directly obtained, again by calculating the similarity between webpage in the present embodiment, to find the relevant sub-pages of initial seed webpage accurately, be namely reprint webpage.Concrete, for the calculating of similarity, participle instrument can be utilized to carry out to related web page cosine (cos) value that then participle calculate they and seed page vector respectively and to obtain, namely vector space model.

S605: according to the subset of the text marked out in described initial seed mark webpage language material, the text of described relevant sub-pages is marked.

S606: according to the title marked out in described initial seed mark webpage language material, the title of described relevant sub-pages is marked.

S607: the content do not marked in described relevant sub-pages is labeled as other, obtains relevant seed mark webpage language material.

S608: judge whether described relevant seed mark webpage language material and described initial seed mark webpage language material meet pre-conditioned, if so, then enter step S609; If not, then step S610 is entered.

S609: described relevant seed mark webpage language material and described initial seed mark webpage language material are combined as mark webpage corpus.

Described relevant seed mark webpage language material is marked webpage language material as initial seed by S610, and performs described step S602.

In a word, in the present embodiment, the object setting up extensive mark webpage corpus in embodiment 1 can not only be realized, and because the result for retrieval for search engine in the present embodiment is optimized, so the mark webpage corpus adopting the present embodiment to set up is more accurate and effective, further, also make follow-up application more effective and accurate.

See Fig. 7, embodiments provide the another kind of embodiment of the method 3 building mark webpage corpus, can comprise:

S701: for the initial seed auto-building html files initial seed mark webpage language material chosen in advance, described initial seed webpage is the set of dissimilar webpage composition, described initial seed mark webpage language material is the sub-pages marking out text and title.

In the embodiment of the present invention with embodiment 1 and embodiment 2 realize similarity, can mutually see, be no longer described in detail in the present embodiment.

S702: the relevant sub-pages obtaining predetermined number according to the keyword of described initial seed mark webpage language material from search engine.

S703: mark described relevant sub-pages according to described initial seed webpage label language material, obtains relevant seed mark webpage language material.

S704: judge whether described relevant seed mark webpage language material and described initial seed mark webpage language material meet pre-conditioned, if so, then enter step S705; If not, then step S706 is entered.

S705: described relevant seed mark webpage language material and described initial seed mark webpage language material are combined as mark webpage corpus, enter step 707.

S706: using described relevant seed mark webpage language material as initial seed mark webpage language material, and return step S702.

S707: according to described mark webpage training one for extracting the extraction model of web page contents.

In the present embodiment, after employing aforementioned manner constructs mark webpage corpus, based on the mark webpage corpus built, one can also be trained for extracting the extraction model of web page contents.

S708: according to the title in described extraction model extraction target web and body matter.

Can extract the title in target web and body matter according to the extraction model trained, thus the title that can obtain accurately in target web and text.Therefore the method for the present embodiment is adopted can to improve precision and the accuracy of contents extraction.

Certainly, after constructing mark webpage corpus, can also apply described mark webpage corpus and set up index, be namely that search engine server with reference to the title in the mark webpage corpus built and body matter, conveniently can set up index.Because subsequent applications has a lot, will not enumerate in the embodiment of the present invention, those skilled in the art can implement in conjunction with application of the prior art.

There is provided with the embodiment of the present invention the first to build the embodiment of the method 1 marking webpage corpus corresponding, the embodiment of the present invention additionally provides a kind of device embodiment 1 building mark webpage corpus, and see Fig. 8, this device can comprise:

Generation module 801, for marking webpage language material for the initial seed auto-building html files initial seed chosen in advance, described initial seed webpage is the set of dissimilar webpage composition, and described initial seed mark webpage language material is the sub-pages marking out text and title.

Acquisition module 802, for obtaining the relevant sub-pages of predetermined number from search engine according to the keyword of described initial seed mark webpage language material.

Wherein, shown in figure 9, described acquisition module 802 specifically can comprise:

Participle submodule 901, for utilizing participle instrument to carry out participle to the title of described initial seed webpage and text, to obtain the initial key word of described initial seed webpage.

First calculating sub module 902, for the weighted value according to the part of speech of described initial key word, word frequency and each initial key word of word positional information calculation.

First chooses submodule 903, is greater than several keywords of predetermined threshold value as final keyword for weight selection value.

Retrieval submodule 904, for being undertaken retrieving to obtain retrieval sub-pages by search engine by described final keyword.

Labeling module 803, for marking described relevant sub-pages according to described initial seed webpage label language material, obtains relevant seed mark webpage language material.

Wherein, with reference to shown in Figure 10, described labeling module 803 can comprise:

First mark submodule 1001, for the subset according to the text marked out in described initial seed mark webpage language material, marks the text of described relevant sub-pages.

Wherein, with reference to shown in Figure 11, described first mark submodule 1001 can comprise:

Subset extracts submodule 1101, for extracting the subset of text from initial seed mark webpage language material; Described subset is any one of the text of described initial seed mark webpage or the content of multiple part.

Search submodule 1102, for searching corresponding relevant start-up portion and relevant latter end according to the initial start-up portion of described subset and initial latter end from described relevant sub-pages.

Contents extraction submodule 1103, for carrying out contents extraction according to contents extraction instrument to described relevant sub-pages, obtains extracting start-up portion and extracting latter end.

First judges submodule 1104, and for judging described relevant start-up portion and extracting start-up portion, and whether described relevant latter end is identical with described extraction latter end.

4th mark submodule 1105, for when described first judges the result of submodule as being, is labeled as text by the content between described relevant start-up portion and relevant latter end.

5th mark submodule 1106, for when described first judges that the result of submodule is no, is labeled as text by described extraction start-up portion and the content extracted between latter end.

Second mark submodule 1002, for according to the title marked out in described initial seed mark webpage language material, marks the title of described relevant sub-pages.

Wherein, with reference to shown in Figure 12, described second mark submodule 1002 specifically can comprise:

Second judges submodule 1201, for judging that whether the title of relevant sub-pages is consistent with the title of described initial seed webpage.

6th mark submodule 1202, for when described second judges the result of submodule as being, marks the title of described relevant sub-pages.

Terminate submodule 1203, for when described second judges that the result of submodule is no, terminate the title annotation process to described relevant sub-pages.

3rd mark submodule 1003, for being labeled as other by the content do not marked in described relevant sub-pages.

Judge module 804, for judging whether described relevant seed mark webpage language material and described initial seed mark webpage language material meet pre-conditioned.

Described judge module 804 specifically can comprise: for judging whether described relevant seed mark webpage language material and described initial seed mark webpage language material all expand; Or, for judging whether described relevant seed mark webpage language material and described initial seed mark webpage language material reach the scale preset.

Composite module 805, for when the result of described judge module is for being, is combined as mark webpage corpus by described relevant seed mark webpage language material and described initial seed mark webpage language material.

Trigger module 806, for when the result of described judge module is no, using described relevant seed mark webpage language material as initial seed mark webpage language material, and triggers described acquisition module 802.

Adopt the device of the structure mark webpage corpus in the embodiment of the present invention, can when given a small amount of seed mark language material, constantly circulation expands the scale of mark language material, form large-scale fiducial mark note language material, this method building mark webpage corpus carries out repeatability mark without the need to artificial to identical web page contents, while saving human resources and physics cost, also make mark webpage corpus can realize larger scale.

There is provided with the embodiment of the present invention the first to build the embodiment of the method 2 marking webpage corpus corresponding, the embodiment of the present invention additionally provides a kind of device embodiment 2 building mark webpage corpus, and with reference to shown in Figure 13, this device can comprise:

Acquisition module 802, for obtaining the retrieval sub-pages of predetermined number from search engine according to the keyword of described initial seed mark webpage language material.

Second calculating sub module 1301, application adopts vector space model to calculate the similarity of described initial seed webpage and described retrieval sub-pages.

Second chooses submodule 1302, and the value for choosing described similarity is greater than several retrieval sub-pages of predetermined threshold value as described relevant sub-pages.

In a word, the device of the present embodiment is adopted to build mark webpage corpus, the object setting up extensive mark webpage corpus can not only be realized, and because the result for retrieval for search engine in the present embodiment is optimized, so the mark webpage corpus adopting the present embodiment to set up is more accurate and effective, further, also make follow-up application more effective and accurate.

There is provided with the embodiment of the present invention the first to build the embodiment of the method 3 marking webpage corpus corresponding, the embodiment of the present invention additionally provides a kind of device embodiment 3 building mark webpage corpus, and with reference to Figure 14, this device can comprise:

Trigger module 806, for when the result of described judge module is no, using described relevant seed mark webpage language material as initial seed mark webpage language material, and triggers described acquisition module 902.

Training module 1401, for according to described mark webpage training one for extracting the extraction model of web page contents.

Extraction module 1402, for extracting title in target web and body matter according to described extraction model.

In addition, should also be noted that above-mentioned series of processes and device also can be realized by software and/or firmware.When being realized by software and/or firmware, from storage medium or network to the computing machine with specialized hardware structure, general purpose personal computer 1500 such as shown in Figure 15 installs the program forming this software, and this computing machine, when being provided with various program, can perform various function etc.

In fig .15, CPU (central processing unit) (CPU) 1501 performs various process according to the program stored in ROM (read-only memory) (ROM) 1502 or from the program that storage area 1508 is loaded into random access memory (RAM) 1503.In RAM 1503, also store the data required when CPU 1501 performs various process etc. as required.

CPU 1501, ROM 1502 and RAM 1503 are connected to each other via bus 1504.Input/output interface 1505 is also connected to bus 1504.

Following parts are connected to input/output interface 1505: importation 1506, comprise keyboard, mouse etc.; Output 1507, comprises display, such as cathode-ray tube (CRT) (CRT), liquid crystal display (LCD) etc., and loudspeaker etc.; Storage area 1508, comprises hard disk etc.; With communications portion 1509, comprise network interface unit such as LAN card, modulator-demodular unit etc.Communications portion 1509 is via network such as the Internet executive communication process.

As required, driver 1510 is also connected to input/output interface 1505.Detachable media 1511 such as disk, CD, magneto-optic disk, semiconductor memory etc. are installed on driver 1510 as required, and the computer program therefrom read is installed in storage area 1508 as required.

When series of processes above-mentioned by software simulating, from network such as the Internet or storage medium, such as detachable media 1511 installs the program forming software.

It will be understood by those of skill in the art that this storage medium is not limited to wherein having program stored therein shown in Figure 15, distributes the detachable media 1511 to provide program to user separately with equipment.The example of detachable media 1511 comprises disk (comprising floppy disk (registered trademark)), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD) (registered trademark)) and semiconductor memory.Or hard disk that storage medium can be ROM 1502, comprise in storage area 1508 etc., wherein computer program stored, and user is distributed to together with comprising their equipment.

Also it is pointed out that the step performing above-mentioned series of processes can order naturally following the instructions perform in chronological order, but do not need necessarily to perform according to time sequencing.Some step can walk abreast or perform independently of one another.

Although described the present invention and advantage thereof in detail, be to be understood that and can have carried out various change when not departing from the spirit and scope of the present invention limited by appended claim, substituting and conversion.And, the term of the embodiment of the present invention " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or equipment and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or equipment.When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment comprising described key element and also there is other identical element.

About the embodiment comprising above embodiment, following remarks is also disclosed:

Remarks 1. 1 kinds builds the method for mark webpage corpus, comprising:

For the initial seed auto-building html files initial seed mark webpage language material chosen in advance, described initial seed webpage is the set of dissimilar webpage composition, and described initial seed mark webpage language material is the sub-pages marking out text and title;

From search engine, the relevant sub-pages of predetermined number is obtained according to the keyword of described initial seed mark webpage language material;

According to described initial seed webpage label language material, described relevant sub-pages is marked, obtain relevant seed mark webpage language material; And

Judge whether described relevant seed mark webpage language material and described initial seed mark webpage language material meet pre-conditioned, if so, then described relevant seed mark webpage language material and described initial seed mark webpage language material are combined as mark webpage corpus; If not, then described relevant seed mark webpage language material is marked webpage language material as initial seed, and from search engine, obtain the step of the relevant sub-pages of predetermined number described in performing.

2, the method according to remarks 1, wherein, the described step marked described relevant sub-pages according to described initial seed webpage label language material, comprising:

According to the subset of the text marked out in described initial seed mark webpage language material, the text of described relevant sub-pages is marked;

According to the title marked out in described initial seed mark webpage language material, the title of described relevant sub-pages is marked; And

The content do not marked in described relevant sub-pages is labeled as other.

3, the method according to remarks 2, wherein, the described step marked the text of described relevant sub-pages, comprising:

The subset of text is extracted from initial seed mark webpage language material; Described subset is any one of the text of described initial seed mark webpage or the content of multiple part;

From described relevant sub-pages, corresponding relevant start-up portion and relevant latter end is searched according to the initial start-up portion of described subset and initial latter end;

According to contents extraction instrument, contents extraction is carried out to described relevant sub-pages, obtain extracting start-up portion and extracting latter end; And

Judge described relevant start-up portion and extract start-up portion, and whether described relevant latter end is identical with described extraction latter end, if, then the content between described relevant start-up portion and relevant latter end is labeled as text, if not, then described extraction start-up portion and the content extracted between latter end are labeled as text.

4, the method according to remarks 2, wherein, the described step marked the title of described relevant sub-pages, comprising:

Judge that whether the title of relevant sub-pages is consistent with the title of described initial seed webpage, if so, then the title of described relevant sub-pages is marked; If not, then the title annotation process to described relevant sub-pages is terminated.

5, the method according to remarks 1, wherein, the described keyword according to described initial seed mark webpage language material obtains the step of the relevant sub-pages of predetermined number from search engine, comprising:

Participle instrument is utilized to carry out participle to the title of described initial seed webpage and text, to obtain the initial key word of described initial seed webpage;

According to the weighted value of the part of speech of described initial key word, word frequency and each initial key word of word positional information calculation;

Weight selection value is greater than several keywords of predetermined threshold value as final keyword; And

Described final keyword is undertaken retrieving to obtain retrieval sub-pages by search engine.

6, the method according to remarks 5, wherein, described described final keyword is undertaken retrieving to obtain retrieval sub-pages by search engine after, also comprise:

Vector space model is adopted to calculate the similarity of described initial seed webpage and described retrieval sub-pages; And

The value of described similarity is greater than several retrieval sub-pages of predetermined threshold value as described relevant sub-pages.

7, the method according to remarks 1, wherein, describedly judges whether described relevant seed mark webpage language material and described initial seed mark webpage language material meet pre-conditioned step, comprising:

Judge whether described relevant seed mark webpage language material and described initial seed mark webpage language material all expand; Or

Judge whether described relevant seed mark webpage language material and described initial seed mark webpage language material reach the scale preset.

8, the method according to remarks 1, wherein, after obtaining described mark webpage corpus, also comprises:

According to described mark webpage training one for extracting the extraction model of web page contents; And

According to the title in described extraction model extraction target web and body matter.

9, build a device for mark webpage corpus, comprising:

Generation module, for marking webpage language material for the initial seed auto-building html files initial seed chosen in advance, described initial seed webpage is the set of dissimilar webpage composition, and described initial seed mark webpage language material is the sub-pages marking out text and title;

Acquisition module, for obtaining the relevant sub-pages of predetermined number from search engine according to the keyword of described initial seed mark webpage language material;

Labeling module, for marking described relevant sub-pages according to described initial seed webpage label language material, obtains relevant seed mark webpage language material;

Judge module, for judging whether described relevant seed mark webpage language material and described initial seed mark webpage language material meet pre-conditioned;

Composite module, for when the result of described judge module is for being, is combined as mark webpage corpus by described relevant seed mark webpage language material and described initial seed mark webpage language material; And

Trigger module, for when the result of described judge module is no, using described relevant seed mark webpage language material as initial seed mark webpage language material, and triggers described acquisition module.

10, the device according to remarks 9, wherein, described labeling module comprises:

First mark submodule, for the subset according to the text marked out in described initial seed mark webpage language material, marks the text of described relevant sub-pages;

Second mark submodule, for according to the title marked out in described initial seed mark webpage language material, marks the title of described relevant sub-pages; And

3rd mark submodule, for being labeled as other by the content do not marked in described relevant sub-pages.

11, the device according to remarks 10, wherein, described first mark submodule comprises:

Subset extracts submodule, for extracting the subset of text from initial seed mark webpage language material; Described subset is any one of the text of described initial seed mark webpage or the content of multiple part;

Search submodule, for searching corresponding relevant start-up portion and relevant latter end according to the initial start-up portion of described subset and initial latter end from described relevant sub-pages;

Contents extraction submodule, for carrying out contents extraction according to contents extraction instrument to described relevant sub-pages, obtains extracting start-up portion and extracting latter end;

First judges submodule, and for judging described relevant start-up portion and extracting start-up portion, and whether described relevant latter end is identical with described extraction latter end;

4th mark submodule, for when described first judges the result of submodule as being, is labeled as text by the content between described relevant start-up portion and relevant latter end; And

5th mark submodule, for when described first judges that the result of submodule is no, is labeled as text by described extraction start-up portion and the content extracted between latter end.

12, the device according to remarks 10, described second mark submodule comprises:

Second judges submodule, for judging that whether the title of relevant sub-pages is consistent with the title of described initial seed webpage;

6th mark submodule, for when described second judges the result of submodule as being, marks the title of described relevant sub-pages;

Terminate submodule, for when described second judges that the result of submodule is no, terminate the title annotation process to described relevant sub-pages.

13, the device according to remarks 9, described acquisition module comprises:

Participle submodule, for utilizing participle instrument to carry out participle to the title of described initial seed webpage and text, to obtain the initial key word of described initial seed webpage;

First calculating sub module, for the weighted value according to the part of speech of described initial key word, word frequency and each initial key word of word positional information calculation;

First chooses submodule, is greater than several keywords of predetermined threshold value as final keyword for weight selection value;

Retrieval submodule, for being undertaken retrieving to obtain retrieval sub-pages by search engine by described final keyword.

14, the device according to remarks 13, also comprises:

Second calculating sub module, for the similarity adopting vector space model to calculate described initial seed webpage and described retrieval sub-pages; And

Second chooses submodule, and the value for choosing described similarity is greater than several retrieval sub-pages of predetermined threshold value as described relevant sub-pages.

15, the device according to remarks 9, described judge module comprises:

For judging whether described relevant seed mark webpage language material and described initial seed mark webpage language material all expand; Or, for judging whether described relevant seed mark webpage language material and described initial seed mark webpage language material reach the scale preset.

16, the device according to remarks 9, also comprises:

Training module, for according to described mark webpage training one for extracting the extraction model of web page contents; And

Extraction module, for extracting title in target web and body matter according to described extraction model.

Claims

1. build a method for mark webpage corpus, described mark webpage corpus is for training extraction model to extract title in target web and text according to described extraction model, and described method comprises:

Judge whether described relevant seed mark webpage language material and described initial seed mark webpage language material meet pre-conditioned, if so, then described relevant seed mark webpage language material and described initial seed mark webpage language material are combined as mark webpage corpus; If not, then described relevant seed mark webpage language material is marked webpage language material as initial seed, and perform the step that the described keyword according to described initial seed mark webpage language material obtains the relevant sub-pages of predetermined number from search engine.

2. method according to claim 1, wherein, the described step marked described relevant sub-pages according to described initial seed webpage label language material, comprising:

The content do not marked in described relevant sub-pages is labeled as other.

3. method according to claim 2, wherein, the described step marked the text of described relevant sub-pages, comprising:

4. method according to claim 2, wherein, the described step marked the title of described relevant sub-pages, comprising:

5. method according to claim 1, wherein, describedly judges whether described relevant seed mark webpage language material and described initial seed mark webpage language material meet pre-conditioned step, comprising:

6. build a device for mark webpage corpus, described mark webpage corpus is for training extraction model to extract title in target web and text according to described extraction model, and described device comprises:

7. device according to claim 6, wherein, described labeling module comprises:

8. device according to claim 7, wherein, described first mark submodule comprises:

9. device according to claim 7, described second mark submodule comprises:

6th mark submodule, for when described second judges the result of submodule as being, marks the title of described relevant sub-pages; And

10. device according to claim 6, described judge module: for judging whether described relevant seed mark webpage language material and described initial seed mark webpage language material all expand; Or, for judging whether described relevant seed mark webpage language material and described initial seed mark webpage language material reach the scale preset.