CN102855324B

CN102855324B - A kind of extraction method of the network information and device

Info

Publication number: CN102855324B
Application number: CN201210335719.1A
Authority: CN
Inventors: 杨俊拯; 温予; 张旸; 黄百宁; 王世平; 葛猛; 孟玲会
Original assignee: BEIJING YUNHONG DAOYUAN INFORMATION TECHNOLOGY Co Ltd
Current assignee: BEIJING YUNHONG DAOYUAN INFORMATION TECHNOLOGY Co Ltd
Priority date: 2012-09-11
Filing date: 2012-09-11
Publication date: 2015-08-26
Anticipated expiration: 2032-09-11
Also published as: CN102855324A

Abstract

The invention provides a kind of extraction method and device of the network information, corresponding method comprises the webpage intersection W closed from given information S-phase the webpage W ' of element in the subset Ssub found containing given information S; Gather P ' according to pre-defined rule information generated pattern, and information pattern is gathered P ' and regular expression set P and ask intersection to obtain to gather P ₁; P will be gathered ₁all webpages in the webpage intersection W relevant to given information mate, and obtain S set sub ', until the process that captures during Ssub==Ssub ' terminates.The present invention, by according to the corresponding regular expression set of different auto-building html files, realizes the content automatically extracted in webpage, eliminates a lot of workload.

Description

A kind of extraction method of the network information and device

Technical field

The present invention relates to a kind of extraction method and device of the network information, belong to network information extractive technique field.

Background technology

For the information represented on webpage, prior art is generally described by regular expression, and for different webpages, corresponding regular expression is different often, and the workload so just causing the network information to be extracted is larger.

Summary of the invention

The present invention is the problem that the workload of solution existing network information extraction is larger, and then provides a kind of extraction method and device of the network information.For this reason, the invention provides following technical scheme:

An extraction method for the network information, comprising:

The webpage W ' of element in the subset Ssub containing given information S is found from the webpage intersection W that given information S-phase is closed;

Gather P ' according to pre-defined rule information generated pattern, and information pattern is gathered P ' and regular expression set P and ask intersection to obtain to gather P ₁;

P will be gathered ₁all webpages in the webpage intersection W relevant to given information mate, obtain S set sub ', until Ssub==Ssub ' time capture process terminate.

An automatic extracting device for the network information, comprising:

Webpage chooses unit, for finding the webpage W ' of element in the subset Ssub containing given information S in the webpage intersection W that closes from given information S-phase;

Unit is chosen in set, for gathering P ' according to pre-defined rule information generated pattern, and information pattern is gathered P ' and regular expression set P and asks intersection to obtain to gather P ₁;

Content placement unit, for gathering P ₁all webpages in the webpage intersection W relevant to given information mate, obtain S set sub ', until Ssub==Ssub ' time capture process terminate.

Technical scheme provided by the invention, by according to the corresponding regular expression set of different auto-building html files, realizes the content automatically extracted in webpage, eliminates a lot of workload.

Accompanying drawing explanation

In order to be illustrated more clearly in the technical scheme of the embodiment of the present invention, below the accompanying drawing used required in describing embodiment is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

Fig. 1 is the schematic diagram of two webpage obtaining informations that the specific embodiment of the present invention provides;

Fig. 2 is the schematic diagram of n the webpage obtaining information that the specific embodiment of the present invention provides;

Fig. 3 is the schematic flow sheet of the extraction method of the network information that the specific embodiment of the present invention provides;

Fig. 4 is the schematic flow sheet of the information generated pattern set P ' that the specific embodiment of the present invention provides;

Fig. 5 is the schematic flow sheet of the checking regular expression set that the specific embodiment of the present invention provides;

Fig. 6 is the structural representation of the extract facial feature device that the specific embodiment of the present invention provides.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

The principle of the technical scheme that this embodiment provides is: the situation that can comprise same information for dissimilar webpage, because same information expression way on different websites is different.Such as at music field, internet has and a lot of comprises music information website, forum etc., their different websites, forum Web pages structure and the form of expression are generally not identical, but that they include the information of a lot of same kind, the information such as such as song title, Ge Shouming, special edition, for a kind of information, for the webpage (being designated as urlpattern1) of same type, regular expression (prefix1 infosuffix1) can be passed through represent, and the intersection recording value is designated as V1.And for dissimilar webpage (urlpattern2), they have different regular expressions (prefix2 info suffix2), the intersection of the value of this website is designated as V2, then the common factor of V1 and V2 is not equal to sky, and the information that the value of V1 with V2 describes is consistent.If there is n dissimilar webpage by that analogy, then should there is the set being less than or equal to n value, exist and be less than or equal to n regular expression.Concrete logic as depicted in figs. 1 and 2.Therefore for the part set (sample sizes of such as 10 to 100) of given information, be designated as Ssub, then can pass through webpage intersection W, obtain information intersection S '.Defining coverage rate is | S ∩ S ' | and/S, definition accuracy rate | S ∩ S ' |/S ', relative to coverage rate, accuracy rate is extracted more important for web page contents.Because if accuracy rate is too low, nonsensical for most application, but coverage rate is too low can be made up by the webpage quantity of magnanimity, so the technical scheme that this embodiment provides proposes for the accuracy rate improving web page contents extraction.Be described in detail below in conjunction with Figure of description, as shown in Figure 3, the extraction method of the corresponding network information comprises:

Step 31, finds the webpage W ' of element in the subset Ssub containing given information S from the webpage intersection W that given information S-phase is closed.

Concrete, for the subset Ssub of given information S, the element in subset Ssub is enumerable, and defines regular expression intersection first travel through the webpage intersection W that given information S-phase is closed, from webpage intersection W, find the webpage W ' of element in the subset Ssub containing given information S.

Step 32, gathers P ' according to pre-defined rule information generated pattern, and information pattern is gathered P ' and regular expression set P and ask intersection to obtain to gather P ₁.

Gather P ' according to pre-defined rule information generated pattern, and make W '=>Ssub, wherein information pattern gathers the generative process of P ' specifically as shown in Figure 4, specifically can comprise:

First the pattern defining regular expression is: p=prefix info suffix; And be the component of regular expression in order to next part cooperation: digital collection NumberSet, set of letters EnglishSet, special symbol S set pecialSet, character set ChineseSet, web page tag set MetaSet; Wherein the content of the info of regular expression is represented by digital collection NumberSet, set of letters EnglishSet, special symbol S set pecialSet and character set ChineseSet, and the content of prefix and suffix is represented by web page tag set MetaSet;

The subset Ssub of traversal given information S, finds a certain element s, and finds the position of element s in webpage w;

Forward trace, finds first web page tag, is designated as prefix; Recall backward, find first web page tag, be designated as suffix;

The description rule generting element s canonical set on webpage w of the content in the middle of prefix and suffix according to digital collection NumberSet, set of letters EnglishSet, special symbol S set pecialSet and character set ChineseSet;

Become the regular expressions set of Ssub on webpage w according to the canonic(al) ensemble symphysis of element s on webpage w, be recorded as P ' p1, p2 ... pn.

Step 33, will gather P ₁all webpages in the webpage intersection W relevant to given information mate, obtain S set sub ', until Ssub==Ssub ' time capture process terminate.

Concrete, will P be gathered ₁all webpages in the webpage intersection W relevant to given information mate, obtain S set sub ', if Ssub>Ssub ', then re-execute step 31 after making Ssub=Ssub ', until Ssub==Ssub ' time capture process terminate.

Further, this embodiment can also comprise the process of checking regular expression set, as shown in Figure 5, specifically can comprise:

Each webpage W ' is multiplied with the subset Ssub of given information, obtains regular expression intersection Tt=T1, T2 ... Tn;

Traversal regular expression intersection Tt, obtains a regular expression intersection T ₁, traversal regular expression intersection T ₁, arbitrary regular expression p ∈ Tn is mated with webpage W ', obtains the S set of value;

If S-Ssub ≠ Φ, give up and change expression formula (regular expression simultaneously mating other guide is removed in the effect of this step); If S-Ssub=is Φ, then in the subset Ssub of given information, the number Scount of element equals the element number in S set;

Traversal regular expression intersection Tt, for arbitrary Tn ∈ Tt, if the number of regular expression is greater than 1 in Tn, then choose the regular expression that in Tn, Scount is maximum, cast out remaining regular expression (effect of this step is the multiple expression formulas for same coupling, chooses maximum that of coupling);

Traversal regular expression intersection Tt, contrast wherein any two Tn, if regular expression is identical, then give up wherein any one (identical regular expression is removed in the effect of this step);

By the set of remaining regular expression composition, be designated as P '=p1, p2 ... pn.

Adopt the technical scheme that this embodiment provides, by according to the corresponding regular expression set of different auto-building html files, realize the content automatically extracted in webpage, eliminate a lot of workload, and the correctness of regular expression can be verified.

It should be noted that, one of ordinary skill in the art will appreciate that all or part of step realized in above-mentioned each embodiment of the method is that the hardware that can carry out instruction relevant by program completes, corresponding program can be stored in a kind of computer-readable recording medium, the above-mentioned storage medium mentioned can be ROM (read-only memory), disk or CD etc.

The specific embodiment of the present invention additionally provides a kind of automatic extracting device of the network information, as shown in Figure 6, comprising:

Webpage chooses unit 61, for finding the webpage W ' of element in the subset Ssub containing given information S in the webpage intersection W that closes from given information S-phase;

Unit 62 is chosen in set, for gathering P ' according to pre-defined rule information generated pattern, and information pattern is gathered P ' and regular expression set P and asks intersection to obtain to gather P ₁;

Content placement unit 63, for gathering P ₁all webpages in the webpage intersection W relevant to given information mate, and obtain S set sub ', until the process that captures during Ssub==Ssub ' terminates.

Optionally, choose unit 62 in set to comprise: traversal subelement, backtracking subelement, canonical set statement subelement become subelement with canonic(al) ensemble symphysis; Traversal subelement wherein, for traveling through the subset Ssub of given information S, finds a certain element s, and finds the position of element s in webpage w; Backtracking subelement is used for forward trace, finds first web page tag, is designated as prefix; Recall backward, find first web page tag, be designated as suffix; Canonical set statement subelement is used for the content of prefix and suffix centre according to the canonical set of description rule generting element s on webpage w of digital collection NumberSet, set of letters EnglishSet, special symbol S set pecialSet and character set ChineseSet; Canonic(al) ensemble symphysis becomes subelement for the regular expressions set becoming Ssub on webpage w according to the canonic(al) ensemble symphysis of element s on webpage w, is recorded as P '=p1, p2 ... pn.

Optionally, this device can also comprise authentication unit, comprises at authentication unit: get multiplier unit, coupling subelement, element number determination subelement, the first screening subelement, the second screening subelement and canonical set determination subelement; Wherein get multiplier unit for being multiplied with the subset Ssub of given information by each webpage W ', obtain regular expression intersection Tt=T1, T2 ... Tn; Coupling subelement, for traveling through regular expression intersection Tt, obtains a regular expression intersection T ₁, traversal regular expression intersection T ₁, arbitrary regular expression p ∈ Tn is mated with webpage W ', obtains the S set of value; If element number determination subelement is used for S-Ssub ≠ Φ, gives up and change expression formula; If S-Ssub=is Φ, then in the subset Ssub of given information, the number Scount of element equals the element number in S set; First screening subelement, for traveling through regular expression intersection Tt, for arbitrary Tn ∈ Tt, if the number of regular expression is greater than 1 in Tn, is then chosen the regular expression that Scount in Tn is maximum, is cast out remaining regular expression; Second screening subelement for traveling through regular expression intersection Tt, contrast wherein any two Tn, if regular expression is identical, then give up wherein any one; Canonical set determination subelement is used for, by the set of remaining regular expression composition, being designated as P '=p1, p2 ... pn.

The specific implementation of the processing capacity of each unit comprised in the automatic extracting device of the above-mentioned network information describes in embodiment of the method before, in this no longer repeated description.

It should be noted that in the embodiment of said apparatus, included unit is carry out dividing according to function logic, but is not limited to above-mentioned division, as long as can realize corresponding function; In addition, the concrete title of each functional unit, also just for the ease of mutual differentiation, is not limited to protection scope of the present invention.

The above; be only the present invention's preferably embodiment; but protection scope of the present invention is not limited thereto; anyly be familiar with those skilled in the art in the technical scope that the embodiment of the present invention discloses; the change that can expect easily or replacement, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims

1. an extraction method for the network information, is characterized in that, comprising:

Gather P ' according to pre-defined rule information generated pattern, and information pattern is gathered P ' and regular expression set P and ask intersection to obtain to gather P ₁; Wherein, regular expression intersection describedly gather P ' according to pre-defined rule information generated pattern and comprising: the subset Ssub of traversal given information S, find a certain element s, and find the position of element s in webpage w; Forward trace, finds first web page tag, is designated as prefix; Recall backward, find first web page tag, be designated as suffix; The description rule generting element s canonical set on webpage w of the content in the middle of prefix and suffix according to digital collection NumberSet, set of letters EnglishSet, special symbol S set pecialSet and character set ChineseSet; Become the information pattern set of Ssub on webpage w according to the canonic(al) ensemble symphysis of element s on webpage w, be recorded as P '=p1, p2 ... pn;

P will be gathered ₁all webpages in the webpage intersection W relevant to given information mate, and obtain S set sub ', until the process that captures during Ssub==Ssub ' terminates.

2. method according to claim 1, is characterized in that, the method also comprises the set of checking regular expression, and the set of described checking regular expression comprises:

Traversal regular expression intersection Tt, obtains a regular expression intersection T1, and traversal regular expression intersection T1, mates arbitrary regular expression p ∈ Tn with webpage W ', obtain the set F of value;

If F – is Ssub ≠ Φ, gives up and change expression formula; If F – Ssub=is Φ, then in the subset Ssub of given information, the number Scount of element equals to gather the element number in F;

Traversal regular expression intersection Tt, for arbitrary Tn ∈ Tt, if the number of regular expression is greater than 1 in Tn, then chooses the regular expression that Scount in Tn is maximum, casts out remaining regular expression;

Traversal regular expression intersection Tt, contrast wherein any two Tn, if regular expression is identical, then give up wherein any one;

3. an automatic extracting device for the network information, is characterized in that, comprising:

Unit is chosen in set, for gathering P ' according to pre-defined rule information generated pattern, and information pattern is gathered P ' and regular expression set P and asks intersection to obtain to gather P ₁; Wherein, regular expression intersection

Content placement unit, for gathering P ₁all webpages in the webpage intersection W relevant to given information mate, and obtain S set sub ', until the process that captures during Ssub==Ssub ' terminates;

Wherein, choose unit in set to comprise:

Traversal subelement, for traveling through the subset Ssub of given information S, finding a certain element s, and finding the position of element s in webpage w;

Backtracking subelement, for forward trace, finds first web page tag, is designated as prefix; Recall backward, find first web page tag, be designated as suffix;

Canonical set statement subelement, for the description rule generting element s canonical set on webpage w of the content in the middle of prefix and suffix according to digital collection NumberSet, set of letters EnglishSet, special symbol S set pecialSet and character set ChineseSet;

Canonic(al) ensemble symphysis becomes subelement, for becoming the regular expressions set of Ssub on webpage w according to the canonic(al) ensemble symphysis of element s on webpage w, is recorded as P '=p1, p2 ... pn.

4. device according to claim 3, is characterized in that, this device also comprises authentication unit, and described authentication unit comprises:

Getting multiplier unit, for being multiplied with the subset Ssub of given information by each webpage W ', obtaining regular expression intersection Tt=T1, T2 ... Tn;

Coupling subelement, for traveling through regular expression intersection Tt, obtains a regular expression intersection T1, and traversal regular expression intersection T1, mates arbitrary regular expression p ∈ Tn with webpage W ', obtain the set F of value;

Element number determination subelement, if for F – Ssub ≠ Φ, give up and change expression formula; If F – Ssub=is Φ, then in the subset Ssub of given information, the number Scount of element equals to gather the element number in F;

First screening subelement, for traveling through regular expression intersection Tt, for arbitrary Tn ∈ Tt, if the number of regular expression is greater than 1 in Tn, then choosing the regular expression that Scount in Tn is maximum, casting out remaining regular expression;

Second screening subelement, for traveling through regular expression intersection Tt, contrast wherein any two Tn, if regular expression is identical, then give up wherein any one;

Canonical set determination subelement, for by the set of remaining regular expression composition, is designated as P '=p1, p2 ... pn.