CN102855324A

CN102855324A - Automatic extracting method and device for network information

Info

Publication number: CN102855324A
Application number: CN2012103357191A
Authority: CN
Inventors: 杨俊拯; 温予; 张旸; 黄百宁; 王世平; 葛猛; 孟玲会
Original assignee: BEIJING YUNHONG DAOYUAN INFORMATION TECHNOLOGY Co Ltd
Current assignee: BEIJING YUNHONG DAOYUAN INFORMATION TECHNOLOGY Co Ltd
Priority date: 2012-09-11
Filing date: 2012-09-11
Publication date: 2013-01-02
Anticipated expiration: 2032-09-11
Also published as: CN102855324B

Abstract

The invention provides an automatic extracting method and an automatic extracting device for network information. The relevant method comprises the steps as follows: finding out a webpage W' of an element in a subset S sub with the given information S in a webpage unit relevant to the given information S; generating an information pattern unit P' based on the preset rule, and summing the information pattern unit P and the regular expression unit P so as to obtain a set P1; matching the set P1 with all the webpage in the webpage collection W relevant to the given information, thus obtaining the set S sub'; and finishing the grasping until S sub= =S sub'. With the adoption of the automatic extracting method and the automatic extracting device for the network information, provided by the invention, the relevant regular expression set can be generated based on different webpage, therefore, the content in the webpage can be automatically extracted, and lots of workloads can be saved.

Description

A kind of extraction method of the network information and device

Technical field

The present invention relates to a kind of extraction method and device of the network information, belong to network information extractive technique field.

Background technology

For the information that represents at webpage, prior art is generally described by regular expression, and for different webpages, corresponding regular expression is different often, and the workload that so just causes the network information to be extracted is larger.

Summary of the invention

The present invention is the larger problem of workload of the existing network information extraction of solution, and then a kind of extraction method and device of the network information are provided.For this reason, the invention provides following technical scheme:

A kind of extraction method of the network information comprises:

From the relevant webpage intersection W of given information S, find the webpage W ' of element among the subset Ssub that contains given information S;

According to pre-defined rule information generated pattern set P ', and information pattern set P ' is gathered P with regular expression ask intersection to obtain set P ₁

To gather P ₁All webpages among the webpage intersection W relevant with given information mate, and obtain S set sub ', until Ssub==Ssub ' time crawl process finishes.

A kind of automatic extracting device of the network information comprises:

Webpage is chosen the unit, is used for finding from the relevant webpage intersection W of given information S the webpage W ' of element the subset Ssub that contains given information S;

The unit is chosen in set, is used for according to pre-defined rule information generated pattern set P ', and asks intersection to obtain set P information pattern set P ' and regular expression set P ₁

The content placement unit is used for gathering P ₁All webpages among the webpage intersection W relevant with given information mate, and obtain S set sub ', until Ssub==Ssub ' time crawl process finishes.

Technical scheme provided by the invention realizes automatically extracting the content in the webpage by generate corresponding regular expression set according to different webpages, has saved a lot of workloads.

Description of drawings

In order to be illustrated more clearly in the technical scheme of the embodiment of the invention, the accompanying drawing of required use was done to introduce simply during the below will describe embodiment, apparently, accompanying drawing in the following describes only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain according to these accompanying drawings other accompanying drawing.

Fig. 1 is the synoptic diagram of two webpage obtaining informations providing of the specific embodiment of the present invention;

Fig. 2 is the synoptic diagram of n webpage obtaining information providing of the specific embodiment of the present invention;

Fig. 3 is the schematic flow sheet of the extraction method of the network information that provides of the specific embodiment of the present invention;

Fig. 4 is the schematic flow sheet of the information generated pattern set P ' that provides of the specific embodiment of the present invention;

Fig. 5 is the schematic flow sheet of the checking regular expression set that provides of the specific embodiment of the present invention;

Fig. 6 is the structural representation of the face characteristic locating device that provides of the specific embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the invention, the technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills belong to the scope of protection of the invention not making the every other embodiment that obtains under the creative work prerequisite.

The principle of the technical scheme that this embodiment provides is: can comprise the situation of same information for dissimilar webpages, because same information expression way on different websites is different.For example at music field, a lot of music information website, forums etc. of comprising are arranged on the internet, their different websites, forum's structure of web page and the form of expression generally are not identical, but they have comprised the information of a lot of same kind, information such as song title, singer's name, special edition, for a kind of information, webpage (being designated as urlpattern1) for same type, can represent by regular expression (prefix1 info suffix1), and intersection that will the value of recording is designated as V1.And for dissimilar webpages (urlpattern2), they have different regular expression (prefix2 info suffix2), the intersection of the value of this website is designated as V2, and then the common factor of V1 and V2 is not equal to sky, and the information that the value of V1 and V2 is described is consistent.If there are by that analogy dissimilar webpages of n, then should there be the set less than or equal to n value, exist less than or equal to n regular expression.Concrete logic as depicted in figs. 1 and 2.Therefore for the part set (sample sizes such as 10 to 100) of given information, be designated as Ssub, then can pass through webpage intersection W, obtain information intersection S '.Defining coverage rate is | S ∩ S ' | and/S, the definition accuracy rate | S ∩ S ' |/S ', with respect to coverage rate, accuracy rate is extracted more important for web page contents.Because if accuracy rate is excessively low, nonsensical for most application, but coverage rate is crossed low can remedying by the webpage quantity of magnanimity, so the technical scheme that this embodiment provides proposes for the accuracy rate that improves the web page contents extraction.Be elaborated below in conjunction with Figure of description, as shown in Figure 3, the extraction method of the corresponding network information comprises:

Step 31 finds the webpage W ' of element among the subset Ssub that contains given information S from the relevant webpage intersection W of given information S.

Concrete, for the subset Ssub of given information S, the element among the subset Ssub is enumerable, and definition regular expression intersection

At first travel through the relevant webpage intersection W of given information S, from webpage intersection W, find the webpage W ' of element among the subset Ssub that contains given information S.

Step 32 according to pre-defined rule information generated pattern set P ', and is gathered P with information pattern set P ' with regular expression and is asked intersection to obtain set P ₁

According to pre-defined rule information generated pattern set P ', and make W '=Ssub, wherein the generative process of information pattern set P ' specifically can comprise specifically as shown in Figure 4:

The pattern that at first defines regular expression is: p=prefix info suffix; And be the component of regular expression in order to the next part cooperation: digital collection NumberSet, set of letters EnglishSet, special symbol S set pecialSet, character set ChineseSet, webpage tag set MetaSet; Wherein the content of the info of regular expression represents by digital collection NumberSet, set of letters EnglishSet, special symbol S set pecialSet and character set ChineseSet, and the content of prefix and suffix represents by webpage tag set MetaSet;

The subset Ssub of traversal given information S finds a certain element s, and finds the position of element s in webpage w;

Recall forward, find first webpage label, be designated as prefix; Recall backward, find first webpage label, be designated as suffix;

Description rule generting element s the canonical set on webpage w of the content in the middle of prefix and the suffix according to digital collection NumberSet, set of letters EnglishSet, special symbol S set pecialSet and character set ChineseSet;

Become the regular expressions set of Ssub on webpage w according to element s in the canonic(al) ensemble symphysis on the webpage w, be recorded as P ' p1, p2 ... pn.

Step 33 will be gathered P ₁All webpages among the webpage intersection W relevant with given information mate, and obtain S set sub ', until Ssub==Ssub ' time crawl process finishes.

Concrete, will gather P ₁All webpages among the webpage intersection W relevant with given information mate, and obtain S set sub ', if Ssub〉Ssub ', re-execute step 31 after then making Ssub=Ssub ', until Ssub==Ssub ' time crawl process finishes.

Further, this embodiment can also comprise the process of checking regular expression set, as shown in Figure 5, specifically can comprise:

Each webpage W ' and the subset Ssub of given information are multiplied each other, obtain regular expression intersection Tt=T1, T2 ... Tn;

Traversal regular expression intersection Tt obtains a regular expression intersection T ₁, traversal regular expression intersection T ₁, regular expression p ∈ Tn and webpage W ' mate the S set of the value of obtaining arbitrarily;

If S-Ssub ≠ Φ gives up and changes expression formula (effect of this step is to remove the regular expression that mates simultaneously other guide); If S-Ssub=is Φ, then the number Scount of element equals element number in the S set among the subset Ssub of given information;

Traversal regular expression intersection Tt, for Tn ∈ Tt arbitrarily, if the number of regular expression is greater than 1 among the Tn, then choose the regular expression of Scount maximum among the Tn, cast out remaining regular expression (effect of this step is a plurality of expression formulas for same coupling, chooses maximum that of coupling);

Traversal regular expression intersection Tt, contrast is any two Tn wherein, if regular expression is identical, then give up wherein any one (effect of this step is to remove identical regular expression);

Remaining regular expression is formed set, be designated as P '=p1, p2 ... pn.

The technical scheme that adopts this embodiment to provide by generate corresponding regular expression set according to different webpages, realizes automatically extracting the content in the webpage, has saved a lot of workloads, and can verify the correctness of regular expression.

Need to prove, one of ordinary skill in the art will appreciate that all or part of step that realizes in above-mentioned each embodiment of the method is to come the relevant hardware of instruction to finish by program, corresponding program can be stored in a kind of computer-readable recording medium, the above-mentioned storage medium of mentioning can be ROM (read-only memory), disk or CD etc.

The specific embodiment of the present invention also provides a kind of automatic extracting device of the network information, as shown in Figure 6, comprising:

Webpage is chosen unit 61, is used for finding from the relevant webpage intersection W of given information S the webpage W ' of element the subset Ssub that contains given information S;

Unit 62 is chosen in set, is used for according to pre-defined rule information generated pattern set P ', and asks intersection to obtain set P information pattern set P ' and regular expression set P ₁;

Content placement unit 63 is used for gathering P ₁All webpages among the webpage intersection W relevant with given information mate, and obtain S set sub ', until the crawl process finishes during Ssub==Ssub '.

Optionally, choose in the unit 62 in set and comprise: the traversal subelement, recall subelement, symphysis becomes subelement to canonical set statement subelement with canonic(al) ensemble; Traversal subelement wherein is used for the subset Ssub of traversal given information S, finds a certain element s, and finds the position of element s in webpage w; Recall subelement and be used for recalling forward, find first webpage label, be designated as prefix; Recall backward, find first webpage label, be designated as suffix; Canonical set statement subelement is used for description rule generting element s the canonical set on webpage w of the content in the middle of prefix and the suffix according to digital collection NumberSet, set of letters EnglishSet, special symbol S set pecialSet and character set ChineseSet; The canonic(al) ensemble symphysis becomes subelement to be used for becoming the regular expressions set of Ssub on webpage w according to element s in the canonic(al) ensemble symphysis on the webpage w, is recorded as P '=p1, p2 ... pn.

Optionally, this device can also comprise authentication unit, comprises in authentication unit: get multiplier unit, coupling subelement, element number and determine that subelement, the first screening subelement, the second screening subelement and canonical set determine subelement; The multiplier unit of getting wherein is used for each webpage W ' and the subset Ssub of given information are multiplied each other, and obtains regular expression intersection Tt=T1, T2 ... Tn; The coupling subelement is used for traversal regular expression intersection Tt, obtains a regular expression intersection T ₁, traversal regular expression intersection T ₁, regular expression p ∈ Tn and webpage W ' mate the S set of the value of obtaining arbitrarily; If element number determines subelement and be used for S-Ssub ≠ Φ, give up and change expression formula; If S-Ssub=is Φ, then the number Scount of element equals element number in the S set among the subset Ssub of given information; The first screening subelement is used for traversal regular expression intersection Tt, for Tn ∈ Tt arbitrarily, if the number of regular expression is greater than 1 among the Tn, then chooses the regular expression of Scount maximum among the Tn, casts out remaining regular expression; The second screening subelement is used for traversal regular expression intersection Tt, and contrast is any two Tn wherein, if regular expression is identical, then gives up wherein any one; The canonical set determines that subelement is used for remaining regular expression is formed set, is designated as P '=p1, p2 ... pn.

The specific implementation of the processing capacity of each unit that comprises in the automatic extracting device of the above-mentioned network information is described in embodiment of the method before, no longer is repeated in this description at this.

It should be noted that among the embodiment of said apparatus that included unit is just divided according to function logic, but is not limited to above-mentioned division, as long as can realize corresponding function; In addition, the concrete title of each functional unit also just for the ease of mutual differentiation, is not limited to protection scope of the present invention.

The above; only be the better embodiment of the present invention; but protection scope of the present invention is not limited to this; anyly be familiar with those skilled in the art in the technical scope that the embodiment of the invention discloses; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims

1. the extraction method of a network information is characterized in that, comprising:

2. method according to claim 1 is characterized in that, describedly comprising according to pre-defined rule information generated pattern set P ':

Become the regular expressions set of Ssub on webpage w according to element s in the canonic(al) ensemble symphysis on the webpage w, be recorded as P '=p1, p2 ... pn.

3. method according to claim 1 is characterized in that, the method also comprises the set of checking regular expression, and described checking regular expression set comprises:

If S-Ssub ≠ Φ gives up and changes expression formula; If S-Ssub=is Φ, then the number Scount of element equals element number in the S set among the subset Ssub of given information;

Traversal regular expression intersection Tt for Tn ∈ Tt arbitrarily, if the number of regular expression is greater than 1 among the Tn, then chooses the regular expression of Scount maximum among the Tn, casts out remaining regular expression;

Traversal regular expression intersection Tt, contrast is any two Tn wherein, if regular expression is identical, then give up wherein any one;

Remaining regular expression is formed set, be designated as P '=p1, p2 ... pn.

4. the automatic extracting device of a network information is characterized in that, comprising:

5. device according to claim 4 is characterized in that, chooses in the unit in set to comprise:

The traversal subelement, the subset Ssub for traversal given information S finds a certain element s, and finds the position of element s in webpage w;

Recall subelement, be used for recalling forward, find first webpage label, be designated as prefix; Recall backward, find first webpage label, be designated as suffix;

Canonical set statement subelement is used for description rule generting element s the canonical set on webpage w of the content in the middle of prefix and the suffix according to digital collection NumberSet, set of letters EnglishSet, special symbol S set pecialSet and character set ChineseSet;

The canonic(al) ensemble symphysis becomes subelement, is used for becoming the regular expressions set of Ssub on webpage w according to element s in the canonic(al) ensemble symphysis on the webpage w, is recorded as P '=p1, p2 ... pn.

6. device according to claim 4 is characterized in that, this device also comprises authentication unit, and described authentication unit comprises:

Get the multiplier unit, be used for each webpage W ' and the subset Ssub of given information are multiplied each other, obtain regular expression intersection Tt=T1, T2 ... Tn;

The coupling subelement is used for traversal regular expression intersection Tt, obtains a regular expression intersection T ₁, traversal regular expression intersection T ₁, regular expression p ∈ Tn and webpage W ' mate the S set of the value of obtaining arbitrarily;

Element number is determined subelement, if be used for S-Ssub ≠ Φ, give up and changes expression formula; If S-Ssub=is Φ, then the number Scount of element equals element number in the S set among the subset Ssub of given information;

The first screening subelement is used for traversal regular expression intersection Tt, for Tn ∈ Tt arbitrarily, if the number of regular expression is greater than 1 among the Tn, then chooses the regular expression of Scount maximum among the Tn, casts out remaining regular expression;

The second screening subelement is used for traversal regular expression intersection Tt, and contrast is any two Tn wherein, if regular expression is identical, then gives up wherein any one;

Subelement is determined in the canonical set, is used for remaining regular expression is formed set, is designated as P '=p1, p2 ... pn.