CN102855324B - A kind of extraction method of the network information and device - Google Patents
A kind of extraction method of the network information and device Download PDFInfo
- Publication number
- CN102855324B CN102855324B CN201210335719.1A CN201210335719A CN102855324B CN 102855324 B CN102855324 B CN 102855324B CN 201210335719 A CN201210335719 A CN 201210335719A CN 102855324 B CN102855324 B CN 102855324B
- Authority
- CN
- China
- Prior art keywords
- regular expression
- webpage
- ssub
- intersection
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Abstract
The invention provides a kind of extraction method and device of the network information, corresponding method comprises the webpage intersection W closed from given information S-phase the webpage W ' of element in the subset Ssub found containing given information S; Gather P ' according to pre-defined rule information generated pattern, and information pattern is gathered P ' and regular expression set P and ask intersection to obtain to gather P
1; P will be gathered
1all webpages in the webpage intersection W relevant to given information mate, and obtain S set sub ', until the process that captures during Ssub==Ssub ' terminates.The present invention, by according to the corresponding regular expression set of different auto-building html files, realizes the content automatically extracted in webpage, eliminates a lot of workload.
Description
Technical field
The present invention relates to a kind of extraction method and device of the network information, belong to network information extractive technique field.
Background technology
For the information represented on webpage, prior art is generally described by regular expression, and for different webpages, corresponding regular expression is different often, and the workload so just causing the network information to be extracted is larger.
Summary of the invention
The present invention is the problem that the workload of solution existing network information extraction is larger, and then provides a kind of extraction method and device of the network information.For this reason, the invention provides following technical scheme:
An extraction method for the network information, comprising:
The webpage W ' of element in the subset Ssub containing given information S is found from the webpage intersection W that given information S-phase is closed;
Gather P ' according to pre-defined rule information generated pattern, and information pattern is gathered P ' and regular expression set P and ask intersection to obtain to gather P
1;
P will be gathered
1all webpages in the webpage intersection W relevant to given information mate, obtain S set sub ', until Ssub==Ssub ' time capture process terminate.
An automatic extracting device for the network information, comprising:
Webpage chooses unit, for finding the webpage W ' of element in the subset Ssub containing given information S in the webpage intersection W that closes from given information S-phase;
Unit is chosen in set, for gathering P ' according to pre-defined rule information generated pattern, and information pattern is gathered P ' and regular expression set P and asks intersection to obtain to gather P
1;
Content placement unit, for gathering P
1all webpages in the webpage intersection W relevant to given information mate, obtain S set sub ', until Ssub==Ssub ' time capture process terminate.
Technical scheme provided by the invention, by according to the corresponding regular expression set of different auto-building html files, realizes the content automatically extracted in webpage, eliminates a lot of workload.
Accompanying drawing explanation
In order to be illustrated more clearly in the technical scheme of the embodiment of the present invention, below the accompanying drawing used required in describing embodiment is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is the schematic diagram of two webpage obtaining informations that the specific embodiment of the present invention provides;
Fig. 2 is the schematic diagram of n the webpage obtaining information that the specific embodiment of the present invention provides;
Fig. 3 is the schematic flow sheet of the extraction method of the network information that the specific embodiment of the present invention provides;
Fig. 4 is the schematic flow sheet of the information generated pattern set P ' that the specific embodiment of the present invention provides;
Fig. 5 is the schematic flow sheet of the checking regular expression set that the specific embodiment of the present invention provides;
Fig. 6 is the structural representation of the extract facial feature device that the specific embodiment of the present invention provides.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
The principle of the technical scheme that this embodiment provides is: the situation that can comprise same information for dissimilar webpage, because same information expression way on different websites is different.Such as at music field, internet has and a lot of comprises music information website, forum etc., their different websites, forum Web pages structure and the form of expression are generally not identical, but that they include the information of a lot of same kind, the information such as such as song title, Ge Shouming, special edition, for a kind of information, for the webpage (being designated as urlpattern1) of same type, regular expression (prefix1 infosuffix1) can be passed through represent, and the intersection recording value is designated as V1.And for dissimilar webpage (urlpattern2), they have different regular expressions (prefix2 info suffix2), the intersection of the value of this website is designated as V2, then the common factor of V1 and V2 is not equal to sky, and the information that the value of V1 with V2 describes is consistent.If there is n dissimilar webpage by that analogy, then should there is the set being less than or equal to n value, exist and be less than or equal to n regular expression.Concrete logic as depicted in figs. 1 and 2.Therefore for the part set (sample sizes of such as 10 to 100) of given information, be designated as Ssub, then can pass through webpage intersection W, obtain information intersection S '.Defining coverage rate is | S ∩ S ' | and/S, definition accuracy rate | S ∩ S ' |/S ', relative to coverage rate, accuracy rate is extracted more important for web page contents.Because if accuracy rate is too low, nonsensical for most application, but coverage rate is too low can be made up by the webpage quantity of magnanimity, so the technical scheme that this embodiment provides proposes for the accuracy rate improving web page contents extraction.Be described in detail below in conjunction with Figure of description, as shown in Figure 3, the extraction method of the corresponding network information comprises:
Step 31, finds the webpage W ' of element in the subset Ssub containing given information S from the webpage intersection W that given information S-phase is closed.
Concrete, for the subset Ssub of given information S, the element in subset Ssub is enumerable, and defines regular expression intersection
first travel through the webpage intersection W that given information S-phase is closed, from webpage intersection W, find the webpage W ' of element in the subset Ssub containing given information S.
Step 32, gathers P ' according to pre-defined rule information generated pattern, and information pattern is gathered P ' and regular expression set P and ask intersection to obtain to gather P
1.
Gather P ' according to pre-defined rule information generated pattern, and make W '=>Ssub, wherein information pattern gathers the generative process of P ' specifically as shown in Figure 4, specifically can comprise:
First the pattern defining regular expression is: p=prefix info suffix; And be the component of regular expression in order to next part cooperation: digital collection NumberSet, set of letters EnglishSet, special symbol S set pecialSet, character set ChineseSet, web page tag set MetaSet; Wherein the content of the info of regular expression is represented by digital collection NumberSet, set of letters EnglishSet, special symbol S set pecialSet and character set ChineseSet, and the content of prefix and suffix is represented by web page tag set MetaSet;
The subset Ssub of traversal given information S, finds a certain element s, and finds the position of element s in webpage w;
Forward trace, finds first web page tag, is designated as prefix; Recall backward, find first web page tag, be designated as suffix;
The description rule generting element s canonical set on webpage w of the content in the middle of prefix and suffix according to digital collection NumberSet, set of letters EnglishSet, special symbol S set pecialSet and character set ChineseSet;
Become the regular expressions set of Ssub on webpage w according to the canonic(al) ensemble symphysis of element s on webpage w, be recorded as P ' p1, p2 ... pn.
Step 33, will gather P
1all webpages in the webpage intersection W relevant to given information mate, obtain S set sub ', until Ssub==Ssub ' time capture process terminate.
Concrete, will P be gathered
1all webpages in the webpage intersection W relevant to given information mate, obtain S set sub ', if Ssub>Ssub ', then re-execute step 31 after making Ssub=Ssub ', until Ssub==Ssub ' time capture process terminate.
Further, this embodiment can also comprise the process of checking regular expression set, as shown in Figure 5, specifically can comprise:
Each webpage W ' is multiplied with the subset Ssub of given information, obtains regular expression intersection Tt=T1, T2 ... Tn;
Traversal regular expression intersection Tt, obtains a regular expression intersection T
1, traversal regular expression intersection T
1, arbitrary regular expression p ∈ Tn is mated with webpage W ', obtains the S set of value;
If S-Ssub ≠ Φ, give up and change expression formula (regular expression simultaneously mating other guide is removed in the effect of this step); If S-Ssub=is Φ, then in the subset Ssub of given information, the number Scount of element equals the element number in S set;
Traversal regular expression intersection Tt, for arbitrary Tn ∈ Tt, if the number of regular expression is greater than 1 in Tn, then choose the regular expression that in Tn, Scount is maximum, cast out remaining regular expression (effect of this step is the multiple expression formulas for same coupling, chooses maximum that of coupling);
Traversal regular expression intersection Tt, contrast wherein any two Tn, if regular expression is identical, then give up wherein any one (identical regular expression is removed in the effect of this step);
By the set of remaining regular expression composition, be designated as P '=p1, p2 ... pn.
Adopt the technical scheme that this embodiment provides, by according to the corresponding regular expression set of different auto-building html files, realize the content automatically extracted in webpage, eliminate a lot of workload, and the correctness of regular expression can be verified.
It should be noted that, one of ordinary skill in the art will appreciate that all or part of step realized in above-mentioned each embodiment of the method is that the hardware that can carry out instruction relevant by program completes, corresponding program can be stored in a kind of computer-readable recording medium, the above-mentioned storage medium mentioned can be ROM (read-only memory), disk or CD etc.
The specific embodiment of the present invention additionally provides a kind of automatic extracting device of the network information, as shown in Figure 6, comprising:
Webpage chooses unit 61, for finding the webpage W ' of element in the subset Ssub containing given information S in the webpage intersection W that closes from given information S-phase;
Unit 62 is chosen in set, for gathering P ' according to pre-defined rule information generated pattern, and information pattern is gathered P ' and regular expression set P and asks intersection to obtain to gather P
1;
Content placement unit 63, for gathering P
1all webpages in the webpage intersection W relevant to given information mate, and obtain S set sub ', until the process that captures during Ssub==Ssub ' terminates.
Optionally, choose unit 62 in set to comprise: traversal subelement, backtracking subelement, canonical set statement subelement become subelement with canonic(al) ensemble symphysis; Traversal subelement wherein, for traveling through the subset Ssub of given information S, finds a certain element s, and finds the position of element s in webpage w; Backtracking subelement is used for forward trace, finds first web page tag, is designated as prefix; Recall backward, find first web page tag, be designated as suffix; Canonical set statement subelement is used for the content of prefix and suffix centre according to the canonical set of description rule generting element s on webpage w of digital collection NumberSet, set of letters EnglishSet, special symbol S set pecialSet and character set ChineseSet; Canonic(al) ensemble symphysis becomes subelement for the regular expressions set becoming Ssub on webpage w according to the canonic(al) ensemble symphysis of element s on webpage w, is recorded as P '=p1, p2 ... pn.
Optionally, this device can also comprise authentication unit, comprises at authentication unit: get multiplier unit, coupling subelement, element number determination subelement, the first screening subelement, the second screening subelement and canonical set determination subelement; Wherein get multiplier unit for being multiplied with the subset Ssub of given information by each webpage W ', obtain regular expression intersection Tt=T1, T2 ... Tn; Coupling subelement, for traveling through regular expression intersection Tt, obtains a regular expression intersection T
1, traversal regular expression intersection T
1, arbitrary regular expression p ∈ Tn is mated with webpage W ', obtains the S set of value; If element number determination subelement is used for S-Ssub ≠ Φ, gives up and change expression formula; If S-Ssub=is Φ, then in the subset Ssub of given information, the number Scount of element equals the element number in S set; First screening subelement, for traveling through regular expression intersection Tt, for arbitrary Tn ∈ Tt, if the number of regular expression is greater than 1 in Tn, is then chosen the regular expression that Scount in Tn is maximum, is cast out remaining regular expression; Second screening subelement for traveling through regular expression intersection Tt, contrast wherein any two Tn, if regular expression is identical, then give up wherein any one; Canonical set determination subelement is used for, by the set of remaining regular expression composition, being designated as P '=p1, p2 ... pn.
The specific implementation of the processing capacity of each unit comprised in the automatic extracting device of the above-mentioned network information describes in embodiment of the method before, in this no longer repeated description.
Adopt the technical scheme that this embodiment provides, by according to the corresponding regular expression set of different auto-building html files, realize the content automatically extracted in webpage, eliminate a lot of workload, and the correctness of regular expression can be verified.
It should be noted that in the embodiment of said apparatus, included unit is carry out dividing according to function logic, but is not limited to above-mentioned division, as long as can realize corresponding function; In addition, the concrete title of each functional unit, also just for the ease of mutual differentiation, is not limited to protection scope of the present invention.
The above; be only the present invention's preferably embodiment; but protection scope of the present invention is not limited thereto; anyly be familiar with those skilled in the art in the technical scope that the embodiment of the present invention discloses; the change that can expect easily or replacement, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.
Claims (4)
1. an extraction method for the network information, is characterized in that, comprising:
The webpage W ' of element in the subset Ssub containing given information S is found from the webpage intersection W that given information S-phase is closed;
Gather P ' according to pre-defined rule information generated pattern, and information pattern is gathered P ' and regular expression set P and ask intersection to obtain to gather P
1; Wherein, regular expression intersection
describedly gather P ' according to pre-defined rule information generated pattern and comprising: the subset Ssub of traversal given information S, find a certain element s, and find the position of element s in webpage w; Forward trace, finds first web page tag, is designated as prefix; Recall backward, find first web page tag, be designated as suffix; The description rule generting element s canonical set on webpage w of the content in the middle of prefix and suffix according to digital collection NumberSet, set of letters EnglishSet, special symbol S set pecialSet and character set ChineseSet; Become the information pattern set of Ssub on webpage w according to the canonic(al) ensemble symphysis of element s on webpage w, be recorded as P '=p1, p2 ... pn;
P will be gathered
1all webpages in the webpage intersection W relevant to given information mate, and obtain S set sub ', until the process that captures during Ssub==Ssub ' terminates.
2. method according to claim 1, is characterized in that, the method also comprises the set of checking regular expression, and the set of described checking regular expression comprises:
Each webpage W ' is multiplied with the subset Ssub of given information, obtains regular expression intersection Tt=T1, T2 ... Tn;
Traversal regular expression intersection Tt, obtains a regular expression intersection T1, and traversal regular expression intersection T1, mates arbitrary regular expression p ∈ Tn with webpage W ', obtain the set F of value;
If F – is Ssub ≠ Φ, gives up and change expression formula; If F – Ssub=is Φ, then in the subset Ssub of given information, the number Scount of element equals to gather the element number in F;
Traversal regular expression intersection Tt, for arbitrary Tn ∈ Tt, if the number of regular expression is greater than 1 in Tn, then chooses the regular expression that Scount in Tn is maximum, casts out remaining regular expression;
Traversal regular expression intersection Tt, contrast wherein any two Tn, if regular expression is identical, then give up wherein any one;
By the set of remaining regular expression composition, be designated as P '=p1, p2 ... pn.
3. an automatic extracting device for the network information, is characterized in that, comprising:
Webpage chooses unit, for finding the webpage W ' of element in the subset Ssub containing given information S in the webpage intersection W that closes from given information S-phase;
Unit is chosen in set, for gathering P ' according to pre-defined rule information generated pattern, and information pattern is gathered P ' and regular expression set P and asks intersection to obtain to gather P
1; Wherein, regular expression intersection
Content placement unit, for gathering P
1all webpages in the webpage intersection W relevant to given information mate, and obtain S set sub ', until the process that captures during Ssub==Ssub ' terminates;
Wherein, choose unit in set to comprise:
Traversal subelement, for traveling through the subset Ssub of given information S, finding a certain element s, and finding the position of element s in webpage w;
Backtracking subelement, for forward trace, finds first web page tag, is designated as prefix; Recall backward, find first web page tag, be designated as suffix;
Canonical set statement subelement, for the description rule generting element s canonical set on webpage w of the content in the middle of prefix and suffix according to digital collection NumberSet, set of letters EnglishSet, special symbol S set pecialSet and character set ChineseSet;
Canonic(al) ensemble symphysis becomes subelement, for becoming the regular expressions set of Ssub on webpage w according to the canonic(al) ensemble symphysis of element s on webpage w, is recorded as P '=p1, p2 ... pn.
4. device according to claim 3, is characterized in that, this device also comprises authentication unit, and described authentication unit comprises:
Getting multiplier unit, for being multiplied with the subset Ssub of given information by each webpage W ', obtaining regular expression intersection Tt=T1, T2 ... Tn;
Coupling subelement, for traveling through regular expression intersection Tt, obtains a regular expression intersection T1, and traversal regular expression intersection T1, mates arbitrary regular expression p ∈ Tn with webpage W ', obtain the set F of value;
Element number determination subelement, if for F – Ssub ≠ Φ, give up and change expression formula; If F – Ssub=is Φ, then in the subset Ssub of given information, the number Scount of element equals to gather the element number in F;
First screening subelement, for traveling through regular expression intersection Tt, for arbitrary Tn ∈ Tt, if the number of regular expression is greater than 1 in Tn, then choosing the regular expression that Scount in Tn is maximum, casting out remaining regular expression;
Second screening subelement, for traveling through regular expression intersection Tt, contrast wherein any two Tn, if regular expression is identical, then give up wherein any one;
Canonical set determination subelement, for by the set of remaining regular expression composition, is designated as P '=p1, p2 ... pn.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210335719.1A CN102855324B (en) | 2012-09-11 | 2012-09-11 | A kind of extraction method of the network information and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210335719.1A CN102855324B (en) | 2012-09-11 | 2012-09-11 | A kind of extraction method of the network information and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102855324A CN102855324A (en) | 2013-01-02 |
CN102855324B true CN102855324B (en) | 2015-08-26 |
Family
ID=47401912
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210335719.1A Expired - Fee Related CN102855324B (en) | 2012-09-11 | 2012-09-11 | A kind of extraction method of the network information and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102855324B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103902578B (en) * | 2012-12-27 | 2017-05-31 | 中国移动通信集团四川有限公司 | A kind of method for abstracting web page information and device |
CN105740355B (en) * | 2016-01-26 | 2019-03-26 | 中国人民解放军国防科学技术大学 | Webpage context extraction method and device based on aggregation text density |
CN106126684B (en) * | 2016-06-29 | 2019-12-24 | 联想(北京)有限公司 | Method and device for generating network crawler configuration file |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101582075A (en) * | 2009-06-24 | 2009-11-18 | 大连海事大学 | Web information extraction system |
CN101727447A (en) * | 2008-10-10 | 2010-06-09 | 浙江搜富网络技术有限公司 | Generation method and device of regular expression based on URL |
CN102456050A (en) * | 2010-10-27 | 2012-05-16 | 中国移动通信集团四川有限公司 | Method and device for extracting data from webpage |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7814084B2 (en) * | 2007-03-21 | 2010-10-12 | Schmap Inc. | Contact information capture and link redirection |
-
2012
- 2012-09-11 CN CN201210335719.1A patent/CN102855324B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101727447A (en) * | 2008-10-10 | 2010-06-09 | 浙江搜富网络技术有限公司 | Generation method and device of regular expression based on URL |
CN101582075A (en) * | 2009-06-24 | 2009-11-18 | 大连海事大学 | Web information extraction system |
CN102456050A (en) * | 2010-10-27 | 2012-05-16 | 中国移动通信集团四川有限公司 | Method and device for extracting data from webpage |
Non-Patent Citations (3)
Title |
---|
基于正则表达式的大规模网页术语对抽取研究;程岚岚;《情报杂志》;20090216;第27卷(第11期);62-64,68 * |
大规模复杂规则匹配技术研究;张树壮等;《高技术通讯》;20110330;第20卷(第12期);1217-1223 * |
正则表达式在Web信息抽取中的应用;胡军伟等;《北京信息科技大学学报》;20111231;第26卷(第6期);86-89 * |
Also Published As
Publication number | Publication date |
---|---|
CN102855324A (en) | 2013-01-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230267141A1 (en) | Systems and methods for processing electronic content | |
CN103116638B (en) | Webpage screening method and device thereof | |
CN107656787B (en) | Method for generating topics based on electronic book, computing device and computer storage medium | |
CN102855324B (en) | A kind of extraction method of the network information and device | |
CN103020176A (en) | Data block dividing method in XML parsing and XML parsing method | |
US20130110818A1 (en) | Profile driven extraction | |
CN104391917A (en) | Method for incrementally capturing webpage contents | |
CN103823892A (en) | Method and device of determining webpage clustering mode | |
CN104598536B (en) | A kind of distributed network information structuring processing method | |
CN105550179A (en) | Webpage collection method and browser plug-in | |
CN108536700A (en) | A kind of method that nothing buries a collector journal | |
CN106933916B (en) | JSON character string processing method and device | |
CN104765823A (en) | Method and device for collecting website data | |
CN104899215A (en) | Data processing method, recommendation source information organization, information recommendation method and information recommendation device | |
CN108804472A (en) | A kind of webpage content extraction method, device and server | |
CN105528432A (en) | Digital resource hotspot generating method and device | |
CN106446055A (en) | Webpage generation method and system | |
CN104899203A (en) | Webpage generating method, webpage generating device and terminal equipment | |
CN103744944A (en) | Method for re-filtering in webpage or data crawling by web crawler | |
CN106339381B (en) | Information processing method and device | |
CN103902578B (en) | A kind of method for abstracting web page information and device | |
CN106484746A (en) | The analysis method of website transformation event and device | |
CN106021582B (en) | Method for filtering position information, method and device for extracting effective webpage information | |
CN104050186A (en) | Information classifying method and device | |
Lin et al. | Combining a segmentation-like approach and a density-based approach in content extraction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20150826 Termination date: 20160911 |