CN102855324B - A kind of extraction method of the network information and device - Google Patents

A kind of extraction method of the network information and device Download PDF

Info

Publication number
CN102855324B
CN102855324B CN201210335719.1A CN201210335719A CN102855324B CN 102855324 B CN102855324 B CN 102855324B CN 201210335719 A CN201210335719 A CN 201210335719A CN 102855324 B CN102855324 B CN 102855324B
Authority
CN
China
Prior art keywords
regular expression
webpage
ssub
intersection
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201210335719.1A
Other languages
Chinese (zh)
Other versions
CN102855324A (en
Inventor
杨俊拯
温予
张旸
黄百宁
王世平
葛猛
孟玲会
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING YUNHONG DAOYUAN INFORMATION TECHNOLOGY Co Ltd
Original Assignee
BEIJING YUNHONG DAOYUAN INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING YUNHONG DAOYUAN INFORMATION TECHNOLOGY Co Ltd filed Critical BEIJING YUNHONG DAOYUAN INFORMATION TECHNOLOGY Co Ltd
Priority to CN201210335719.1A priority Critical patent/CN102855324B/en
Publication of CN102855324A publication Critical patent/CN102855324A/en
Application granted granted Critical
Publication of CN102855324B publication Critical patent/CN102855324B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention provides a kind of extraction method and device of the network information, corresponding method comprises the webpage intersection W closed from given information S-phase the webpage W ' of element in the subset Ssub found containing given information S; Gather P ' according to pre-defined rule information generated pattern, and information pattern is gathered P ' and regular expression set P and ask intersection to obtain to gather P 1; P will be gathered 1all webpages in the webpage intersection W relevant to given information mate, and obtain S set sub ', until the process that captures during Ssub==Ssub ' terminates.The present invention, by according to the corresponding regular expression set of different auto-building html files, realizes the content automatically extracted in webpage, eliminates a lot of workload.

Description

A kind of extraction method of the network information and device
Technical field
The present invention relates to a kind of extraction method and device of the network information, belong to network information extractive technique field.
Background technology
For the information represented on webpage, prior art is generally described by regular expression, and for different webpages, corresponding regular expression is different often, and the workload so just causing the network information to be extracted is larger.
Summary of the invention
The present invention is the problem that the workload of solution existing network information extraction is larger, and then provides a kind of extraction method and device of the network information.For this reason, the invention provides following technical scheme:
An extraction method for the network information, comprising:
The webpage W ' of element in the subset Ssub containing given information S is found from the webpage intersection W that given information S-phase is closed;
Gather P ' according to pre-defined rule information generated pattern, and information pattern is gathered P ' and regular expression set P and ask intersection to obtain to gather P 1;
P will be gathered 1all webpages in the webpage intersection W relevant to given information mate, obtain S set sub ', until Ssub==Ssub ' time capture process terminate.
An automatic extracting device for the network information, comprising:
Webpage chooses unit, for finding the webpage W ' of element in the subset Ssub containing given information S in the webpage intersection W that closes from given information S-phase;
Unit is chosen in set, for gathering P ' according to pre-defined rule information generated pattern, and information pattern is gathered P ' and regular expression set P and asks intersection to obtain to gather P 1;
Content placement unit, for gathering P 1all webpages in the webpage intersection W relevant to given information mate, obtain S set sub ', until Ssub==Ssub ' time capture process terminate.
Technical scheme provided by the invention, by according to the corresponding regular expression set of different auto-building html files, realizes the content automatically extracted in webpage, eliminates a lot of workload.
Accompanying drawing explanation
In order to be illustrated more clearly in the technical scheme of the embodiment of the present invention, below the accompanying drawing used required in describing embodiment is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is the schematic diagram of two webpage obtaining informations that the specific embodiment of the present invention provides;
Fig. 2 is the schematic diagram of n the webpage obtaining information that the specific embodiment of the present invention provides;
Fig. 3 is the schematic flow sheet of the extraction method of the network information that the specific embodiment of the present invention provides;
Fig. 4 is the schematic flow sheet of the information generated pattern set P ' that the specific embodiment of the present invention provides;
Fig. 5 is the schematic flow sheet of the checking regular expression set that the specific embodiment of the present invention provides;
Fig. 6 is the structural representation of the extract facial feature device that the specific embodiment of the present invention provides.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
The principle of the technical scheme that this embodiment provides is: the situation that can comprise same information for dissimilar webpage, because same information expression way on different websites is different.Such as at music field, internet has and a lot of comprises music information website, forum etc., their different websites, forum Web pages structure and the form of expression are generally not identical, but that they include the information of a lot of same kind, the information such as such as song title, Ge Shouming, special edition, for a kind of information, for the webpage (being designated as urlpattern1) of same type, regular expression (prefix1 infosuffix1) can be passed through represent, and the intersection recording value is designated as V1.And for dissimilar webpage (urlpattern2), they have different regular expressions (prefix2 info suffix2), the intersection of the value of this website is designated as V2, then the common factor of V1 and V2 is not equal to sky, and the information that the value of V1 with V2 describes is consistent.If there is n dissimilar webpage by that analogy, then should there is the set being less than or equal to n value, exist and be less than or equal to n regular expression.Concrete logic as depicted in figs. 1 and 2.Therefore for the part set (sample sizes of such as 10 to 100) of given information, be designated as Ssub, then can pass through webpage intersection W, obtain information intersection S '.Defining coverage rate is | S ∩ S ' | and/S, definition accuracy rate | S ∩ S ' |/S ', relative to coverage rate, accuracy rate is extracted more important for web page contents.Because if accuracy rate is too low, nonsensical for most application, but coverage rate is too low can be made up by the webpage quantity of magnanimity, so the technical scheme that this embodiment provides proposes for the accuracy rate improving web page contents extraction.Be described in detail below in conjunction with Figure of description, as shown in Figure 3, the extraction method of the corresponding network information comprises:
Step 31, finds the webpage W ' of element in the subset Ssub containing given information S from the webpage intersection W that given information S-phase is closed.
Concrete, for the subset Ssub of given information S, the element in subset Ssub is enumerable, and defines regular expression intersection first travel through the webpage intersection W that given information S-phase is closed, from webpage intersection W, find the webpage W ' of element in the subset Ssub containing given information S.
Step 32, gathers P ' according to pre-defined rule information generated pattern, and information pattern is gathered P ' and regular expression set P and ask intersection to obtain to gather P 1.
Gather P ' according to pre-defined rule information generated pattern, and make W '=>Ssub, wherein information pattern gathers the generative process of P ' specifically as shown in Figure 4, specifically can comprise:
First the pattern defining regular expression is: p=prefix info suffix; And be the component of regular expression in order to next part cooperation: digital collection NumberSet, set of letters EnglishSet, special symbol S set pecialSet, character set ChineseSet, web page tag set MetaSet; Wherein the content of the info of regular expression is represented by digital collection NumberSet, set of letters EnglishSet, special symbol S set pecialSet and character set ChineseSet, and the content of prefix and suffix is represented by web page tag set MetaSet;
The subset Ssub of traversal given information S, finds a certain element s, and finds the position of element s in webpage w;
Forward trace, finds first web page tag, is designated as prefix; Recall backward, find first web page tag, be designated as suffix;
The description rule generting element s canonical set on webpage w of the content in the middle of prefix and suffix according to digital collection NumberSet, set of letters EnglishSet, special symbol S set pecialSet and character set ChineseSet;
Become the regular expressions set of Ssub on webpage w according to the canonic(al) ensemble symphysis of element s on webpage w, be recorded as P ' p1, p2 ... pn.
Step 33, will gather P 1all webpages in the webpage intersection W relevant to given information mate, obtain S set sub ', until Ssub==Ssub ' time capture process terminate.
Concrete, will P be gathered 1all webpages in the webpage intersection W relevant to given information mate, obtain S set sub ', if Ssub>Ssub ', then re-execute step 31 after making Ssub=Ssub ', until Ssub==Ssub ' time capture process terminate.
Further, this embodiment can also comprise the process of checking regular expression set, as shown in Figure 5, specifically can comprise:
Each webpage W ' is multiplied with the subset Ssub of given information, obtains regular expression intersection Tt=T1, T2 ... Tn;
Traversal regular expression intersection Tt, obtains a regular expression intersection T 1, traversal regular expression intersection T 1, arbitrary regular expression p ∈ Tn is mated with webpage W ', obtains the S set of value;
If S-Ssub ≠ Φ, give up and change expression formula (regular expression simultaneously mating other guide is removed in the effect of this step); If S-Ssub=is Φ, then in the subset Ssub of given information, the number Scount of element equals the element number in S set;
Traversal regular expression intersection Tt, for arbitrary Tn ∈ Tt, if the number of regular expression is greater than 1 in Tn, then choose the regular expression that in Tn, Scount is maximum, cast out remaining regular expression (effect of this step is the multiple expression formulas for same coupling, chooses maximum that of coupling);
Traversal regular expression intersection Tt, contrast wherein any two Tn, if regular expression is identical, then give up wherein any one (identical regular expression is removed in the effect of this step);
By the set of remaining regular expression composition, be designated as P '=p1, p2 ... pn.
Adopt the technical scheme that this embodiment provides, by according to the corresponding regular expression set of different auto-building html files, realize the content automatically extracted in webpage, eliminate a lot of workload, and the correctness of regular expression can be verified.
It should be noted that, one of ordinary skill in the art will appreciate that all or part of step realized in above-mentioned each embodiment of the method is that the hardware that can carry out instruction relevant by program completes, corresponding program can be stored in a kind of computer-readable recording medium, the above-mentioned storage medium mentioned can be ROM (read-only memory), disk or CD etc.
The specific embodiment of the present invention additionally provides a kind of automatic extracting device of the network information, as shown in Figure 6, comprising:
Webpage chooses unit 61, for finding the webpage W ' of element in the subset Ssub containing given information S in the webpage intersection W that closes from given information S-phase;
Unit 62 is chosen in set, for gathering P ' according to pre-defined rule information generated pattern, and information pattern is gathered P ' and regular expression set P and asks intersection to obtain to gather P 1;
Content placement unit 63, for gathering P 1all webpages in the webpage intersection W relevant to given information mate, and obtain S set sub ', until the process that captures during Ssub==Ssub ' terminates.
Optionally, choose unit 62 in set to comprise: traversal subelement, backtracking subelement, canonical set statement subelement become subelement with canonic(al) ensemble symphysis; Traversal subelement wherein, for traveling through the subset Ssub of given information S, finds a certain element s, and finds the position of element s in webpage w; Backtracking subelement is used for forward trace, finds first web page tag, is designated as prefix; Recall backward, find first web page tag, be designated as suffix; Canonical set statement subelement is used for the content of prefix and suffix centre according to the canonical set of description rule generting element s on webpage w of digital collection NumberSet, set of letters EnglishSet, special symbol S set pecialSet and character set ChineseSet; Canonic(al) ensemble symphysis becomes subelement for the regular expressions set becoming Ssub on webpage w according to the canonic(al) ensemble symphysis of element s on webpage w, is recorded as P '=p1, p2 ... pn.
Optionally, this device can also comprise authentication unit, comprises at authentication unit: get multiplier unit, coupling subelement, element number determination subelement, the first screening subelement, the second screening subelement and canonical set determination subelement; Wherein get multiplier unit for being multiplied with the subset Ssub of given information by each webpage W ', obtain regular expression intersection Tt=T1, T2 ... Tn; Coupling subelement, for traveling through regular expression intersection Tt, obtains a regular expression intersection T 1, traversal regular expression intersection T 1, arbitrary regular expression p ∈ Tn is mated with webpage W ', obtains the S set of value; If element number determination subelement is used for S-Ssub ≠ Φ, gives up and change expression formula; If S-Ssub=is Φ, then in the subset Ssub of given information, the number Scount of element equals the element number in S set; First screening subelement, for traveling through regular expression intersection Tt, for arbitrary Tn ∈ Tt, if the number of regular expression is greater than 1 in Tn, is then chosen the regular expression that Scount in Tn is maximum, is cast out remaining regular expression; Second screening subelement for traveling through regular expression intersection Tt, contrast wherein any two Tn, if regular expression is identical, then give up wherein any one; Canonical set determination subelement is used for, by the set of remaining regular expression composition, being designated as P '=p1, p2 ... pn.
The specific implementation of the processing capacity of each unit comprised in the automatic extracting device of the above-mentioned network information describes in embodiment of the method before, in this no longer repeated description.
Adopt the technical scheme that this embodiment provides, by according to the corresponding regular expression set of different auto-building html files, realize the content automatically extracted in webpage, eliminate a lot of workload, and the correctness of regular expression can be verified.
It should be noted that in the embodiment of said apparatus, included unit is carry out dividing according to function logic, but is not limited to above-mentioned division, as long as can realize corresponding function; In addition, the concrete title of each functional unit, also just for the ease of mutual differentiation, is not limited to protection scope of the present invention.
The above; be only the present invention's preferably embodiment; but protection scope of the present invention is not limited thereto; anyly be familiar with those skilled in the art in the technical scope that the embodiment of the present invention discloses; the change that can expect easily or replacement, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims (4)

1. an extraction method for the network information, is characterized in that, comprising:
The webpage W ' of element in the subset Ssub containing given information S is found from the webpage intersection W that given information S-phase is closed;
Gather P ' according to pre-defined rule information generated pattern, and information pattern is gathered P ' and regular expression set P and ask intersection to obtain to gather P 1; Wherein, regular expression intersection describedly gather P ' according to pre-defined rule information generated pattern and comprising: the subset Ssub of traversal given information S, find a certain element s, and find the position of element s in webpage w; Forward trace, finds first web page tag, is designated as prefix; Recall backward, find first web page tag, be designated as suffix; The description rule generting element s canonical set on webpage w of the content in the middle of prefix and suffix according to digital collection NumberSet, set of letters EnglishSet, special symbol S set pecialSet and character set ChineseSet; Become the information pattern set of Ssub on webpage w according to the canonic(al) ensemble symphysis of element s on webpage w, be recorded as P '=p1, p2 ... pn;
P will be gathered 1all webpages in the webpage intersection W relevant to given information mate, and obtain S set sub ', until the process that captures during Ssub==Ssub ' terminates.
2. method according to claim 1, is characterized in that, the method also comprises the set of checking regular expression, and the set of described checking regular expression comprises:
Each webpage W ' is multiplied with the subset Ssub of given information, obtains regular expression intersection Tt=T1, T2 ... Tn;
Traversal regular expression intersection Tt, obtains a regular expression intersection T1, and traversal regular expression intersection T1, mates arbitrary regular expression p ∈ Tn with webpage W ', obtain the set F of value;
If F – is Ssub ≠ Φ, gives up and change expression formula; If F – Ssub=is Φ, then in the subset Ssub of given information, the number Scount of element equals to gather the element number in F;
Traversal regular expression intersection Tt, for arbitrary Tn ∈ Tt, if the number of regular expression is greater than 1 in Tn, then chooses the regular expression that Scount in Tn is maximum, casts out remaining regular expression;
Traversal regular expression intersection Tt, contrast wherein any two Tn, if regular expression is identical, then give up wherein any one;
By the set of remaining regular expression composition, be designated as P '=p1, p2 ... pn.
3. an automatic extracting device for the network information, is characterized in that, comprising:
Webpage chooses unit, for finding the webpage W ' of element in the subset Ssub containing given information S in the webpage intersection W that closes from given information S-phase;
Unit is chosen in set, for gathering P ' according to pre-defined rule information generated pattern, and information pattern is gathered P ' and regular expression set P and asks intersection to obtain to gather P 1; Wherein, regular expression intersection
Content placement unit, for gathering P 1all webpages in the webpage intersection W relevant to given information mate, and obtain S set sub ', until the process that captures during Ssub==Ssub ' terminates;
Wherein, choose unit in set to comprise:
Traversal subelement, for traveling through the subset Ssub of given information S, finding a certain element s, and finding the position of element s in webpage w;
Backtracking subelement, for forward trace, finds first web page tag, is designated as prefix; Recall backward, find first web page tag, be designated as suffix;
Canonical set statement subelement, for the description rule generting element s canonical set on webpage w of the content in the middle of prefix and suffix according to digital collection NumberSet, set of letters EnglishSet, special symbol S set pecialSet and character set ChineseSet;
Canonic(al) ensemble symphysis becomes subelement, for becoming the regular expressions set of Ssub on webpage w according to the canonic(al) ensemble symphysis of element s on webpage w, is recorded as P '=p1, p2 ... pn.
4. device according to claim 3, is characterized in that, this device also comprises authentication unit, and described authentication unit comprises:
Getting multiplier unit, for being multiplied with the subset Ssub of given information by each webpage W ', obtaining regular expression intersection Tt=T1, T2 ... Tn;
Coupling subelement, for traveling through regular expression intersection Tt, obtains a regular expression intersection T1, and traversal regular expression intersection T1, mates arbitrary regular expression p ∈ Tn with webpage W ', obtain the set F of value;
Element number determination subelement, if for F – Ssub ≠ Φ, give up and change expression formula; If F – Ssub=is Φ, then in the subset Ssub of given information, the number Scount of element equals to gather the element number in F;
First screening subelement, for traveling through regular expression intersection Tt, for arbitrary Tn ∈ Tt, if the number of regular expression is greater than 1 in Tn, then choosing the regular expression that Scount in Tn is maximum, casting out remaining regular expression;
Second screening subelement, for traveling through regular expression intersection Tt, contrast wherein any two Tn, if regular expression is identical, then give up wherein any one;
Canonical set determination subelement, for by the set of remaining regular expression composition, is designated as P '=p1, p2 ... pn.
CN201210335719.1A 2012-09-11 2012-09-11 A kind of extraction method of the network information and device Expired - Fee Related CN102855324B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210335719.1A CN102855324B (en) 2012-09-11 2012-09-11 A kind of extraction method of the network information and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210335719.1A CN102855324B (en) 2012-09-11 2012-09-11 A kind of extraction method of the network information and device

Publications (2)

Publication Number Publication Date
CN102855324A CN102855324A (en) 2013-01-02
CN102855324B true CN102855324B (en) 2015-08-26

Family

ID=47401912

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210335719.1A Expired - Fee Related CN102855324B (en) 2012-09-11 2012-09-11 A kind of extraction method of the network information and device

Country Status (1)

Country Link
CN (1) CN102855324B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902578B (en) * 2012-12-27 2017-05-31 中国移动通信集团四川有限公司 A kind of method for abstracting web page information and device
CN105740355B (en) * 2016-01-26 2019-03-26 中国人民解放军国防科学技术大学 Webpage context extraction method and device based on aggregation text density
CN106126684B (en) * 2016-06-29 2019-12-24 联想(北京)有限公司 Method and device for generating network crawler configuration file

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101582075A (en) * 2009-06-24 2009-11-18 大连海事大学 Web information extraction system
CN101727447A (en) * 2008-10-10 2010-06-09 浙江搜富网络技术有限公司 Generation method and device of regular expression based on URL
CN102456050A (en) * 2010-10-27 2012-05-16 中国移动通信集团四川有限公司 Method and device for extracting data from webpage

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7814084B2 (en) * 2007-03-21 2010-10-12 Schmap Inc. Contact information capture and link redirection

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727447A (en) * 2008-10-10 2010-06-09 浙江搜富网络技术有限公司 Generation method and device of regular expression based on URL
CN101582075A (en) * 2009-06-24 2009-11-18 大连海事大学 Web information extraction system
CN102456050A (en) * 2010-10-27 2012-05-16 中国移动通信集团四川有限公司 Method and device for extracting data from webpage

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
基于正则表达式的大规模网页术语对抽取研究;程岚岚;《情报杂志》;20090216;第27卷(第11期);62-64,68 *
大规模复杂规则匹配技术研究;张树壮等;《高技术通讯》;20110330;第20卷(第12期);1217-1223 *
正则表达式在Web信息抽取中的应用;胡军伟等;《北京信息科技大学学报》;20111231;第26卷(第6期);86-89 *

Also Published As

Publication number Publication date
CN102855324A (en) 2013-01-02

Similar Documents

Publication Publication Date Title
US20230267141A1 (en) Systems and methods for processing electronic content
CN103116638B (en) Webpage screening method and device thereof
CN107656787B (en) Method for generating topics based on electronic book, computing device and computer storage medium
CN102855324B (en) A kind of extraction method of the network information and device
CN103020176A (en) Data block dividing method in XML parsing and XML parsing method
US20130110818A1 (en) Profile driven extraction
CN104391917A (en) Method for incrementally capturing webpage contents
CN103823892A (en) Method and device of determining webpage clustering mode
CN104598536B (en) A kind of distributed network information structuring processing method
CN105550179A (en) Webpage collection method and browser plug-in
CN108536700A (en) A kind of method that nothing buries a collector journal
CN106933916B (en) JSON character string processing method and device
CN104765823A (en) Method and device for collecting website data
CN104899215A (en) Data processing method, recommendation source information organization, information recommendation method and information recommendation device
CN108804472A (en) A kind of webpage content extraction method, device and server
CN105528432A (en) Digital resource hotspot generating method and device
CN106446055A (en) Webpage generation method and system
CN104899203A (en) Webpage generating method, webpage generating device and terminal equipment
CN103744944A (en) Method for re-filtering in webpage or data crawling by web crawler
CN106339381B (en) Information processing method and device
CN103902578B (en) A kind of method for abstracting web page information and device
CN106484746A (en) The analysis method of website transformation event and device
CN106021582B (en) Method for filtering position information, method and device for extracting effective webpage information
CN104050186A (en) Information classifying method and device
Lin et al. Combining a segmentation-like approach and a density-based approach in content extraction

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20150826

Termination date: 20160911