CN102855324A - Automatic extracting method and device for network information - Google Patents
Automatic extracting method and device for network information Download PDFInfo
- Publication number
- CN102855324A CN102855324A CN2012103357191A CN201210335719A CN102855324A CN 102855324 A CN102855324 A CN 102855324A CN 2012103357191 A CN2012103357191 A CN 2012103357191A CN 201210335719 A CN201210335719 A CN 201210335719A CN 102855324 A CN102855324 A CN 102855324A
- Authority
- CN
- China
- Prior art keywords
- webpage
- regular expression
- ssub
- intersection
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The invention provides an automatic extracting method and an automatic extracting device for network information. The relevant method comprises the steps as follows: finding out a webpage W' of an element in a subset S sub with the given information S in a webpage unit relevant to the given information S; generating an information pattern unit P' based on the preset rule, and summing the information pattern unit P and the regular expression unit P so as to obtain a set P1; matching the set P1 with all the webpage in the webpage collection W relevant to the given information, thus obtaining the set S sub'; and finishing the grasping until S sub= =S sub'. With the adoption of the automatic extracting method and the automatic extracting device for the network information, provided by the invention, the relevant regular expression set can be generated based on different webpage, therefore, the content in the webpage can be automatically extracted, and lots of workloads can be saved.
Description
Technical field
The present invention relates to a kind of extraction method and device of the network information, belong to network information extractive technique field.
Background technology
For the information that represents at webpage, prior art is generally described by regular expression, and for different webpages, corresponding regular expression is different often, and the workload that so just causes the network information to be extracted is larger.
Summary of the invention
The present invention is the larger problem of workload of the existing network information extraction of solution, and then a kind of extraction method and device of the network information are provided.For this reason, the invention provides following technical scheme:
A kind of extraction method of the network information comprises:
From the relevant webpage intersection W of given information S, find the webpage W ' of element among the subset Ssub that contains given information S;
According to pre-defined rule information generated pattern set P ', and information pattern set P ' is gathered P with regular expression ask intersection to obtain set P
1
To gather P
1All webpages among the webpage intersection W relevant with given information mate, and obtain S set sub ', until Ssub==Ssub ' time crawl process finishes.
A kind of automatic extracting device of the network information comprises:
Webpage is chosen the unit, is used for finding from the relevant webpage intersection W of given information S the webpage W ' of element the subset Ssub that contains given information S;
The unit is chosen in set, is used for according to pre-defined rule information generated pattern set P ', and asks intersection to obtain set P information pattern set P ' and regular expression set P
1
The content placement unit is used for gathering P
1All webpages among the webpage intersection W relevant with given information mate, and obtain S set sub ', until Ssub==Ssub ' time crawl process finishes.
Technical scheme provided by the invention realizes automatically extracting the content in the webpage by generate corresponding regular expression set according to different webpages, has saved a lot of workloads.
Description of drawings
In order to be illustrated more clearly in the technical scheme of the embodiment of the invention, the accompanying drawing of required use was done to introduce simply during the below will describe embodiment, apparently, accompanying drawing in the following describes only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain according to these accompanying drawings other accompanying drawing.
Fig. 1 is the synoptic diagram of two webpage obtaining informations providing of the specific embodiment of the present invention;
Fig. 2 is the synoptic diagram of n webpage obtaining information providing of the specific embodiment of the present invention;
Fig. 3 is the schematic flow sheet of the extraction method of the network information that provides of the specific embodiment of the present invention;
Fig. 4 is the schematic flow sheet of the information generated pattern set P ' that provides of the specific embodiment of the present invention;
Fig. 5 is the schematic flow sheet of the checking regular expression set that provides of the specific embodiment of the present invention;
Fig. 6 is the structural representation of the face characteristic locating device that provides of the specific embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the invention, the technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills belong to the scope of protection of the invention not making the every other embodiment that obtains under the creative work prerequisite.
The principle of the technical scheme that this embodiment provides is: can comprise the situation of same information for dissimilar webpages, because same information expression way on different websites is different.For example at music field, a lot of music information website, forums etc. of comprising are arranged on the internet, their different websites, forum's structure of web page and the form of expression generally are not identical, but they have comprised the information of a lot of same kind, information such as song title, singer's name, special edition, for a kind of information, webpage (being designated as urlpattern1) for same type, can represent by regular expression (prefix1 info suffix1), and intersection that will the value of recording is designated as V1.And for dissimilar webpages (urlpattern2), they have different regular expression (prefix2 info suffix2), the intersection of the value of this website is designated as V2, and then the common factor of V1 and V2 is not equal to sky, and the information that the value of V1 and V2 is described is consistent.If there are by that analogy dissimilar webpages of n, then should there be the set less than or equal to n value, exist less than or equal to n regular expression.Concrete logic as depicted in figs. 1 and 2.Therefore for the part set (sample sizes such as 10 to 100) of given information, be designated as Ssub, then can pass through webpage intersection W, obtain information intersection S '.Defining coverage rate is | S ∩ S ' | and/S, the definition accuracy rate | S ∩ S ' |/S ', with respect to coverage rate, accuracy rate is extracted more important for web page contents.Because if accuracy rate is excessively low, nonsensical for most application, but coverage rate is crossed low can remedying by the webpage quantity of magnanimity, so the technical scheme that this embodiment provides proposes for the accuracy rate that improves the web page contents extraction.Be elaborated below in conjunction with Figure of description, as shown in Figure 3, the extraction method of the corresponding network information comprises:
Concrete, for the subset Ssub of given information S, the element among the subset Ssub is enumerable, and definition regular expression intersection
At first travel through the relevant webpage intersection W of given information S, from webpage intersection W, find the webpage W ' of element among the subset Ssub that contains given information S.
According to pre-defined rule information generated pattern set P ', and make W '=Ssub, wherein the generative process of information pattern set P ' specifically can comprise specifically as shown in Figure 4:
The pattern that at first defines regular expression is: p=prefix info suffix; And be the component of regular expression in order to the next part cooperation: digital collection NumberSet, set of letters EnglishSet, special symbol S set pecialSet, character set ChineseSet, webpage tag set MetaSet; Wherein the content of the info of regular expression represents by digital collection NumberSet, set of letters EnglishSet, special symbol S set pecialSet and character set ChineseSet, and the content of prefix and suffix represents by webpage tag set MetaSet;
The subset Ssub of traversal given information S finds a certain element s, and finds the position of element s in webpage w;
Recall forward, find first webpage label, be designated as prefix; Recall backward, find first webpage label, be designated as suffix;
Description rule generting element s the canonical set on webpage w of the content in the middle of prefix and the suffix according to digital collection NumberSet, set of letters EnglishSet, special symbol S set pecialSet and character set ChineseSet;
Become the regular expressions set of Ssub on webpage w according to element s in the canonic(al) ensemble symphysis on the webpage w, be recorded as P ' p1, p2 ... pn.
Concrete, will gather P
1All webpages among the webpage intersection W relevant with given information mate, and obtain S set sub ', if Ssub〉Ssub ', re-execute step 31 after then making Ssub=Ssub ', until Ssub==Ssub ' time crawl process finishes.
Further, this embodiment can also comprise the process of checking regular expression set, as shown in Figure 5, specifically can comprise:
Each webpage W ' and the subset Ssub of given information are multiplied each other, obtain regular expression intersection Tt=T1, T2 ... Tn;
Traversal regular expression intersection Tt obtains a regular expression intersection T
1, traversal regular expression intersection T
1, regular expression p ∈ Tn and webpage W ' mate the S set of the value of obtaining arbitrarily;
If S-Ssub ≠ Φ gives up and changes expression formula (effect of this step is to remove the regular expression that mates simultaneously other guide); If S-Ssub=is Φ, then the number Scount of element equals element number in the S set among the subset Ssub of given information;
Traversal regular expression intersection Tt, for Tn ∈ Tt arbitrarily, if the number of regular expression is greater than 1 among the Tn, then choose the regular expression of Scount maximum among the Tn, cast out remaining regular expression (effect of this step is a plurality of expression formulas for same coupling, chooses maximum that of coupling);
Traversal regular expression intersection Tt, contrast is any two Tn wherein, if regular expression is identical, then give up wherein any one (effect of this step is to remove identical regular expression);
Remaining regular expression is formed set, be designated as P '=p1, p2 ... pn.
The technical scheme that adopts this embodiment to provide by generate corresponding regular expression set according to different webpages, realizes automatically extracting the content in the webpage, has saved a lot of workloads, and can verify the correctness of regular expression.
Need to prove, one of ordinary skill in the art will appreciate that all or part of step that realizes in above-mentioned each embodiment of the method is to come the relevant hardware of instruction to finish by program, corresponding program can be stored in a kind of computer-readable recording medium, the above-mentioned storage medium of mentioning can be ROM (read-only memory), disk or CD etc.
The specific embodiment of the present invention also provides a kind of automatic extracting device of the network information, as shown in Figure 6, comprising:
Webpage is chosen unit 61, is used for finding from the relevant webpage intersection W of given information S the webpage W ' of element the subset Ssub that contains given information S;
Optionally, choose in the unit 62 in set and comprise: the traversal subelement, recall subelement, symphysis becomes subelement to canonical set statement subelement with canonic(al) ensemble; Traversal subelement wherein is used for the subset Ssub of traversal given information S, finds a certain element s, and finds the position of element s in webpage w; Recall subelement and be used for recalling forward, find first webpage label, be designated as prefix; Recall backward, find first webpage label, be designated as suffix; Canonical set statement subelement is used for description rule generting element s the canonical set on webpage w of the content in the middle of prefix and the suffix according to digital collection NumberSet, set of letters EnglishSet, special symbol S set pecialSet and character set ChineseSet; The canonic(al) ensemble symphysis becomes subelement to be used for becoming the regular expressions set of Ssub on webpage w according to element s in the canonic(al) ensemble symphysis on the webpage w, is recorded as P '=p1, p2 ... pn.
Optionally, this device can also comprise authentication unit, comprises in authentication unit: get multiplier unit, coupling subelement, element number and determine that subelement, the first screening subelement, the second screening subelement and canonical set determine subelement; The multiplier unit of getting wherein is used for each webpage W ' and the subset Ssub of given information are multiplied each other, and obtains regular expression intersection Tt=T1, T2 ... Tn; The coupling subelement is used for traversal regular expression intersection Tt, obtains a regular expression intersection T
1, traversal regular expression intersection T
1, regular expression p ∈ Tn and webpage W ' mate the S set of the value of obtaining arbitrarily; If element number determines subelement and be used for S-Ssub ≠ Φ, give up and change expression formula; If S-Ssub=is Φ, then the number Scount of element equals element number in the S set among the subset Ssub of given information; The first screening subelement is used for traversal regular expression intersection Tt, for Tn ∈ Tt arbitrarily, if the number of regular expression is greater than 1 among the Tn, then chooses the regular expression of Scount maximum among the Tn, casts out remaining regular expression; The second screening subelement is used for traversal regular expression intersection Tt, and contrast is any two Tn wherein, if regular expression is identical, then gives up wherein any one; The canonical set determines that subelement is used for remaining regular expression is formed set, is designated as P '=p1, p2 ... pn.
The specific implementation of the processing capacity of each unit that comprises in the automatic extracting device of the above-mentioned network information is described in embodiment of the method before, no longer is repeated in this description at this.
The technical scheme that adopts this embodiment to provide by generate corresponding regular expression set according to different webpages, realizes automatically extracting the content in the webpage, has saved a lot of workloads, and can verify the correctness of regular expression.
It should be noted that among the embodiment of said apparatus that included unit is just divided according to function logic, but is not limited to above-mentioned division, as long as can realize corresponding function; In addition, the concrete title of each functional unit also just for the ease of mutual differentiation, is not limited to protection scope of the present invention.
The above; only be the better embodiment of the present invention; but protection scope of the present invention is not limited to this; anyly be familiar with those skilled in the art in the technical scope that the embodiment of the invention discloses; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.
Claims (6)
1. the extraction method of a network information is characterized in that, comprising:
From the relevant webpage intersection W of given information S, find the webpage W ' of element among the subset Ssub that contains given information S;
According to pre-defined rule information generated pattern set P ', and information pattern set P ' is gathered P with regular expression ask intersection to obtain set P
1
To gather P
1All webpages among the webpage intersection W relevant with given information mate, and obtain S set sub ', until Ssub==Ssub ' time crawl process finishes.
2. method according to claim 1 is characterized in that, describedly comprising according to pre-defined rule information generated pattern set P ':
The subset Ssub of traversal given information S finds a certain element s, and finds the position of element s in webpage w;
Recall forward, find first webpage label, be designated as prefix; Recall backward, find first webpage label, be designated as suffix;
Description rule generting element s the canonical set on webpage w of the content in the middle of prefix and the suffix according to digital collection NumberSet, set of letters EnglishSet, special symbol S set pecialSet and character set ChineseSet;
Become the regular expressions set of Ssub on webpage w according to element s in the canonic(al) ensemble symphysis on the webpage w, be recorded as P '=p1, p2 ... pn.
3. method according to claim 1 is characterized in that, the method also comprises the set of checking regular expression, and described checking regular expression set comprises:
Each webpage W ' and the subset Ssub of given information are multiplied each other, obtain regular expression intersection Tt=T1, T2 ... Tn;
Traversal regular expression intersection Tt obtains a regular expression intersection T
1, traversal regular expression intersection T
1, regular expression p ∈ Tn and webpage W ' mate the S set of the value of obtaining arbitrarily;
If S-Ssub ≠ Φ gives up and changes expression formula; If S-Ssub=is Φ, then the number Scount of element equals element number in the S set among the subset Ssub of given information;
Traversal regular expression intersection Tt for Tn ∈ Tt arbitrarily, if the number of regular expression is greater than 1 among the Tn, then chooses the regular expression of Scount maximum among the Tn, casts out remaining regular expression;
Traversal regular expression intersection Tt, contrast is any two Tn wherein, if regular expression is identical, then give up wherein any one;
Remaining regular expression is formed set, be designated as P '=p1, p2 ... pn.
4. the automatic extracting device of a network information is characterized in that, comprising:
Webpage is chosen the unit, is used for finding from the relevant webpage intersection W of given information S the webpage W ' of element the subset Ssub that contains given information S;
The unit is chosen in set, is used for according to pre-defined rule information generated pattern set P ', and asks intersection to obtain set P information pattern set P ' and regular expression set P
1
The content placement unit is used for gathering P
1All webpages among the webpage intersection W relevant with given information mate, and obtain S set sub ', until Ssub==Ssub ' time crawl process finishes.
5. device according to claim 4 is characterized in that, chooses in the unit in set to comprise:
The traversal subelement, the subset Ssub for traversal given information S finds a certain element s, and finds the position of element s in webpage w;
Recall subelement, be used for recalling forward, find first webpage label, be designated as prefix; Recall backward, find first webpage label, be designated as suffix;
Canonical set statement subelement is used for description rule generting element s the canonical set on webpage w of the content in the middle of prefix and the suffix according to digital collection NumberSet, set of letters EnglishSet, special symbol S set pecialSet and character set ChineseSet;
The canonic(al) ensemble symphysis becomes subelement, is used for becoming the regular expressions set of Ssub on webpage w according to element s in the canonic(al) ensemble symphysis on the webpage w, is recorded as P '=p1, p2 ... pn.
6. device according to claim 4 is characterized in that, this device also comprises authentication unit, and described authentication unit comprises:
Get the multiplier unit, be used for each webpage W ' and the subset Ssub of given information are multiplied each other, obtain regular expression intersection Tt=T1, T2 ... Tn;
The coupling subelement is used for traversal regular expression intersection Tt, obtains a regular expression intersection T
1, traversal regular expression intersection T
1, regular expression p ∈ Tn and webpage W ' mate the S set of the value of obtaining arbitrarily;
Element number is determined subelement, if be used for S-Ssub ≠ Φ, give up and changes expression formula; If S-Ssub=is Φ, then the number Scount of element equals element number in the S set among the subset Ssub of given information;
The first screening subelement is used for traversal regular expression intersection Tt, for Tn ∈ Tt arbitrarily, if the number of regular expression is greater than 1 among the Tn, then chooses the regular expression of Scount maximum among the Tn, casts out remaining regular expression;
The second screening subelement is used for traversal regular expression intersection Tt, and contrast is any two Tn wherein, if regular expression is identical, then gives up wherein any one;
Subelement is determined in the canonical set, is used for remaining regular expression is formed set, is designated as P '=p1, p2 ... pn.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210335719.1A CN102855324B (en) | 2012-09-11 | 2012-09-11 | A kind of extraction method of the network information and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210335719.1A CN102855324B (en) | 2012-09-11 | 2012-09-11 | A kind of extraction method of the network information and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102855324A true CN102855324A (en) | 2013-01-02 |
CN102855324B CN102855324B (en) | 2015-08-26 |
Family
ID=47401912
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210335719.1A Expired - Fee Related CN102855324B (en) | 2012-09-11 | 2012-09-11 | A kind of extraction method of the network information and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102855324B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105740355A (en) * | 2016-01-26 | 2016-07-06 | 中国人民解放军国防科学技术大学 | Aggregated text density based webpage body text extraction method and apparatus |
CN106126684A (en) * | 2016-06-29 | 2016-11-16 | 联想(北京)有限公司 | A kind of method and device generating web crawlers configuration file |
CN103902578B (en) * | 2012-12-27 | 2017-05-31 | 中国移动通信集团四川有限公司 | A kind of method for abstracting web page information and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080243793A1 (en) * | 2007-03-21 | 2008-10-02 | Paul Hallett | Contact Information Capture and Link Redirection |
CN101582075A (en) * | 2009-06-24 | 2009-11-18 | 大连海事大学 | Web information extraction system |
CN101727447A (en) * | 2008-10-10 | 2010-06-09 | 浙江搜富网络技术有限公司 | Generation method and device of regular expression based on URL |
CN102456050A (en) * | 2010-10-27 | 2012-05-16 | 中国移动通信集团四川有限公司 | Method and device for extracting data from webpage |
-
2012
- 2012-09-11 CN CN201210335719.1A patent/CN102855324B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080243793A1 (en) * | 2007-03-21 | 2008-10-02 | Paul Hallett | Contact Information Capture and Link Redirection |
CN101727447A (en) * | 2008-10-10 | 2010-06-09 | 浙江搜富网络技术有限公司 | Generation method and device of regular expression based on URL |
CN101582075A (en) * | 2009-06-24 | 2009-11-18 | 大连海事大学 | Web information extraction system |
CN102456050A (en) * | 2010-10-27 | 2012-05-16 | 中国移动通信集团四川有限公司 | Method and device for extracting data from webpage |
Non-Patent Citations (3)
Title |
---|
张树壮等: "大规模复杂规则匹配技术研究", 《高技术通讯》, vol. 20, no. 12, 30 March 2011 (2011-03-30), pages 1217 - 1223 * |
程岚岚: "基于正则表达式的大规模网页术语对抽取研究", 《情报杂志》, vol. 27, no. 11, 16 February 2009 (2009-02-16) * |
胡军伟等: "正则表达式在Web信息抽取中的应用", 《北京信息科技大学学报》, vol. 26, no. 6, 31 December 2011 (2011-12-31), pages 86 - 89 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103902578B (en) * | 2012-12-27 | 2017-05-31 | 中国移动通信集团四川有限公司 | A kind of method for abstracting web page information and device |
CN105740355A (en) * | 2016-01-26 | 2016-07-06 | 中国人民解放军国防科学技术大学 | Aggregated text density based webpage body text extraction method and apparatus |
CN105740355B (en) * | 2016-01-26 | 2019-03-26 | 中国人民解放军国防科学技术大学 | Webpage context extraction method and device based on aggregation text density |
CN106126684A (en) * | 2016-06-29 | 2016-11-16 | 联想(北京)有限公司 | A kind of method and device generating web crawlers configuration file |
CN106126684B (en) * | 2016-06-29 | 2019-12-24 | 联想(北京)有限公司 | Method and device for generating network crawler configuration file |
Also Published As
Publication number | Publication date |
---|---|
CN102855324B (en) | 2015-08-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8732199B2 (en) | System, method, and computer readable media for identifying a user-initiated log file record in a log file | |
JP2016524229A (en) | Search recommendation method and apparatus | |
CN104462547A (en) | Configurable webpage data acquisition method and system | |
CN106649413A (en) | Grouping method and device for webpage tabs | |
CN105117402A (en) | Log data fragmentation method based on segment order-preserving Hash and log data fragmentation device based on segment order-preserving Hash | |
CA2977847A1 (en) | Automated extraction tools and their use in social content tagging systems | |
CN108874379B (en) | Page processing method and device | |
CN103823892A (en) | Method and device of determining webpage clustering mode | |
CN106933916B (en) | JSON character string processing method and device | |
CN103853654A (en) | Method and device for selecting webpage testing paths | |
CN105138538A (en) | Cross-domain knowledge discovery-oriented topic mining method | |
CN103902618A (en) | File search method and device | |
CN105069063A (en) | Picture searching method and apparatus | |
CN108804472A (en) | A kind of webpage content extraction method, device and server | |
CN102855324A (en) | Automatic extracting method and device for network information | |
CN104765823A (en) | Method and device for collecting website data | |
CN104166648A (en) | Recommendation data excavation method and device based on labels | |
CN106462933A (en) | Using content structure to socially connect users | |
US20110078635A1 (en) | Relationship map generator | |
CN106202050A (en) | Subject information acquisition methods, device and electronic equipment | |
CN109145307A (en) | User portrait recognition method, pushing method, device, equipment and storage medium | |
CN103999079A (en) | Aligning annotation of fields of documents | |
CN104156458B (en) | The extracting method and device of a kind of information | |
CN109558381A (en) | A kind of data processing method and device | |
CN106484746A (en) | The analysis method of website transformation event and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20150826 Termination date: 20160911 |