CN101464905A

CN101464905A - Web page information extraction system and method

Info

Publication number: CN101464905A
Application number: CNA2009100765483A
Authority: CN
Inventors: 吴博; 王宇; 张刚; 丁国栋; 程学旗
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2009-01-08
Filing date: 2009-01-08
Publication date: 2009-06-24
Anticipated expiration: 2029-01-08
Also published as: CN101464905B

Abstract

The invention relates to a system for extracting web page information and a method thereof. The system comprises a template generation module, a web page homogenization module, an automatic tagging module, a wrapper file generation module and an on-line extraction module, wherein, the template generation module is used for selecting web pages to be automatically tagged from a web page collection, and the web pages to be automatically tagged is classified according to training web pages tagged by a user, so as to generate a classified category web page template; the web page homogenization module is used for screening out the difference between the automatic tagging web pages and the web page template belonging to the same category with automatic tagging web; the automatic tagging module is used for analyzing training web pages corresponding to the category, so as to generate a first wrapper file; automatic tagging can be performed on the automatic tagging web pages according to the fisrt wrapper file, so as to generate new training web pages; the wrapper file generation module is used for analyzing all the training web pages and generating a second wrapper file; and the on-line extraction module is applied to the second wrapper document, and is used for extracting unselected web page information in the web page collection. The invention ensures that a plurality of templates corresponding to inhomogeneous web pages can be generated, and extracting can be performed on a plurality of records in a web page and a plurality of attributes of each record.

Description

A kind of system and method for Web page information extraction

Technical field

The invention belongs to network information process field, relate in particular to a kind of system and method for Web page information extraction.

Background technology

Present web page extraction technology can be divided at the web page extraction technology of specific area and general web page extraction technology according to the field of using.

In web page extraction technology, need treat the content of extraction usually and make some prerequisite hypothesis at specific area.For example to the extraction of news web page text, to the extraction of some particular community in the webpage, as extraction to product price.These class methods often according to the feature of waiting to extract object, by statistical method or by summing up the method for hairdo rule, extract webpage.But owing to extract special objects, limited the versatility of these class methods and the kind and the quantity of the information that can extract.

In general web page extraction technology, according to the automaticity of web page extraction instrument, be divided into the extraction system of manual construction rule, the extraction system of supervision is partly arranged, unsupervised extraction system and the extraction system that supervision is arranged.

In the extraction system of manual construction rule, the user is by extracting a wrapper of each website hand-coding (wrapper), the employed language of wrapper can be the language of a general programming language or the specially designed extraction that is used for, such instrument needs the user that the knowledge of certain computing machine and programming aspect is arranged, so the cost of this method is quite high, can not put up with often for this cost of extraction of a large amount of websites and magnanimity webpage.

The extraction system that supervision is partly arranged is with respect to the extraction system that supervision is arranged, and mark just can generate the rule that is used to extract not need the user to make accurately to the data in the webpage usually, so this type systematic is called as the extraction system that supervision is partly arranged.Though this type systematic does not need the user that the data in the webpage are made mark, but this type systematic often needs the user to do subsequent treatment, the data of for example selecting target pattern and wanting to extract, and this all type systematics all is in order to extract other data of record level.So the extraction precision of this type systematic can not be satisfied the demand usually and accurately be extracted the requirement of attribute information in the webpage.

In unsupervised extraction system, do not need the user to mark out any training data, so in generating the process of wrapper, just do not need the interface with user interactions yet.Being different from the data that extract in the extraction system that supervision is arranged is marked out by the user, unsupervised extraction system extracted data is to be determined by data itself, be that unsupervised system it has been generally acknowledged that webpage is actually the data that a web page template that is generated by program adds in the background data base and generates, and the task of unsupervised extraction system is exactly the data that extract in these background data bases.But because this full automatic extraction mode tends to extract the unwanted information of many users, some information that the user needs but may not extract, and because the data integration and the understanding that do not have mark to extract also become a difficult problem.

There is the extraction system of supervision normally to import a series of webpages that mark by the user, utilizes described training webpage to generate the wrapper file then, utilize the wrapper file that generates that the information in the similar webpage is extracted at last.In such system, often not wanting special programming personnel only needs some common users to train simply through some that to mark out the data that will extract on graphic user interface just passable, and it is higher that such extraction system extracts precision, and the data that extract are owing to there is label also to be convenient to understanding and integrated.The system of being introduced among the present invention is exactly a kind of extraction system that supervision is arranged.

Now increase like the blast of information on the internet, and webpage is as information carrier important on the network, how extracting the information that needs from webpage becomes an important research project day by day.But webpage is all served the user and is browsed on the internet, thereby the information in the webpage is surrounded the difficulty that has caused the information that extracts from webpage by a lot of webpage labels and format information.

Higher semi-automatic of at present popular a kind of accurate rate has the abstracting method of supervision to be: grasp down the same web page template of cause generates webpage from certain website, therefrom choose several webpages as the training webpage, being marked out by the user needs the information that extracts in these webpages, practise out the contextual feature of the data field that will extract by the mode of machine learning from these training webpage middle schools then, generate the wrapper file that is used to extract at last.Just use this wrapper to extract automatically to other webpages in this website.But there are the following problems for this method.

First, at present the foundation judged during similar webpage in grasping the website of webpage capture program all is whether these webpages are under same ur1 path, but exist a large amount of dynamic ur1 on the present website, even have this situation, even the structure between the ur1 path same web page at some webpage place also may be very dissimilar.Will cause like this can't extracting the webpage that generates by the different web pages template in the collections of web pages by the wrapper file that the training webpage generates.

Second, even these webpages are generated by same web page template, but there is a lot of non-template nodes in the webpage, and there is each species diversity between the non-template node of different web pages, often can't contain these all differences for the wrapper file of only training webpage to generate so by part, caused the wrapper file to being competent at task to this part web page extraction, and traditional mode is the webpage that can't correctly extract for these, it is submitted to the user, allow the user go to mark out data field in these webpages, and then these webpages are offered the web page extraction program as the training webpage regenerate wrapper.

The 3rd, all there is degree of accuracy in present web page extraction system, automaticity is with the contradiction between the needed artificial intervention, such as higher accuracy is arranged, need the extraction system of less training examples often to need the long needs that can't satisfy online instant extraction working time at extraction stage, and the system of greater efficiency can be arranged at extraction stage, often need more training webpage and manual intervention can generate all wrapper files preferably of accurate rate and recall rate.

The 4th, present network upgrade speed is fast, and after generating correct wrapper file, along with the correcting of website, the wrapper file that utilizes the old edition webpage to generate just can't be finished the task to web page extraction in the website after the correcting.

The 5th, present stage, a lot of web page extraction technology all were the websites at certain type, as extracting at news web page, perhaps can only some product of certain object be extracted, as extracting attributes such as the price of product and titles.

Press for the general information that extracts of the needs to any appointment that can be cross-cutting now and can finish the instrument of extraction.

Summary of the invention

In order to solve above-mentioned technical matters, the invention provides a kind of system and method for Web page information extraction, can generate a plurality of templates by corresponding inhomogeneous webpage, and a plurality of attributes in the webpage are extracted.

The invention discloses a kind of system of Web page information extraction, comprising:

The template generation module is used for choosing from collections of web pages and treats to mark automatically webpage, treats that to described marking webpage automatically classifies according to the training webpage of user's mark, generates the web page template of described training webpage corresponding class simultaneously;

Webpage homogeneity module, be used for according to the shielding of the web page template of described classification belong to described classification treat mark webpage automatically with the difference between the web page template of described classification;

Automatically labeling module is used to resolve the training webpage of described classification, generates the first wrapper file, by the described first wrapper file to described classification treat that marking webpage automatically marks automatically, to generate new training webpage;

The wrapper file generating module is used to resolve all training webpages, generates the second wrapper file;

Online abstraction module is used for using the described second wrapper file info web that described collections of web pages is not selected is extracted.

Described template generation module is further used for carrying out the described operation for the treatment of to mark automatically webpage of choosing from collections of web pages, make up the dom tree and the described dom tree for the treatment of to mark automatically webpage of the training webpage of user's mark, the web page template of described training webpage corresponding class is the dom tree of described training webpage, the dom tree that calculates described training webpage is with the described similarity for the treatment of to mark automatically the dom tree of webpage, to carry out similarity relatively, if described similarity is greater than predetermined threshold value, then describedly treat to mark automatically webpage and belong to described training webpage corresponding class, otherwise, the prompting user treats that to described marking webpage automatically marks, to generate new training webpage, make up the dom tree of described new training page or leaf and treat to mark automatically the dom tree of webpage, the web page template of new training webpage corresponding class is the dom tree of described new training webpage, the dom tree of new training page or leaf is treated that together the dom tree that marks webpage automatically carries out described similarity relatively, to finish classification.

Described template generation module also is used for utilizing the dom tree that marks webpage automatically for the treatment of of described classification to simplify described web page template.

Described template generation module is further used for identifying the repeating data node in the described web page template when simplifying described web page template, the treat dom tree that automatically mark webpage of described web page template with described classification mated, will do not mated in the described web page template and not be the knot removal of described repeating data node.

Described template generation module is further used for determining common ancestor's node that is labeled all nodes in the record in described web page template when the repeating data node of identification in the described web page template, search all neighbor nodes of the same name of described ancestor node, judgement with the described neighbor node of the same name subtree that is root node with the similarity of the subtree that is root node with the described node that is labeled in the record whether greater than predetermined threshold value, if then described neighbor node of the same name and the described node that is labeled in the record are the repeating data node; Otherwise, in the last layer node of described ancestor node, search till finding the repeating data node or finding the root node of described web page template.

It is that the subtree of root node is mated with the subtree that is root node with the described node that is labeled in the record with the repeating data node except that the described node that is labeled in the record that described template generation module also is used for after described web page template is simplified described web page template, will be that the node that is not mated in the subtree of root node shields with the repeating data node except that the described node that is labeled in the record.

Described webpage homogeneity module is further used for and will treats described in the same classification that the dom tree that marks webpage automatically mates with described web page template, treats to mark automatically the node shielding of not mated in the dom tree of webpage with described.

Described automatic labeling module resolving the training webpage of described classification, is further used for when generating the first wrapper file described training webpage is converted into flag sequence, locatees the front and back separator of the data of described mark, determine described separator about rule.

Described automatic labeling module, at the training webpage of resolving described classification, also be used for when generating the first wrapper file the state of each separator, with regular about each separator correspondence as the redirect rule that jumps to NextState from current state as the nondeterministic statement machine.

Described wrapper file generating module is further used for described training webpage is converted into flag sequence, locatees the front and back separator of the data of described mark, determine described separator about the rule; Each separator is a state of nondeterministic statement machine, and rule is for jumping to the redirect rule of NextState about each separator correspondence from current state.

The webpage that described online abstraction module is further used for not being selected in the described collections of web pages is converted into flag sequence, travel through mark in the described flag sequence, whether judge mark meets the redirect rule of current state, if, then jump to NextState, when the separator of the front and back of the respectively corresponding attribute of described current state and described NextState, web page text between the decollator of the separator of described current state correspondence and described NextState correspondence is preserved as the value of the attribute of described separator correspondence.

The invention also discloses a kind of method of Web page information extraction, comprising:

Step 1 is chosen from collections of web pages and is treated to mark automatically webpage, treats that to described marking webpage automatically classifies according to the training webpage of user's mark, generates the web page template of described training webpage corresponding class simultaneously;

Step 2, according to the shielding of the web page template of described classification belong to described classification treat mark webpage automatically with the difference between the web page template of described classification;

Step 3 is resolved the training webpage of described classification, generates the first wrapper file, by the described first wrapper file to described classification treat that marking webpage automatically marks automatically, to generate new training webpage;

Step 4 is resolved all training webpages, generates the second wrapper file;

Step 5 is used the described second wrapper file info web that is not selected in the described collections of web pages is extracted.

Described step 1 further is:

Step 131 is carried out the described operation for the treatment of to mark automatically webpage of choosing from collections of web pages;

Step 132, make up the dom tree and the described dom tree for the treatment of to mark automatically webpage of the training webpage of user's mark, the dom tree of described training webpage is as the web page template of described training webpage corresponding class, the dom tree that calculates described training webpage is with the described similarity for the treatment of to mark automatically the dom tree of webpage, if described similarity is greater than predetermined threshold value, determine describedly to treat to mark automatically webpage and belong to described training webpage corresponding class, otherwise, execution in step 133;

Described step 133, the prompting user treats that to described marking webpage automatically marks, and to generate new training webpage, carries out described step 132.

Described step 132 also comprises:

Step 141, the dom tree that utilizes treating in the described classification to mark webpage is automatically simplified described web page template.

Described step 141 further is:

Step 151 identifies the repeating data node in the described web page template;

Step 152, determine described treat to mark automatically webpage and belong to described training webpage corresponding class after, the treat dom tree that automatically mark webpage of described web page template with described classification mated, will do not mated in the described web page template and not be the knot removal of described repeating data node.

Described step 151 further is

Step 161 is determined common ancestor's node that is labeled all nodes in the record in described web page template, search all neighbor nodes of the same name of described ancestor node;

Step 162, judgement with the described neighbor node of the same name subtree that is root node with the similarity of the subtree that is root node with the described node that is labeled in the record whether greater than predetermined threshold value, if determine that then described neighbor node of the same name and the described node that is labeled in the record are the repeating data node; Otherwise, in the last layer node of described ancestor node, search till finding the repeating data node or finding the root node of described web page template.

Also comprise after the described step 141:

Step 171, with in the described web page template with the repeating data node except that the described node that is labeled in the record be the subtree of root node subtree is mated with being root node with the described node that is labeled in the record,

Step 172, with described be the node shielding of not mated in the subtree of root node with the repeating data node except that the described node that is labeled in the record.

Described step 2 further is,

Step 181 will treat described in the same classification that the dom tree that marks webpage automatically mates with described web page template, treats to mark automatically the node shielding of not mated in the dom tree of webpage with described.

Resolving the training webpage of described classification in the described step 3, generating the first wrapper file and further be,

Step 191 is converted into flag sequence with described training webpage, locatees the front and back separator of the data of described mark, determine described separator about the rule.

Also comprise after the described step 191:

With the state of each separator, with regular about each separator correspondence as the redirect rule that jumps to NextState from current state as the nondeterministic statement machine.

Described step 4 further is,

Step 211 is converted into flag sequence with described training webpage, locatees the front and back separator of the data of described mark, determine described separator about the rule;

Step 212, each separator are as a state of nondeterministic statement machine, and rule is for jumping to the redirect rule of NextState about each separator correspondence from current state.

Described step 5 further is:

Step 221 is converted into flag sequence with the described webpage that is not selected;

Step 222 travels through mark in the described flag sequence, and whether judge mark meets the redirect rule of current state, if then jump to NextState;

Step 223, when the separator of the front and back of the respectively corresponding attribute of described current state and described NextState, web page text between the decollator of the separator of described current state correspondence and described NextState correspondence is preserved as the value of the attribute of described separator correspondence.

Beneficial effect of the present invention is, by the template generation module, webpage is classified according to template, utilizes same wrapper to extract of a sort webpage cluster together and has improved the accuracy rate that extracts; By webpage homogeneity and automatic labeling module, thereby shield difference between certain class webpage and mark out target data in all these webpages, and then can be used as the training webpage of wrapper generation module, difference between these webpages is learnt and is written in the decimation rule by wrapper at this moment, just can these dissimilar webpages be extracted, improved the recall rate of web page extraction at the extraction stage of reality; By wrapper generation module and online abstraction module, thereby the context token sequence signature of training webpage is learnt to obtain decimation rule based on the data segment contextual feature, and this rule can be fast extracts the data of magnanimity at extraction stage; And, the present invention extracts because only depending on the context sequence signature of target data in the webpage, the present invention do not need the type of webpage and the data type that will extract are made more restriction, so can extract to the webpage of most types and data wherein.

Description of drawings

Fig. 1 is the structural drawing of the system of Web page information extraction of the present invention;

Fig. 2 is template generation module function realization flow figure;

Fig. 3 is the process flow diagram of method for abstracting web page information of the present invention.

Embodiment

Below in conjunction with accompanying drawing, the present invention is described in further detail.

System architecture of the present invention comprises as shown in Figure 1:

Template generation module 101 is used for choosing from collections of web pages and treats to mark automatically webpage, will describedly treat to mark automatically Web page classifying according to the training webpage of user's mark, generates the web page template of training the webpage corresponding class simultaneously.

Webpage homogeneity module 102, be used for according to the shielding of the web page template of described classification belong to described classification treat mark webpage automatically with the difference between the web page template of described classification.

Automatically labeling module 103 is used to resolve the training webpage of described classification, generates a warpper (wrapper) file, by a described wrapper file to described classification treat that marking webpage automatically marks automatically, to generate new training webpage.

Wrapper file generating module 104 is used to resolve all training webpages, generates the 2nd wrapper file.

Online abstraction module 105 is used for using described the 2nd wrapper file the info web that described collections of web pages is not selected is extracted.

The wrapper file is the set of decimation rule.Decimation rule in the wrapper file is the description to the contextual token sequence signature of the data that require to extract.

In a specific embodiment, the specific implementation of template generation module 101 is as described below.

DOM Document Object Model (DOM) is the standard interface standard that W3C formulates.Can the institutional framework of a webpage be described as one tree with the DOM model, be commonly referred to dom tree.Each node in the tree is an object.Dom tree has not only been described the structure of webpage, has also defined the behavior of node object, can utilize the method and the attribute of object, easily node in the tree and content is done dynamically operation, as visit, modification, interpolation and deletion action.

Template generation module 101 specific implementation flow processs as shown in Figure 2

Step S201, prompting user ID training webpage after user's mark is finished, enters step S202.

One specific implementation method is for by a user interface tool, and the user is by using mouse to mark to want the data that extract in browser.

Step S202 chooses from collections of web pages and treats to mark automatically webpage, makes up the dom tree of training webpage, is labeled as TempTree and treats to mark automatically the dom tree of webpage, is labeled as Tree.

Step S203, the repeating data node among the identification TempTree.

The repeating data node is the node of the similar record in the dom tree.Can follow a lot of follow-up such as the main subsides in forum's webpage, described follow-up is a repeating data district node.

The method of concrete identification repeating data node is:

In TempTree, determine common ancestor's node that is labeled all nodes in the record, search all neighbor nodes of the same name of this ancestor node; Judgement is that the subtree of root node is with being whether the similarity of subtree of root node is greater than predetermined threshold value, if determine that then neighbor node of the same name and the node that is labeled in the record are the repeating data node to be labeled node in the record with neighbor node of the same name; Otherwise, in the last layer node of described ancestor node, search till finding the repeating data node or finding the root node of TempTree.

Step S204 obtains a Tree.

Step S205 judges whether not processed Tree, if, execution in step S206, otherwise, execution in step S210.

Step S206 calculates the similarity of TempTree with Tree.

Calculating the process of Tree with the similarity of TempTree be, at first use the similarity of tree matching algorithm calculating Tree and TempTree, the calculation of similarity degree formula is, the editing distance between two trees is divided by the absolute average of the node of two trees.

Whether step S207 judges similarity greater than predetermined threshold value, if similarity greater than predetermined threshold value, determine describedly to treat to mark automatically webpage and belong to described training webpage corresponding class, execution in step S208, otherwise, execution in step S201.

Step S208 simplifies TempTree.

TempTree is mated with Tree, will do not mated among the TempTree and not be the knot removal of repeating data node.

Web page template is TempTree, comprising node and repeating data node total among the generic Tree.

Step S209, normalization TempTree.

Be normalized to and utilize the data subtree mark out to go other data subtrees in the abbreviation training page, all data subtrees are turned to data subtree with unified structure.

Concrete mode for among the TempTree with the repeating data node the node in being labeled record be the subtree of root node with being that the subtree of root node is mated to be labeled node in the record, will be that the node that is not mated in the subtree of root node shields with the repeating data node except that the described node that is labeled in the record.

Shielding is for to tell program by the mode that labels, and the content of this part is to be left in the basket.

Step S210 finishes.

The final TempTree that obtains is the web page template of classification.

Webpage homogeneity module 102 specific implementations are as described below.

To treat in the same classification that the dom tree that marks webpage automatically mates with web page template, the node shielding of will treat to mark automatically not mated in the dom tree of webpage marks automatically thereby the wrapper file that automatic labeling module 203 can the applying web page template be generated marks webpage automatically to treating in this classification.

Automatically labeling module 103 specific implementations are as described below.

To train webpage to be converted into token (mark) sequence, locate the front and back separator of the data of described mark, determine separator about the rule.With the state of each separator, with regular about each separator correspondence as the redirect rule that jumps to NextState from current state as the nondeterministic statement machine.

Because the robotization mark webpage for the treatment of in the classification all has unified structure, use decimation rule and make automatic mark treating robotization mark webpage in this classification.

In this classification treat that robotization mark webpage is through the homogeneity resume module after, the structure between the webpage is very similar.The context token sequence rules of target data and identical feature is arranged in the training webpage of the artificial mark of learning through the context token sequence of target data that marks webpage automatically for the treatment of of the processing of webpage homogeneity module.According to a wrapper file and the training webpage for the treatment of to mark automatically, treat to mark automatically the left and right sides separator of data segment in the webpage by the location, after finding the separator position, use with the identical label of artificial mark and be inserted into the position of separator, thereby realize the automatic mark of webpage.

The specific implementation of wrapper file generating module 104 is as described below.

Described training webpage is converted into flag sequence, locatees the front and back separator of the data of described mark, determine described separator about the rule; Each separator is a state of nondeterministic statement machine, and rule is for jumping to the redirect rule of NextState about each separator correspondence from current state.

Webpage is made up of two kinds of token, and html token and alph token wherein the html label correspondence of webpage inside html token, and the content correspondence between the alph label alph token.

In the process that generates the 2nd wrapper file, at first will train webpage to be converted into the token sequence, turn to the token sequence of forming by html token and alph token, navigate to the position of the data place token that marks in the webpage.

The separator position begins to search for forward before the labeled data, and till the separator that runs into first a word token or a last labeled data, all token that run in the process of above-mentioned search are as the preceding label left side rule of labeled data.

It is not to form alph token by symbol fully that word token is defined as content.

The separator position begins to search for forward before the labeled data, and till the back separator that runs into first a word token or a last labeled data, all token that run in the process of above-mentioned search are as the preceding label left side rule of labeled data.

In like manner, seek backward till the preceding separator that runs into first word token or next labeled data from the back decollator of labeled data, all token that run in said process are as the right rule of the back separator of labeled data.

Seek forward till the preceding separator that runs into first word token or labeled data from the back decollator of labeled data, all token that run in said process are as the left side rule of the back separator of labeled data.

Preserved the name of corresponding html label among the html token, for example Html (＜div 〉) has write down the name div of html label in this html token.The Alphtoken correspondence part between two webpage labels in the webpage, and this part is made up of the symbol of some non-characters fully.To train webpage according to above-mentioned agreement serializing, and find out the acceptance of the bid of training webpage then and annotate the front and back separator of data, and note this separator left and right sides token sequence.

Shown in being exemplified below an of web page fragments.

Author<tr><td><a></a><a>{AET:author}Daniel{/AET:author}</a><br><div>Farmer

Wherein, { AET:author} is with {/AET:author} is a separator, has marked author information in the webpage, and is as follows respectively as separator left and right sides sequence rules

The left side rule of preceding separator:

alph(Author)html(<tr>)html(<td>)html(<a>)html(</a>)html(<a>)；

The right rule of preceding separator: walph (_);

The left side rule of back separator: alph (_);

The right rule of back separator: html (＜/a 〉) html (＜br 〉) and html (＜div 〉) alph (Farmer).

In order to generate the 2nd wrapper file that can extract all other similar webpages, require the separator rule to need generalization ability, thereby in the separator rule, introduced asterisk wildcard " _ ", expression is as long as this token belongs to this classification, need not to consider the particular content of this token, just can mate.It is for example superincumbent that { the right rule of AET:author} separator is alph (_), as long as run into a text that comprises character (character) and numeral so, determines that then this separator rule is just mated.In addition because the number of the token that occurs in the contextual token sequence of part webpage, type exists different with order, introduced " or " descriptor for this reason, there are the rule of a plurality of correspondences in its left rule or right rule when a state jumps to another state, as long as satisfied some rules wherein, determine that then its left side rule or right rule are mated.For example there is its form of another web page fragments as follows now,

Author:<tr><td><a>{AET:author}Alex{/AET:author}</a><br><div>General

The left side rule of preceding separator:

alph(Author)html(<tr>)html(<td>)html(<a>)html(</a>)html(<a>)|alph(Author)html(<tr>)html(<td>)html(<a>)

The right rule of preceding separator: alph (_)

The left side rule of back separator: alph (_)

The right rule of back separator: html (＜/a 〉) html (＜br 〉) and html (＜div 〉) alph (_)

Not only added a rule in the left rule of preceding separator here, change has also taken place in last token of the right rule of back decollator simultaneously, and alph originally (Farmer) has become alph (_).This is owing to General and two token of Farmer have occurred at same position after adding new training webpage, has caused extensively, and the content among the alphtoken in the final separator rule is become by extensive " _ ".

Its preceding separator in a last example, lacked＜a in the left rule of AET:author}＜/a〉label, so original separator rule just can not extract this example, this webpage label is come out for this reason and it is regenerated rule as the training webpage, new rule such as precedingly in the left rule of preceding separator, added

Alph (Author) html (＜tr 〉) html (＜td 〉) html (＜a 〉) html (＜/a 〉) html (＜a 〉) rule.

After handling all training webpages, will generate the 2nd wrapper file that finally is used to extract at last, in the 2nd wrapper file, can comprise all properties separator about the rule of redirect between the rule of redirect between rule and the status attribute and the record.

Rule is the rule of redirect between the record about the separator correspondence of the beginning of the separator of the end of last record and back one record.

Rule is the rule of redirect between the attribute about the preceding separator correspondence of the back separator of last attribute and back one attribute.

We need extract the author and the money order receipt to be signed and returned to the sender content of each money order receipt to be signed and returned to the sender for example a plurality of money order receipts to be signed and returned to the sender in certain forum's webpage, and each money order receipt to be signed and returned to the sender is considered to a record.Now represent the front and back separator of author property with A and @A, C and @C represent the front and back separator of money order receipt to be signed and returned to the sender contents attribute, and RB and RE represent start-of-record and end of record (EOR) separator.Comprised the left and right sides specification of RB in the wrapper file to redirect A, rule about A Tiao Zhuandao @A, @A jump to C about the rule, rule about C Tiao Zhuandao @C, @C jumps to the rule of RE and the rule that RE jumps to RB, by finding the redirect rule between all separators, realize identification to all records and all properties in the page.

The specific implementation of online abstraction module 105 is as described below.

Specific implementation is as described below.

Webpage to be extracted is converted into the token sequence, in the process that changes into to the token sequence, token type and generation method during the composition of token sequence generates with the 2nd wrapper file.Read in the 2nd wrapper file, its parsing is become in the nondeterministic automaton that is kept in the internal memory.Traversal token sequence, to each token, determine whether token place current state meets the redirect rule of this state, if, then jump to NextState, when the separator of the front and back of the respectively corresponding attribute of current state and NextState, web page text between the decollator of the separator of current state correspondence and NextState correspondence is preserved as the value of the attribute of separator correspondence.

The process flow diagram of method for abstracting web page information of the present invention as shown in Figure 3.

Step S301 chooses from collections of web pages and treats to mark automatically webpage, will describedly treat to mark automatically Web page classifying according to the training webpage of user mark, generates the web page template of training the webpage corresponding class simultaneously.

Step S301 further comprises:

Step 311 is carried out the described operation for the treatment of to mark automatically webpage of choosing from collections of web pages.

Step 312, the dom tree and the described dom tree for the treatment of to mark automatically webpage of the training webpage of structure user mark, the dom tree of described training webpage is the web page template of described training webpage corresponding class.

Step 313 is determined common ancestor's node that is labeled all nodes in the record in the dom tree of described training webpage, search all neighbor nodes of the same name of described ancestor node; Judgement with the described neighbor node of the same name subtree that is root node with the similarity of the subtree that is root node with the described node that is labeled in the record whether greater than predetermined threshold value, if determine that then described neighbor node of the same name and the described node that is labeled in the record are the repeating data node; Otherwise, in the last layer node of described ancestor node, search till finding the repeating data node or finding the root node of dom tree of described training webpage.

Step 314, the dom tree that calculates described training webpage is with the described similarity for the treatment of to mark automatically the dom tree of webpage, if described similarity is greater than predetermined threshold value, determine describedly to treat to mark automatically webpage and belong to described training webpage corresponding class, the treat dom tree that automatically mark webpage of described web page template with described classification mated, to do not mated in the described web page template and be not the knot removal of described repeating data node, with in the described web page template with the repeating data node except that the described node that is labeled in the record be the subtree of root node subtree is mated with being root node with the described node that is labeled in the record, with described be that the node that is not mated in the subtree of root node shields with the repeating data node except that the described node that is labeled in the record; Otherwise, execution in step 315;

Described step 315, the prompting user treats that to described marking webpage automatically marks, to generate new training webpage, execution in step S312.

Step S302 treats to mark automatically webpage with the difference between the web page template of classification under it according to described web page template shielding.

Described step S302 is further for treating described in the same classification that the dom tree that marks webpage automatically mates with described web page template, treats to mark automatically the node shielding of not mated in the dom tree of webpage with described.

Step S303 resolves the training webpage of described classification correspondence, generates the first wrapper file, by the described first wrapper file to described classification treat that marking webpage automatically marks automatically, to generate new training webpage.

Described step S303 further is, described training webpage is converted into flag sequence, locate the front and back separator of the data of described mark, determine described separator about the rule, with the state of each separator, with regular about each separator correspondence as the redirect rule that jumps to NextState from current state as the nondeterministic statement machine; Automatically mark according to the first wrapper file.

Step S304 resolves all training webpages, generates the second wrapper file.

Described step S304 further comprises:

Step 341 is converted into flag sequence with described training webpage, locatees the front and back separator of the data of described mark, determine described separator about the rule;

Step 342, each separator are a state of nondeterministic statement machine, and rule is for jumping to the redirect rule of NextState about each separator correspondence from current state.

Step S305 uses the described second wrapper file info web that is not selected in the described collections of web pages is extracted.

Described step S305 is further for to be converted into flag sequence with the described webpage that is not selected; Travel through mark in the described flag sequence, whether judge mark meets the redirect rule of current state, if then jump to NextState; When the separator of the front and back of the respectively corresponding attribute of described current state and described NextState, web page text between the decollator of the separator of described current state correspondence and described NextState correspondence is preserved as the value of the attribute of described separator correspondence.

Those skilled in the art can also carry out various modifications to above content under the condition that does not break away from the definite the spirit and scope of the present invention of claims.Therefore scope of the present invention is not limited in above explanation, but determine by the scope of claims.

Claims

1. the system of a Web page information extraction is characterized in that, comprising:

2. the system of Web page information extraction as claimed in claim 1 is characterized in that,

3. the system of Web page information extraction as claimed in claim 2 is characterized in that,

4. the system of Web page information extraction as claimed in claim 3 is characterized in that,

5. the system of Web page information extraction as claimed in claim 4 is characterized in that,

6. the system of Web page information extraction as claimed in claim 5 is characterized in that,

7. the system of Web page information extraction as claimed in claim 2 is characterized in that,

8. the system of Web page information extraction as claimed in claim 1 is characterized in that,

9. the system of Web page information extraction as claimed in claim 8 is characterized in that,

10. the system of Web page information extraction as claimed in claim 1 is characterized in that,

11. the system of Web page information extraction as claimed in claim 10 is characterized in that,

12. the method for a Web page information extraction is characterized in that, comprising:

Step 4 is resolved all training webpages, generates the second wrapper file;

13. the method for Web page information extraction as claimed in claim 12 is characterized in that,

Described step 1 further is:

14. the method for Web page information extraction as claimed in claim 13 is characterized in that,

Described step 132 also comprises:

15. the method for Web page information extraction as claimed in claim 14 is characterized in that,

Described step 141 further is:

Step 151 identifies the repeating data node in the described web page template;

16. the method for Web page information extraction as claimed in claim 15 is characterized in that,

Described step 151 further is

17. the method for Web page information extraction as claimed in claim 16 is characterized in that,

Also comprise after the described step 141:

18. the method for Web page information extraction as claimed in claim 13 is characterized in that,

Described step 2 further is,

19. the method for Web page information extraction as claimed in claim 12 is characterized in that,

20. the method for Web page information extraction as claimed in claim 19 is characterized in that,

Also comprise after the described step 191:

21. the method for Web page information extraction as claimed in claim 12 is characterized in that,

Described step 4 further is,

22. the method for Web page information extraction as claimed in claim 21 is characterized in that,

Described step 5 further is: