CN102591931B

CN102591931B - Recognition and extraction method for webpage data records based on tree weight

Info

Publication number: CN102591931B
Application number: CN201110438187.XA
Authority: CN
Inventors: 尹建伟; 彭勇; 杨弈锦; 邓水光; 李莹; 吴健; 吴朝晖
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2011-12-23
Filing date: 2011-12-23
Publication date: 2015-03-18
Anticipated expiration: 2031-12-23
Also published as: CN102591931A

Abstract

The invention discloses a recognition and extraction method for webpage data records based on a tree weight. The method comprises the following steps of: processing and transforming a webpage; recognizing the data records; aligning and extracting the data records; storing the data; and processing and converting the extracted webpage to a tag tree structure according to the characteristics of the tree structure of the contents on an HTML (hypertext markup language) webpage, assigning weights for each of tree nodes from bottom to top, so that the nodes in different layers have different weights, then, identifying a data record area according to a similar sub-tree set and the position consistency, and then, aligning the tree according to a tag tree set including the data records so as to generate a reference tree as an extraction template, so that the effects of high efficiency and high accuracy can be achieved.

Description

Based on the identification of web data record and the abstracting method of tree weights

Technical field

The present invention relates to information extraction field, particularly relate to the web data record identification based on tree weights and abstracting method.

Background technology

Along with the development exponentially level growth of internet information under time measure, contain the interested resource of more and more people and information in webpage, but increase along with quantity of information, searching of information starts to become difficulty.Due to the randomness of information, we can only adopt " full-text search " to search required information, but comprise in all kinds of webpages of information needed and be flooded with the irrelevant informations such as a large amount of advertisements and link, make us cannot get the useful information of our needs fast and intuitively, and currently rely on the mode inefficiency of artificial-information-obtaining, information needed for user often needs to obtain from multiple different information source, because the different web sites comprising these information exists structural otherness, the information obtaining needs must carry out query analysis process in artificial mode to different websites, the form finally result being organized into needs is saved in database and uses for follow-up service, this process is loaded down with trivial details, and efficiency is also very low.Thus how the interested content of user that extracts from magnanimity html document of precise and high efficiency can become more and more important, be suggested under this background just based on the web data record identification of tree weights and abstracting method.

Web page information extraction technology is the use-case of information extraction technique in internet, applications field.Web page information extraction is by the structuring of distribution magnanimity on the internet, and semi-structured or freely in html text customizing messages extracts, and is converted to the unified structuring form of expression.The Internet web page information extracts the feature being different from conventional information and extracting, and has mass data, structural difference, dynamic change, unstructured data, semantic information shortage.

Be responsible for the program that information extracted from webpage in Web page information extraction and be called as wrapper (Wrapper), it is Extracting Information the program stored after being converted into unified structured message from semi-structured html text.It is by information extraction robotization, to towards particularly useful during the info web of magnanimity.Machine Method, natural language understanding method, bulk process, HTML method etc. can be divided into according to generation side's ratio juris.Method based on HTML architectural feature is study in current Web page information extraction technology maximum, is also the best technical method of development.Method based on HTML architectural feature takes full advantage of the architectural feature that html text has and carries out data pick-up.Before data pick-up, first html text is converted to a corresponding tag tree, then generates decimation rule by automatic or automanual mode, and rule is applied in tag tree and carries out data pick-up.Certainly, the research of this method also also exists some problems, as follows:

1. present research is as template mostly using tag tree corresponding for html texts whole in webpage, this is obviously inefficient, and in fact a lot of text nodes also may comprise html tag, a lot of texts in such as text node can with the decorative labels such as color or font, some texts are also had to have hypertext link, and in content, also may comprise picture and form etc., these all belong to irrelevant information.

2. have some text nodes to concentrate at webpage to repeat, these nodes should not extracted by as key content, but should as a part for template.

3. for the tree comparative approach described in existing abstracting method, the object webpage collection of extraction is generally small-scale, and for extensive, efficiency can be very low.

4. the structure comparison mode for the page needs more than two pages usually, if only have single page, cannot extract template.

These aspects are not considered to be the complexity due to webpage and these two characteristics determined of magnanimity, will concentrate at the so complicated and webpage of magnanimity ensures to extract while accuracy rate and recall rate have higher result, the extraction time also will guarantee there is speed faster, and this is very necessary in practical operation.Thus for a pin-point accuracy and high efficiency webpage content extracting method in the urgent need to.

Summary of the invention

The present invention mainly for existing web data identification and abstracting method accuracy and efficiency low, can not effectively in the information that a large amount of web page extraction go out to need; Web data record identification based on tree weights and abstracting method are proposed, according to the tree construction feature of the content of html web page, tag tree structure is converted into by carrying out process to the webpage grabbed, on the end of from is that each tree node composes upper weights, the node of different levels is made to have different weights, then data recording area is identified according to similar subtree set and position continuity, then carry out tree alignment function according to the tag tree collection comprising data record and generate benchmark tree as extraction template, the result of high-level efficiency and pin-point accuracy can be obtained.

In order to solve the problems of the technologies described above, technical scheme of the present invention is as follows:

1., based on the identification of web data record and the abstracting method of tree weights, comprise the steps:

(1) Web Page Processing and conversion;

(2) data record identification;

(3) data record is aimed at and is extracted;

(4) data store;

Described Web Page Processing and conversion comprise the steps:

11) tag tree is constructed after the effect of the webpage foundation label captured being classified;

12) weights are given to each tree node of described tag tree according to following formula:

W = λ^{depth} + Σ_{i = 1}^{n} Sub W_{i}

Wherein SubW _ifor children tree nodes weights, λ is weights regulating parameter, and depth is the degree of depth of tree, if the weights of certain node are 0, then think that this node is without articulation point, if tree node is irrelevant label node, described irrelevant label comprises hyperlink label, describes the label of display property, then λ=0 of this tree correspondence; If if tree node is leaf node and this tree node is text node or picture node, these tree node weights W=1, if other types then these tree node weights W=0;

The identification of described data record comprises the steps:

13) using step 12) through giving the tag tree after weights as input tree input data recordin module, described data recordin module is access templates treebank first, compare according to the essential subtree set weights of the template tree in described template treebank and search, if described input tree comprises essential subtree set weights, then this input tree can carry out the identification of data record with this template tree, obtain corresponding data recording area, if can not find corresponding templates tree in described template treebank, then carry out adaptive identification data record, data recording area is identified by the judgement of the similar subtree to self continuum, obtain corresponding data recording area,

Described data record is aimed at and extraction comprises the steps:

14) in step 13) tag tree of getting maximum weight in data recording area array that comprises that obtains sets T as benchmark _b, for remaining every the tag tree T in data recording area array _i, according to label after first weights, weights find out tag tree T by the mode of descending order _ilower allly can aim at described benchmark tree T _bnode, if find out a node T _i[j] is at benchmark tree T _bunder there are weights and be greater than or equal threshold values K, and the node T that node label is consistent _b[k], so will think node T _i[j] can aim at; If there is no the node can aimed at then performs update by node T _i[j] is inserted into described benchmark tree T _bin make described benchmark tree T _badjust, the benchmark tree T after adjustment _bto the aligning of other tag tree in data recording area array be used for, finally produce final benchmark tree T _b;

Described data storage comprises the steps:

15) tag tree set and template are set carry out Data Matching obtaining information, result is preserved with the form of database.

Further, described step 11) by the effect of label, the label of HTML is divided three classes: first be planning page layout label its label in content information region is provided; Second for describing the label of its content display mode of label of display property; 3rd label of being correlated with for hyperlink.

Further, in employing step 12) to before each tree node imparting weights, carry out except making an uproar to webpage, described except step of making an uproar is for prune tag tree, comprise father node label leaf node label being set to irrelevant label, adjacent text or picture node and be set to irrelevant label, the text of nothing brother or the father node label of picture node for irrelevant label.

Further, described step 13) identification of data record is needed to compare tag tree to judge its similarity degree, adopt comparative approach be: if in the subtree set of tag tree T1 with there is the equal common factor of weights in the subtree set of tag tree T2, satisfied existence is greater than the subtree of threshold values K and the equal subtree set of weights must exist ordinal relation, namely W [T1 [i]]==W [T2 [j]] and W [T1 [k]]==W [T2 [t]] time, j <=t during and if only if i <=k, then establish tag tree T1 similar with tag tree T2.

Further, described step 14) insertion position of described update determines as follows: if sequence node T _i[j] ... T _i[m] is at tag tree T _icommon father node under have 2 adjacent brotghers of node, one at Far Left, one at rightmost, these two brotghers of node all benchmark tree T _bunder have corresponding aligning node, so sequence node T _i[j] ... T _i[m] can be unique be inserted into benchmark tree T _bbetween lower 2 adjacent brotghers of node; If sequence node T _i[j] ... T _i[m] is at tag tree T _icommon father node under only have 1 adjacent brotgher of node k in a left side, and node k corresponding aim at benchmark tree T _bunder rightmost node, so sequence node T _i[j] ... T _i[m] can be unique be inserted into benchmark tree T _bthe farthest right position of lower node k; If sequence node T _i[j] ... T _i[m] is at tag tree T _icommon father node under only have 1 adjacent brotgher of node k in the right side, and node k corresponding aim at benchmark tree T _bunder leftmost node, so sequence node T _i[j] ... T _i[m] can be unique be inserted into benchmark tree T _bthe leftmost position of lower node k; If uniquely tag tree T can not be determined _inext non-aligned node k sets T at benchmark _bunder position, so execution not to be inserted, but by tag tree T _ibe put in ephemeral data record array.

Beneficial effect of the present invention is: according to the tree construction feature of the content of html web page, tag tree structure is converted into by carrying out process to the webpage grabbed, on the end of from is that each tree node composes upper weights, the node of different levels is made to have different weights, then data recording area is identified according to similar subtree set and position continuity, then carry out tree alignment function according to the tag tree collection comprising data record and generate benchmark tree as extraction template, the result of high-level efficiency and pin-point accuracy can be obtained.

Accompanying drawing explanation

Fig. 1 is system assumption diagram of the present invention;

Fig. 2 is the tag tree that a html page is corresponding;

Fig. 3 is the data record identification process of the tag tree based on weights;

Fig. 4 is judged as two similar trees;

Fig. 5 be insertion situation 1 two trees;

Fig. 6 be insertion situation 2 two trees;

Fig. 7 be insertion situation 3 two trees;

Fig. 8 is the basic procedure that data record is aimed at.

Embodiment

Below by the present invention is described further with specific embodiment with reference to the accompanying drawings.

Propose a kind of web data record identification based on tree weights and abstracting method in the present invention, mainly comprise Web Page Processing and conversion, the identification of data record, data record aligning and extraction and data and store four parts.Architecture of the present invention as shown in Figure 1.

Web Page Processing and transform mainly the webpage grabbed from different data source is converted into computing machine can easy to handle mode, usual process webpage mode comprises the tree by label, webpage being converted into label composition, and the Web Page Processing method for transformation simultaneously introduced based on the tag tree of weights improves recognition efficiency.

For pretreated webpage, web page contents is configured to the form of tag tree, the label of HTML can be divided three classes by the effect according to label: the label tree graph shown in accompanying drawing 2 being exactly certain html webpage.

1. plan page layout label its provide the label in content information region as <div>, <p>, <td>, <tr>, <table>, </table>, </tr>, </td>, </p>, </div> etc.

2. the label of its content display mode of label of display property is described as <b>, <i>, <strong>, <h1>, <h2>, </h2>, </h1>, </strong>, ></i>, </b> etc.

3. the label that hyperlink is correlated with comprises the relevant label of hyperlink as <a>, <base> etc.Based on the Web Page Processing method for transformation of the tag tree based on weights to each tree node definition weights:

W = λ^{depth} + Σ_{i = 1}^{n} Sub W_{i}

Wherein SubW _ifor children tree nodes weights, λ is weights regulating parameter, depth is the degree of depth of tree, if the weights of certain node are 0, then think that this node is without articulation point, if tree node is leaf node, this tree node is text node or picture node then these tree node weights W=1, other types are tree node weights W=0 then, in addition irrelevant tag tree node is comprised as <a> for one tree, <b>, <i>, <strong>, <h1>, <h2> etc., then the λ of this tree correspondence is 0,

The total framework of tree node weights assignment method is recurrence, and this is also determined by the structure of tag tree.Adopt recurrence each node to be carried out to the assignment of weights, the weights assignment flow process of each node comprises the superposition of children tree nodes weights, the judgement assignment etc. of the calculating of current depth and leaf node and irrelevant label.Can process webpage noise before execution assignment, these noise datas include advertisement link, navigation link, copyright information etc. information, so in order to remove noise effect, before to tag tree assignment, to prune tag tree and namely operate irrelevant label leaf node and irrelevant tag tree node, strategy is as follows:

1. leaf node label is set to irrelevant label

2. the father node label of adjacent text or picture node is all set to irrelevant label

3. be set to irrelevant label without the text of brother or the father node label of picture node

The identification of data record, mainly for through process and the tag tree collection after transforming, by the data record recognition methods of the tag tree based on weights, carries out data recording area identification to the tag tree collection of the Weighted Coefficients that process transforms.Based on the tag tree of weights data record identification process as shown in Figure 3, after tag tree collection is fed to data record identification module, for each label input tree, data recordin module is access templates treebank first, compare according to the essential subtree set weights of the template tree in storehouse and search, if input tree comprises essential subtree set weights, then this input tree can carry out the identification of data record with this template tree, if can not find corresponding templates tree in template treebank, then carry out adaptive identification data record, relatively self subtree, identifies data recording area.

Identifying for data record needs to compare tag tree to judge its similarity degree, and the comparative approach of employing is all compared by the node label of tree construction, does not generally comprise the content that text node comprises.

The comparative approach that the present invention uses is the tree comparative approach of base weights, by tag tree T1 and tag tree T2 similar definitions as follows: if in the subtree set of tag tree T1 with there is the equal common factor of weights in the subtree set of tag tree T2, wherein there is the subtree being greater than threshold values K, and there is ordinal relation in the equal subtree set of weights, W [T1 [i]]==W [T2 [j]] and W [T1 [k]]==W [T2 [t]] time, j <=t during and if only if i <=k.

Tag tree such as shown in Fig. 4 is similar, and may be judged as dissmilarity due to reasons such as threshold values in the tree comparative approach mentioned in the prior art.Because tree T1 and tree T2 exist subtree weights be 11121 a and subtree weights be 1421 c be identical weights subtree, and sequencing is consistent, so these two trees are similar in determination strategy herein.And in the differentiation of additive method, these two trees may be determined as due to the inconsistent of subtree k under subtree b and tree T2 under the existence of subtree d under tree T2 and tree T1 unequal.

For the tag tree in ergodic process in based on the data record recognition methods of weights, the weights of each node complete for traversal are joined one and does not repeat in set.Traversal template base, the weights set according to each template and essential subtree set weights judge, if exist, that selects this template tree recurrence carries out the identification of data record.To the situation that can not find corresponding template tree in template treebank, then utilize template to set carry out identifying prerequisite to be judged as NO, start to carry out adaptive identification data record flow process, identify data recording area by the judgement of the similar subtree to self continuum, obtain corresponding data record.

Data record is aimed at and is extracted the efficiency mainly improving aligning according to a kind of tree alignment methods of the tag tree based on weights.The basic procedure that data record is aimed at as shown in Figure 8, the object mainly array that comprise data recording area of tag tree collection by returning after the identification of data record that data record is aimed at and extracted, here each data record in array is registered to a benchmark tree, each node of data record each in array and benchmark, by tree alignment methods, are set after aiming at and are extracted the interested data of user by the generation of benchmark tree.

Basic thought based on the tree alignment methods of weights is first chosen by the tree of maximum weight in data acknowledgment number group, as benchmark tree T _b.Why select the reason of maximum weight, be because this tree will have the most deeply or the widest tree construction, other trees can easierly be aimed at it.Afterwards for every tree (T in record array _i), the present invention is according to label after first weights, and weights are attempted finding out tag tree T by the mode of descending order _ilower all can alignment fiducials tree T _bnode.If find out a node T _i[j] is at benchmark tree T _blower exist weights and be greater than threshold values K and equal or weights are less than K and the equal and node T that node label is consistent _b[k], so will think node T _i[j] can aim at.If there is no the node can aimed at then performs update by node T _i[j] is inserted into benchmark tree T _bin make to benchmark tree T _badjust.Benchmark tree T after adjustment _bthe tree being used for other data records in data acknowledgment number group is aimed at.

In tree alignment methods, need to aim at many tag tree, and alignment procedures be process is realize by comparing two tag tree successively, and the comparative approach of two trees specific as follows shown in:

At benchmark tree T _b(or template tree T _t, lower same) and tree T _tafter aligning, tree T _tin some nodes will correspond to tree T _bin some nodes, the weights of these nodes are equal.For the misalignment also i.e. unequal node of weights, these node city are set T to benchmark _bin carry out benchmark tree T _badjustment because these nodes may contain optional data item.When by node T _i[j] is inserted into benchmark tree T _bshi Keneng has multiple situation, all depends on node T _iwhether [j] can set T without the benchmark that is inserted into of ambiguous _bunder certain position.In fact, as the non-aligned node T of insertion one _itime [j], can with the adjacent node T of the misalignment of whole sequence string _i[j] ... T _i[m] replaces a node T _i[j] raises the efficiency.Without loss of generality, T is considered _i[j] ... T _ithe father node of [m] exists at benchmark tree T _bunder aligning, to insert node T now _i[j] ... T _i[m] is to benchmark tree T _bcorrespondence aligning father node under.These insert nodes can not by unordered insertion, but want advanced line position judgement.Sequence node T _i[j] ... T _ithe insertion position of [m] all can be now uniquely determined in following several situation:

1. if sequence node T _i[j] ... T _i[m] tree Ti common father node under have 2 adjacent brotghers of node, one at Far Left, one at rightmost, these two brotghers of node all benchmark tree T _bunder have corresponding aligning node, so sequence node T _i[j] ... T _i[m] can be unique be inserted into benchmark tree T _bbetween lower 2 adjacent brotghers of node, as shown in Figure 5, tag tree T _iwith benchmark tree T _breduced graph, can tag tree T be seen _iunder continuous adjacent node b and node c can be inserted into benchmark tree T _bunder node a and node d between, this is because tag tree T _iunder node a and d set T at benchmark _bunder have corresponding aligning node.Benchmark tree T after insertion _bas shown in Fig. 5 Lower Half.A in figure, b, c, d, e node can not be simple text node, also can be one tree.

2. as shown in Figure 6, if sequence node T _i[j] ... T _i[m] is at tag tree T _icommon father node under only have 1 adjacent brotgher of node k in a left side, and node k corresponding aim at benchmark tree T _bunder rightmost node, so sequence node T _i[j] ... T _i[m] can be unique be inserted into benchmark tree T _bthe farthest right position of lower node k.

3. as shown in Figure 7, if sequence node T _i[j] ... T _i[m] is at tag tree T _icommon father node under only have 1 adjacent brotgher of node k in the right side, and node k corresponding aim at benchmark tree T _bunder leftmost node, so sequence node T _i[j] ... T _i[m] can be unique be inserted into benchmark tree T _bthe leftmost position of lower node k

In addition, if uniquely tag tree T can not be determined _inext non-aligned node k is at tree T _bunder position, so will not insert it, but by tag tree T _ibe put in ephemeral data record array.Fig. 7 illustrates this situation, and in this case, node k both can be placed between node a and node b, also can be placed between node b and node c.

Data store mainly tag tree set and template sets carries out Data Matching and obtains relevant information, result is saved in the form of a rule data-base recording in the database of correspondence and serves use for subsequent query etc.

The present invention is described further for the following form according to code.

1) Web Page Processing and conversion process

The total framework of tree node weights assignment method is recurrence, and this is also determined by the structure of tag tree.Method meeting recurrence carries out the assignment of weights to each node, the weights assignment flow process of each node comprises the superposition of children tree nodes weights, the judgement assignment etc. of the calculating of current depth and leaf node and irrelevant label.

The flow process of tree node weights assignment method is recurrence, and concrete grammar is as follows:

In the method, first 1st row judges present tree whether non-leaf nodes, if not leaf node is then at the 2nd row initialization tree degree of depth and weights, the all child nodes of 3 to 7 row to present tree travel through, children tree nodes weights are added up, the weights of each subtree are also the flow processs of recursive call oneself, and the degree of depth of present tree is set to depth capacity depth+1 in all subtrees.10th row judges that present tree is leaf node and as text node, the 11 to the 12 row carries out to current text leaf node the assignment setting the degree of depth and weights.14th row judges that present tree is leaf node and as non-textual node, the 15 to the 16 row carries out to current non-textual leaf node the assignment setting the degree of depth and weights.18th row judges whether children tree nodes weights cumulative sum is 0, if very then the weights of present tree are 0.20th row judges that the label of present tree is whether in useless label array, if very then weights are constant (the 21st row), if the weights of vacation then present tree are that subtree sum weight adds the exponential depth of weights regulating parameter doubly.(the 22nd row).After flow process terminates, then all tree nodes all can be endowed a weights W.

2) data record identifying

First, for the tag tree in ergodic process, the weights of each node complete for traversal are joined one and does not repeat in set.Traversal template base, the weights set according to each template and essential subtree set weights judge, if exist, select this template to set and carry out the identification of data record.

The flow process of this recognition methods is recurrence, and concrete grammar is as follows:

Carry out weights judge to show whether this subtree is data recording area by setting each subtree and selected template.Identify code is as implied above, 1st row judges whether present tree has children tree nodes, all children tree nodes of the 2nd row traversal present tree, by calling Compare function, 4th row judges whether current subtree comprises selected template tree or select the essential subtree of template tree, if result is very, be added to (the 5th row) in data acknowledgment number group arrayT, otherwise recursive call access children tree nodes (the 6th row).Compare function directly can be called to judge whether current subtree is data identification region, and Rule of judgment is as follows in recognition function:

1) if weights are equal, return true

2) if weights are unequal and input tree comprises all template trees tree centralization of state power value, return true

3) all the other all return vacation

If can not find corresponding template tree in template treebank, then utilize template to set carry out identifying prerequisite to be judged as NO, start to carry out adaptive identification data record flow process, identify data recording area by the judgement of the similar subtree to self continuum, obtain corresponding data record.

3) based on the tree alignment procedures of weights

For the alignment procedures of two trees, first the judgement from big to small carrying out based on weights to subtree is carried out to two tag tree, if there is the subtree that weights are equal, this subtree is without the need to adjusting, otherwise carry out aligning to the subtree that weights do not wait to judge, find out in corresponding tag tree the continuous nodes sequence of failing to aim at, perform benchmark tree T _bcarry out the operation of inserting, adjustment benchmark tree T _b, process is always to there is not unjustified continuous nodes or cannot adjusting benchmark tree T _btill.If benchmark tree T cannot be adjusted _b, meaning also exists insertion position cannot well-determined node, then the alignment function of this tree does not complete.If alignment function does not complete, this tag tree is joined in blotter array

For the alignment procedures of many trees, take a kind of like this strategy to carry out the aligning of many trees, according to tree node weights size, large for weights is preferentially aimed at, make to be aligned to the maximum and adjustment dynamics of power so at every turn and all can reach maximum, thus the aligning number of times reduced subsequently, improve the efficiency of aiming at, concrete grammar is as follows:

Method carries out sequence (the 2nd row) from big to small by the outcome record array arrayT produced data record identification module, and the tree of maximum weight is set T as benchmark _b(the 3rd row), create the record array temparrayT (the 5th row) that is preserved misalignment, the length of circulation to record array arrayT judges (the 6th row), blotter array temparrayT is emptied (the 7th row), searching loop record array arrayT (eighth row), the current maximum weights tree T of array arrayT will be recorded and take out (the 9th row), call function TreeAlign and tree T _bcarry out alignment function (the 10th row), if failure, joins tree T (the 11st row) in blotter array temparrayT, finally remove the tree T (the 12nd row) in arrayT, after searching loop record array arrayT terminates, temparrayT clone is worth assignment and enters cycle criterion next time (the 14th row) to arrayT.

The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, without departing from the inventive concept of the premise; can also make some improvements and modifications, these improvements and modifications also should be considered as in scope.

Claims

1., based on the identification of web data record and the abstracting method of tree weights, it is characterized in that, comprise the steps:

(1) Web Page Processing and conversion;

(2) data record identification;

(3) data record is aimed at and is extracted;

(4) data store;

Described Web Page Processing and conversion comprise the steps:

W = λ^{depth} + Σ_{i = 1}^{n} {SubW}_{i}

Wherein SubW _ifor children tree nodes weights, λ is weights regulating parameter, and depth is the degree of depth of tree, if the weights of certain node are 0, then thinks that this node is irrelevant label node, if tree node is irrelevant label node, then and λ=0 that this tree node is corresponding; If tree node is leaf node and this tree node is text node or picture node, these tree node weights W=1, if other types then these tree node weights W=0; Described irrelevant label comprises hyperlink label, describes the label of display property;

The identification of described data record comprises the steps:

Described data record is aimed at and extraction comprises the steps:

14) in step 13) tag tree comprising maximum weight in the array of data recording area that obtains sets T as benchmark _bthe object mainly array that comprise data recording area of tag tree collection by returning after the identification of data record that data record is aimed at and extracted, why select the reason of maximum weight, be because this tree will have the most deeply or the widest tree construction, other trees can easierly be aimed at it; Afterwards for remaining every the tag tree T in data recording area array _i, according to label after first weights, weights find out tag tree T by the mode of descending order _ilower allly can aim at described benchmark tree T _bnode, if find out a node T _i[j] is at benchmark tree T _bunder there are weights and be greater than or equal to threshold values K, and the node T that node label is consistent _b[k], so will think node T _i[j] can aim at; If there is no the node can aimed at then performs update, by node T _i[j] is inserted into described benchmark tree T _bin make described benchmark tree T _badjust, the benchmark tree T after adjustment _bto the aligning of other tag tree in data recording area array be used for, finally produce final benchmark tree T _b;

Described data storage comprises the steps:

15) by tag tree set with by step 14) aims at the template with attribute labeling produced afterwards and sets and carry out Data Matching obtaining information, result is preserved with the form of database, and for step 14) aim at after the storage of data, build table by carrying out database to the attribute of attribute labeling, thus each tag tree in tag tree set is corresponded to a record in database table when data alignment;

Described step 11) according to the effect of label, the label of HTML is divided three classes: first be planning page layout label its label in content information region is provided; Second for describing the label of its content display mode of label of display property; 3rd label of being correlated with for hyperlink;

In employing step 12) to before each tree node imparting weights, carry out except making an uproar to webpage, described except step of making an uproar is for prune tag tree, comprise the father node label that irrelevant label, adjacent text or picture node be set to leaf node label and be set to irrelevant label, be irrelevant label without the text of brother or the father node label of picture node;

Described step 13) identification of data record is needed to compare tag tree to judge its similarity degree, adopt comparative approach be: if in the subtree set of tag tree T1 with there is the equal common factor of weights in the subtree set of tag tree T2, then must meet the subtree existing and be greater than threshold values K, must there is ordinal relation in the equal subtree set of weights simultaneously, such as W [T1 [i]]==W [T2 [j]] and W [T1 [k]]==W [T2 [t]] time, j<=t during and if only if i<=k, then establish tag tree T1 similar with tag tree T2,

Described step 14) insertion position of described update determines as follows: if sequence node T _i[j] ... T _i[m] is at tag tree T _icommon father node under have 2 adjacent brotghers of node, one at Far Left, one at rightmost, these two brotghers of node all benchmark tree T _bunder have corresponding aligning node, so sequence node T _i[j] ... T _i[m] can be unique be inserted into benchmark tree T _bbetween lower 2 adjacent brotghers of node; If sequence node T _i[j] ... T _i[m] is at tag tree T _icommon father node under only have 1 adjacent brotgher of node k in a left side, and node k corresponding benchmark tree T _bunder rightmost node, so sequence node T _i[j] ... T _i[m] can be unique be inserted into benchmark tree T _bthe farthest right position of lower node k; If sequence node T _i[j] ... T _i[m] is at tag tree T _icommon father node under only have 1 adjacent brotgher of node k in the right side, and node k corresponding benchmark tree T _bunder leftmost node, so sequence node T _i[j] ... T _i[m] can be unique be inserted into benchmark tree T _bthe leftmost position of lower node k; If uniquely tag tree T can not be determined _inext non-aligned node k sets T at benchmark _bunder position, so execution not to be inserted, but by tag tree T _ibe put in ephemeral data record array.