CN102591931A

CN102591931A - Recognition and extraction method for webpage data records based on tree weight

Info

Publication number: CN102591931A
Application number: CN201110438187XA
Authority: CN
Inventors: 尹建伟; 彭勇; 杨弈锦; 邓水光; 李莹; 吴健; 吴朝晖
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2011-12-23
Filing date: 2011-12-23
Publication date: 2012-07-18
Anticipated expiration: 2031-12-23
Also published as: CN102591931B

Abstract

The invention discloses a recognition and extraction method for webpage data records based on a tree weight. The method comprises the following steps of: processing and transforming a webpage; recognizing the data records; aligning and extracting the data records; storing the data; and processing and converting the extracted webpage to a tag tree structure according to the characteristics of the tree structure of the contents on an HTML (hypertext markup language) webpage, assigning weights for each of tree nodes from bottom to top, so that the nodes in different layers have different weights, then, identifying a data record area according to a similar sub-tree set and the position consistency, and then, aligning the tree according to a tag tree set including the data records so as to generate a reference tree as an extraction template, so that the effects of high efficiency and high accuracy can be achieved.

Description

Identification of web data record and abstracting method based on the tree weights

Technical field

The present invention relates to the information extraction field, relate in particular to identification of web data record and abstracting method based on the tree weights.

Background technology

Along with internet information is exponential growth in the development under the time measure, comprising interested resource of more and more people and information in the webpage, increase but be accompanied by quantity of information, searching of information begins to become difficulty.Because the randomness of information; We can only adopt " full-text search " to search required information; Be flooded with irrelevant informations such as a large amount of advertisements and link but comprise in all kinds of webpages of information needed, make that we can't be fast and get access to the useful information that we need intuitively, and currently depend on the mode inefficiency that information is obtained in manual work; The required information of user often need obtain from a plurality of different information sources; Because there is structural otherness in the different web sites that comprises these information, obtain the information that needs and must carry out query analysis to different websites with the mode of manual work and handle, at last the result is organized into the form that needs and is saved in the database and uses for follow-up service; This process is loaded down with trivial details, and efficient is also very low.Thereby the user's interest content that from the magnanimity html document, extracts that how can precise and high efficiency becomes more and more important, under this background, is suggested just based on the web data record identification of tree weights and abstracting method.

The Web page information extraction technology is the use-case of information extraction technique in the internet, applications field.Web page information extraction is the structuring with the magnanimity on the internet that distributes, semi-structured or freely the customizing messages in the html text extract, and convert the unified structure form of expression into.Internet web page information extracts and is different from the characteristics that conventional information extracts, and has mass data, textural difference property, dynamic change property, unstructured data, semantic information shortage.

The program of being responsible in the Web page information extraction information is extracted from webpage is called as wrapper (Wrapper), and it is an extraction information and be converted into the program that stores after the unified structure information from semi-structured html text.It is the information extraction robotization, and is particularly useful to towards the info web of magnanimity the time.Principle according to generation method can be divided into Machine Method, natural language understanding method, bulk process, HTML method or the like.Method based on the HTML architectural feature is study in the present Web page information extraction technology maximum, also is the best techniques method of development.Make full use of the architectural feature that html text had based on the method for HTML architectural feature and carried out data pick-up.Before data pick-up, earlier html text is converted into the tag tree of a correspondence, generate decimation rule through automatic or automanual mode then, and rule application is extracted in the enterprising line data of tag tree.Certainly, in the research of this method, also exist some problems, as follows:

1. present research is that tag tree that html text whole in the webpage is corresponding is as template mostly; This obviously is inefficient; And in fact a lot of text nodes also possibly comprise html tag, can have ornamental labels such as color or font such as a lot of texts in the text node, also has some texts to have hypertext link; And also possibly comprise picture and form etc. in the content, these all belong to irrelevant information.

2. have some text nodes to concentrate at webpage and repeat to occur, these nodes should not be used as key content and extract, but should be as the part of template.

3. for the tree comparative approach of describing in the existing abstracting method, the object webpage collection of extraction generally is small-scale, and for extensive, efficient can be very low.

4. need two usually more than the page for the structure manner of comparison of the page, if having only single page then can't extract template.

It is because complicacy and these two characteristics determined of magnanimity of webpage that these several aspects are not considered; Will be in such complicacy and the webpage of magnanimity concentrate guarantee to extract accuracy rate and recall rate higher result all arranged in; On the extraction time, also will guarantee to have fast speeds, this is very necessary in practical operation.Thereby press for for a pin-point accuracy and high efficiency web page contents method for distilling.

Summary of the invention

It is low that the present invention is primarily aimed at existing web data identification and abstracting method accuracy and efficient, the information that can not effectively go out to need at a large amount of web page extractions; Proposition is based on the identification of web data record and the abstracting method of tree weights; Tree construction characteristics according to the content of html web page; Through being handled, the webpage that grabs is converted into the tag tree structure; The end of from and on be that each tree node is composed and go up weights, make the node of different levels have different weights, identify data recording area according to similar subtree set and position continuity then; Set alignment function according to the tag tree collection that comprises data recording then and generate the benchmark tree, can obtain the result of high-level efficiency and pin-point accuracy as extraction template.

In order to solve the problems of the technologies described above, technical scheme of the present invention is following:

1. based on the identification of web data record and the abstracting method of tree weights, comprise the steps:

(1) webpage is handled and is transformed;

(2) data recording identification;

(3) data recording is aimed at and is extracted;

(4) data storage;

Said webpage is handled and conversion comprises the steps:

11) to the webpage that grasps according to the effect of the label back structure tag tree of classifying;

12) give weights to each tree node of said tag tree according to following formula:

W = λ^{depth} + Σ_{i = 1}^{n} Sub W_{i}

SubW wherein _iBe subtree node weights, λ is that weights are regulated parameter, and depth is the degree of depth of tree; If the weights of certain node are 0, think that then this node is no articulation point, if tree node is irrelevant label node; Said irrelevant label comprises the hyperlink label, describes the label of display property, then should set corresponding λ=0; If tree node is a leaf node and if this tree node is then these tree node weights W=1 of text node or picture node, if other types these tree node weights W=0 then;

Said data recording identification comprises the steps:

13) with step 12) through giving tag tree behind the weights as input tree input data recordin module; Said data recordin module is the access templates treebank at first, compares based on the essential subtree set weights of the template in said template treebank tree and searches, if said input tree comprises essential subtree set weights; Then this input tree can carry out data record identification with this template tree; Obtain corresponding data recording area,, then carry out adaptive recognition data record if in said template treebank, can not find the corresponding templates tree; Judgement through to the similar subtree of self continuum identifies data recording area, obtains corresponding data recording area;

Said data recording is aimed at and extraction comprises the steps:

14) tag tree of obtaining in step 13) that comprises weighting value maximum in the data recording area array is set T as benchmark _b, for remaining every the tag tree T in the data recording area array _i, according to label behind the first weights, weights are found out tag tree T by the mode of descending order _iAll can aim at said benchmark tree T down _bNode, if find out a node T _i[j] is at benchmark tree T _bUnder exist weights greater than or equate threshold values K, and the consistent node T of node label _b[k] will think node T so _i[j] can aim at; If do not have the node that to aim at then carry out and insert operation with node T _i[j] is inserted into said benchmark tree T _bIn make to said benchmark tree T _bAdjust adjusted benchmark tree T _bThe aligning that will be used for other tag tree of data recording area array produces final benchmark tree T at last _b

Said data storage comprises the steps:

15) the tag tree set is carried out Data Matching with the template tree and obtain information, the result is preserved with the form of database.

Further, the effect of said step 11) through label is divided three classes the label of HTML: first for the label of planning page layout it provides the label in content information zone; Second is the label of describing its content display mode of label of display property; The 3rd is the relevant label of hyperlink.

Further; Before the employing step 12) is given weights to each tree node; Webpage removed make an uproar; Said remove made an uproar step for tag tree is pruned, and comprises that the father node label that the father node label that the leaf node label is made as irrelevant label, adjacent text or picture node is made as irrelevant label, the text that does not have the brother or picture node is the label that has nothing to do.

Further; Said step 13) need compare tag tree for data recording identification judges its similarity degree; The comparative approach that adopts is: if in the subtree set of tag tree T1 with the subtree set of tag tree T2 in exist weights to equate common factor; Satisfy the subtree and the weights that exist greater than threshold values K and equate that must there be ordinal relation in subtree set; Be W [T1 [i]]==W [T2 [j]] and W [T1 [k]]==during W [T2 [t]], j＜=t during and if only if i＜=k, it is similar with tag tree T2 then to establish tag tree T1.

Further, the insertion position of the said insertion operation of said step 14) is confirmed through following steps: if sequence node T _i[j] ... T _i[m] is at tag tree T _iCommon father node under 2 adjacent brotghers of node are arranged, one at Far Left, one at rightmost, these two brotghers of node are all at benchmark tree T _bCorresponding aligning node is arranged, so sequence node T down _i[j] ... T _i[m] can be unique be inserted into benchmark tree T _bBetween following 2 adjacent brotghers of node; If sequence node T _i[j] ... T _i[m] is at tag tree T _iCommon father node under have only 1 adjacent brotgher of node k in a left side, and node k corresponding aiming at benchmark tree T _bUnder rightmost node, sequence node T so _i[j] ... T _i[m] can be unique be inserted into benchmark tree T _bThe farthest right position of lower node k; If sequence node T _i[j] ... T _i[m] is at tag tree T _iCommon father node under have only 1 adjacent brotgher of node k in the right side, and node k corresponding aiming at benchmark tree T _bUnder leftmost node, sequence node T so _i[j] ... T _i[m] can be unique be inserted into benchmark tree T _bThe leftmost position of lower node k; If can not unique definite tag tree T _iNext non-aligning node k sets T at benchmark _bUnder the position, will carry out so and not insert, but with tag tree T _iBe put in the ephemeral data record array.

Beneficial effect of the present invention is: according to the tree construction characteristics of the content of html web page; Through being handled, the webpage that grabs is converted into the tag tree structure; The end of from and on be that each tree node is composed and go up weights; Make the node of different levels have different weights; Identify data recording area according to similar subtree set and position continuity then, set alignment function according to the tag tree collection that comprises data recording then and generate the benchmark tree, can obtain the result of high-level efficiency and pin-point accuracy as extraction template.

Description of drawings

Fig. 1 is a system assumption diagram of the present invention;

Fig. 2 is the corresponding tag tree of a html page;

Fig. 3 is the data recording identification process based on the tag tree of weights;

Fig. 4 is judged as two similar trees;

Fig. 5 for insertion situation 1 two trees;

Fig. 6 for insertion situation 2 two trees;

Fig. 7 for insertion situation 3 two trees;

The basic procedure that Fig. 8 aims at for data recording.

Embodiment

To do further explanation to the present invention according to accompanying drawing and specific embodiment below.

Proposed a kind of identification of web data record and abstracting method among the present invention, mainly comprised webpage processing and conversion, data recording identification, data recording aligning and extraction and four parts of data storage based on the tree weights.Architecture of the present invention is as shown in Figure 1.

Webpage handle with transform mainly be the webpage that from different data sources, grabs is converted into computing machine can the easy to handle mode; Normal processing webpage mode comprises through label webpage is converted into the tree that label is formed, and introduced simultaneously to handle method for transformation based on the webpage of the tag tree of weights and improve recognition efficiency.

For pretreated webpage, web page contents is configured to the form of tag tree, can the label of HTML be divided three classes according to the effect of label: the label tree graph that shown in the accompanying drawing 2 is exactly certain html webpage.

Its label that content information zone is provided of the label of planning page layout as<div>,<p>,<td>,<tr>,<table>,</table>,</tr>,</td>,</p>,</div>Deng.

2. describe display property its content display mode of label label as<b>,<i>,<strong>,<h1>,<h2>,</h2>,</h1>,</strong>、></i>,</b>Deng.

3. the label that hyperlink is relevant comprises that the relevant label of hyperlink is like < a >, <base>etc.Based on handling method for transformation each tree node has been defined weights based on the webpage of the tag tree of weights:

W = λ^{depth} + Σ_{i = 1}^{n} Sub W_{i}

SubW wherein _iBe subtree node weights, λ is that weights are regulated parameter, and depth is the degree of depth of tree; If the weights of certain node are 0; Think that then this node is no articulation point, if tree node is leaf node, this tree node is then these tree node weights W=1 of text node or picture node; Other types are tree node weights W=0 then, in addition for one tree comprise irrelevant tag tree node as<a>,<b>,<i>,<strong>,<h1>,<h2>Deng, the λ that then this tree is corresponding is 0;

The total framework of tree node weights assignment method is a recurrence, and this also is that structure by tag tree determines.Adopt recurrence that each node is carried out the assignment of weights, the weights assignment flow process of each node comprises the stack of subtree node weights, the judgement assignment of the calculating of the current degree of depth and leaf node and irrelevant label etc.Before carrying out assignment, can handle the webpage noise; These noise datas have comprised advertisement link, navigation link, copyright information or the like information; So in order to remove noise effect; Before to the tag tree assignment, will prune promptly tag tree irrelevant label leaf node and irrelevant tag tree node will be operated, strategy as follows:

1. the leaf node label is made as irrelevant label

2. the father node label of adjacent text or picture node all is made as irrelevant label

3. do not have brother's the text or the father node label of picture node and be made as irrelevant label

Data recording identification mainly is for through the tag tree collection after handling and transforming, and through the data recording recognition methods based on the tag tree of weights, the tag tree collection of handling the cum rights value that transforms is carried out data recording area identification.Data recording identification process based on the tag tree of weights is as shown in Figure 3, after the tag tree collection is fed to the data recording identification module, for each label input tree; Data recordin module is the access templates treebank at first; Compare according to the essential subtree set weights of the template in storehouse tree and to search, the input tree is if comprise essential subtree set weights, then should input tree can set with this template and carry out data recording and discern; If in the template treebank, can not find the corresponding templates tree; Then carry out adaptive recognition data record, compare self subtree, identify data recording area.

Need compare tag tree for the identifying of data recording and to judge that its similarity degree, the comparative approach of employing all are that node label through tree construction compares, the content that does not generally comprise text node and comprised.

The comparative approach that the present invention uses is the tree comparative approach of basic weights; Tag tree T1 is following with the similar definition of tag tree T2: if in the subtree set of tag tree T1 with the subtree set of tag tree T2 in exist weights to equate common factor; Wherein there is subtree greater than threshold values K; And weights equate that there is ordinal relation in subtree set, W [T1 [i]]==W [T2 [j]] and W [T1 [k]]==during W [T2 [t]], j＜=t during and if only if i＜=k.

Tag tree for example shown in Figure 4 is similar, and possible because threshold values etc. are former thereby be judged as dissmilarity in the tree comparative approach of mentioning in the prior art.Because tree T1 is that 1421 c is identical weights subtree with setting T2 to have the subtree weights be 11121 a with the subtree weights, and sequencing is consistent, so these two trees are similar in the determination strategy of this paper.And in the differentiation of additive method, maybe since under the tree T2 under the existence of subtree d and the tree T1 under subtree b and the tree T2 subtree k inconsistent to differentiate be that these two trees are unequal.

For the tag tree in the ergodic process, the weights of each node that traversal is finished join one and do not repeat in the set in based on the data recording recognition methods of weights.Travel through ATL, judge, then select for use the data record of carrying out of this template tree recurrence to discern if exist based on weights and the essential subtree set weights of each template tree.To in the template treebank, can not find the situation of corresponding template tree; Prerequisite being judged as not of then utilizing the template tree to discern; Begin to carry out adaptive recognition data record flow process; Judgement through to the similar subtree of self continuum identifies data recording area, obtains corresponding data recording.

The efficient that the tree alignment methods that mainly is based on a kind of tag tree based on weights improves aligning is aimed at and extracted to data record.The basic procedure that data recording is aimed at is as shown in Figure 9; The object that data recording is aimed at and extracted mainly is the array of returning after the tag tree collection is discerned through data recording that comprises data recording area; Here each data recording in the array is registered to a benchmark tree; The generation of benchmark tree is through the tree alignment methods, each node and the benchmark of each data recording in the array set extracted the user's interest data after aiming at.

Basic thought based on the tree alignment methods of weights is that at first that weights in the data recording array are maximum tree is chosen, as benchmark tree T _bWhy selecting the maximum reason of weights, is because this tree will have the darkest or the widest tree construction, makes other trees easierly to aim at it.Afterwards for every tree (T in the record array _i), the present invention according to first weights after label, weights attempt finding out tag tree T by the mode of descending order _iAll can alignment fiducials tree T down _bNode.If find out a node T _i[j] is at benchmark tree T _bExist down weights greater than threshold values K and equate or weights less than K and equal and the consistent node T of node label _b[k] will think node T so _i[j] can aim at.If do not have the node that to aim at then carry out and insert operation with node T _i[j] is inserted into benchmark tree T _bIn make to benchmark tree T _bAdjust.Adjusted benchmark tree T _bThe tree that will be used for other data recording of data recording array is aimed at.

In the tree alignment methods, need aim at many tag tree, and alignment procedures to be process be through successively relatively two tag tree realize, and the comparative approach of two trees specific as follows shown in:

At benchmark tree T _b(or template tree T _t, down together) and tree T _tAfter the aligning, tree T _tIn some nodes will corresponding to the tree T _bIn some nodes, the weights of these nodes equate.For misalignment also is the unequal nodes of weights, and these nodes are inserted into benchmark tree T _bIn carry out benchmark tree T _bAdjustment because these nodes possibly comprise optional data item.When with node T _i[j] is inserted into benchmark tree T _bShi Keneng has multiple situation, all depends on node T _iWhether [j] can not have the benchmark that is inserted into of ambiguous is set T _bUnder certain position.In fact, when inserting a non-aligning node T _iWhen [j], can be with the adjacent node T of the misalignment of whole sequence string _i[j] ... T _i[m] replaces a node T _i[j] raises the efficiency.Be without loss of generality, consider T _i[j] ... T _iThe father node of [m] exists in benchmark tree T _bUnder aligning, to insert node T now _i[j] ... T _i[m] is to benchmark tree T _bThe aligning father node of correspondence under.These insert nodes can be by unordered insertion, but want advanced line position judgement.Sequence node T _i[j] ... T _iThe insertion position of [m] all can be confirmed by unique under following several kinds of situation:

1. if sequence node T _i[j] ... T _i[m] has 2 adjacent brotghers of node under the common father node of tree Ti, one at Far Left, and one at rightmost, and these two brotghers of node are all at benchmark tree T _bCorresponding aligning node is arranged, so sequence node T down _i[j] ... T _i[m] can be unique be inserted into benchmark tree T _bBetween following 2 adjacent brotghers of node, as shown in Figure 5, tag tree T _iWith benchmark tree T _bReduced graph, can see tag tree T _iUnder continuous adjacent node b and node c can be inserted into benchmark tree T _bUnder node a and node d between, this is because tag tree T _iUnder node a and d set T at benchmark _bCorresponding aligning node is arranged down.Benchmark tree T after the insertion _bShown in Fig. 5 Lower Half.A among the figure, b, c, d, e node can not be simple text nodes, also can be one trees.

2. as shown in Figure 6, if sequence node T _i[j] ... T _i[m] is at tag tree T _iCommon father node under have only 1 adjacent brotgher of node k in a left side, and node k corresponding aiming at benchmark tree T _bUnder rightmost node, sequence node T so _i[j] ... T _i[m] can be unique be inserted into benchmark tree T _bThe farthest right position of lower node k.

3. as shown in Figure 7, if sequence node T _i[j] ... T _i[m] is at tag tree T _iCommon father node under have only 1 adjacent brotgher of node k in the right side, and node k corresponding aiming at benchmark tree T _bUnder leftmost node, sequence node T so _i[j] ... T _i[m] can be unique be inserted into benchmark tree T _bThe leftmost position of lower node k

In addition, if can not unique definite tag tree T _iNext non-aligning node k is at tree T _bUnder the position, will not insert it so, but with tag tree T _iBe put in the ephemeral data record array.Fig. 7 has showed this situation, and in this case, node k both can be placed between node a and the node b, also can be placed between node b and the node c.

Data storage mainly is tag tree set to be carried out Data Matching with the template tree obtain relevant information, the result is saved in the corresponding database with the form of a rule data-base recording uses for services such as subsequent query.

Following form according to code is done further explanation to the present invention.

1) webpage is handled and conversion process

The total framework of tree node weights assignment method is a recurrence, and this also is that structure by tag tree determines.Method can recurrence be carried out the assignment of weights to each node, and the weights assignment flow process of each node comprises the stack of subtree node weights, the judgement assignment of the calculating of the current degree of depth and leaf node and irrelevant label etc.

The flow process of tree node weights assignment method is a recurrence, and concrete grammar is as follows:

In the method; The 1st row is at first judged whether non-leaf node of present tree; If not then at the 2nd row initialization tree degree of depth and weights, the 3rd to 7 row travels through all child nodes of present tree leaf node, antithetical phrase tree node weights add up; The weights of each subtree also are the flow processs of recursive call oneself, and the degree of depth of present tree is made as depth capacity depth+1 in all subtrees.The 10th row judges that present tree is leaf node and is text node, and the 11st to the 12nd row is set the assignment of the degree of depth and weights to the current text leaf node.The 14th row judges that present tree is leaf node and is non-text node, and the 15th to the 16th row is set the assignment of the degree of depth and weights to current non-text leaf node.Whether the 18th row is judged that subtree node weights add up and is 0, if very then the weights of present tree are 0.Whether the label that the 20th row is judged present tree is in useless label array, if very then weights are constant (the 21st row), the weights that add up add that weights regulate exponential depth times of parameter if the false then weights of present tree are subtree.(the 22nd row).After flow process finishes then all tree nodes all can be endowed a weights W.

2) data recording identifying

At first, for the tag tree in the ergodic process, the weights of each node that traversal is finished join one and do not repeat in the set.The traversal ATL is judged based on weights and the essential subtree set weights of each template tree, then selects for use this template tree to carry out data record identification if exist.

The flow process of this recognition methods is a recurrence, and concrete grammar is as follows:

Judge through each subtree and selected template tree being carried out weights whether draw this subtree is data recording area.The Identify code is as implied above; The 1st row judges whether present tree has the subtree node; All subtree nodes of the 2nd row traversal present tree; The 4th worked calls the Compare function and judges whether current subtree comprises selected template tree or select the essential subtree of template tree, if the result advances (the 5th row) among the data recording array arrayT for very then adding, otherwise recursive call visit subtree node (the 6th goes).Can directly call the Compare function in the recognition function and judge whether current subtree is the data identification zone, and Rule of judgment is following:

1) if weights equates then to return true

2) if weights are unequal and input tree comprises all template trees tree centralization of state power values and then returns very

3) all the other all return vacation

If in the template treebank, can not find corresponding template tree; Prerequisite being judged as not of then utilizing the template tree to discern; Begin to carry out adaptive recognition data record flow process, identify data recording area, obtain corresponding data recording through judgement to the similar subtree of self continuum.

3) based on the tree alignment procedures of weights

Alignment procedures for two trees; At first two tag tree are carried out subtree is carried out the judgement from big to small based on weights; If exist subtree that weights equate then this subtree need not to adjust; Otherwise the subtree to weights do not wait is aimed at judgement, finds out the continuous nodes sequence of failing to aim in the corresponding tag tree, carries out benchmark tree T _bThe operation of inserting, adjustment benchmark tree T _b, handle always and do not have unjustified continuous nodes or can't adjust benchmark tree T _bTill.If can't adjust benchmark tree T _b, meaning exists the insertion position can't well-determined node, and then the alignment function of this tree is not accomplished.If alignment function is not accomplished then this tag tree is joined in the blotter array

Alignment procedures for many trees; Take a kind of like this strategy to carry out the aligning of many trees, according to tree node weights size, with big preferentially the aiming at of weights; Make like this and be aligned at every turn that power is maximum all can to reach maximum with the adjustment dynamics; Thereby reduce aligning number of times subsequently, improve the efficient of aiming at, concrete grammar is as follows:

Method is carried out ordering (the 2nd row) from big to small through the outcome record array arrayT that data record identification module is produced, and the tree that weights are maximum is set T as benchmark _b(the 3rd row); Create a record array temparrayT (the 5th row) who preserves misalignment; Circulation is judged (the 6th row) to the length of record array arrayT, blotter array temparrayT is emptied (the 7th row), searching loop record array arrayT (eighth row); To write down the current maximum weights tree T of array arrayT and take out (the 9th row), call function TreeAlign and tree T _bCarry out alignment function (the 10th row); If failure then will be set T and joined (the 11st row) among the blotter array temparrayT; Remove the tree T (the 12nd row) among the arrayT at last, after searching loop record array arrayT finishes the temparrayT clone is worth assignment and gets into cycle criterion next time (the 14th row) to arrayT.

The above only is a preferred implementation of the present invention; Should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the present invention's design; Can also make some improvement and retouching, these improvement and retouching also should be regarded as in the protection domain of the present invention.

Claims

1. based on the identification of web data record and the abstracting method of tree weights, it is characterized in that, comprise the steps:

(1) webpage is handled and is transformed;

(2) data recording identification;

(3) data recording is aimed at and is extracted;

(4) data storage;

Said webpage is handled and conversion comprises the steps:

W = λ^{depth} + Σ_{i = 1}^{n} Sub W_{i}

Said data recording identification comprises the steps:

Said data recording is aimed at and extraction comprises the steps:

Said data storage comprises the steps:

2. identification of web data record and abstracting method based on the tree weights according to claim 1; It is characterized in that the effect of said step 11) through label is divided three classes the label of HTML: first for the label of planning page layout it provides the label in content information zone; Second is the label of describing its content display mode of label of display property; The 3rd is the relevant label of hyperlink.

3. identification of web data record and abstracting method based on the tree weights according to claim 1; It is characterized in that; Before the employing step 12) is given weights to each tree node; Webpage removed make an uproar, said removing made an uproar step for tag tree is pruned, and comprises that the father node label that the father node label that the leaf node label is made as irrelevant label, adjacent text or picture node is made as irrelevant label, the text that does not have the brother or picture node is the label that has nothing to do.

4. identification of web data record and abstracting method based on the tree weights according to claim 1; It is characterized in that; Said step 13) need compare tag tree for data recording identification judges its similarity degree; The comparative approach that adopts is: if in the subtree set of tag tree T1 with the subtree set of tag tree T2 in exist weights to equate common factor, satisfy the subtree and the equal subtree set of weights that exist greater than threshold values K and must have ordinal relation, promptly W [T1 [i]]==W [T2 [j]] and W [T1 [k]]==during W [T2 [t]]; J＜=t during and if only if i＜=k, it is similar with tag tree T2 then to establish tag tree T1.

5. identification of web data record and abstracting method based on the tree weights according to claim 1 is characterized in that the insertion position of the said insertion operation of said step 14) is confirmed through following steps: if sequence node T _i[j] ... T _i[m] is at tag tree T _iCommon father node under 2 adjacent brotghers of node are arranged, one at Far Left, one at rightmost, these two brotghers of node are all at benchmark tree T _bCorresponding aligning node is arranged, so sequence node T down _i[j] ... T _i[m] can be unique be inserted into benchmark tree T _bBetween following 2 adjacent brotghers of node; If sequence node T _i[j] ... T _i[m] is at tag tree T _iCommon father node under have only 1 adjacent brotgher of node k in a left side, and node k corresponding aiming at benchmark tree T _bUnder rightmost node, sequence node T so _i[j] ... T _i[m] can be unique be inserted into benchmark tree T _bThe farthest right position of lower node k; If sequence node T _i[j] ... T _i[m] is at tag tree T _iCommon father node under have only 1 adjacent brotgher of node k in the right side, and node k corresponding aiming at benchmark tree T _bUnder leftmost node, sequence node T so _i[j] ... T _i[m] can be unique be inserted into benchmark tree T _bThe leftmost position of lower node k; If can not unique definite tag tree T _iNext non-aligning node k sets T at benchmark _bUnder the position, will carry out so and not insert, but with tag tree T _iBe put in the ephemeral data record array.