CN102591931A - Recognition and extraction method for webpage data records based on tree weight - Google Patents

Recognition and extraction method for webpage data records based on tree weight Download PDF

Info

Publication number
CN102591931A
CN102591931A CN201110438187XA CN201110438187A CN102591931A CN 102591931 A CN102591931 A CN 102591931A CN 201110438187X A CN201110438187X A CN 201110438187XA CN 201110438187 A CN201110438187 A CN 201110438187A CN 102591931 A CN102591931 A CN 102591931A
Authority
CN
China
Prior art keywords
tree
node
weights
label
tag
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201110438187XA
Other languages
Chinese (zh)
Other versions
CN102591931B (en
Inventor
尹建伟
彭勇
杨弈锦
邓水光
李莹
吴健
吴朝晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201110438187.XA priority Critical patent/CN102591931B/en
Publication of CN102591931A publication Critical patent/CN102591931A/en
Application granted granted Critical
Publication of CN102591931B publication Critical patent/CN102591931B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a recognition and extraction method for webpage data records based on a tree weight. The method comprises the following steps of: processing and transforming a webpage; recognizing the data records; aligning and extracting the data records; storing the data; and processing and converting the extracted webpage to a tag tree structure according to the characteristics of the tree structure of the contents on an HTML (hypertext markup language) webpage, assigning weights for each of tree nodes from bottom to top, so that the nodes in different layers have different weights, then, identifying a data record area according to a similar sub-tree set and the position consistency, and then, aligning the tree according to a tag tree set including the data records so as to generate a reference tree as an extraction template, so that the effects of high efficiency and high accuracy can be achieved.

Description

Identification of web data record and abstracting method based on the tree weights
Technical field
The present invention relates to the information extraction field, relate in particular to identification of web data record and abstracting method based on the tree weights.
Background technology
Along with internet information is exponential growth in the development under the time measure, comprising interested resource of more and more people and information in the webpage, increase but be accompanied by quantity of information, searching of information begins to become difficulty.Because the randomness of information; We can only adopt " full-text search " to search required information; Be flooded with irrelevant informations such as a large amount of advertisements and link but comprise in all kinds of webpages of information needed, make that we can't be fast and get access to the useful information that we need intuitively, and currently depend on the mode inefficiency that information is obtained in manual work; The required information of user often need obtain from a plurality of different information sources; Because there is structural otherness in the different web sites that comprises these information, obtain the information that needs and must carry out query analysis to different websites with the mode of manual work and handle, at last the result is organized into the form that needs and is saved in the database and uses for follow-up service; This process is loaded down with trivial details, and efficient is also very low.Thereby the user's interest content that from the magnanimity html document, extracts that how can precise and high efficiency becomes more and more important, under this background, is suggested just based on the web data record identification of tree weights and abstracting method.
The Web page information extraction technology is the use-case of information extraction technique in the internet, applications field.Web page information extraction is the structuring with the magnanimity on the internet that distributes, semi-structured or freely the customizing messages in the html text extract, and convert the unified structure form of expression into.Internet web page information extracts and is different from the characteristics that conventional information extracts, and has mass data, textural difference property, dynamic change property, unstructured data, semantic information shortage.
The program of being responsible in the Web page information extraction information is extracted from webpage is called as wrapper (Wrapper), and it is an extraction information and be converted into the program that stores after the unified structure information from semi-structured html text.It is the information extraction robotization, and is particularly useful to towards the info web of magnanimity the time.Principle according to generation method can be divided into Machine Method, natural language understanding method, bulk process, HTML method or the like.Method based on the HTML architectural feature is study in the present Web page information extraction technology maximum, also is the best techniques method of development.Make full use of the architectural feature that html text had based on the method for HTML architectural feature and carried out data pick-up.Before data pick-up, earlier html text is converted into the tag tree of a correspondence, generate decimation rule through automatic or automanual mode then, and rule application is extracted in the enterprising line data of tag tree.Certainly, in the research of this method, also exist some problems, as follows:
1. present research is that tag tree that html text whole in the webpage is corresponding is as template mostly; This obviously is inefficient; And in fact a lot of text nodes also possibly comprise html tag, can have ornamental labels such as color or font such as a lot of texts in the text node, also has some texts to have hypertext link; And also possibly comprise picture and form etc. in the content, these all belong to irrelevant information.
2. have some text nodes to concentrate at webpage and repeat to occur, these nodes should not be used as key content and extract, but should be as the part of template.
3. for the tree comparative approach of describing in the existing abstracting method, the object webpage collection of extraction generally is small-scale, and for extensive, efficient can be very low.
4. need two usually more than the page for the structure manner of comparison of the page, if having only single page then can't extract template.
It is because complicacy and these two characteristics determined of magnanimity of webpage that these several aspects are not considered; Will be in such complicacy and the webpage of magnanimity concentrate guarantee to extract accuracy rate and recall rate higher result all arranged in; On the extraction time, also will guarantee to have fast speeds, this is very necessary in practical operation.Thereby press for for a pin-point accuracy and high efficiency web page contents method for distilling.
Summary of the invention
It is low that the present invention is primarily aimed at existing web data identification and abstracting method accuracy and efficient, the information that can not effectively go out to need at a large amount of web page extractions; Proposition is based on the identification of web data record and the abstracting method of tree weights; Tree construction characteristics according to the content of html web page; Through being handled, the webpage that grabs is converted into the tag tree structure; The end of from and on be that each tree node is composed and go up weights, make the node of different levels have different weights, identify data recording area according to similar subtree set and position continuity then; Set alignment function according to the tag tree collection that comprises data recording then and generate the benchmark tree, can obtain the result of high-level efficiency and pin-point accuracy as extraction template.
In order to solve the problems of the technologies described above, technical scheme of the present invention is following:
1. based on the identification of web data record and the abstracting method of tree weights, comprise the steps:
(1) webpage is handled and is transformed;
(2) data recording identification;
(3) data recording is aimed at and is extracted;
(4) data storage;
Said webpage is handled and conversion comprises the steps:
11) to the webpage that grasps according to the effect of the label back structure tag tree of classifying;
12) give weights to each tree node of said tag tree according to following formula:
W = λ depth + Σ i = 1 n Sub W i
SubW wherein iBe subtree node weights, λ is that weights are regulated parameter, and depth is the degree of depth of tree; If the weights of certain node are 0, think that then this node is no articulation point, if tree node is irrelevant label node; Said irrelevant label comprises the hyperlink label, describes the label of display property, then should set corresponding λ=0; If tree node is a leaf node and if this tree node is then these tree node weights W=1 of text node or picture node, if other types these tree node weights W=0 then;
Said data recording identification comprises the steps:
13) with step 12) through giving tag tree behind the weights as input tree input data recordin module; Said data recordin module is the access templates treebank at first, compares based on the essential subtree set weights of the template in said template treebank tree and searches, if said input tree comprises essential subtree set weights; Then this input tree can carry out data record identification with this template tree; Obtain corresponding data recording area,, then carry out adaptive recognition data record if in said template treebank, can not find the corresponding templates tree; Judgement through to the similar subtree of self continuum identifies data recording area, obtains corresponding data recording area;
Said data recording is aimed at and extraction comprises the steps:
14) tag tree of obtaining in step 13) that comprises weighting value maximum in the data recording area array is set T as benchmark b, for remaining every the tag tree T in the data recording area array i, according to label behind the first weights, weights are found out tag tree T by the mode of descending order iAll can aim at said benchmark tree T down bNode, if find out a node T i[j] is at benchmark tree T bUnder exist weights greater than or equate threshold values K, and the consistent node T of node label b[k] will think node T so i[j] can aim at; If do not have the node that to aim at then carry out and insert operation with node T i[j] is inserted into said benchmark tree T bIn make to said benchmark tree T bAdjust adjusted benchmark tree T bThe aligning that will be used for other tag tree of data recording area array produces final benchmark tree T at last b
Said data storage comprises the steps:
15) the tag tree set is carried out Data Matching with the template tree and obtain information, the result is preserved with the form of database.
Further, the effect of said step 11) through label is divided three classes the label of HTML: first for the label of planning page layout it provides the label in content information zone; Second is the label of describing its content display mode of label of display property; The 3rd is the relevant label of hyperlink.
Further; Before the employing step 12) is given weights to each tree node; Webpage removed make an uproar; Said remove made an uproar step for tag tree is pruned, and comprises that the father node label that the father node label that the leaf node label is made as irrelevant label, adjacent text or picture node is made as irrelevant label, the text that does not have the brother or picture node is the label that has nothing to do.
Further; Said step 13) need compare tag tree for data recording identification judges its similarity degree; The comparative approach that adopts is: if in the subtree set of tag tree T1 with the subtree set of tag tree T2 in exist weights to equate common factor; Satisfy the subtree and the weights that exist greater than threshold values K and equate that must there be ordinal relation in subtree set; Be W [T1 [i]]==W [T2 [j]] and W [T1 [k]]==during W [T2 [t]], j<=t during and if only if i<=k, it is similar with tag tree T2 then to establish tag tree T1.
Further, the insertion position of the said insertion operation of said step 14) is confirmed through following steps: if sequence node T i[j] ... T i[m] is at tag tree T iCommon father node under 2 adjacent brotghers of node are arranged, one at Far Left, one at rightmost, these two brotghers of node are all at benchmark tree T bCorresponding aligning node is arranged, so sequence node T down i[j] ... T i[m] can be unique be inserted into benchmark tree T bBetween following 2 adjacent brotghers of node; If sequence node T i[j] ... T i[m] is at tag tree T iCommon father node under have only 1 adjacent brotgher of node k in a left side, and node k corresponding aiming at benchmark tree T bUnder rightmost node, sequence node T so i[j] ... T i[m] can be unique be inserted into benchmark tree T bThe farthest right position of lower node k; If sequence node T i[j] ... T i[m] is at tag tree T iCommon father node under have only 1 adjacent brotgher of node k in the right side, and node k corresponding aiming at benchmark tree T bUnder leftmost node, sequence node T so i[j] ... T i[m] can be unique be inserted into benchmark tree T bThe leftmost position of lower node k; If can not unique definite tag tree T iNext non-aligning node k sets T at benchmark bUnder the position, will carry out so and not insert, but with tag tree T iBe put in the ephemeral data record array.
Beneficial effect of the present invention is: according to the tree construction characteristics of the content of html web page; Through being handled, the webpage that grabs is converted into the tag tree structure; The end of from and on be that each tree node is composed and go up weights; Make the node of different levels have different weights; Identify data recording area according to similar subtree set and position continuity then, set alignment function according to the tag tree collection that comprises data recording then and generate the benchmark tree, can obtain the result of high-level efficiency and pin-point accuracy as extraction template.
Description of drawings
Fig. 1 is a system assumption diagram of the present invention;
Fig. 2 is the corresponding tag tree of a html page;
Fig. 3 is the data recording identification process based on the tag tree of weights;
Fig. 4 is judged as two similar trees;
Fig. 5 for insertion situation 1 two trees;
Fig. 6 for insertion situation 2 two trees;
Fig. 7 for insertion situation 3 two trees;
The basic procedure that Fig. 8 aims at for data recording.
Embodiment
To do further explanation to the present invention according to accompanying drawing and specific embodiment below.
Proposed a kind of identification of web data record and abstracting method among the present invention, mainly comprised webpage processing and conversion, data recording identification, data recording aligning and extraction and four parts of data storage based on the tree weights.Architecture of the present invention is as shown in Figure 1.
Webpage handle with transform mainly be the webpage that from different data sources, grabs is converted into computing machine can the easy to handle mode; Normal processing webpage mode comprises through label webpage is converted into the tree that label is formed, and introduced simultaneously to handle method for transformation based on the webpage of the tag tree of weights and improve recognition efficiency.
For pretreated webpage, web page contents is configured to the form of tag tree, can the label of HTML be divided three classes according to the effect of label: the label tree graph that shown in the accompanying drawing 2 is exactly certain html webpage.
Its label that content information zone is provided of the label of planning page layout as<div>,<p>,<td>,<tr>,<table>,</table>,</tr>,</td>,</p>,</div>Deng.
2. describe display property its content display mode of label label as<b>,<i>,<strong>,<h1>,<h2>,</h2>,</h1>,</strong>、></i>,</b>Deng.
3. the label that hyperlink is relevant comprises that the relevant label of hyperlink is like < a >, <base>etc.Based on handling method for transformation each tree node has been defined weights based on the webpage of the tag tree of weights:
W = &lambda; depth + &Sigma; i = 1 n Sub W i
SubW wherein iBe subtree node weights, λ is that weights are regulated parameter, and depth is the degree of depth of tree; If the weights of certain node are 0; Think that then this node is no articulation point, if tree node is leaf node, this tree node is then these tree node weights W=1 of text node or picture node; Other types are tree node weights W=0 then, in addition for one tree comprise irrelevant tag tree node as<a>,<b>,<i>,<strong>,<h1>,<h2>Deng, the λ that then this tree is corresponding is 0;
The total framework of tree node weights assignment method is a recurrence, and this also is that structure by tag tree determines.Adopt recurrence that each node is carried out the assignment of weights, the weights assignment flow process of each node comprises the stack of subtree node weights, the judgement assignment of the calculating of the current degree of depth and leaf node and irrelevant label etc.Before carrying out assignment, can handle the webpage noise; These noise datas have comprised advertisement link, navigation link, copyright information or the like information; So in order to remove noise effect; Before to the tag tree assignment, will prune promptly tag tree irrelevant label leaf node and irrelevant tag tree node will be operated, strategy as follows:
1. the leaf node label is made as irrelevant label
2. the father node label of adjacent text or picture node all is made as irrelevant label
3. do not have brother's the text or the father node label of picture node and be made as irrelevant label
Data recording identification mainly is for through the tag tree collection after handling and transforming, and through the data recording recognition methods based on the tag tree of weights, the tag tree collection of handling the cum rights value that transforms is carried out data recording area identification.Data recording identification process based on the tag tree of weights is as shown in Figure 3, after the tag tree collection is fed to the data recording identification module, for each label input tree; Data recordin module is the access templates treebank at first; Compare according to the essential subtree set weights of the template in storehouse tree and to search, the input tree is if comprise essential subtree set weights, then should input tree can set with this template and carry out data recording and discern; If in the template treebank, can not find the corresponding templates tree; Then carry out adaptive recognition data record, compare self subtree, identify data recording area.
Need compare tag tree for the identifying of data recording and to judge that its similarity degree, the comparative approach of employing all are that node label through tree construction compares, the content that does not generally comprise text node and comprised.
The comparative approach that the present invention uses is the tree comparative approach of basic weights; Tag tree T1 is following with the similar definition of tag tree T2: if in the subtree set of tag tree T1 with the subtree set of tag tree T2 in exist weights to equate common factor; Wherein there is subtree greater than threshold values K; And weights equate that there is ordinal relation in subtree set, W [T1 [i]]==W [T2 [j]] and W [T1 [k]]==during W [T2 [t]], j<=t during and if only if i<=k.
Tag tree for example shown in Figure 4 is similar, and possible because threshold values etc. are former thereby be judged as dissmilarity in the tree comparative approach of mentioning in the prior art.Because tree T1 is that 1421 c is identical weights subtree with setting T2 to have the subtree weights be 11121 a with the subtree weights, and sequencing is consistent, so these two trees are similar in the determination strategy of this paper.And in the differentiation of additive method, maybe since under the tree T2 under the existence of subtree d and the tree T1 under subtree b and the tree T2 subtree k inconsistent to differentiate be that these two trees are unequal.
For the tag tree in the ergodic process, the weights of each node that traversal is finished join one and do not repeat in the set in based on the data recording recognition methods of weights.Travel through ATL, judge, then select for use the data record of carrying out of this template tree recurrence to discern if exist based on weights and the essential subtree set weights of each template tree.To in the template treebank, can not find the situation of corresponding template tree; Prerequisite being judged as not of then utilizing the template tree to discern; Begin to carry out adaptive recognition data record flow process; Judgement through to the similar subtree of self continuum identifies data recording area, obtains corresponding data recording.
The efficient that the tree alignment methods that mainly is based on a kind of tag tree based on weights improves aligning is aimed at and extracted to data record.The basic procedure that data recording is aimed at is as shown in Figure 9; The object that data recording is aimed at and extracted mainly is the array of returning after the tag tree collection is discerned through data recording that comprises data recording area; Here each data recording in the array is registered to a benchmark tree; The generation of benchmark tree is through the tree alignment methods, each node and the benchmark of each data recording in the array set extracted the user's interest data after aiming at.
Basic thought based on the tree alignment methods of weights is that at first that weights in the data recording array are maximum tree is chosen, as benchmark tree T bWhy selecting the maximum reason of weights, is because this tree will have the darkest or the widest tree construction, makes other trees easierly to aim at it.Afterwards for every tree (T in the record array i), the present invention according to first weights after label, weights attempt finding out tag tree T by the mode of descending order iAll can alignment fiducials tree T down bNode.If find out a node T i[j] is at benchmark tree T bExist down weights greater than threshold values K and equate or weights less than K and equal and the consistent node T of node label b[k] will think node T so i[j] can aim at.If do not have the node that to aim at then carry out and insert operation with node T i[j] is inserted into benchmark tree T bIn make to benchmark tree T bAdjust.Adjusted benchmark tree T bThe tree that will be used for other data recording of data recording array is aimed at.
In the tree alignment methods, need aim at many tag tree, and alignment procedures to be process be through successively relatively two tag tree realize, and the comparative approach of two trees specific as follows shown in:
At benchmark tree T b(or template tree T t, down together) and tree T tAfter the aligning, tree T tIn some nodes will corresponding to the tree T bIn some nodes, the weights of these nodes equate.For misalignment also is the unequal nodes of weights, and these nodes are inserted into benchmark tree T bIn carry out benchmark tree T bAdjustment because these nodes possibly comprise optional data item.When with node T i[j] is inserted into benchmark tree T bShi Keneng has multiple situation, all depends on node T iWhether [j] can not have the benchmark that is inserted into of ambiguous is set T bUnder certain position.In fact, when inserting a non-aligning node T iWhen [j], can be with the adjacent node T of the misalignment of whole sequence string i[j] ... T i[m] replaces a node T i[j] raises the efficiency.Be without loss of generality, consider T i[j] ... T iThe father node of [m] exists in benchmark tree T bUnder aligning, to insert node T now i[j] ... T i[m] is to benchmark tree T bThe aligning father node of correspondence under.These insert nodes can be by unordered insertion, but want advanced line position judgement.Sequence node T i[j] ... T iThe insertion position of [m] all can be confirmed by unique under following several kinds of situation:
1. if sequence node T i[j] ... T i[m] has 2 adjacent brotghers of node under the common father node of tree Ti, one at Far Left, and one at rightmost, and these two brotghers of node are all at benchmark tree T bCorresponding aligning node is arranged, so sequence node T down i[j] ... T i[m] can be unique be inserted into benchmark tree T bBetween following 2 adjacent brotghers of node, as shown in Figure 5, tag tree T iWith benchmark tree T bReduced graph, can see tag tree T iUnder continuous adjacent node b and node c can be inserted into benchmark tree T bUnder node a and node d between, this is because tag tree T iUnder node a and d set T at benchmark bCorresponding aligning node is arranged down.Benchmark tree T after the insertion bShown in Fig. 5 Lower Half.A among the figure, b, c, d, e node can not be simple text nodes, also can be one trees.
2. as shown in Figure 6, if sequence node T i[j] ... T i[m] is at tag tree T iCommon father node under have only 1 adjacent brotgher of node k in a left side, and node k corresponding aiming at benchmark tree T bUnder rightmost node, sequence node T so i[j] ... T i[m] can be unique be inserted into benchmark tree T bThe farthest right position of lower node k.
3. as shown in Figure 7, if sequence node T i[j] ... T i[m] is at tag tree T iCommon father node under have only 1 adjacent brotgher of node k in the right side, and node k corresponding aiming at benchmark tree T bUnder leftmost node, sequence node T so i[j] ... T i[m] can be unique be inserted into benchmark tree T bThe leftmost position of lower node k
In addition, if can not unique definite tag tree T iNext non-aligning node k is at tree T bUnder the position, will not insert it so, but with tag tree T iBe put in the ephemeral data record array.Fig. 7 has showed this situation, and in this case, node k both can be placed between node a and the node b, also can be placed between node b and the node c.
Data storage mainly is tag tree set to be carried out Data Matching with the template tree obtain relevant information, the result is saved in the corresponding database with the form of a rule data-base recording uses for services such as subsequent query.
Following form according to code is done further explanation to the present invention.
1) webpage is handled and conversion process
The total framework of tree node weights assignment method is a recurrence, and this also is that structure by tag tree determines.Method can recurrence be carried out the assignment of weights to each node, and the weights assignment flow process of each node comprises the stack of subtree node weights, the judgement assignment of the calculating of the current degree of depth and leaf node and irrelevant label etc.
The flow process of tree node weights assignment method is a recurrence, and concrete grammar is as follows:
Figure BDA0000124294050000111
Figure BDA0000124294050000121
In the method; The 1st row is at first judged whether non-leaf node of present tree; If not then at the 2nd row initialization tree degree of depth and weights, the 3rd to 7 row travels through all child nodes of present tree leaf node, antithetical phrase tree node weights add up; The weights of each subtree also are the flow processs of recursive call oneself, and the degree of depth of present tree is made as depth capacity depth+1 in all subtrees.The 10th row judges that present tree is leaf node and is text node, and the 11st to the 12nd row is set the assignment of the degree of depth and weights to the current text leaf node.The 14th row judges that present tree is leaf node and is non-text node, and the 15th to the 16th row is set the assignment of the degree of depth and weights to current non-text leaf node.Whether the 18th row is judged that subtree node weights add up and is 0, if very then the weights of present tree are 0.Whether the label that the 20th row is judged present tree is in useless label array, if very then weights are constant (the 21st row), the weights that add up add that weights regulate exponential depth times of parameter if the false then weights of present tree are subtree.(the 22nd row).After flow process finishes then all tree nodes all can be endowed a weights W.
2) data recording identifying
At first, for the tag tree in the ergodic process, the weights of each node that traversal is finished join one and do not repeat in the set.The traversal ATL is judged based on weights and the essential subtree set weights of each template tree, then selects for use this template tree to carry out data record identification if exist.
The flow process of this recognition methods is a recurrence, and concrete grammar is as follows:
Figure BDA0000124294050000131
Judge through each subtree and selected template tree being carried out weights whether draw this subtree is data recording area.The Identify code is as implied above; The 1st row judges whether present tree has the subtree node; All subtree nodes of the 2nd row traversal present tree; The 4th worked calls the Compare function and judges whether current subtree comprises selected template tree or select the essential subtree of template tree, if the result advances (the 5th row) among the data recording array arrayT for very then adding, otherwise recursive call visit subtree node (the 6th goes).Can directly call the Compare function in the recognition function and judge whether current subtree is the data identification zone, and Rule of judgment is following:
1) if weights equates then to return true
2) if weights are unequal and input tree comprises all template trees tree centralization of state power values and then returns very
3) all the other all return vacation
If in the template treebank, can not find corresponding template tree; Prerequisite being judged as not of then utilizing the template tree to discern; Begin to carry out adaptive recognition data record flow process, identify data recording area, obtain corresponding data recording through judgement to the similar subtree of self continuum.
3) based on the tree alignment procedures of weights
Alignment procedures for two trees; At first two tag tree are carried out subtree is carried out the judgement from big to small based on weights; If exist subtree that weights equate then this subtree need not to adjust; Otherwise the subtree to weights do not wait is aimed at judgement, finds out the continuous nodes sequence of failing to aim in the corresponding tag tree, carries out benchmark tree T bThe operation of inserting, adjustment benchmark tree T b, handle always and do not have unjustified continuous nodes or can't adjust benchmark tree T bTill.If can't adjust benchmark tree T b, meaning exists the insertion position can't well-determined node, and then the alignment function of this tree is not accomplished.If alignment function is not accomplished then this tag tree is joined in the blotter array
Alignment procedures for many trees; Take a kind of like this strategy to carry out the aligning of many trees, according to tree node weights size, with big preferentially the aiming at of weights; Make like this and be aligned at every turn that power is maximum all can to reach maximum with the adjustment dynamics; Thereby reduce aligning number of times subsequently, improve the efficient of aiming at, concrete grammar is as follows:
Figure BDA0000124294050000141
Method is carried out ordering (the 2nd row) from big to small through the outcome record array arrayT that data record identification module is produced, and the tree that weights are maximum is set T as benchmark b(the 3rd row); Create a record array temparrayT (the 5th row) who preserves misalignment; Circulation is judged (the 6th row) to the length of record array arrayT, blotter array temparrayT is emptied (the 7th row), searching loop record array arrayT (eighth row); To write down the current maximum weights tree T of array arrayT and take out (the 9th row), call function TreeAlign and tree T bCarry out alignment function (the 10th row); If failure then will be set T and joined (the 11st row) among the blotter array temparrayT; Remove the tree T (the 12nd row) among the arrayT at last, after searching loop record array arrayT finishes the temparrayT clone is worth assignment and gets into cycle criterion next time (the 14th row) to arrayT.
The above only is a preferred implementation of the present invention; Should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the present invention's design; Can also make some improvement and retouching, these improvement and retouching also should be regarded as in the protection domain of the present invention.

Claims (5)

1. based on the identification of web data record and the abstracting method of tree weights, it is characterized in that, comprise the steps:
(1) webpage is handled and is transformed;
(2) data recording identification;
(3) data recording is aimed at and is extracted;
(4) data storage;
Said webpage is handled and conversion comprises the steps:
11) to the webpage that grasps according to the effect of the label back structure tag tree of classifying;
12) give weights to each tree node of said tag tree according to following formula:
W = &lambda; depth + &Sigma; i = 1 n Sub W i
SubW wherein iBe subtree node weights, λ is that weights are regulated parameter, and depth is the degree of depth of tree; If the weights of certain node are 0, think that then this node is no articulation point, if tree node is irrelevant label node; Said irrelevant label comprises the hyperlink label, describes the label of display property, then should set corresponding λ=0; If tree node is a leaf node and if this tree node is then these tree node weights W=1 of text node or picture node, if other types these tree node weights W=0 then;
Said data recording identification comprises the steps:
13) with step 12) through giving tag tree behind the weights as input tree input data recordin module; Said data recordin module is the access templates treebank at first, compares based on the essential subtree set weights of the template in said template treebank tree and searches, if said input tree comprises essential subtree set weights; Then this input tree can carry out data record identification with this template tree; Obtain corresponding data recording area,, then carry out adaptive recognition data record if in said template treebank, can not find the corresponding templates tree; Judgement through to the similar subtree of self continuum identifies data recording area, obtains corresponding data recording area;
Said data recording is aimed at and extraction comprises the steps:
14) tag tree of obtaining in step 13) that comprises weighting value maximum in the data recording area array is set T as benchmark b, for remaining every the tag tree T in the data recording area array i, according to label behind the first weights, weights are found out tag tree T by the mode of descending order iAll can aim at said benchmark tree T down bNode, if find out a node T i[j] is at benchmark tree T bUnder exist weights greater than or equate threshold values K, and the consistent node T of node label b[k] will think node T so i[j] can aim at; If do not have the node that to aim at then carry out and insert operation with node T i[j] is inserted into said benchmark tree T bIn make to said benchmark tree T bAdjust adjusted benchmark tree T bThe aligning that will be used for other tag tree of data recording area array produces final benchmark tree T at last b
Said data storage comprises the steps:
15) the tag tree set is carried out Data Matching with the template tree and obtain information, the result is preserved with the form of database.
2. identification of web data record and abstracting method based on the tree weights according to claim 1; It is characterized in that the effect of said step 11) through label is divided three classes the label of HTML: first for the label of planning page layout it provides the label in content information zone; Second is the label of describing its content display mode of label of display property; The 3rd is the relevant label of hyperlink.
3. identification of web data record and abstracting method based on the tree weights according to claim 1; It is characterized in that; Before the employing step 12) is given weights to each tree node; Webpage removed make an uproar, said removing made an uproar step for tag tree is pruned, and comprises that the father node label that the father node label that the leaf node label is made as irrelevant label, adjacent text or picture node is made as irrelevant label, the text that does not have the brother or picture node is the label that has nothing to do.
4. identification of web data record and abstracting method based on the tree weights according to claim 1; It is characterized in that; Said step 13) need compare tag tree for data recording identification judges its similarity degree; The comparative approach that adopts is: if in the subtree set of tag tree T1 with the subtree set of tag tree T2 in exist weights to equate common factor, satisfy the subtree and the equal subtree set of weights that exist greater than threshold values K and must have ordinal relation, promptly W [T1 [i]]==W [T2 [j]] and W [T1 [k]]==during W [T2 [t]]; J<=t during and if only if i<=k, it is similar with tag tree T2 then to establish tag tree T1.
5. identification of web data record and abstracting method based on the tree weights according to claim 1 is characterized in that the insertion position of the said insertion operation of said step 14) is confirmed through following steps: if sequence node T i[j] ... T i[m] is at tag tree T iCommon father node under 2 adjacent brotghers of node are arranged, one at Far Left, one at rightmost, these two brotghers of node are all at benchmark tree T bCorresponding aligning node is arranged, so sequence node T down i[j] ... T i[m] can be unique be inserted into benchmark tree T bBetween following 2 adjacent brotghers of node; If sequence node T i[j] ... T i[m] is at tag tree T iCommon father node under have only 1 adjacent brotgher of node k in a left side, and node k corresponding aiming at benchmark tree T bUnder rightmost node, sequence node T so i[j] ... T i[m] can be unique be inserted into benchmark tree T bThe farthest right position of lower node k; If sequence node T i[j] ... T i[m] is at tag tree T iCommon father node under have only 1 adjacent brotgher of node k in the right side, and node k corresponding aiming at benchmark tree T bUnder leftmost node, sequence node T so i[j] ... T i[m] can be unique be inserted into benchmark tree T bThe leftmost position of lower node k; If can not unique definite tag tree T iNext non-aligning node k sets T at benchmark bUnder the position, will carry out so and not insert, but with tag tree T iBe put in the ephemeral data record array.
CN201110438187.XA 2011-12-23 2011-12-23 Recognition and extraction method for webpage data records based on tree weight Active CN102591931B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110438187.XA CN102591931B (en) 2011-12-23 2011-12-23 Recognition and extraction method for webpage data records based on tree weight

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110438187.XA CN102591931B (en) 2011-12-23 2011-12-23 Recognition and extraction method for webpage data records based on tree weight

Publications (2)

Publication Number Publication Date
CN102591931A true CN102591931A (en) 2012-07-18
CN102591931B CN102591931B (en) 2015-03-18

Family

ID=46480573

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110438187.XA Active CN102591931B (en) 2011-12-23 2011-12-23 Recognition and extraction method for webpage data records based on tree weight

Country Status (1)

Country Link
CN (1) CN102591931B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104346405A (en) * 2013-08-08 2015-02-11 阿里巴巴集团控股有限公司 Method and device for extracting information from webpage
CN105786951A (en) * 2015-12-31 2016-07-20 北京金山安全软件有限公司 Method and device for extracting content blocks in webpage and server
CN108874934A (en) * 2018-06-01 2018-11-23 百度在线网络技术(北京)有限公司 Page body extracting method and device
WO2020249020A1 (en) * 2019-06-14 2020-12-17 中国建设银行股份有限公司 Method and system for capturing structured web page data
CN115344571A (en) * 2022-05-20 2022-11-15 药渡经纬信息科技(北京)有限公司 Universal data acquisition and analysis method, system and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080154926A1 (en) * 2002-12-16 2008-06-26 Newman Paula S System And Method For Clustering Nodes Of A Tree Structure
CN101833554A (en) * 2009-03-09 2010-09-15 富士通株式会社 Method and equipment for producing extraction template and method and equipment for extracting content on web pages

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080154926A1 (en) * 2002-12-16 2008-06-26 Newman Paula S System And Method For Clustering Nodes Of A Tree Structure
CN101833554A (en) * 2009-03-09 2010-09-15 富士通株式会社 Method and equipment for producing extraction template and method and equipment for extracting content on web pages

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
黄健斌等: "Web网页中动态数据区域的识别与抽取", 《计算机工程》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104346405A (en) * 2013-08-08 2015-02-11 阿里巴巴集团控股有限公司 Method and device for extracting information from webpage
CN104346405B (en) * 2013-08-08 2018-05-22 阿里巴巴集团控股有限公司 A kind of method and device of the Extracting Information from webpage
CN105786951A (en) * 2015-12-31 2016-07-20 北京金山安全软件有限公司 Method and device for extracting content blocks in webpage and server
CN108874934A (en) * 2018-06-01 2018-11-23 百度在线网络技术(北京)有限公司 Page body extracting method and device
CN108874934B (en) * 2018-06-01 2021-11-30 百度在线网络技术(北京)有限公司 Page text extraction method and device
WO2020249020A1 (en) * 2019-06-14 2020-12-17 中国建设银行股份有限公司 Method and system for capturing structured web page data
CN115344571A (en) * 2022-05-20 2022-11-15 药渡经纬信息科技(北京)有限公司 Universal data acquisition and analysis method, system and storage medium
CN115344571B (en) * 2022-05-20 2023-05-23 药渡经纬信息科技(北京)有限公司 Universal data acquisition and analysis method, system and storage medium

Also Published As

Publication number Publication date
CN102591931B (en) 2015-03-18

Similar Documents

Publication Publication Date Title
Gatterbauer et al. Towards domain-independent information extraction from web tables
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
CN104537116B (en) A kind of books searching method based on label
CN101388022B (en) Web portrait search method for fusing text semantic and vision content
CN101968819B (en) Audio/video intelligent catalog information acquisition method facing to wide area network
CN102651003B (en) Cross-language searching method and device
CN103927397B (en) Recognition method for Web page link blocks based on block tree
Zheng et al. Template-independent news extraction based on visual consistency
CN103559234B (en) System and method for automated semantic annotation of RESTful Web services
CN105426529A (en) Image retrieval method and system based on user search intention positioning
CN107577788B (en) E-commerce website topic crawler method for automatically structuring data
CN103440232A (en) Automatic sScientific paper standardization automatic detecting and editing method
CN101515287A (en) Automatic generating method of wrapper of complex page
CN102681994A (en) Webpage information extracting method and system
Ji et al. Tag tree template for Web information and schema extraction
CN102591931A (en) Recognition and extraction method for webpage data records based on tree weight
CN103440233A (en) Automatic sScientific paper standardization automatic detecting and editing system
CN101763395A (en) Method for automatically generating webpage by adopting artificial intelligence technology
CN100568221C (en) A kind of method of newspaper layout being carried out the words reading sequence recovery
CN100590623C (en) System and method for abstraction of Web data based on vision
CN103761312B (en) Information extraction system and method for multi-recording webpage
Maynard et al. Change management for metadata evolution
CN105426490A (en) Tree structure based indexing method
CN103544167A (en) Backward word segmentation method and device based on Chinese retrieval
CN113282793A (en) Web table data semantic extraction and RDF construction method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant