CN102591931B - Recognition and extraction method for webpage data records based on tree weight - Google Patents

Recognition and extraction method for webpage data records based on tree weight Download PDF

Info

Publication number
CN102591931B
CN102591931B CN201110438187.XA CN201110438187A CN102591931B CN 102591931 B CN102591931 B CN 102591931B CN 201110438187 A CN201110438187 A CN 201110438187A CN 102591931 B CN102591931 B CN 102591931B
Authority
CN
China
Prior art keywords
tree
node
weights
label
tag
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110438187.XA
Other languages
Chinese (zh)
Other versions
CN102591931A (en
Inventor
尹建伟
彭勇
杨弈锦
邓水光
李莹
吴健
吴朝晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201110438187.XA priority Critical patent/CN102591931B/en
Publication of CN102591931A publication Critical patent/CN102591931A/en
Application granted granted Critical
Publication of CN102591931B publication Critical patent/CN102591931B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a recognition and extraction method for webpage data records based on a tree weight. The method comprises the following steps of: processing and transforming a webpage; recognizing the data records; aligning and extracting the data records; storing the data; and processing and converting the extracted webpage to a tag tree structure according to the characteristics of the tree structure of the contents on an HTML (hypertext markup language) webpage, assigning weights for each of tree nodes from bottom to top, so that the nodes in different layers have different weights, then, identifying a data record area according to a similar sub-tree set and the position consistency, and then, aligning the tree according to a tag tree set including the data records so as to generate a reference tree as an extraction template, so that the effects of high efficiency and high accuracy can be achieved.

Description

Based on the identification of web data record and the abstracting method of tree weights
Technical field
The present invention relates to information extraction field, particularly relate to the web data record identification based on tree weights and abstracting method.
Background technology
Along with the development exponentially level growth of internet information under time measure, contain the interested resource of more and more people and information in webpage, but increase along with quantity of information, searching of information starts to become difficulty.Due to the randomness of information, we can only adopt " full-text search " to search required information, but comprise in all kinds of webpages of information needed and be flooded with the irrelevant informations such as a large amount of advertisements and link, make us cannot get the useful information of our needs fast and intuitively, and currently rely on the mode inefficiency of artificial-information-obtaining, information needed for user often needs to obtain from multiple different information source, because the different web sites comprising these information exists structural otherness, the information obtaining needs must carry out query analysis process in artificial mode to different websites, the form finally result being organized into needs is saved in database and uses for follow-up service, this process is loaded down with trivial details, and efficiency is also very low.Thus how the interested content of user that extracts from magnanimity html document of precise and high efficiency can become more and more important, be suggested under this background just based on the web data record identification of tree weights and abstracting method.
Web page information extraction technology is the use-case of information extraction technique in internet, applications field.Web page information extraction is by the structuring of distribution magnanimity on the internet, and semi-structured or freely in html text customizing messages extracts, and is converted to the unified structuring form of expression.The Internet web page information extracts the feature being different from conventional information and extracting, and has mass data, structural difference, dynamic change, unstructured data, semantic information shortage.
Be responsible for the program that information extracted from webpage in Web page information extraction and be called as wrapper (Wrapper), it is Extracting Information the program stored after being converted into unified structured message from semi-structured html text.It is by information extraction robotization, to towards particularly useful during the info web of magnanimity.Machine Method, natural language understanding method, bulk process, HTML method etc. can be divided into according to generation side's ratio juris.Method based on HTML architectural feature is study in current Web page information extraction technology maximum, is also the best technical method of development.Method based on HTML architectural feature takes full advantage of the architectural feature that html text has and carries out data pick-up.Before data pick-up, first html text is converted to a corresponding tag tree, then generates decimation rule by automatic or automanual mode, and rule is applied in tag tree and carries out data pick-up.Certainly, the research of this method also also exists some problems, as follows:
1. present research is as template mostly using tag tree corresponding for html texts whole in webpage, this is obviously inefficient, and in fact a lot of text nodes also may comprise html tag, a lot of texts in such as text node can with the decorative labels such as color or font, some texts are also had to have hypertext link, and in content, also may comprise picture and form etc., these all belong to irrelevant information.
2. have some text nodes to concentrate at webpage to repeat, these nodes should not extracted by as key content, but should as a part for template.
3. for the tree comparative approach described in existing abstracting method, the object webpage collection of extraction is generally small-scale, and for extensive, efficiency can be very low.
4. the structure comparison mode for the page needs more than two pages usually, if only have single page, cannot extract template.
These aspects are not considered to be the complexity due to webpage and these two characteristics determined of magnanimity, will concentrate at the so complicated and webpage of magnanimity ensures to extract while accuracy rate and recall rate have higher result, the extraction time also will guarantee there is speed faster, and this is very necessary in practical operation.Thus for a pin-point accuracy and high efficiency webpage content extracting method in the urgent need to.
Summary of the invention
The present invention mainly for existing web data identification and abstracting method accuracy and efficiency low, can not effectively in the information that a large amount of web page extraction go out to need; Web data record identification based on tree weights and abstracting method are proposed, according to the tree construction feature of the content of html web page, tag tree structure is converted into by carrying out process to the webpage grabbed, on the end of from is that each tree node composes upper weights, the node of different levels is made to have different weights, then data recording area is identified according to similar subtree set and position continuity, then carry out tree alignment function according to the tag tree collection comprising data record and generate benchmark tree as extraction template, the result of high-level efficiency and pin-point accuracy can be obtained.
In order to solve the problems of the technologies described above, technical scheme of the present invention is as follows:
1., based on the identification of web data record and the abstracting method of tree weights, comprise the steps:
(1) Web Page Processing and conversion;
(2) data record identification;
(3) data record is aimed at and is extracted;
(4) data store;
Described Web Page Processing and conversion comprise the steps:
11) tag tree is constructed after the effect of the webpage foundation label captured being classified;
12) weights are given to each tree node of described tag tree according to following formula:
W = λ depth + Σ i = 1 n Sub W i
Wherein SubW ifor children tree nodes weights, λ is weights regulating parameter, and depth is the degree of depth of tree, if the weights of certain node are 0, then think that this node is without articulation point, if tree node is irrelevant label node, described irrelevant label comprises hyperlink label, describes the label of display property, then λ=0 of this tree correspondence; If if tree node is leaf node and this tree node is text node or picture node, these tree node weights W=1, if other types then these tree node weights W=0;
The identification of described data record comprises the steps:
13) using step 12) through giving the tag tree after weights as input tree input data recordin module, described data recordin module is access templates treebank first, compare according to the essential subtree set weights of the template tree in described template treebank and search, if described input tree comprises essential subtree set weights, then this input tree can carry out the identification of data record with this template tree, obtain corresponding data recording area, if can not find corresponding templates tree in described template treebank, then carry out adaptive identification data record, data recording area is identified by the judgement of the similar subtree to self continuum, obtain corresponding data recording area,
Described data record is aimed at and extraction comprises the steps:
14) in step 13) tag tree of getting maximum weight in data recording area array that comprises that obtains sets T as benchmark b, for remaining every the tag tree T in data recording area array i, according to label after first weights, weights find out tag tree T by the mode of descending order ilower allly can aim at described benchmark tree T bnode, if find out a node T i[j] is at benchmark tree T bunder there are weights and be greater than or equal threshold values K, and the node T that node label is consistent b[k], so will think node T i[j] can aim at; If there is no the node can aimed at then performs update by node T i[j] is inserted into described benchmark tree T bin make described benchmark tree T badjust, the benchmark tree T after adjustment bto the aligning of other tag tree in data recording area array be used for, finally produce final benchmark tree T b;
Described data storage comprises the steps:
15) tag tree set and template are set carry out Data Matching obtaining information, result is preserved with the form of database.
Further, described step 11) by the effect of label, the label of HTML is divided three classes: first be planning page layout label its label in content information region is provided; Second for describing the label of its content display mode of label of display property; 3rd label of being correlated with for hyperlink.
Further, in employing step 12) to before each tree node imparting weights, carry out except making an uproar to webpage, described except step of making an uproar is for prune tag tree, comprise father node label leaf node label being set to irrelevant label, adjacent text or picture node and be set to irrelevant label, the text of nothing brother or the father node label of picture node for irrelevant label.
Further, described step 13) identification of data record is needed to compare tag tree to judge its similarity degree, adopt comparative approach be: if in the subtree set of tag tree T1 with there is the equal common factor of weights in the subtree set of tag tree T2, satisfied existence is greater than the subtree of threshold values K and the equal subtree set of weights must exist ordinal relation, namely W [T1 [i]]==W [T2 [j]] and W [T1 [k]]==W [T2 [t]] time, j <=t during and if only if i <=k, then establish tag tree T1 similar with tag tree T2.
Further, described step 14) insertion position of described update determines as follows: if sequence node T i[j] ... T i[m] is at tag tree T icommon father node under have 2 adjacent brotghers of node, one at Far Left, one at rightmost, these two brotghers of node all benchmark tree T bunder have corresponding aligning node, so sequence node T i[j] ... T i[m] can be unique be inserted into benchmark tree T bbetween lower 2 adjacent brotghers of node; If sequence node T i[j] ... T i[m] is at tag tree T icommon father node under only have 1 adjacent brotgher of node k in a left side, and node k corresponding aim at benchmark tree T bunder rightmost node, so sequence node T i[j] ... T i[m] can be unique be inserted into benchmark tree T bthe farthest right position of lower node k; If sequence node T i[j] ... T i[m] is at tag tree T icommon father node under only have 1 adjacent brotgher of node k in the right side, and node k corresponding aim at benchmark tree T bunder leftmost node, so sequence node T i[j] ... T i[m] can be unique be inserted into benchmark tree T bthe leftmost position of lower node k; If uniquely tag tree T can not be determined inext non-aligned node k sets T at benchmark bunder position, so execution not to be inserted, but by tag tree T ibe put in ephemeral data record array.
Beneficial effect of the present invention is: according to the tree construction feature of the content of html web page, tag tree structure is converted into by carrying out process to the webpage grabbed, on the end of from is that each tree node composes upper weights, the node of different levels is made to have different weights, then data recording area is identified according to similar subtree set and position continuity, then carry out tree alignment function according to the tag tree collection comprising data record and generate benchmark tree as extraction template, the result of high-level efficiency and pin-point accuracy can be obtained.
Accompanying drawing explanation
Fig. 1 is system assumption diagram of the present invention;
Fig. 2 is the tag tree that a html page is corresponding;
Fig. 3 is the data record identification process of the tag tree based on weights;
Fig. 4 is judged as two similar trees;
Fig. 5 be insertion situation 1 two trees;
Fig. 6 be insertion situation 2 two trees;
Fig. 7 be insertion situation 3 two trees;
Fig. 8 is the basic procedure that data record is aimed at.
Embodiment
Below by the present invention is described further with specific embodiment with reference to the accompanying drawings.
Propose a kind of web data record identification based on tree weights and abstracting method in the present invention, mainly comprise Web Page Processing and conversion, the identification of data record, data record aligning and extraction and data and store four parts.Architecture of the present invention as shown in Figure 1.
Web Page Processing and transform mainly the webpage grabbed from different data source is converted into computing machine can easy to handle mode, usual process webpage mode comprises the tree by label, webpage being converted into label composition, and the Web Page Processing method for transformation simultaneously introduced based on the tag tree of weights improves recognition efficiency.
For pretreated webpage, web page contents is configured to the form of tag tree, the label of HTML can be divided three classes by the effect according to label: the label tree graph shown in accompanying drawing 2 being exactly certain html webpage.
1. plan page layout label its provide the label in content information region as <div>, <p>, <td>, <tr>, <table>, </table>, </tr>, </td>, </p>, </div> etc.
2. the label of its content display mode of label of display property is described as <b>, <i>, <strong>, <h1>, <h2>, </h2>, </h1>, </strong>, ></i>, </b> etc.
3. the label that hyperlink is correlated with comprises the relevant label of hyperlink as <a>, <base> etc.Based on the Web Page Processing method for transformation of the tag tree based on weights to each tree node definition weights:
W = &lambda; depth + &Sigma; i = 1 n Sub W i
Wherein SubW ifor children tree nodes weights, λ is weights regulating parameter, depth is the degree of depth of tree, if the weights of certain node are 0, then think that this node is without articulation point, if tree node is leaf node, this tree node is text node or picture node then these tree node weights W=1, other types are tree node weights W=0 then, in addition irrelevant tag tree node is comprised as <a> for one tree, <b>, <i>, <strong>, <h1>, <h2> etc., then the λ of this tree correspondence is 0,
The total framework of tree node weights assignment method is recurrence, and this is also determined by the structure of tag tree.Adopt recurrence each node to be carried out to the assignment of weights, the weights assignment flow process of each node comprises the superposition of children tree nodes weights, the judgement assignment etc. of the calculating of current depth and leaf node and irrelevant label.Can process webpage noise before execution assignment, these noise datas include advertisement link, navigation link, copyright information etc. information, so in order to remove noise effect, before to tag tree assignment, to prune tag tree and namely operate irrelevant label leaf node and irrelevant tag tree node, strategy is as follows:
1. leaf node label is set to irrelevant label
2. the father node label of adjacent text or picture node is all set to irrelevant label
3. be set to irrelevant label without the text of brother or the father node label of picture node
The identification of data record, mainly for through process and the tag tree collection after transforming, by the data record recognition methods of the tag tree based on weights, carries out data recording area identification to the tag tree collection of the Weighted Coefficients that process transforms.Based on the tag tree of weights data record identification process as shown in Figure 3, after tag tree collection is fed to data record identification module, for each label input tree, data recordin module is access templates treebank first, compare according to the essential subtree set weights of the template tree in storehouse and search, if input tree comprises essential subtree set weights, then this input tree can carry out the identification of data record with this template tree, if can not find corresponding templates tree in template treebank, then carry out adaptive identification data record, relatively self subtree, identifies data recording area.
Identifying for data record needs to compare tag tree to judge its similarity degree, and the comparative approach of employing is all compared by the node label of tree construction, does not generally comprise the content that text node comprises.
The comparative approach that the present invention uses is the tree comparative approach of base weights, by tag tree T1 and tag tree T2 similar definitions as follows: if in the subtree set of tag tree T1 with there is the equal common factor of weights in the subtree set of tag tree T2, wherein there is the subtree being greater than threshold values K, and there is ordinal relation in the equal subtree set of weights, W [T1 [i]]==W [T2 [j]] and W [T1 [k]]==W [T2 [t]] time, j <=t during and if only if i <=k.
Tag tree such as shown in Fig. 4 is similar, and may be judged as dissmilarity due to reasons such as threshold values in the tree comparative approach mentioned in the prior art.Because tree T1 and tree T2 exist subtree weights be 11121 a and subtree weights be 1421 c be identical weights subtree, and sequencing is consistent, so these two trees are similar in determination strategy herein.And in the differentiation of additive method, these two trees may be determined as due to the inconsistent of subtree k under subtree b and tree T2 under the existence of subtree d under tree T2 and tree T1 unequal.
For the tag tree in ergodic process in based on the data record recognition methods of weights, the weights of each node complete for traversal are joined one and does not repeat in set.Traversal template base, the weights set according to each template and essential subtree set weights judge, if exist, that selects this template tree recurrence carries out the identification of data record.To the situation that can not find corresponding template tree in template treebank, then utilize template to set carry out identifying prerequisite to be judged as NO, start to carry out adaptive identification data record flow process, identify data recording area by the judgement of the similar subtree to self continuum, obtain corresponding data record.
Data record is aimed at and is extracted the efficiency mainly improving aligning according to a kind of tree alignment methods of the tag tree based on weights.The basic procedure that data record is aimed at as shown in Figure 8, the object mainly array that comprise data recording area of tag tree collection by returning after the identification of data record that data record is aimed at and extracted, here each data record in array is registered to a benchmark tree, each node of data record each in array and benchmark, by tree alignment methods, are set after aiming at and are extracted the interested data of user by the generation of benchmark tree.
Basic thought based on the tree alignment methods of weights is first chosen by the tree of maximum weight in data acknowledgment number group, as benchmark tree T b.Why select the reason of maximum weight, be because this tree will have the most deeply or the widest tree construction, other trees can easierly be aimed at it.Afterwards for every tree (T in record array i), the present invention is according to label after first weights, and weights are attempted finding out tag tree T by the mode of descending order ilower all can alignment fiducials tree T bnode.If find out a node T i[j] is at benchmark tree T blower exist weights and be greater than threshold values K and equal or weights are less than K and the equal and node T that node label is consistent b[k], so will think node T i[j] can aim at.If there is no the node can aimed at then performs update by node T i[j] is inserted into benchmark tree T bin make to benchmark tree T badjust.Benchmark tree T after adjustment bthe tree being used for other data records in data acknowledgment number group is aimed at.
In tree alignment methods, need to aim at many tag tree, and alignment procedures be process is realize by comparing two tag tree successively, and the comparative approach of two trees specific as follows shown in:
At benchmark tree T b(or template tree T t, lower same) and tree T tafter aligning, tree T tin some nodes will correspond to tree T bin some nodes, the weights of these nodes are equal.For the misalignment also i.e. unequal node of weights, these node city are set T to benchmark bin carry out benchmark tree T badjustment because these nodes may contain optional data item.When by node T i[j] is inserted into benchmark tree T bshi Keneng has multiple situation, all depends on node T iwhether [j] can set T without the benchmark that is inserted into of ambiguous bunder certain position.In fact, as the non-aligned node T of insertion one itime [j], can with the adjacent node T of the misalignment of whole sequence string i[j] ... T i[m] replaces a node T i[j] raises the efficiency.Without loss of generality, T is considered i[j] ... T ithe father node of [m] exists at benchmark tree T bunder aligning, to insert node T now i[j] ... T i[m] is to benchmark tree T bcorrespondence aligning father node under.These insert nodes can not by unordered insertion, but want advanced line position judgement.Sequence node T i[j] ... T ithe insertion position of [m] all can be now uniquely determined in following several situation:
1. if sequence node T i[j] ... T i[m] tree Ti common father node under have 2 adjacent brotghers of node, one at Far Left, one at rightmost, these two brotghers of node all benchmark tree T bunder have corresponding aligning node, so sequence node T i[j] ... T i[m] can be unique be inserted into benchmark tree T bbetween lower 2 adjacent brotghers of node, as shown in Figure 5, tag tree T iwith benchmark tree T breduced graph, can tag tree T be seen iunder continuous adjacent node b and node c can be inserted into benchmark tree T bunder node a and node d between, this is because tag tree T iunder node a and d set T at benchmark bunder have corresponding aligning node.Benchmark tree T after insertion bas shown in Fig. 5 Lower Half.A in figure, b, c, d, e node can not be simple text node, also can be one tree.
2. as shown in Figure 6, if sequence node T i[j] ... T i[m] is at tag tree T icommon father node under only have 1 adjacent brotgher of node k in a left side, and node k corresponding aim at benchmark tree T bunder rightmost node, so sequence node T i[j] ... T i[m] can be unique be inserted into benchmark tree T bthe farthest right position of lower node k.
3. as shown in Figure 7, if sequence node T i[j] ... T i[m] is at tag tree T icommon father node under only have 1 adjacent brotgher of node k in the right side, and node k corresponding aim at benchmark tree T bunder leftmost node, so sequence node T i[j] ... T i[m] can be unique be inserted into benchmark tree T bthe leftmost position of lower node k
In addition, if uniquely tag tree T can not be determined inext non-aligned node k is at tree T bunder position, so will not insert it, but by tag tree T ibe put in ephemeral data record array.Fig. 7 illustrates this situation, and in this case, node k both can be placed between node a and node b, also can be placed between node b and node c.
Data store mainly tag tree set and template sets carries out Data Matching and obtains relevant information, result is saved in the form of a rule data-base recording in the database of correspondence and serves use for subsequent query etc.
The present invention is described further for the following form according to code.
1) Web Page Processing and conversion process
The total framework of tree node weights assignment method is recurrence, and this is also determined by the structure of tag tree.Method meeting recurrence carries out the assignment of weights to each node, the weights assignment flow process of each node comprises the superposition of children tree nodes weights, the judgement assignment etc. of the calculating of current depth and leaf node and irrelevant label.
The flow process of tree node weights assignment method is recurrence, and concrete grammar is as follows:
In the method, first 1st row judges present tree whether non-leaf nodes, if not leaf node is then at the 2nd row initialization tree degree of depth and weights, the all child nodes of 3 to 7 row to present tree travel through, children tree nodes weights are added up, the weights of each subtree are also the flow processs of recursive call oneself, and the degree of depth of present tree is set to depth capacity depth+1 in all subtrees.10th row judges that present tree is leaf node and as text node, the 11 to the 12 row carries out to current text leaf node the assignment setting the degree of depth and weights.14th row judges that present tree is leaf node and as non-textual node, the 15 to the 16 row carries out to current non-textual leaf node the assignment setting the degree of depth and weights.18th row judges whether children tree nodes weights cumulative sum is 0, if very then the weights of present tree are 0.20th row judges that the label of present tree is whether in useless label array, if very then weights are constant (the 21st row), if the weights of vacation then present tree are that subtree sum weight adds the exponential depth of weights regulating parameter doubly.(the 22nd row).After flow process terminates, then all tree nodes all can be endowed a weights W.
2) data record identifying
First, for the tag tree in ergodic process, the weights of each node complete for traversal are joined one and does not repeat in set.Traversal template base, the weights set according to each template and essential subtree set weights judge, if exist, select this template to set and carry out the identification of data record.
The flow process of this recognition methods is recurrence, and concrete grammar is as follows:
Carry out weights judge to show whether this subtree is data recording area by setting each subtree and selected template.Identify code is as implied above, 1st row judges whether present tree has children tree nodes, all children tree nodes of the 2nd row traversal present tree, by calling Compare function, 4th row judges whether current subtree comprises selected template tree or select the essential subtree of template tree, if result is very, be added to (the 5th row) in data acknowledgment number group arrayT, otherwise recursive call access children tree nodes (the 6th row).Compare function directly can be called to judge whether current subtree is data identification region, and Rule of judgment is as follows in recognition function:
1) if weights are equal, return true
2) if weights are unequal and input tree comprises all template trees tree centralization of state power value, return true
3) all the other all return vacation
If can not find corresponding template tree in template treebank, then utilize template to set carry out identifying prerequisite to be judged as NO, start to carry out adaptive identification data record flow process, identify data recording area by the judgement of the similar subtree to self continuum, obtain corresponding data record.
3) based on the tree alignment procedures of weights
For the alignment procedures of two trees, first the judgement from big to small carrying out based on weights to subtree is carried out to two tag tree, if there is the subtree that weights are equal, this subtree is without the need to adjusting, otherwise carry out aligning to the subtree that weights do not wait to judge, find out in corresponding tag tree the continuous nodes sequence of failing to aim at, perform benchmark tree T bcarry out the operation of inserting, adjustment benchmark tree T b, process is always to there is not unjustified continuous nodes or cannot adjusting benchmark tree T btill.If benchmark tree T cannot be adjusted b, meaning also exists insertion position cannot well-determined node, then the alignment function of this tree does not complete.If alignment function does not complete, this tag tree is joined in blotter array
For the alignment procedures of many trees, take a kind of like this strategy to carry out the aligning of many trees, according to tree node weights size, large for weights is preferentially aimed at, make to be aligned to the maximum and adjustment dynamics of power so at every turn and all can reach maximum, thus the aligning number of times reduced subsequently, improve the efficiency of aiming at, concrete grammar is as follows:
Method carries out sequence (the 2nd row) from big to small by the outcome record array arrayT produced data record identification module, and the tree of maximum weight is set T as benchmark b(the 3rd row), create the record array temparrayT (the 5th row) that is preserved misalignment, the length of circulation to record array arrayT judges (the 6th row), blotter array temparrayT is emptied (the 7th row), searching loop record array arrayT (eighth row), the current maximum weights tree T of array arrayT will be recorded and take out (the 9th row), call function TreeAlign and tree T bcarry out alignment function (the 10th row), if failure, joins tree T (the 11st row) in blotter array temparrayT, finally remove the tree T (the 12nd row) in arrayT, after searching loop record array arrayT terminates, temparrayT clone is worth assignment and enters cycle criterion next time (the 14th row) to arrayT.
The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, without departing from the inventive concept of the premise; can also make some improvements and modifications, these improvements and modifications also should be considered as in scope.

Claims (1)

1., based on the identification of web data record and the abstracting method of tree weights, it is characterized in that, comprise the steps:
(1) Web Page Processing and conversion;
(2) data record identification;
(3) data record is aimed at and is extracted;
(4) data store;
Described Web Page Processing and conversion comprise the steps:
11) tag tree is constructed after the effect of the webpage foundation label captured being classified;
12) weights are given to each tree node of described tag tree according to following formula:
W = &lambda; depth + &Sigma; i = 1 n SubW i
Wherein SubW ifor children tree nodes weights, λ is weights regulating parameter, and depth is the degree of depth of tree, if the weights of certain node are 0, then thinks that this node is irrelevant label node, if tree node is irrelevant label node, then and λ=0 that this tree node is corresponding; If tree node is leaf node and this tree node is text node or picture node, these tree node weights W=1, if other types then these tree node weights W=0; Described irrelevant label comprises hyperlink label, describes the label of display property;
The identification of described data record comprises the steps:
13) using step 12) through giving the tag tree after weights as input tree input data recordin module, described data recordin module is access templates treebank first, compare according to the essential subtree set weights of the template tree in described template treebank and search, if described input tree comprises essential subtree set weights, then this input tree can carry out the identification of data record with this template tree, obtain corresponding data recording area, if can not find corresponding templates tree in described template treebank, then carry out adaptive identification data record, data recording area is identified by the judgement of the similar subtree to self continuum, obtain corresponding data recording area,
Described data record is aimed at and extraction comprises the steps:
14) in step 13) tag tree comprising maximum weight in the array of data recording area that obtains sets T as benchmark bthe object mainly array that comprise data recording area of tag tree collection by returning after the identification of data record that data record is aimed at and extracted, why select the reason of maximum weight, be because this tree will have the most deeply or the widest tree construction, other trees can easierly be aimed at it; Afterwards for remaining every the tag tree T in data recording area array i, according to label after first weights, weights find out tag tree T by the mode of descending order ilower allly can aim at described benchmark tree T bnode, if find out a node T i[j] is at benchmark tree T bunder there are weights and be greater than or equal to threshold values K, and the node T that node label is consistent b[k], so will think node T i[j] can aim at; If there is no the node can aimed at then performs update, by node T i[j] is inserted into described benchmark tree T bin make described benchmark tree T badjust, the benchmark tree T after adjustment bto the aligning of other tag tree in data recording area array be used for, finally produce final benchmark tree T b;
Described data storage comprises the steps:
15) by tag tree set with by step 14) aims at the template with attribute labeling produced afterwards and sets and carry out Data Matching obtaining information, result is preserved with the form of database, and for step 14) aim at after the storage of data, build table by carrying out database to the attribute of attribute labeling, thus each tag tree in tag tree set is corresponded to a record in database table when data alignment;
Described step 11) according to the effect of label, the label of HTML is divided three classes: first be planning page layout label its label in content information region is provided; Second for describing the label of its content display mode of label of display property; 3rd label of being correlated with for hyperlink;
In employing step 12) to before each tree node imparting weights, carry out except making an uproar to webpage, described except step of making an uproar is for prune tag tree, comprise the father node label that irrelevant label, adjacent text or picture node be set to leaf node label and be set to irrelevant label, be irrelevant label without the text of brother or the father node label of picture node;
Described step 13) identification of data record is needed to compare tag tree to judge its similarity degree, adopt comparative approach be: if in the subtree set of tag tree T1 with there is the equal common factor of weights in the subtree set of tag tree T2, then must meet the subtree existing and be greater than threshold values K, must there is ordinal relation in the equal subtree set of weights simultaneously, such as W [T1 [i]]==W [T2 [j]] and W [T1 [k]]==W [T2 [t]] time, j<=t during and if only if i<=k, then establish tag tree T1 similar with tag tree T2,
Described step 14) insertion position of described update determines as follows: if sequence node T i[j] ... T i[m] is at tag tree T icommon father node under have 2 adjacent brotghers of node, one at Far Left, one at rightmost, these two brotghers of node all benchmark tree T bunder have corresponding aligning node, so sequence node T i[j] ... T i[m] can be unique be inserted into benchmark tree T bbetween lower 2 adjacent brotghers of node; If sequence node T i[j] ... T i[m] is at tag tree T icommon father node under only have 1 adjacent brotgher of node k in a left side, and node k corresponding benchmark tree T bunder rightmost node, so sequence node T i[j] ... T i[m] can be unique be inserted into benchmark tree T bthe farthest right position of lower node k; If sequence node T i[j] ... T i[m] is at tag tree T icommon father node under only have 1 adjacent brotgher of node k in the right side, and node k corresponding benchmark tree T bunder leftmost node, so sequence node T i[j] ... T i[m] can be unique be inserted into benchmark tree T bthe leftmost position of lower node k; If uniquely tag tree T can not be determined inext non-aligned node k sets T at benchmark bunder position, so execution not to be inserted, but by tag tree T ibe put in ephemeral data record array.
CN201110438187.XA 2011-12-23 2011-12-23 Recognition and extraction method for webpage data records based on tree weight Active CN102591931B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110438187.XA CN102591931B (en) 2011-12-23 2011-12-23 Recognition and extraction method for webpage data records based on tree weight

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110438187.XA CN102591931B (en) 2011-12-23 2011-12-23 Recognition and extraction method for webpage data records based on tree weight

Publications (2)

Publication Number Publication Date
CN102591931A CN102591931A (en) 2012-07-18
CN102591931B true CN102591931B (en) 2015-03-18

Family

ID=46480573

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110438187.XA Active CN102591931B (en) 2011-12-23 2011-12-23 Recognition and extraction method for webpage data records based on tree weight

Country Status (1)

Country Link
CN (1) CN102591931B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104346405B (en) * 2013-08-08 2018-05-22 阿里巴巴集团控股有限公司 A kind of method and device of the Extracting Information from webpage
CN105786951A (en) * 2015-12-31 2016-07-20 北京金山安全软件有限公司 Method and device for extracting content blocks in webpage and server
CN108874934B (en) * 2018-06-01 2021-11-30 百度在线网络技术(北京)有限公司 Page text extraction method and device
CN110309394B (en) * 2019-06-14 2021-06-04 中国建设银行股份有限公司 Method and system for capturing webpage structured data
CN115344571B (en) * 2022-05-20 2023-05-23 药渡经纬信息科技(北京)有限公司 Universal data acquisition and analysis method, system and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101833554A (en) * 2009-03-09 2010-09-15 富士通株式会社 Method and equipment for producing extraction template and method and equipment for extracting content on web pages

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7340674B2 (en) * 2002-12-16 2008-03-04 Xerox Corporation Method and apparatus for normalizing quoting styles in electronic mail messages

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101833554A (en) * 2009-03-09 2010-09-15 富士通株式会社 Method and equipment for producing extraction template and method and equipment for extracting content on web pages

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Web网页中动态数据区域的识别与抽取;黄健斌等;《计算机工程》;20070630;第33卷(第11期);全文 *

Also Published As

Publication number Publication date
CN102591931A (en) 2012-07-18

Similar Documents

Publication Publication Date Title
CN110633409B (en) Automobile news event extraction method integrating rules and deep learning
CN109710701B (en) Automatic construction method for big data knowledge graph in public safety field
CN102955856B (en) Chinese short text classification method based on characteristic extension
Liu et al. Vide: A vision-based approach for deep web data extraction
Gatterbauer et al. Towards domain-independent information extraction from web tables
CN101872347B (en) Method and device for judging type of webpage
CN101957816B (en) Webpage metadata automatic extraction method and system based on multi-page comparison
CN102591931B (en) Recognition and extraction method for webpage data records based on tree weight
CN103559234B (en) System and method for automated semantic annotation of RESTful Web services
CN101833554B (en) Method and equipment for producing extraction template and method and equipment for extracting content on web pages
CN101388022B (en) Web portrait search method for fusing text semantic and vision content
CN103049435A (en) Text fine granularity sentiment analysis method and text fine granularity sentiment analysis device
CN103886020B (en) A kind of real estate information method for fast searching
CN101464905A (en) Web page information extraction system and method
CN102306204A (en) Subject area identifying method based on weight of text structure
CN104182412A (en) Webpage crawling method and webpage crawling system
CN112347255B (en) Text classification method based on title and text combination of graph network
CN102456053A (en) Method for mapping XML document to database
Ji et al. Tag tree template for Web information and schema extraction
CN103810251A (en) Method and device for extracting text
CN103761312B (en) Information extraction system and method for multi-recording webpage
CN112148938A (en) Cross-domain heterogeneous data retrieval system and retrieval method
CN100590623C (en) System and method for abstraction of Web data based on vision
Nethra et al. WEB CONTENT EXTRACTION USING HYBRID APPROACH.
Li et al. Extraction and integration information in HTML tables

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant