CN102375847A - Method and device for forming merge tree for generating document template - Google Patents

Method and device for forming merge tree for generating document template Download PDF

Info

Publication number
CN102375847A
CN102375847A CN2010102607472A CN201010260747A CN102375847A CN 102375847 A CN102375847 A CN 102375847A CN 2010102607472 A CN2010102607472 A CN 2010102607472A CN 201010260747 A CN201010260747 A CN 201010260747A CN 102375847 A CN102375847 A CN 102375847A
Authority
CN
China
Prior art keywords
tree
trees
subtree
node
subtrees
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2010102607472A
Other languages
Chinese (zh)
Other versions
CN102375847B (en
Inventor
王新文
夏迎炬
孟遥
于浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN201010260747.2A priority Critical patent/CN102375847B/en
Publication of CN102375847A publication Critical patent/CN102375847A/en
Application granted granted Critical
Publication of CN102375847B publication Critical patent/CN102375847B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a method and a device for forming a merge tree for generating a document template. The method comprises the following steps of: a similarity calculating step: calculating similarity of sub-trees on the same layer in two trees under comparison when comparing each tree of a plurality of trees analyzed from a plurality of pages with another tree to extract the similar sub-trees with similarity greater than or equal to that of a predetermined first threshold value from the two trees under comparison and a common root node of the similar sub-trees, wherein required characteristic can be extracted from the nodes of the trees; a merging step: forming an initial merge tree by using the extracted similar sub-trees from all the trees, wherein the root node of the initial merge tree is the common root node of the similar sub-trees of all the trees; and a post-processing step: post-processing the initial merge tree to acquire a merge tree by removing invalid sub-trees of the initial merge tree.

Description

Be formed for generating method and the device of the merging tree of document template
Technical field
Present invention relates in general to computer realm, more specifically, relate to method and the device of the merging tree that is formed for generating document template.
Background technology
Along with the develop rapidly of the Internet and electronic technology, people no longer receive the restriction of region, can exchange various information easily on the net.Under the participation of a large number of users, there is a large amount of useful informations in the webpage of website (such as forum, blog, products catalogue website etc.), these information not only all have good use value for the individual but also for enterprise.
In order to obtain these useful informations, need a plurality of webpages included in the download site further to analyze extraction.
For the webpage of same website, great majority all have similar structure and composition, if use the template of these pages, remove the useful information that extracts behind the noise wherein so, will become simple and accurately.Wherein, how to generate correct template and just become key point wherein.
And original template generates normally through manual and realizes, but because the variation of the numerous and website template of website, making to generate with template of long-time maintenance becomes time-consuming and a require great effort job.
Summary of the invention
To above-mentioned situation, the objective of the invention is to propose a kind ofly to form the method that merges tree through many trees that resolved to by a plurality of pages being compared and merging, merge the accuracy rate that tree generates template to improve to use.
In addition, another object of the present invention is to propose a kind ofly to conclude and to extract the method for the template that generates the website webpage through being combined tree root according to the characteristic of node, so that the production of template becomes simple.
According to one side of the present invention, a kind of method that is formed for generating the merging tree of document template is provided, may further comprise the steps:
The similarity calculation procedure; Be used for when when every the tree of many trees that is resolved to by a plurality of pages compares with another tree, calculating two quilts than the similarity that is positioned at the subtree of same layer setting; With from two quilts than extracting similarity the tree more than or equal to the similar subtree of predetermined first threshold and the common root node of those similar subtrees, wherein the node from said many trees can extract required characteristic;
Combining step uses the similar subtree of all trees of extracting to form initial merging tree, and the wherein initial root node that merges tree is the common root node of the similar subtree of all trees; And
Post-processing step is used for initial merging tree is carried out aftertreatment, to obtain to merge tree through the invalid subtree of removing initial merging tree.
According to one side of the present invention, a kind of device that is formed for generating the merging tree of document template is provided, comprising:
Similarity calculated; Be configured to when every tree compares with another tree from many trees that resolved to by a plurality of pages, calculate two quilts than the similarity that is positioned at the subtree of same layer in the tree; With from two quilts than extracting similarity the tree more than or equal to the similar subtree of predetermined first threshold and the common root node of those similar subtrees, wherein the node from said many trees can extract required characteristic;
Merge cells is configured to use the similar subtree of all trees of extraction to form initial merging tree, and the wherein initial root node that merges tree is the common root node of the similar subtree of all trees; And
Post-processing unit is configured to initial merging tree is carried out aftertreatment, to obtain to merge tree through the invalid subtree of removing initial merging tree.
Be, can improve the accuracy rate that generates template according to the obtainable benefit of the method and apparatus of the embodiment of the invention through merging the merging tree that many trees that resolved to by a plurality of pages obtain being used to generate template.Further, conclude and extract according to the characteristic of node, can reduce the risk that causes the generation error template because of some subtle change in the page through being combined tree root.In addition, through change, can improve the accuracy rate under the different situations to some parameters.Obtainable other benefit is; Through conclusion to a plurality of pages; Can find the node that changes easily in the template path clearly; Through adding these node change informations in the template of path to, the time loss of information extraction and the accuracy rate that increases information extraction after can reducing, thus increased the dirigibility that generates template.Obtainable another benefit is; Carry out the conclusion and the extraction of information path according to the characteristic of node; Make the production of template become automatically and simple, and can be through to extracting the contrast of result and original event memory, thereby in time find the template that wherein change and modification change.
Description of drawings
From to the preferred embodiment that purport of the present invention and use thereof are described and the following description of accompanying drawing, of the present invention above will be to be prone to understand with other purpose, characteristics and advantage.The not necessarily proportional drafting of parts in the accompanying drawing, and just for principle of the present invention is shown.For the ease of illustrating and describe some parts of the present invention, counterpart possibly be exaggerated in the accompanying drawing, that is, make it with respect to becoming bigger according to other parts in the exemplary means of the present invention.In the accompanying drawings, identical or similar techniques characteristic or parts will adopt identical or similar Reference numeral to represent.
Fig. 1 shows the general flow chart of formation method of merging tree that is used to generate document template according to the embodiment of the invention;
Fig. 2 shows the general flow chart of an object lesson using method shown in Figure 1;
Fig. 3 shows the general flow chart of applied similarity calculation procedure in a concrete example of the method for embodiment as shown in Figure 1;
Fig. 4 shows the general flow chart of applied post-processing step in a concrete example of the method for embodiment as shown in Figure 1;
Fig. 5 shows the simplified block diagram of the device of the merging tree that is formed for generating document template according to an embodiment of the invention; And
Fig. 6 shows the schematic block diagram that can be used for implementing according to the computing machine of the method and apparatus of the embodiment of the invention.
Embodiment
Embodiments of the invention are described with reference to the accompanying drawings.Element of in an accompanying drawing of the present invention or a kind of embodiment, describing and characteristic can combine with element and the characteristic shown in one or more other accompanying drawing or the embodiment.Should be noted that for purpose clearly, omitted the parts that have nothing to do with the present invention, those of ordinary skills are known and the expression and the description of processing in accompanying drawing and the explanation.
Fig. 1 shows the general flow chart of formation method 100 of merging tree that is used to generate document template according to the embodiment of the invention.As shown in Figure 1, this method starts from step S110.At similarity calculation procedure S120; When every tree and another tree compare from many trees that resolved to by a plurality of pages; Calculate by than two trees in be positioned at the similarity of the subtree of same layer, with from by than two trees extract similarity more than or equal to the similar subtree of predetermined first threshold and the common root node of those similar subtrees.At combining step S130, use the similar subtree of all trees of extracting to form initial merging tree, the wherein initial root node that merges tree is the common root node of the similar subtree of all trees.Then, at post-processing step S140, initial merging tree is carried out aftertreatment, to obtain to merge tree through the invalid subtree of removing initial merging tree.
Describe a concrete example in detail with reference to Fig. 2-4 below according to the method for the embodiment of Fig. 1.
Fig. 2 shows the general flow chart of an object lesson having used method shown in Figure 1.As shown in Figure 2, at step S212, n the page that obtains a certain website be n the theme page for example, and resolve to n DOM (Document Object Model, DOM Document Object Model) tree (n>1) to the page.Preferably, choose the highest theme page of n URL (uniform resource locator) similarity of same website, so that improve the accuracy that generates template.
In one example, can download the source code of webpage in real time through the page download program.In another example, can also the source code of web pages downloaded be formed an interim web page files and be pre-stored in the memory storage (for example hard disc of computer).The source code of webpage can be a html format.Through the source code of analyzing web page, can utilize the DOM technology to set up the dom tree structure of webpage.Can comprise one or more nodes in the dom tree structure of webpage.
Then; At step S214; According to certain rule dom tree is carried out the pre-service such as beta pruning, to remove common interference item wherein, for example note node, script node, frame joint, picture node and data representing node etc. merge the useless node of tree for forming DOM.
Then,, compare dom tree by level, to find with the similar subtree of layer at similarity calculation procedure S220.Come exemplary illustrated in the flow process of formation method similarity calculation procedure of merging tree that is used for generating template below in conjunction with Fig. 3 according to the embodiment of the invention.
At similarity calculation procedure S320 shown in Figure 3; At first at step S322; Order is chosen identical subtree of root node as two subtrees that are used for comparison separately from any two dom trees of n dom tree, takes out all leaf nodes of two subtrees and is that two subtrees form the leaf node tabulation respectively.At step S324,, according to all father node titles of this leaf node, form the path string of corresponding root node, and write down the father node number P N of each leaf node at step S325 from this leaf node to this leaf node place subtree for each leaf node.Next, at step S326, find out the same paths (same paths character string) in the identical path of two subtree middle period nodenames, and confirm the number LN that said same paths exists respectively in two subtrees.Process proceeds to step S327; Through merging the identical leaf node of path string in the tabulation of the leaf node of two subtrees the leaf node tabulation of two subtrees; To generate a total leaf node tabulation; Leaf node number in wherein total leaf node tabulation is N, representes the number sum of the leaf node of path string different (path is different) in two subtrees.Then, at step S329,, calculate the similarity A of two subtrees according to following formula (1).
A = Σ i = 1 N | PN i ( LN i 1 + LN i 2 ) | Σ i = 1 N PN i ( LN i 1 - LN i 2 ) × log max PN i - - - ( 1 )
PN wherein iThe number of representing the father node of i leaf node in total leaf node tabulation, max PN iExpression is to the maximal value of the father node number of all leaf nodes of two subtrees, i be integer (i=1 ..., N), and LN I1And LN I2Represent in first subtree and second subtree of two subtrees number respectively to same paths in the path of i leaf node.LN at all leaf nodes I1And LN I2(i=1 ..., under the situation about N) equating respectively, the molecule in the above-mentioned formula (1) is got into predetermined value, for example certain value in 0.1~0.5 scope.
If similarity A is more than or equal to predetermined similarity threshold (first threshold), then confirm by than two subtrees be similar subtree, otherwise be dissimilar subtree.Judging under the similar situation of two subtrees, from by than two dom trees choose next identical subtree of root node in order separately and compare, till the subtree of all layers of two dom trees has all been carried out relatively.On the other hand; Judging under two dissimilar situation of subtree; Elected the current of second dom tree as subtree to be compared by next subtree, compared than subtree, to judge whether being similar subtree with the current of first dom tree with same root node than subtree.Be under the situation of last subtree of certain one deck in the subtree of second selected dom tree; Can choose subtree in this layer circulation, thereby select what when other subtrees with first dom tree compare, be confirmed as dissimilar subtree as subtree to be compared as in preceding subtree.
According to the method described above, the subtree of each layer of two dom trees is compared, to extract similarity more than or equal to the similar subtree of predetermined first threshold and the common root node of those similar subtrees.
Next,, set merging, merge tree to generate a DOM according to similar subtree at combining step S230.Particularly, merge the similar subtree of two dom trees according to certain rule; And for by than two dom trees in confirm as the subtree of dissimilar subtree, then former state is retained in a DOM and merges in the tree.In one embodiment; Can merge the similar subtree of two dom trees according to following mode: the root node of root node corresponding in the similar subtree as the merging tree; Revise the weights of this node simultaneously; For example obtain merging the weights of the root node of tree to the root node of two subtrees weights separately and similarity in the Calais mutually, the initial value of weights for example is made as 0, and merge node corresponding parameters (identical parameter only keep once get final product) and corresponding content of text; After root node merged, the subtree of current two subtrees was adopted in the same way and is merged successively, and so circulation is accomplished up to all node merging.
The number of dom tree greater than 2 situation under (n>2); Process is returned similarity calculation procedure S220; From n dom tree, choose a dom tree and a DOM again and merge tree and compare, to extract the common root node of similar subtree and those similar subtrees in than dom tree from these two quilts.Next, process proceeds to combining step S230, uses the similar subtree of extraction and the common root node of those similar subtrees to form the 2nd DOM merging tree.So the processing of similarity calculation procedure and combining step is carried out in circulation, till all dom trees in n dom tree are selected.Through this process, generated initial merging tree.The initial path part that merges tree partly is made up of the common path of all dom trees.That is to say that initially the merging tree, is added from other subtrees of setting extraction as its subtree as father node with the darkest common root node.
In an alternative embodiment, can at first in similarity calculation procedure S220, compare per two dom trees in n the dom tree, and in combining step S230, merge n dom tree in twos, merge tree to obtain first.The number of dom tree greater than 2 situation under (n>2); Process is returned similarity calculation procedure S220; Merging tree and compare, and in combining step S230, merge first to set and merge in twos through merging first merge in the tree per two first of obtaining in twos.So circulate and carry out the processing of similarity calculation procedure and combining step, merge tree, to set as initial the merging up to finally obtaining one.
Next, process proceeds to post-processing step S240, according to certain rule and characteristic initial merging tree is carried out aftertreatment, to obtain DOM merging tree through removing invalid subtree.Come exemplary illustrated in the flow process of formation method post-processing step of merging tree that is used for generating document template below in conjunction with Fig. 4 according to the embodiment of the invention.
In post-processing step S440 shown in Figure 4,, get a node in the initial merging tree and obtain all properties and the weights of this node at first at step S442.In one embodiment, weights are relating to of obtaining in the similarity calculation procedure to be all times whole similarity sums relatively of the subtree of root node with this node.In another embodiment, weights are the normalized values that obtain after whole similarity sums are handled through normalization.Attribute for example is content of text, the original parameter of node.According to behind the attribute removal interfering nodes, process proceeds to step S446 in step S444, according to the magnitude relationship of weights and interceptive value (second threshold value) and window threshold value (the 3rd threshold value) initial merging tree is handled.Particularly, for the node of weights less than interceptive value, abandoning with this node is the subtree of root node, then keeps for the node of weights greater than the window threshold value.And for weights more than or equal to interceptive value but be less than or equal to the node of window threshold value, then in step S448, these nodes are verified according to some characteristic.The characteristic of institute's foundation for example is text feature, parameter attribute, node type feature.After this, process proceeds to step S449, for the node that checking is passed through, improves its weights and reservation, and then abandoning with this node for the unsanctioned node of checking is the subtree of root node, adds up to eliminate erroneous judgement and/or mistake.
In the above-described embodiments,, all dom trees of use carry out aftertreatment again after forming initial merging tree, to remove invalid subtree.In another embodiment, under the bigger situation of the number of dom tree, can increase the aftertreatment that invalid subtree is removed in other being used in the forming process of setting in initial the merging, to simplify subsequent treatment.For example, after (n/2-1) merging tree of using n/2 dom tree to form is removed the aftertreatment of invalid subtree, re-use a remaining n/2 dom tree and carry out the processing of similarity calculation procedure and combining step.
Then, process proceeds to step S250.At step S250, at first merge the tree and choose the effective information node from DOM.For example can extract the effective information node through following mode: merge from DOM and extract all leaf nodes the subtree of setting, generation pass also obtains total number of paths; Classifying according to the path earlier obtains same paths, same paths is classified according to the content of text of pairing all nodes again; According to the classification quantity of content of text and the ratio of total number of paths, the big more description of ratio changes more greatly in the calculating same paths, might be the effective information path more just, otherwise be exactly the framework garbage.
Next, according to the characteristic of the node of choosing DOM is merged tree and conclude and extract, choosing the information path that needs, said characteristic can comprise at least a in parameter attribute, node type feature and the text feature.In one embodiment, the information path of node can be a character string forms.Said information path can comprise the information of relevant root node from the corresponding DOM tree structure to certain node, particularly, comprises the information that arrives each node that this node will pass through from said root node.In one example, information can comprise title (like label), serial number of node etc.The position of said serial number instructs node in the corresponding DOM tree structure.In one example, said serial number can the position of instructs node in the affiliated layer of corresponding DOM tree structure.Dom tree is concluded and extraction is to well known to a person skilled in the art technology according to node diagnostic, details repeats no more.
Afterwards, preserve the information path of choosing, thereby generate Page Template.Behind the information path that has generated the node that is allowed a choice, also just obtained to comprise the Page Template of the information path of these nodes.In one example, the template that is generated can be XML (Extensible Markup Language, an extend markup language) file layout.
Then, process proceeds to step S260, uses the template that generates from a plurality of page information extractions, and in step S262, judges the accuracy of information extraction and the relation between the predetermined threshold.If accuracy, judges then that the template that is generated is correct template greater than predetermined threshold, process finishes.Otherwise judge that DOM merges the wrong parameter of also changing of tree, for example similarity threshold, interceptive value or window threshold value, process is returned similarity calculation procedure S220, confirms similar subtree again.
Generate among each embodiment of method of merging tree of document template in above-mentioned being formed for, use DOM Document Object Model (DOM) tree construction to be illustrated.It will be appreciated by those skilled in the art that the page also can be resolved to the tree that can extract the other types of required characteristic from node, for example tag tree.
Moreover additional embodiments of the present invention also provides a kind of device that is formed for generating the merging tree of document template.Fig. 5 shows the simplified block diagram of this device 500.As shown in the figure; This device 500 comprises: similarity calculated 520; Be configured to when every tree compares with another tree from n the tree (for example dom tree) that is resolved to by n the page to calculate two quilts than the similarity that is positioned at the subtree of same layer in the tree, with from two quilts than extracting similarity the tree more than or equal to the similar subtree of predetermined first threshold and the common root node of those similar subtrees; And merge cells 530, be configured to use the similar subtree of all trees of extraction to form initial merging tree, the wherein initial root node that merges tree is the common root node of the similar subtree of all trees; And post-processing unit 540, be configured to initial merging tree is carried out aftertreatment, to obtain to merge tree through the invalid subtree of removing initial merging tree.
Greater than 2 situation (n>2), the cycle of treatment that processing that similarity calculated is carried out and merge cells carry out is carried out for the number of tree.In one embodiment; Similarity calculated 520 compares first tree and second tree in n the tree; With the common root node from first and second tree similar subtrees of extraction and those similar subtrees, and merge cells 530 uses the similar subtree of extraction and the common root node of those similar subtrees to form the first merging tree.Then, similarity calculated 520 merges tree to the 3rd tree in n the tree and first and compares, and extracting similar subtree and common root node, and merge cells 530 uses the similar subtree of extracting to form second merging with the common root node to set.So circulate.At last; Similarity calculated 520 merges tree to n tree in n the tree and (n-2) and compares; Extracting similar subtree and common root node, and merge cells 530 uses the similar subtree of extracting to form (n-1) with the common root node to merge tree, set as initial merging.
Device 500 shown in above-mentioned Fig. 5 and included each unit 520-530 thereof can be configured to carry out top with reference to the described various operations of Fig. 1-4.About the further details of these operations, can be not described in detail here with reference to each embodiment, embodiment and the instance of above description.
Describe in detail through block diagram, process flow diagram and/or embodiment above, illustrated the different embodiments of devices in accordance with embodiments of the present invention and/or method.When these block diagrams, process flow diagram and/or embodiment comprise one or more functions and/or operation; It will be obvious to those skilled in the art that among these block diagrams, process flow diagram and/or the embodiment each function and/or operation can through various hardware, software, firmware or in fact they combination in any and individually and/or enforcement jointly.The several sections of the theme of describing in this instructions in one embodiment, can pass through application-specific IC (ASIC), field programmable gate array (FPGA), digital signal processor (DSP) or other integrated forms to be realized.Yet; Those skilled in the art will recognize that; Some aspects of the embodiment of describing in this instructions can be whole or in part in integrated circuit with the form of one or more computer programs of on one or more computing machines, moving (for example; With form at the one or more computer programs that move on one or more computer systems), with the form of one or more programs of on one or more processors, moving (for example; Form with one or more programs of on one or more microprocessors, moving), implement equivalently with the form of firmware or with the form of their combination in any in fact; And; According to disclosed content in this instructions, being designed for circuit of the present disclosure and/or writing the code that is used for software of the present disclosure and/or firmware is fully within those skilled in the art's limit of power.
For example, each composition module, unit, subelement can be configured through the mode of software, firmware, hardware or its combination in any in the said apparatus 500.Under situation about realizing through software or firmware; Can the program that constitute this software be installed to the computing machine with specialized hardware structure (multi-purpose computer 600 for example shown in Figure 6) from storage medium or network; This computing machine can be carried out various functions when various program is installed.
Fig. 6 shows the schematic block diagram that can be used for implementing according to the computing machine of the method and apparatus of the embodiment of the invention.
In Fig. 6, CPU (CPU) 601 carries out various processing according to program stored among ROM (read-only memory) (ROM) 602 or from the program that storage area 608 is loaded into random-access memory (ram) 603.In RAM 603, also store data required when CPU 601 carries out various processing or the like as required.CPU 601, ROM602 and RAM 603 are connected to each other via bus 604.Input/output interface 605 also is connected to bus 604.
Following parts also are connected to input/output interface 605: importation 606 (comprising keyboard, mouse or the like), output 607 (comprise display; For example cathode ray tube (CRT), LCD (LCD) etc. and loudspeaker etc.), storage area 608 (comprising hard disk etc.), communications portion 609 (comprising NIC for example LAN card, modulator-demodular unit etc.).Communications portion 609 is via for example the Internet executive communication processing of network.As required, driver 610 also can be connected to input/output interface 605.Detachable media 611 for example disk, CD, magneto-optic disk, semiconductor memory or the like can be installed on the driver 610 as required, makes the computer program of therefrom reading be installed to as required in the storage area 608.
Realizing through software under the situation of above-mentioned series of processes, can from network for example the Internet or from storage medium for example detachable media 611 program that constitutes softwares is installed.
It will be understood by those of skill in the art that this storage medium is not limited to shown in Figure 6 wherein having program stored therein, distribute so that the detachable media 611 of program to be provided to the user with equipment with being separated.The example of detachable media 611 comprises disk (comprising floppy disk), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD) (registered trademark)) and semiconductor memory.Perhaps, storage medium can be hard disk that comprises in ROM 602, the storage area 608 or the like, computer program stored wherein, and be distributed to the user with the equipment that comprises them.
Therefore, the invention allows for a kind of program product that stores the instruction code of machine-readable.When said instruction code is read and carried out by machine, can carry out above-mentioned the whole bag of tricks according to the embodiment of the invention.Correspondingly, the above-named various storage mediums that are used for carrying this program product are also included within of the present invention open.
In the above in the description to the specific embodiment of the invention; Characteristic to a kind of embodiment is described and/or illustrated can be used in one or more other embodiment with identical or similar mode; Combined with the characteristic in other embodiment, or substitute the characteristic in other embodiment.
Should stress that term " comprises/comprise " existence that when this paper uses, refers to characteristic, key element, step or assembly, but not get rid of the existence of one or more further feature, key element, step or assembly or additional.
In addition, the time sequencing of describing during method of the present invention is not limited to is to specifications carried out, also can according to other time sequencing ground, carry out concurrently or independently.The execution sequence of the method for therefore, describing in this instructions does not constitute restriction to technical scope of the present invention.
Can know that through top description the technical scheme that the present invention is contained includes but not limited to the described content of following remarks to embodiments of the invention:
Remarks is formed for generating the method for the merging tree of document template for 1. 1 kinds, may further comprise the steps:
The similarity calculation procedure; Be used for when when every the tree of many trees that is resolved to by a plurality of pages compares with another tree, calculating two quilts than the similarity that is positioned at the subtree of same layer setting; With from two quilts than extracting similarity the tree more than or equal to the similar subtree of predetermined first threshold and the common root node of those similar subtrees, wherein the node from said many trees can extract required characteristic;
Combining step uses the similar subtree of all trees of extracting to form initial merging tree, and the wherein initial root node that merges tree is the common root node of the similar subtree of all trees; And
Post-processing step is used for initial merging tree is carried out aftertreatment, to obtain to merge tree through the invalid subtree of removing initial merging tree.
Remarks 2. is according to remarks 1 described method, and wherein similarity calculation procedure and combining step comprise:
Compare first in many trees trees and second tree in many trees, form first with the common root node that uses the similar subtree extracted from first and second trees and those similar subtrees and merge and set; And
Merge tree to n tree with (n-2) and compare, set to use the common root node that merges similar subtree that tree extracts and those similar subtrees from n tree and (n-2) to form (n-1) merging, wherein n is the integer more than or equal to 3.
Remarks 3. is according to remarks 1 or 2 described methods; Wherein the similarity calculation procedure comprises: from by than two trees separately order choose identical subtree of root node as two subtrees that are used for comparison; The path of formation root node of all leaf nodes to this subtree of every subtree from two subtrees; Confirm the number of same paths in the identical said path of the title of leaf node in two subtrees respectively, and calculate the similarity A of two subtrees according to following formula:
A = Σ i = 1 N | PN i ( LN i 1 + LN i 2 ) | Σ i = 1 N PN i ( LN i 1 - LN i 2 ) × log max PN i
Wherein N representes the number sum of the mutually different leaf node in path in two subtrees, PN iThe number of representing the father node of i leaf node, max PN iExpression is to the maximal value of the number of the father node of all leaf nodes of two subtrees, i=1 ... N, and LN I1And LN I2Represent to be directed against in two subtrees the number of same paths in the path of i leaf node respectively.
Remarks 4. is according to each described formation method among the remarks 1-3; Wherein post-processing step comprises that the subtree that has a said initial merging tree of the weights relevant with said similarity that fall in the threshold range that is equal to or greater than predetermined second threshold value and is equal to or less than predetermined the 3rd threshold value for its root node handles, so that eliminate erroneous judgement and/or mistake adds up.
Remarks 5. comprises also that according to each described formation method among the remarks 1-4 many trees to resolving to carry out pretreated step, merges the node that tree does not have effect to remove for forming.
Remarks 6. also comprises post-processing step conclusion and extraction step afterwards according to each described formation method among the remarks 1-5, is used for the characteristic according to the node that merges tree, is combined tree and concludes and extract processing, to choose the information path that needs.
Remarks 7. is according to remarks 6 described formation methods, and wherein said characteristic comprises at least a in parameter attribute, node type feature and the text feature.
Remarks 8. comprises also that according to remarks 6 or 7 described formation methods the information path according to selected needs generates the step of document template.
Remarks is formed for generating the device of the merging tree of document template for 9. 1 kinds, comprising:
Similarity calculated; Be configured to when every tree compares with another tree from many trees that resolved to by a plurality of pages, calculate two quilts than the similarity that is positioned at the subtree of same layer in the tree; With from two quilts than extracting similarity the tree more than or equal to the similar subtree of predetermined first threshold and the common root node of those similar subtrees, wherein the node from said many trees can extract required characteristic;
Merge cells is configured to use the similar subtree of all trees of extraction to form initial merging tree, and the wherein initial root node that merges tree is the common root node of the similar subtree of all trees; And
Post-processing unit is configured to initial merging tree is carried out aftertreatment, to obtain to merge tree through the invalid subtree of removing initial merging tree.
Remarks 10. is according to remarks 9 described devices, and wherein similarity calculated and merge cells comprise that similarity cycle calculations subelement and circulation merge subelement, and said similarity cycle calculations subelement is configured to circulation merging subelement:
Compare first in many trees trees and second tree in many trees, form first with the common root node that uses the similar subtree extracted from first and second trees and those similar subtrees and merge and set; And
Merge tree to n tree with (n-2) and compare, set to use the common root node that merges similar subtree that tree extracts and those similar subtrees from n tree and (n-2) to form (n-1) merging, wherein n is the integer more than or equal to 3.
Remarks 11. is according to remarks 9 or 10 described devices, and wherein similarity calculated comprises: subtree chooser unit, be configured to from by than two trees separately order choose identical subtree of root node as two subtrees that are used for comparison; The path forms subelement, is configured to form the path of the root node of all leaf nodes to this subtree of every subtree from two subtrees; Confirm subelement, be configured to respectively to confirm the number of same paths in the identical said path of the title of leaf node in two subtrees, and computation subunit, be configured to calculate the similarity A of two subtrees according to following formula:
A = Σ i = 1 N | PN i ( LN i 1 + LN i 2 ) | Σ i = 1 N PN i ( LN i 1 - LN i 2 ) × log max PN i
Wherein N representes the number sum of the mutually different leaf node in path in two subtrees, PN iThe number of representing the father node of i leaf node, max PN iExpression is to the maximal value of the number of the father node of all leaf nodes of two subtrees, i=1 ... N, and LN I1And LN I2Represent to be directed against in two subtrees the number of same paths in the path of i leaf node respectively.
Remarks 12. is according to each described device among the remarks 9-11; Wherein the post-processing unit similar subtree that is configured to have for its root node the said initial merging tree of the weights relevant with said similarity that fall in the threshold range that is equal to or greater than predetermined second threshold value and is equal to or less than predetermined the 3rd threshold value is handled, so that eliminate erroneous judgement and/or mistake adds up.
Remarks 13. comprises also that according to each described device among the remarks 9-12 being configured to that many trees that resolve to are carried out pre-service merges the pretreatment unit that tree does not have the node of effect to remove for forming.
Remarks 14. comprises also and concluding and extraction unit that said conclusion and extraction unit are configured to be combined tree according to the characteristic that merges the node of setting and conclude and extract processing, to choose the information path that needs according to each described device among the remarks 9-13.
Remarks 15. is according to remarks 14 described devices, and wherein said characteristic comprises at least a in parameter attribute, node type feature and the text feature.
Remarks 16. also comprises the template generation unit that is configured to generate according to the information path of selected needs document template according to remarks 14 or 15 described devices.
17. 1 kinds of program products that store the instruction code of machine-readable of remarks,
When said instruction code is read and carried out by machine, can carry out like any one described method among the claim 1-8.
18. 1 kinds of storage mediums that carry like remarks 17 described program products of remarks.
Although the present invention is disclosed above through description to specific embodiment of the present invention; But; Should be appreciated that, those skilled in the art can be in the spirit of accompanying claims and scope design to various modifications of the present invention, improve or equivalent.These modifications, improvement or equivalent also should be believed to comprise in protection scope of the present invention.

Claims (10)

1. method that is formed for generating the merging tree of document template may further comprise the steps:
The similarity calculation procedure; Be used for when when every the tree of many trees that is resolved to by a plurality of pages compares with another tree, calculating two quilts than the similarity that is positioned at the subtree of same layer setting; With from two quilts than extracting similarity the tree more than or equal to the similar subtree of predetermined first threshold and the common root node of those similar subtrees, wherein the node from said many trees can extract required characteristic;
Combining step uses the similar subtree of all trees of extracting to form initial merging tree, and the wherein initial root node that merges tree is the common root node of the similar subtree of all trees; And
Post-processing step is used for initial merging tree is carried out aftertreatment, to obtain to merge tree through the invalid subtree of removing initial merging tree.
2. according to the described method of claim 1, wherein similarity calculation procedure and combining step comprise:
Compare first in many trees trees and second tree in many trees, form first with the common root node that uses the similar subtree extracted from first and second trees and those similar subtrees and merge and set; And
Merge tree to n tree with (n-2) and compare, set to use the common root node that merges similar subtree that tree extracts and those similar subtrees from n tree and (n-2) to form (n-1) merging, wherein n is the integer more than or equal to 3.
3. according to claim 1 or 2 described methods; Wherein the similarity calculation procedure comprises: from by than two trees separately order choose identical subtree of root node as two subtrees that are used for comparison; The path of formation root node of all leaf nodes to this subtree of every subtree from two subtrees; Confirm the number of same paths in the identical said path of the title of leaf node in two subtrees respectively, and calculate the similarity A of two subtrees according to following formula:
A = Σ i = 1 N | PN i ( LN i 1 + LN i 2 ) | Σ i = 1 N PN i ( LN i 1 - LN i 2 ) × log max PN i
Wherein N representes the number sum of the mutually different leaf node in path in two subtrees, PN iThe number of representing the father node of i leaf node, max PN iExpression is to the maximal value of the number of the father node of all leaf nodes of two subtrees, i=1 ... N, and LN I1And LN I2Represent to be directed against in two subtrees the number of same paths in the path of i leaf node respectively.
4. according to each described formation method among the claim 1-3; Wherein post-processing step comprises that the subtree that has a said initial merging tree of the weights relevant with said similarity that fall in the threshold range that is equal to or greater than predetermined second threshold value and is equal to or less than predetermined the 3rd threshold value for its root node handles, so that eliminate erroneous judgement and/or mistake adds up.
5. according to each described formation method among the claim 1-4, comprise that also many trees to resolving to carry out pretreated step, merge the node that tree does not have effect to remove for forming.
6. according to each described formation method among the claim 1-5, also comprise post-processing step conclusion and extraction step afterwards, be used for characteristic, be combined tree and conclude and extract processing, to choose the information path that needs according to the node that merges tree.
7. according to the described formation method of claim 6, wherein said characteristic comprises at least a in parameter attribute, node type feature and the text feature.
8. according to claim 6 or 7 described formation methods, comprise that also the information path according to selected needs generates the step of document template.
9. device that is formed for generating the merging tree of document template comprises:
Similarity calculated; Be configured to when every tree compares with another tree from many trees that resolved to by a plurality of pages, calculate two quilts than the similarity that is positioned at the subtree of same layer in the tree; With from two quilts than extracting similarity the tree more than or equal to the similar subtree of predetermined first threshold and the common root node of those similar subtrees, wherein the node from said many trees can extract required characteristic;
Merge cells is configured to use the similar subtree of all trees of extraction to form initial merging tree, and the wherein initial root node that merges tree is the common root node of the similar subtree of all trees; And
Post-processing unit is configured to initial merging tree is carried out aftertreatment, to obtain to merge tree through the invalid subtree of removing initial merging tree.
10. according to the described device of claim 9, wherein similarity calculated and merge cells comprise that similarity cycle calculations subelement and circulation merge subelement, and said similarity cycle calculations subelement is configured to circulation merging subelement:
Compare first in many trees trees and second tree in many trees, form first with the common root node that uses the similar subtree extracted from first and second trees and those similar subtrees and merge and set; And
Merge tree to n tree with (n-2) and compare, set to use the common root node that merges similar subtree that tree extracts and those similar subtrees from n tree and (n-2) to form (n-1) merging, wherein n is the integer more than or equal to 3.
CN201010260747.2A 2010-08-17 2010-08-17 Method and device for forming merge tree for generating document template Expired - Fee Related CN102375847B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010260747.2A CN102375847B (en) 2010-08-17 2010-08-17 Method and device for forming merge tree for generating document template

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010260747.2A CN102375847B (en) 2010-08-17 2010-08-17 Method and device for forming merge tree for generating document template

Publications (2)

Publication Number Publication Date
CN102375847A true CN102375847A (en) 2012-03-14
CN102375847B CN102375847B (en) 2014-06-04

Family

ID=45794469

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010260747.2A Expired - Fee Related CN102375847B (en) 2010-08-17 2010-08-17 Method and device for forming merge tree for generating document template

Country Status (1)

Country Link
CN (1) CN102375847B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902582A (en) * 2012-12-27 2014-07-02 中国移动通信集团湖北有限公司 Data warehouse redundancy reduction method and device
CN104636481A (en) * 2015-02-16 2015-05-20 浪潮集团有限公司 Webpage template extracting method and device
CN105531704A (en) * 2013-12-10 2016-04-27 株式会社日立制作所 Data processing method and data processing server
CN106815235A (en) * 2015-11-27 2017-06-09 广州市动景计算机科技有限公司 Super web page template generation method, device and page data transmission method
CN107423391A (en) * 2017-07-24 2017-12-01 福州大学 The information extracting method of Web page structural data
CN109726376A (en) * 2018-12-21 2019-05-07 上海众源网络有限公司 A kind of generation method of standard form, device and electronic equipment
CN109948123A (en) * 2018-11-27 2019-06-28 阿里巴巴集团控股有限公司 A kind of image combining method and device
WO2020063031A1 (en) * 2018-09-29 2020-04-02 Oppo广东移动通信有限公司 Method and apparatus for processing structured data, and storage medium and electronic device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090276506A1 (en) * 2008-05-02 2009-11-05 Yahoo! Inc. Generating document templates that are robust to structural variations
CN101694668A (en) * 2009-09-29 2010-04-14 百度在线网络技术(北京)有限公司 Method and device for confirming web structure similarity
CN101727498A (en) * 2010-01-15 2010-06-09 西安交通大学 Automatic extraction method of web page information based on WEB structure

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090276506A1 (en) * 2008-05-02 2009-11-05 Yahoo! Inc. Generating document templates that are robust to structural variations
CN101694668A (en) * 2009-09-29 2010-04-14 百度在线网络技术(北京)有限公司 Method and device for confirming web structure similarity
CN101727498A (en) * 2010-01-15 2010-06-09 西安交通大学 Automatic extraction method of web page information based on WEB structure

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902582A (en) * 2012-12-27 2014-07-02 中国移动通信集团湖北有限公司 Data warehouse redundancy reduction method and device
CN103902582B (en) * 2012-12-27 2017-08-11 中国移动通信集团湖北有限公司 A kind of method and apparatus for reducing data warehouse data redundancy
CN105531704A (en) * 2013-12-10 2016-04-27 株式会社日立制作所 Data processing method and data processing server
CN104636481A (en) * 2015-02-16 2015-05-20 浪潮集团有限公司 Webpage template extracting method and device
CN106815235A (en) * 2015-11-27 2017-06-09 广州市动景计算机科技有限公司 Super web page template generation method, device and page data transmission method
CN107423391A (en) * 2017-07-24 2017-12-01 福州大学 The information extracting method of Web page structural data
CN107423391B (en) * 2017-07-24 2020-11-03 福州大学 Information extraction method of webpage structured data
WO2020063031A1 (en) * 2018-09-29 2020-04-02 Oppo广东移动通信有限公司 Method and apparatus for processing structured data, and storage medium and electronic device
CN109948123A (en) * 2018-11-27 2019-06-28 阿里巴巴集团控股有限公司 A kind of image combining method and device
CN109948123B (en) * 2018-11-27 2023-06-02 创新先进技术有限公司 Image merging method and device
CN109726376A (en) * 2018-12-21 2019-05-07 上海众源网络有限公司 A kind of generation method of standard form, device and electronic equipment

Also Published As

Publication number Publication date
CN102375847B (en) 2014-06-04

Similar Documents

Publication Publication Date Title
CN102375847B (en) Method and device for forming merge tree for generating document template
Sun et al. Dom based content extraction via text density
US7941420B2 (en) Method for organizing structurally similar web pages from a web site
US7660804B2 (en) Joint optimization of wrapper generation and template detection
CN102253937B (en) Method and related device for acquiring information of interest in webpages
CN102609474B (en) A kind of visit information supplying method and system
CN106776881A (en) A kind of realm information commending system and method based on microblog
US20090216708A1 (en) Structural clustering and template identification for electronic documents
CN105095444A (en) Information acquisition method and device
CN102163203A (en) Method and device for downloading web pages
CN101593200A (en) Chinese Web page classification method based on the keyword frequency analysis
CN103294781A (en) Method and equipment used for processing page data
US20120102390A1 (en) Method and apparatus for generating widget
CN102646095B (en) Object classifying method and system based on webpage classification information
CN102915361B (en) Webpage text extracting method based on character distribution characteristic
CN103246732A (en) Online Web news content extracting method and system
CN107590236B (en) Big data acquisition method and system for building construction enterprises
CN107862039A (en) Web data acquisition methods, system and Data Matching method for pushing
CN103095849B (en) A method and a system of spervised web service finding based on attribution forecast and error correction of quality of service (QoS)
CN103853770B (en) The method and system of model content in a kind of extraction forum Web pages
CN114398138A (en) Interface generation method and device, computer equipment and storage medium
Nethra et al. WEB CONTENT EXTRACTION USING HYBRID APPROACH.
CN113806647A (en) Method for identifying development framework and related equipment
US20120284224A1 (en) Build of website knowledge tables
CN114528811B (en) Article content extraction method, device, equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140604

Termination date: 20180817

CF01 Termination of patent right due to non-payment of annual fee