CN102375847B - Method and device for forming merge tree for generating document template - Google Patents

Method and device for forming merge tree for generating document template Download PDF

Info

Publication number
CN102375847B
CN102375847B CN201010260747.2A CN201010260747A CN102375847B CN 102375847 B CN102375847 B CN 102375847B CN 201010260747 A CN201010260747 A CN 201010260747A CN 102375847 B CN102375847 B CN 102375847B
Authority
CN
China
Prior art keywords
tree
subtree
trees
node
subtrees
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201010260747.2A
Other languages
Chinese (zh)
Other versions
CN102375847A (en
Inventor
王新文
夏迎炬
孟遥
于浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN201010260747.2A priority Critical patent/CN102375847B/en
Publication of CN102375847A publication Critical patent/CN102375847A/en
Application granted granted Critical
Publication of CN102375847B publication Critical patent/CN102375847B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention relates to a method and a device for forming a merge tree for generating a document template. The method comprises the following steps of: a similarity calculating step: calculating similarity of sub-trees on the same layer in two trees under comparison when comparing each tree of a plurality of trees analyzed from a plurality of pages with another tree to extract the similar sub-trees with similarity greater than or equal to that of a predetermined first threshold value from the two trees under comparison and a common root node of the similar sub-trees, wherein required characteristic can be extracted from the nodes of the trees; a merging step: forming an initial merge tree by using the extracted similar sub-trees from all the trees, wherein the root node of the initial merge tree is the common root node of the similar sub-trees of all the trees; and a post-processing step: post-processing the initial merge tree to acquire a merge tree by removing invalid sub-trees of the initial merge tree.

Description

Be formed for method and the device of the merging tree that generates document template
Technical field
Present invention relates in general to computer realm, more specifically, relate to the method and the device that are formed for the merging tree that generates document template.
Background technology
Along with the develop rapidly of the Internet and electronic technology, people are no longer subject to the restriction of region, can exchange easily various information on the net.Under the participation of a large number of users, in the webpage of website (such as forum, blog, products catalogue website etc.), there is a large amount of useful informations, these information not only for individual and also have good use value for enterprise.
In order to obtain these useful informations, need multiple webpages included in download site to carry out further analyzing extraction.
For the webpage of same website, great majority all have similar structure and composition, if use the template of these pages, remove so after noise and extract useful information wherein, will become simple and accurately.Wherein, how to generate correct template and just become key point wherein.
And template generation originally normally realizes by craft, still due to the variation of the numerous of website and website template, make generation and template of long-time maintenance become time-consuming and a require great effort job.
Summary of the invention
For above-mentioned situation, the object of the invention is to propose a kind of by many trees that resolved to by multiple pages being compared and merging to form the method that merges tree, to improve the accuracy rate that uses merging tree to generate template.
In addition, another object of the present invention is to propose a kind of method of concluding and extract the template that generates website webpage by being combined tree root according to the feature of node, to make the production of template become simple.
According to an aspect of of the present present invention, a kind of method that is formed for the merging tree that generates document template is provided, comprise the following steps:
Similarity calculation procedure, be used for when calculating two by the similarity than being positioned at the subtree of same layer setting in the time that every the tree of many trees being resolved to by multiple pages compares with another tree, with from two by than tree extracted the common root node that similarity is more than or equal to similar subtree and those similar subtrees of predetermined first threshold, wherein can extract required feature from the node of described many trees;
Combining step, forms initial merging by the similar subtree of all trees of extracting and sets, and wherein the initial root node that merges tree is the common root node of the similar subtree of all trees; And
Post-processing step, for initial merging tree is carried out to aftertreatment, to obtain and to merge tree by the invalid subtree of removing initial merging tree,
Wherein similarity calculation procedure and combining step comprise: the tree of first in many trees and the second tree in many trees are compared, form the first merging tree to use from the first and second similar subtrees of trees extraction and the common root node of those similar subtrees; And
N is set with n-2 merging and sets and compare, form n-1 merging tree to use from n tree with the similar subtree of n-2 merging tree extraction and the common root node of those similar subtrees, wherein n is more than or equal to 3 integer.
According to an aspect of of the present present invention, a kind of device that is formed for the merging tree that generates document template is provided, comprising:
Similarity calculated, be configured to calculate two by the similarity than being positioned at the subtree of same layer in tree in the time that every tree compares with another tree from many trees that resolved to by multiple pages, with from two by than tree extracted the common root node that similarity is more than or equal to similar subtree and those similar subtrees of predetermined first threshold, wherein can extract required feature from the node of described many trees;
Merge cells, is configured to form initial merging tree by the similar subtree of all trees of extracting, and wherein the initial root node that merges tree is the common root node of the similar subtree of all trees; And
Post-processing unit, is configured to initial merging tree to carry out aftertreatment, to obtain and to merge tree by the invalid subtree of removing initial merging tree,
Wherein similarity calculated and merge cells comprise similarity cycle calculations subelement and loop fusion subelement, and similarity cycle calculations subelement and loop fusion subelement are configured to:
In many trees first tree and the second tree in many trees are compared, form the first merging tree to use from the first and second similar subtrees of trees extraction and the common root node of those similar subtrees; And
N is set with n-2 merging and sets and compare, form n-1 merging tree to use from n tree with the similar subtree of n-2 merging tree extraction and the common root node of those similar subtrees, wherein n is more than or equal to 3 integer.
According to the obtainable benefit of the method and apparatus of the embodiment of the present invention be, the merging that many trees that resolved to by multiple pages by merging obtain for generating template is set, and can improve the accuracy rate that generates template.Further, conclude and extract according to the feature of node by being combined tree root, can reduce the risk that causes generation error template because of some subtle change in the page.In addition,, by the change to some parameters, can improve the accuracy rate under different situations.Obtainable other benefit is, by the conclusion to multiple pages, can find clearly to hold in template path labile node, by these node change informations are added in the template of path, the accuracy rate of the time loss of information extraction and increase information extraction after can reducing, thus the dirigibility that generates template increased.Obtainable another benefit is, carry out conclusion and the extraction of information path according to the feature of node, the production of template is become automatically and simple, and can be by the contrast of extracting result and original event memory, thereby find in time the template that change wherein modification change.
Accompanying drawing explanation
From the following description to the preferred embodiments and drawings that purport of the present invention and use thereof are described, above and other objects, features and advantages of the present invention are easily understood being.Parts in accompanying drawing are proportional drafting not necessarily, and just for principle of the present invention is shown.For the ease of illustrating and describing some parts of the present invention, in accompanying drawing, corresponding part may be exaggerated, that is, make it become larger with respect to other parts in foundation exemplary means of the present invention.In the accompanying drawings, same or similar technical characterictic or parts will adopt same or similar Reference numeral to represent.
Fig. 1 show according to the embodiment of the present invention for generating the general flow chart of formation method of merging tree of document template;
Fig. 2 shows the general flow chart of an object lesson of method shown in application drawing 1;
Fig. 3 shows the general flow chart of applied similarity calculation procedure in a concrete example of the method for embodiment as shown in Figure 1;
Fig. 4 shows the general flow chart of applied post-processing step in a concrete example of the method for embodiment as shown in Figure 1;
Fig. 5 shows the simplified block diagram of the device that is formed for according to an embodiment of the invention the merging tree that generates document template; And
Fig. 6 shows and can be used for implementing according to the schematic block diagram of the computing machine of the method and apparatus of the embodiment of the present invention.
Embodiment
Embodiments of the invention are described with reference to the accompanying drawings.The element of describing in an accompanying drawing of the present invention or a kind of embodiment and feature can combine with element and feature shown in one or more other accompanying drawing or embodiment.It should be noted that for purposes of clarity, in accompanying drawing and explanation, omitted expression and the description of unrelated to the invention, parts known to persons of ordinary skill in the art and processing.
Fig. 1 show according to the embodiment of the present invention for generating the general flow chart of formation method 100 of merging tree of document template.As shown in Figure 1, the method starts from step S110.At similarity calculation procedure S120, in the time that from many trees that resolved to by multiple pages, every tree and another tree compare, calculate by than two trees in be positioned at the similarity of the subtree of same layer, with from by than two trees extract the common root node that similarity is more than or equal to similar subtree and those similar subtrees of predetermined first threshold.At combining step S130, form initial merging by the similar subtree of all trees of extracting and set, wherein the initial root node that merges tree is the common root node of the similar subtree of all trees.Then, at post-processing step S140, initial merging tree is carried out to aftertreatment, to obtain and to merge tree by the invalid subtree of removing initial merging tree.
Describe in detail according to the method for the embodiment of Fig. 1 concrete example below with reference to Fig. 2-4.
Fig. 2 shows the general flow chart of an object lesson having applied method shown in Fig. 1.As shown in Figure 2, at step S212, obtain for example n the theme page of n the page of a certain website, and the page is resolved to n DOM (Document Object Mode1, DOM Document Object Model) tree (n>1).Preferably, choose the highest theme page of n URL (uniform resource locator) similarity of same website, to improve the accuracy that generates template.
In one example, can pass through the page download program source code of downloading web pages in real time.In another example, the source code of the webpage of download can also be formed to an interim web page files and for example be pre-stored in, in memory storage (hard disc of computer).The source code of webpage can be html format.By the source code of analyzing web page, can utilize DOM technology to set up the dom tree structure of webpage.In the dom tree structure of webpage, can comprise one or more nodes.
Then, at step S214, according to certain rule, dom tree is carried out to the pre-service such as beta pruning, to remove common interference item wherein, for example, annotate node, script node, frame joint, picture node and demonstration expression node etc. and merge for forming DOM the node that tree is useless.
Then, at similarity calculation procedure S220, dom tree is compared by level, to find with the similar subtree of layer.Below in conjunction with Fig. 3 come exemplary illustrated according to the embodiment of the present invention for generating the flow process of formation method similarity calculation procedure of merging tree of template.
At the similarity calculation procedure S320 shown in Fig. 3, first at step S322, from any two dom trees of n dom tree, order is chosen a subtree that root node is identical as for two subtrees relatively separately, takes out all leaf nodes of two subtrees and is that two subtrees form respectively leaf node list.At step S324, for each leaf node, according to all father node titles of this leaf node, form the path string of the corresponding root node from this leaf node to this leaf node place subtree, and record the father node number P N of each leaf node at step S325.Next, at step S326, find out the same paths (same paths character string) in the path that two subtree middle period nodenames are identical, and determine the number LN that described same paths exists respectively in two subtrees.Process proceeds to step S327, by the identical leaf node of path string in the leaf node list of two subtrees being merged to the leaf node list of two subtrees, to generate a total leaf node list, wherein the leaf node number in total leaf node list is N, represents the number sum of the leaf node of path string different (path difference) in two subtrees.Then,, at step S329, according to formula (1) below, calculate the similarity A of two subtrees.
A = Σ i = 1 N | PN i ( LN i 1 + LN i 2 ) | Σ i = 1 N PN i ( LN i 1 - LN i 2 ) × log max PN i - - - ( 1 )
Wherein PNi represents the number of the father node of i leaf node in total leaf node list, max PN irepresent the maximal value for the father node number of all leaf nodes of two subtrees, i be integer (i=1 ..., N), and LN i1and LN i2represent respectively the number for same paths in the path of i leaf node in first subtree of two subtrees and second subtree.At the LN of all leaf nodes i1and LN i2(i=1 ..., N) respectively in equal situation, the molecule in above-mentioned formula (1) is got into predetermined value, for example certain value in 0.1~0.5 scope.
If similarity A is more than or equal to predetermined similarity threshold (first threshold), determine by than two subtrees be similar subtree, otherwise be dissimilar subtree.Judging that two subtrees are similar in the situation that, from by than two dom trees sequentially choose separately next subtree that root node is identical and compare, until the subtree of all layers of two dom trees all compares.On the other hand, judging in two dissimilar situations of subtree, selected as the current of second dom tree as subtree to be compared by next subtree with same root node than subtree, compared than subtree with the current of first dom tree, to determine whether similar subtree.Last subtree that is certain one deck in the subtree of second selected dom tree, can choose subtree this layer of circulation, thereby select what in the time comparing with other subtrees of first dom tree, be confirmed as dissimilar subtree as subtree to be compared as in front subtree.
According to the method described above, the subtree of the every one deck to two dom trees compares, and is more than or equal to the common root node of similar subtree and those similar subtrees of predetermined first threshold to extract similarity.
Next, at combining step S230, set merging according to similar subtree, merge tree to generate a DOM.Particularly, according to certain rule, the similar subtree of two dom trees is merged; And for by than two dom trees in be defined as the subtree of dissimilar subtree, former state is retained in a DOM and merges in tree.In one embodiment, can in the following manner the similar subtree of two dom trees be merged: using root node corresponding in similar subtree as the root node that merges tree, revise the weights of this node simultaneously, for example the root node of two subtrees weights separately and similarity phase Calais are obtained to the weights of the root node that merges tree, the initial value of weights is for example made as 0, and corresponding parameter (identical parameter only retains once) and the corresponding content of text of merge node; After root node merges, the subtree of current two subtrees is adopted in the same way and is merged successively, so circulates until all nodes have merged.
Be greater than (n>2) 2 at the number of dom tree, process is returned to similarity calculation procedure S220, from n dom tree, choose again a dom tree and a DOM and merge tree and compare, with the common root node that is extracted similar subtree and those similar subtrees in than dom tree from these two.Next, process proceeds to combining step S230, forms the 2nd DOM merge tree with the common root node of the similar subtree of extracting and those similar subtrees.So loop the processing of similarity calculation procedure and combining step, until all dom trees in n dom tree are selected.By this process, generate initial merging tree.The initial path part that merges tree, is made up of the common path part of all dom trees.That is to say, initial merging sets using the darkest common root node as father node, adds the subtree of extracting from other trees as its subtree.
In an alternative embodiment, can first in similarity calculation procedure S220, every two dom trees in n dom tree be compared, and in combining step S230, n dom tree be merged between two, to obtain the first merging tree.Be greater than (n>2) 2 at the number of dom tree, process is returned to similarity calculation procedure S220, the first every two the first merging trees that merge in tree that obtain by merging are between two compared, and in combining step S230, the first merging tree is merged between two.So loop the processing of similarity calculation procedure and combining step, merge tree until finally obtain one, to set as initial merging.
Next, process proceeds to post-processing step S240, according to certain rule and feature, initial merging tree is carried out to aftertreatment, to obtain DOM merging tree by removing invalid subtree.Below in conjunction with Fig. 4 come exemplary illustrated according to the embodiment of the present invention for generating the flow process of formation method post-processing step of merging tree of document template.
In the post-processing step S440 shown in Fig. 4, first at step S442, get a node in initial merging tree and obtain all properties and the weights of this node.In one embodiment, weights are the whole similarity sums that compare for all times that relate to the subtree take this node as root node that obtain in similarity calculation procedure.In another embodiment, weights are normalized values that whole similarity sums obtain after normalized.Attribute is for example content of text, the original parameter of node.In step S444, remove after interfering nodes according to attribute, process proceeds to step S446, according to the magnitude relationship of weights and interceptive value (Second Threshold) and window threshold value (the 3rd threshold value), initial merging tree is processed.Particularly, be less than the node of interceptive value for weights, abandon the subtree take this node as root node, the node that is greater than window threshold value for weights retains.And be more than or equal to for weights the node that interceptive value is still less than or equal to window threshold value,, in step S448, according to some feature, these nodes are verified.The feature of institute's foundation is for example text feature, parameter attribute, node type feature.After this, process proceeds to step S449, for the node being verified, improves its weights and retains, and abandons the subtree take this node as root node for the unsanctioned node of checking, to eliminate erroneous judgement and/or wrong cumulative.
In the above-described embodiments, after forming initial merging tree, all dom trees of use carry out again aftertreatment, to remove invalid subtree.In another embodiment, in the case of the number of dom tree is larger, can in the initial forming process that merges tree, increase other for removing the aftertreatment of invalid subtree, to simplify subsequent treatment.For example, after (n/2-1) merging tree to using n/2 dom tree formation is removed the aftertreatment of invalid subtree, re-use a remaining n/2 dom tree and carry out the processing of similarity calculation procedure and combining step.
Then, process proceeds to step S250.At step S250, first merge tree and choose effective information node from DOM.For example can extract in the following way effective information node: merge the subtree of setting and extract all leaf nodes from DOM, generation pass also obtains total number of paths; First classify and obtain same paths according to path, then same paths is classified according to the content of text of corresponding all nodes; In calculating same paths, according to the classification quantity of content of text and the ratio of total number of paths, the larger description of ratio changes greatly, is more likely just effective information path, otherwise is exactly framework garbage.
Next, according to the feature of the node of choosing, DOM is merged to tree and conclude and extract, to choose the information path needing, described feature can comprise at least one in parameter attribute, node type feature and text feature.In one embodiment, the information path of node can be character string forms.Described information path can comprise about the root node from corresponding dom tree structure is to the information of certain node, particularly, comprises the information that arrives each node that this node will pass through from described root node.In one example, information can comprise title (as label), the serial number etc. of node.The position of described serial number instructs node in corresponding dom tree structure.In one example, described serial number can the position of instructs node in the affiliated layer of corresponding dom tree structure.According to node diagnostic, dom tree is concluded and extraction is to well known to a person skilled in the art technology, details repeats no more.
Afterwards, preserve the information path of choosing, thereby generate Page Template.When having generated after the information path of the node that is allowed a choice, also just obtain the Page Template of the information path that comprises these nodes.In one example, the template generating can be XML (Extens ible Markup Language, extend markup language) file layout.
Then, process proceeds to step S260, uses the template generating from multiple page information extractions, and in step S262, judges the relation between accuracy and the predetermined threshold of information extraction.If accuracy is greater than predetermined threshold, judge that the template generating is correct template, process finishes.Otherwise judge that DOM merges tree mistake and changes parameter, for example similarity threshold, interceptive value or window threshold value, process is returned to similarity calculation procedure S220, redefines similar subtree.
Generate in each embodiment of method of the merging tree of document template in above-mentioned being formed for, use DOM Document Object Model (DOM) tree construction to be illustrated.It will be appreciated by those skilled in the art that the page also can be resolved into the tree that can go out from Node extraction the other types of required feature, for example tag tree.
Moreover the other embodiment of the present invention also provides a kind of device that is formed for the merging tree that generates document template.Fig. 5 shows the simplified block diagram of this device 500.As shown in the figure, this device 500 comprises: similarity calculated 520, be configured to calculate two by the similarity than being positioned at the subtree of same layer in tree in the time that for example, every tree compares with another tree from n the tree (dom tree) being resolved to by n the page, to be extracted the common root node that similarity is more than or equal to similar subtree and those similar subtrees of predetermined first threshold from two than setting; And merge cells 530, be configured to form initial merging tree by the similar subtree of all trees of extracting, wherein the initial root node that merges tree is the common root node of the similar subtree of all trees; And post-processing unit 540, be configured to initial merging tree to carry out aftertreatment, to obtain and to merge tree by the invalid subtree of removing initial merging tree.
The situation (n>2) that is greater than 2 for the number of tree, the cycle for the treatment of that the processing that similarity calculated is carried out and merge cells carry out is carried out.In one embodiment, similarity calculated 520 compares first tree and second tree in n tree, to extract the common root node of similar subtrees and those similar subtrees from first and second trees, and the merge cells 530 similar subtrees of use extraction and the common root node of those similar subtrees form the first merging and set.Then, similarity calculated 520 merges tree the 3rd tree in n tree and first and compares, and to extract similar subtree and common root node, and the similar subtree extracted of merge cells 530 use and common root node form the second merging and set.So circulate.Finally, similarity calculated 520 merges tree n tree in n tree and (n-2) and compares, to extract similar subtree and common root node, and the similar subtree that merge cells 530 use are extracted and common root node form (n-1) and merge tree, as initial merging tree.
Device 500 shown in above-mentioned Fig. 5 and included unit 520-530 thereof, can be configured to carry out above with reference to the described various operations in Fig. 1-4.About the further details of these operations, can, with reference to each embodiment described above, embodiment and example, be not described in detail here.
Have been described in detail by block diagram, process flow diagram and/or embodiment above, illustrated the different embodiments of devices in accordance with embodiments of the present invention and/or method.In the time that these block diagrams, process flow diagram and/or embodiment comprise one or more functions and/or operation, it will be obvious to those skilled in the art that each function in these block diagrams, process flow diagram and/or embodiment and/or operation can by various hardware, software, firmware or in fact they combination in any and individually and/or jointly enforcement.Several parts of the theme of describing in this instructions in one embodiment, can be passed through application-specific IC (ASIC), field programmable gate array (FPGA), digital signal processor (DSP) or other integrated forms and realize.But, those skilled in the art will recognize that, some aspects of the embodiment of describing in this instructions can be whole or in part in integrated circuit with the form of one or more computer programs of moving on one or more computing machines (for example, with the form of one or more computer programs of moving in one or more computer systems), with the form of one or more programs of moving on one or more processors (for example, with the form of one or more programs of moving on one or more microprocessors), with the form of firmware, or implement equivalently with the form of their combination in any in fact, and, according to disclosed content in this instructions, being designed for circuit of the present disclosure and/or writing for the code of software of the present disclosure and/or firmware is completely within those skilled in the art's limit of power.
For example, in said apparatus 500, all modules, unit, subelement can be configured by the mode of software, firmware, hardware or its combination in any.In the situation that realizing by software or firmware, can to the computing machine (example multi-purpose computer 600 as shown in Figure 6) with specialized hardware structure, the program that forms this software be installed from storage medium or network, this computing machine, in the time that various program is installed, can be carried out various functions.
Fig. 6 shows and can be used for implementing according to the schematic block diagram of the computing machine of the method and apparatus of the embodiment of the present invention.
In Fig. 6, CPU (central processing unit) (CPU) 601 carries out various processing according to the program of storage in ROM (read-only memory) (ROM) 602 or from the program that storage area 608 is loaded into random access memory (RAM) 603.In RAM603, also store as required data required in the time that CPU601 carries out various processing etc.CPU601, ROM602 and RAM603 are connected to each other via bus 604.Input/output interface 605 is also connected to bus 604.
Following parts are also connected to input/output interface 605: importation 606 (comprising keyboard, mouse etc.), output 607 (comprise display, such as cathode-ray tube (CRT) (CRT), liquid crystal display (LCD) etc., and loudspeaker etc.), storage area 608 (comprising hard disk etc.), communications portion 609 (comprising network interface unit such as LAN card, modulator-demodular unit etc.).Communications portion 609 is via for example the Internet executive communication processing of network.As required, driver 610 also can be connected to input/output interface 605.Detachable media 611 for example disk, CD, magneto-optic disk, semiconductor memory etc. can be installed on driver 610 as required, and the computer program of therefrom reading is installed in storage area 608 as required.
In the situation that realizing above-mentioned series of processes by software, can the program that form software be installed from for example the Internet of network or from for example detachable media 611 of storage medium.
It will be understood by those of skill in the art that this storage medium is not limited to wherein having program stored therein shown in Fig. 6, distributes separately the detachable media 611 so that program to be provided to user with equipment.The example of detachable media 611 comprises disk (comprising floppy disk), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD) (registered trademark)) and semiconductor memory.Or storage medium can be hard disk comprising in ROM602, storage area 608 etc., wherein computer program stored, and be distributed to user together with comprising their equipment.
Therefore, the invention allows for a kind of program product that stores the instruction code that machine readable gets.When described instruction code is read and carried out by machine, can carry out above-mentioned according to the whole bag of tricks of the embodiment of the present invention.Correspondingly, be also included within of the present invention open for carrying the above-named various storage mediums of this program product.
In the above in the description of the specific embodiment of the invention, describe and/or the feature that illustrates can be used in same or similar mode in one or more other embodiment for a kind of embodiment, combined with the feature in other embodiment, or substitute the feature in other embodiment.
Should emphasize, term " comprises/comprises " existence that refers to feature, key element, step or assembly while use herein, but does not get rid of the existence of one or more further feature, key element, step or assembly or add.
In addition, the time sequencing of describing during method of the present invention is not limited to is to specifications carried out, also can be according to other time sequencing ground, carry out concurrently or independently.The execution sequence of the method for therefore, describing in this instructions is not construed as limiting technical scope of the present invention.
Known by the description to embodiments of the invention above, the technical scheme that the present invention is contained includes but not limited to the described content of following remarks:
1. 1 kinds of remarks are formed for the method for the merging tree that generates document template, comprise the following steps:
Similarity calculation procedure, be used for when calculating two by the similarity than being positioned at the subtree of same layer setting in the time that every the tree of many trees being resolved to by multiple pages compares with another tree, with from two by than tree extracted the common root node that similarity is more than or equal to similar subtree and those similar subtrees of predetermined first threshold, wherein can extract required feature from the node of described many trees;
Combining step, forms initial merging by the similar subtree of all trees of extracting and sets, and wherein the initial root node that merges tree is the common root node of the similar subtree of all trees; And
Post-processing step, for carrying out aftertreatment to initial merging tree, to be obtained and to be merged tree by the invalid subtree of removing initial merging tree.
Remarks 2. is according to the method described in remarks 1, and wherein similarity calculation procedure and combining step comprise:
In many trees first tree and the second tree in many trees are compared, form the first merging tree to use from the first and second similar subtrees of trees extraction and the common root node of those similar subtrees; And
N tree is merged to tree with (n-2) to be compared, with use, from n tree, and (n-2) merges the similar subtree of tree extraction and the common root node of those similar subtrees forms (n-1) merging tree, and wherein n is more than or equal to 3 integer.
Remarks 3. is according to the method described in remarks 1 or 2, wherein similarity calculation procedure comprises: from by than two trees separately order choose a subtree that root node is identical as for two subtrees relatively, form all leaf nodes of every subtree from two subtrees to the path of the root node of this subtree, determine respectively the number of same paths in the described path that the title of leaf node is identical in two subtrees, and calculate the similarity A of two subtrees according to following formula:
A = Σ i = 1 N | PN i ( LN i 1 + LN i 2 ) | Σ i = 1 N PN i ( LN i 1 - LN i 2 ) × log max PN i
Wherein N represents the number sum of the mutually different leaf node in path in two subtrees, and PNi represents the number of the father node of i leaf node, max PN irepresent the maximal value for the number of the father node of all leaf nodes of two subtrees, i=1 ... N, and LN i1and LN i2represent respectively the number for same paths in the path of i leaf node in two subtrees.
Remarks 4. is according to the formation method described in any one in remarks 1-3, wherein post-processing step comprises for its root node having and fall into the subtree that is equal to or greater than predetermined Second Threshold and is equal to or less than the described initial merging tree of the weights relevant with described similarity in the threshold range of predetermined the 3rd threshold value and process, to eliminate erroneous judgement and/or wrong cumulative.
Remarks 5., according to the formation method described in any one in remarks 1-4, also comprises that many trees to resolving to carry out pretreated step, does not have effective node to remove for forming to merge to set.
Remarks 6., according to the formation method described in any one in remarks 1-5, also comprises post-processing step conclusion and extraction step afterwards, for according to the feature that merges the node of setting, is combined tree and concludes and extraction process, to choose the information path needing.
Remarks 7. is according to the formation method described in remarks 6, and wherein said feature comprises at least one in parameter attribute, node type feature and text feature.
Remarks 8., according to the formation method described in remarks 6 or 7, also comprises the step that becomes document template according to the information path of selected needs next life.
9. 1 kinds of remarks are formed for the device of the merging tree that generates document template, comprising:
Similarity calculated, be configured to calculate two by the similarity than being positioned at the subtree of same layer in tree in the time that every tree compares with another tree from many trees that resolved to by multiple pages, with from two by than tree extracted the common root node that similarity is more than or equal to similar subtree and those similar subtrees of predetermined first threshold, wherein can extract required feature from the node of described many trees;
Merge cells, is configured to form initial merging tree by the similar subtree of all trees of extracting, and wherein the initial root node that merges tree is the common root node of the similar subtree of all trees; And
Post-processing unit, is configured to initial merging tree to carry out aftertreatment, to be obtained and to be merged tree by the invalid subtree of removing initial merging tree.
Remarks 10. is according to the device described in remarks 9, and wherein similarity calculated and merge cells comprise similarity cycle calculations subelement and loop fusion subelement, and described similarity cycle calculations subelement and loop fusion subelement are configured to:
In many trees first tree and the second tree in many trees are compared, form the first merging tree to use from the first and second similar subtrees of trees extraction and the common root node of those similar subtrees; And
N tree is merged to tree with (n-2) to be compared, with use, from n tree, and (n-2) merges the similar subtree of tree extraction and the common root node of those similar subtrees forms (n-1) merging tree, and wherein n is more than or equal to 3 integer.
Remarks 11. is according to the device described in remarks 9 or 10, and wherein similarity calculated comprises: subtree chooser unit, be configured to from by than two trees separately order choose a subtree that root node is identical as for two subtrees relatively; Path forms subelement, is configured to all leaf nodes of formation every subtree from two subtrees to the path of the root node of this subtree; Determine subelement, be configured to determine respectively the number of same paths in the described path that the title of leaf node in two subtrees is identical, and computation subunit, be configured to according to the similarity A of two subtrees of following formula calculating:
A = Σ i = 1 N | PN i ( LN i 1 + LN i 2 ) | Σ i = 1 N PN i ( LN i 1 - LN i 2 ) × log max PN i
Wherein N represents the number sum of the mutually different leaf node in path in two subtrees, PN irepresent the number of the father node of i leaf node, max PN irepresent the maximal value for the number of the father node of all leaf nodes of two subtrees, i=1 ... N, and LN i1and LN i2represent respectively the number for same paths in the path of i leaf node in two subtrees.
Remarks 12. is according to the device described in any one in remarks 9-11, wherein post-processing unit is configured to have and fall into the similar subtree that is equal to or greater than predetermined Second Threshold and is equal to or less than the described initial merging tree of the weights relevant with described similarity in the threshold range of predetermined the 3rd threshold value and process for its root node, to eliminate erroneous judgement and/or wrong cumulative.
Remarks 13., according to the device described in any one in remarks 9-12, also comprises that being configured to many trees to resolving to carries out pre-service to remove the pretreatment unit that does not have effective node for formation merging tree.
Remarks 14., according to the device described in any one in remarks 9-13, also comprises conclusion and extraction unit, and described conclusion and extraction unit are configured to be combined tree according to the feature that merges the node of setting and conclude and extraction process, to choose the information path needing.
Remarks 15. is according to the device described in remarks 14, and wherein said feature comprises at least one in parameter attribute, node type feature and text feature.
Remarks 16., according to the device described in remarks 14 or 15, also comprises the template generation unit that is configured to become next life according to the information path of selected needs document template.
17. 1 kinds of remarks store the program product of the instruction code that machine readable gets,
When described instruction code is read and carried out by machine, can carry out as the method as described in any one in claim 1-8.
18. 1 kinds of remarks carry the storage medium of the program product as described in remarks 17.
Although the present invention is disclosed by the description to specific embodiments of the invention above, but, should be appreciated that, those skilled in the art can design various modifications of the present invention, improvement or equivalent in the spirit and scope of claims.These modifications, improvement or equivalent also should be believed to comprise in protection scope of the present invention.

Claims (8)

1. a method that is formed for the merging tree that generates document template, comprises the following steps:
Similarity calculation procedure, be used for when calculating two by the similarity than being positioned at the subtree of same layer setting in the time that every the tree of many trees being resolved to by multiple pages compares with another tree, with from two by than tree extracted the common root node that similarity is more than or equal to similar subtree and those similar subtrees of predetermined first threshold, wherein can extract required feature from the node of described many trees;
Combining step, forms initial merging by the similar subtree of all trees of extracting and sets, and wherein the initial root node that merges tree is the common root node of the similar subtree of all trees; And
Post-processing step, for initial merging tree is carried out to aftertreatment, to obtain and to merge tree by the invalid subtree of removing initial merging tree,
Wherein similarity calculation procedure and combining step comprise:
In many trees first tree and the second tree in many trees are compared, form the first merging tree to use from the first and second similar subtrees of trees extraction and the common root node of those similar subtrees; And
N is set with n-2 merging and sets and compare, form n-1 merging tree to use from n tree with the similar subtree of n-2 merging tree extraction and the common root node of those similar subtrees, wherein n is more than or equal to 3 integer.
2. in accordance with the method for claim 1, wherein similarity calculation procedure comprises: from by than two trees separately order choose a subtree that root node is identical as for two subtrees relatively, form all leaf nodes of every subtree from two subtrees to the path of the root node of this subtree, determine respectively the number of same paths in the described path that the title of leaf node is identical in two subtrees, and calculate the similarity A of two subtrees according to following formula:
A = Σ i = 1 N | PN i ( LN i 1 + LN i 2 ) | Σ i = 1 N PN i ( LN i 1 - LN i 2 ) × log max PN i
Wherein N represents the number sum of the mutually different leaf node in path in two subtrees, PN irepresent the number of the father node of i leaf node, max PN irepresent the maximal value for the number of the father node of all leaf nodes of two subtrees, i=1 ... N, and LN i1and LN i2represent respectively the number for same paths in the path of i leaf node in two subtrees.
3. according to the formation method described in claim 1 or 2, wherein post-processing step comprises for its root node having and fall into the subtree that is equal to or greater than predetermined Second Threshold and is equal to or less than the described initial merging tree of the weights relevant with described similarity in the threshold range of predetermined the 3rd threshold value and process, to eliminate erroneous judgement and/or wrong cumulative.
4. according to the formation method described in claim 1 or 2, also comprise that many trees to resolving to carry out pretreated step, do not have effective node to remove for forming to merge to set.
5. according to the formation method described in claim 1 or 2, also comprise post-processing step conclusion and extraction step afterwards, for according to the feature that merges the node of setting, be combined tree and conclude and extraction process, to choose the information path needing.
6. according to formation method claimed in claim 5, wherein said feature comprises at least one in parameter attribute, node type feature and text feature.
7. according to formation method claimed in claim 5, also comprise the step that becomes document template according to the information path of selected needs next life.
8. a device that is formed for the merging tree that generates document template, comprising:
Similarity calculated, be configured to calculate two by the similarity than being positioned at the subtree of same layer in tree in the time that every tree compares with another tree from many trees that resolved to by multiple pages, with from two by than tree extracted the common root node that similarity is more than or equal to similar subtree and those similar subtrees of predetermined first threshold, wherein can extract required feature from the node of described many trees;
Merge cells, is configured to form initial merging tree by the similar subtree of all trees of extracting, and wherein the initial root node that merges tree is the common root node of the similar subtree of all trees; And
Post-processing unit, is configured to initial merging tree to carry out aftertreatment, to obtain and to merge tree by the invalid subtree of removing initial merging tree,
Wherein similarity calculated and merge cells comprise similarity cycle calculations subelement and loop fusion subelement, and described similarity cycle calculations subelement and loop fusion subelement are configured to:
In many trees first tree and the second tree in many trees are compared, form the first merging tree to use from the first and second similar subtrees of trees extraction and the common root node of those similar subtrees; And
N is set with n-2 merging and sets and compare, form n-1 merging tree to use from n tree with the similar subtree of n-2 merging tree extraction and the common root node of those similar subtrees, wherein n is more than or equal to 3 integer.
CN201010260747.2A 2010-08-17 2010-08-17 Method and device for forming merge tree for generating document template Expired - Fee Related CN102375847B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010260747.2A CN102375847B (en) 2010-08-17 2010-08-17 Method and device for forming merge tree for generating document template

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010260747.2A CN102375847B (en) 2010-08-17 2010-08-17 Method and device for forming merge tree for generating document template

Publications (2)

Publication Number Publication Date
CN102375847A CN102375847A (en) 2012-03-14
CN102375847B true CN102375847B (en) 2014-06-04

Family

ID=45794469

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010260747.2A Expired - Fee Related CN102375847B (en) 2010-08-17 2010-08-17 Method and device for forming merge tree for generating document template

Country Status (1)

Country Link
CN (1) CN102375847B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902582B (en) * 2012-12-27 2017-08-11 中国移动通信集团湖北有限公司 A kind of method and apparatus for reducing data warehouse data redundancy
JP6173896B2 (en) * 2013-12-10 2017-08-02 株式会社日立製作所 Data processing method and data processing server
CN104636481A (en) * 2015-02-16 2015-05-20 浪潮集团有限公司 Webpage template extracting method and device
CN106815235B (en) * 2015-11-27 2020-06-19 阿里巴巴(中国)有限公司 Super webpage template generation method and device and page data transmission method
CN107423391B (en) * 2017-07-24 2020-11-03 福州大学 Information extraction method of webpage structured data
CN109445784B (en) * 2018-09-29 2020-08-14 Oppo广东移动通信有限公司 Method and device for processing structure data, storage medium and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101694668A (en) * 2009-09-29 2010-04-14 百度在线网络技术(北京)有限公司 Method and device for confirming web structure similarity
CN101727498A (en) * 2010-01-15 2010-06-09 西安交通大学 Automatic extraction method of web page information based on WEB structure

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7668942B2 (en) * 2008-05-02 2010-02-23 Yahoo! Inc. Generating document templates that are robust to structural variations

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101694668A (en) * 2009-09-29 2010-04-14 百度在线网络技术(北京)有限公司 Method and device for confirming web structure similarity
CN101727498A (en) * 2010-01-15 2010-06-09 西安交通大学 Automatic extraction method of web page information based on WEB structure

Also Published As

Publication number Publication date
CN102375847A (en) 2012-03-14

Similar Documents

Publication Publication Date Title
CN102375847B (en) Method and device for forming merge tree for generating document template
Sun et al. Dom based content extraction via text density
CN101944094B (en) Webpage information extraction method and device thereof
CN102609474B (en) A kind of visit information supplying method and system
CN102521248B (en) Network user classification method and device
CN103023714B (en) The liveness of topic Network Based and cluster topology analytical system and method
CN102163203B (en) Method and device for downloading web pages
CN102184184B (en) Method for acquiring webpage dynamic information
CN105224606A (en) A kind of disposal route of user ID and device
CN102460432B (en) Selective content extraction
US9454535B2 (en) Topical mapping
US7613696B2 (en) Configuring search results using a layout editor
US10372980B2 (en) Electronic form identification using spatial information
CN104063401A (en) Webpage style address merging method and device
CN109657121A (en) A kind of Web page information acquisition method and device based on web crawlers
CN106886594A (en) For the method and apparatus of exhibition information
CN104166545B (en) The sniff method and device of a kind of web page resources
CN109376291B (en) Website fingerprint information scanning method and device based on web crawler
Hanika et al. The ucomp protégé plugin: Crowdsourcing enabled ontology engineering
CN103853770B (en) The method and system of model content in a kind of extraction forum Web pages
CN104063506A (en) Method and device for identifying repeated web pages
Ferrara et al. Design of automatically adaptable web wrappers
Pan et al. Automatically maintaining navigation sequences for querying semi-structured web sources
CN109710224A (en) Page processing method, device, equipment and storage medium
CN107590236B (en) Big data acquisition method and system for building construction enterprises

Legal Events

Date Code Title Description
PB01 Publication
C06 Publication
SE01 Entry into force of request for substantive examination
C10 Entry into substantive examination
GR01 Patent grant
C14 Grant of patent or utility model
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140604

Termination date: 20180817

CF01 Termination of patent right due to non-payment of annual fee