CN101876995A

CN101876995A - Method for calculating similarity of XML documents

Info

Publication number: CN101876995A
Application number: CN2009102449033A
Authority: CN
Inventors: 汪陈应; 袁晓洁; 廉鑫; 林伟坚
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2009-12-18
Filing date: 2009-12-18
Publication date: 2010-11-03

Abstract

The invention belongs to the technical field of databases and aims to establish an XML document constraint model known as a bidirectional path constraint model. Based on the model, the invention discloses a new method for calculating the similarity of XML documents. The structural information of the XML documents is extracted more completely through the bidirectional path constraint of a node, so that the similarity of the XML documents is balanced more accurately. A very mature N-Gram thought in the field of natural languages is introduced and an N-Gram-based partition mode is applied in the similarity calculation of path constraint. The extraction and operation of N-Gram information are simplified by making skillful use of positive integers and weight numbers. The method can be used in the fields of XML document classification, clustering, mode extraction and the like.

Description

A kind of method of calculating similarity of XML documents

[technical field]

The invention belongs to database technical field, be specifically related to a kind of method of calculating similarity of XML documents.

[background technology]

Expandable mark language XML has become the standard format that Web goes up expression and swap data.Along with the promotion and application of XML relevant criterion, all trades and professions all with XML as meta-language, formulate the specific sublanguage in field separately, be used to store and data that shared this area is related.Under this background, every field all can continue to bring out out a large amount of XML document.How from large volume document, to excavate knowledge and become current urgent problem.The XML data mining is important use in the knowledge discovering technologies, and similarity calculating plays a part basic in the XML data mining.

XML document is excavated and is divided into content mining and structure excavation, and it can be used for extraction, integration and some other application of XML data.XML document is a semi-structured data, thereby structure is excavated particularly important.Classification, cluster are the methods that data mining is generally adopted, and the XML document similarity is the basis of classification, cluster, are the key factors that the result is excavated in influence.

At present the XML document similarity is calculated and is mainly contained two class methods, based on the method for tree editing distance with based on the method for frequent path.Wherein the method based on the tree editing distance has obtained widespread usage, and it at first is expressed as orderly tag tree, for example a dom tree to one piece of XML document.And then weigh the similarity of XML document tree by the tree editing distance.Based on the tree editing distance three kinds of classic algorithm: Selkow, Chawathe and Dalamagas are arranged, but tree editing distance algorithm time complexity is generally higher.Method based on frequent path can be calculated the document similarity fast, but loses all non-frequent paths, thereby loses a large amount of structural informations, and accuracy is relatively low.

[summary of the invention]

The objective of the invention is to remedy the prior art above shortcomings, propose a kind of method of new calculating similarity of XML documents.This method uses the BPC model to extract the structural information of XML document, introduces various weights and embodies layer of structure, based on the N-Gram dividing mode, has reduced the time complexity that the XML document similarity is calculated by single pass.

The method of calculating similarity of XML documents provided by the invention comprises the steps:

Step 1, XML document is defined as XML document tree;

Step 2, set up two-way approach constraint (Bidirectional path constraints, BPC) model: the BPC of defined node on the basis of step 1 document tree, the BPC set of all nodes that one piece of XML document comprises is called the two-way approach restricted model;

Step 3, use are referred to as the path constraint similarity based on the similarity between dividing mode two ancestors' path constraints of calculating (or child's path constraint) of N-Gram;

Step 4, the path constraint similarity that draws according to step 3 are calculated the BPC similarity of two nodes, and then the similarity of this BPC similarity as these two nodes;

All node similarities are according to the layer of structure weighted sum of the node similarity as two pieces of documents in step 5, the last document.

Concrete computation process of the present invention is as follows:

1.XML document tree

XML document is defined as an XML document tree, specific as follows:

Definition 1.XML document tree: with an XML document tree representation is one 6 tuple T=(V, v ₀, E, ∑, P, lab), wherein:

1), V is the set of all nodes in the document tree;

2), v ₀It is the root node of document tree;

3), E _aDefined the parent-child constraint set, E _a=(u, v) | u ∈ V ∧ v ∈ V, and u is father's node of v }, E _sDefined fraternal constrain set, E _s=(u, v) | u ∈ V ∧ v ∈ V, and v is the right brotgher of node of u }; Represent constrain set with E, i.e. E=E _a∪ E _s

4), ∑ is the set of node label in the document tree;

5), P _ADefined ancestors' path constraint, P _A={ (v ₀, v ₁..., v _n) | (v _i, v _I+1) ∈ E _a, 0≤i＜n} ∪ { v ₀, P _SDefined child's path constraint, P _S={ (v ₁..., v _n) | (v _i, v _I+1) ∈ E _s, 0＜i＜n, v ₁, v _nBe respectively first and last child nodes of their father's nodes } ∪ { v ₁| v ₁Be unique child nodes of its father's node }; Represent path constraint set, i.e. P=P with P _A∪ P _S,

P &Subset; V \cup V^{2} \cup . . . \cup V^{| V |};

6), the label of function lab return node, promptly as v ∈ V, lab (v) ∈ ∑.

Need to prove that what we paid close attention to is structural similarity, traditional information retrieval technique good treatment the content similarity, so text node unified be the node of #text as label value.In addition attribute node is regarded as a kind of special node element.Document tree example such as Fig. 1.

2. the BPC of node

Define the BPC of 2. nodes.P _A(e) defined ancestors' path constraint of node e, P _A(e)=(v ₀, v ₁..., e) ∈ P _A, P _S(e) defined child's path constraint of node e, P _S(e)=(u ₁..., u _n) ∈ P _S, (e, u _i) ∈ E _a, cons (e) has defined the BPC of node, cons (e)=(P _A(e), P _S(e)), e ∈ V.For the leaf node of document tree, its P _S(e) be empty, represent with ε.

Usually only extract ancestors' path constraint based on the method for tree editing distance.The BPC that the present invention uses has increased child's path constraint on the basis of original ancestors' path constraint.So more fully obtained the structural information of XML document, can improve accuracy according to document similarity cluster result.

3. based on the similarity between two path constraints of N-Gram thought calculating

If k is the quantity of the different node labels that occur in two path constraints to be compared, this k node label is arranged according to the dictionary preface, then each node label can be mapped as positive integer in [1, k] successively.Node label with string representation is converted into a numeral like this, and identical tag name has identical numeral number.Path constraint is last so takes the form of a sequential integer array.

Definition 3. dividing mode based on N-Gram thought.The integer array that it will be grown for n is divided into n sub-array, and wherein (0＜i≤n) individual sub-storage of array is the i-Gram item that extracts to i, and this subnumber group abbreviates the i-Gram array as, contains the n-i+1 item, and wherein each is i continuous items (a in the former integer array ₁, a ₂..., a _i) result that generates, the generation method is as follows:

i-GramItem＝a ₁×(k+1) ^i-1+a ₂×(k+1) ^i-2+……+a _i×(k+1) ⁰

Introducing k+1 is that as seen, the 1-Gram array has the n item for the uniqueness of the item that guarantees each subnumber group, and the 2-Gram array has the n-1 item ..., (n-1)-and the Gram array has 2, and the n-Gram array has 1; Thereby all subnumber groups are total

; In order to simplify the processing of back, n sub-array is stored in one successively longly is

Array in;

Two path constraints to be compared, by using sign map to be converted into the integer array, length is respectively n and m, they are ancestors' path constraints (or being both child's path constraint) of certain two node, according to definition 3 they are resolved into the 1-Gram array successively, the 2-Gram array ... min (n, m)-the Gram array.

Define the identical entry number C of 4. two one-dimension array.Array is regarded as set, with two intersection of sets set representations identical entry number C.

Use C _iRepresent that two path constraints decompose the identical entry number of latter two i-Gram array.If when in the i-Gram array complete occurrence being arranged, this all subitems all can mate, and the subitem number of this part coupling has embodied C virtually _iWeight,

C = \cup_{i = 1}^{n} C_{i};

Therefore represent two identical entry numbers after the path constraints decomposition with C.

Define 5. path constraint similarities.According to top definition, path constraint similarity formula is as follows:

Sim (p_{1}, p_{2}) = \frac{C}{\frac{t (t + 1)}{2}} = \frac{2 C}{t (t + 1)},

T=max (n, m), p ₁, p ₂∈ P _AOr p ₁, p ₂∈ P _S

4.BPC similarity

In order to keep original structural information, the present invention has extracted BPC to each node of XML document, but ancestors path similarity and child path similarity may be different to the influence degree of BPC.Introduce factor of influence and describe the influence degree of ancestors' constraint BPC.This factor of influence is set by the programmer.It is generally acknowledged that ancestors' path constraint has bigger influence to BPC.

Definition 6.BPC similarity.If α is the factor of influence of ancestors' path constraint, natural 1-α is the factor of influence of child's path constraint, 0≤α≤1, and BPC similarity formula is as follows:

Sim(cons(e)，cons(e ₀))＝α×Sim(P _A(e)，P _A(e ₀))+(1-α)×Sim(P _S(e)，P _S(e ₀))。

5. document similarity

Define 7. document similarities.Two pieces of XML document D ₁And D ₂, the node number is respectively n and m, calculates D according to definition 6 ₁The BPC of each node and D ₂The BPC similarity of each node is selected D after forming similar matrix ₁Each node and D ₂The similar value of the node of similarity maximum, then document similarity formula is as follows:

Sim (D_{1}, D_{2}) = Σ_{i = 1}^{n} w (v_{i}) \max_{j = 1}^{m} (s_{ij}) / Σ_{i = 1}^{n} w (v_{i}),

s _ij＝Sim(cons(v _i)，cons(v _j))，1≤i≤n，1≤j≤m。

In the XML document tag tree, node is the closer to root node, and it is just big more to the influence of file structure.Introduce

w (v_{i}) = 2^{- lev (v_{i})}

The Different Effects of the different node degree of depth is described, lev (v _i) be node v _iThe number of plies, the number of plies of root node is 0.

Advantage of the present invention and good effect:

The present invention proposes a kind of method of new comparison XML document similarity.This method is used the BPC model, more fully extracts the structural information of XML document, for accurate Calculation XML document similarity lays the foundation.Introduce various weights and embody layer of structure.The tolerance with N-Gram thought simplification path similarity of innovation, accurately efficient height.As the basis of classification, cluster, can improve the accuracy of classification, cluster.

[description of drawings]

Fig. 1 is the XML document tree of one piece of XML document and its correspondence.

Fig. 2 is for using the N-Gram information in the N-Gram thought extraction path constraint 6 → 3 → 4 → 5 → 3, and this figure comprises five processes to e by a, because the maximum integer that occurs is 6, what use in the leaching process is septenary.Wherein,

(a) be the synoptic diagram of filling first 1-Gram behind first element of scanning pattern.

(b) be to fill second 1-Gram, the synoptic diagram of first 2-Gram behind second element of scanning pattern.

(c) be to fill the 3rd 1-Gram, second 2-Gram, the synoptic diagram of first 3-Gram behind the 3rd element of scanning pattern.

(d) be to fill the 4th 1-Gram, the 3rd 2-Gram, second 3-Gram, the synoptic diagram of first 4-Gram behind the 4th element of scanning pattern.

(e) be to fill the 5th 1-Gram, the 4th 2-Gram, the 3rd 3-Gram, second 4-Gram, the synoptic diagram of first 5-Gram behind the 3rd element of scanning pattern.

Fig. 3 is a document similarity algorithm process flow diagram.

[embodiment]

N-Gram (N is first number) is a kind of language model commonly used in the big vocabulary continuous speech recognition.This model is based on a kind of like this hypothesis, and the appearance of N speech is only relevant with a front N-1 speech, and all uncorrelated with other any speech, and the probability of whole sentence is exactly the product of each speech probability of occurrence.These probability can obtain by the number of times that directly N speech of statistics occurs simultaneously from language material.That commonly used is the Bi-Gram of binary and the Tri-Gram of ternary, has been widely used in natural language processing.The meaning of N-Gram can be understood as the sequence that N speech constitutes.

Embodiment 1: the concrete grammar based on XML document tree structure BPC model is described below:

1. an XML document that XML document is defined as that proposes according to the present invention is set, and on this document tree basis each node is set up the BPC model.Fig. 1 has shown the XML document tree of one piece of XML document and its correspondence, and table 1 is the BPC model that example is enumerated each node with Fig. 1 document tree.

Embodiment 2: the concrete grammar based on N-Gram thought calculating document similarity is described below:

Algorithm 1. generates the method CreateGram of i+1-Gram item according to two adjacent i-Gram items

Input: item ₁, item ₂Two adjacent i-Gram item * that/* represents with positive integer/

T/* system t*/

Output: (i+1)-Gram item * that item/* represents with positive integer/

①.item:＝item ₁×t+item ₂％t；

②.RETURN?item；

3.. algorithm finishes

This algorithm is to generate (i+1)-Gram item according to two adjacent i-Gram items.System t in the algorithm is that different number of tags sums add 1 in two path constraints to be compared.For same path constraint, introduce system t, when i ≠ j, can guarantee that the integer field at i-Gram item place and the integer field at j-Gram place do not occur simultaneously.

The extracting method PathDecomposition of N-Gram information in algorithm 2. path constraints

Input: Path[1,2 ..., n]/* be mapped as path constraint * after the positive integer array/

T/* system t, meaning with algorithm 1*/

n ₀The N-Gram item of the maximum that/* need extract promptly extracts

In the k-Gram subnumber group, k≤n ₀*/

Output:

The N-Gram information * that/* extracts/

①.pos[1，2，…，n]；

/ * pos[i] record path constraint Path each i-Gram array the NGram array (i=1,2 ...,

Path.Length) the reference position * in/

②.

pos [i] : = \frac{2 ni - 2 n + 3 i - i^{2}}{2};

③.FOREACH?member?IN?Path

4.. the subscript of i:=member in Path;

5.. NGram[i]=member/* fill i 1-Gram item */

6.. j:=2; / * j represent j-Gram item * to be filled/

⑦.???IF?j≤i&&j≤n ₀THEN

⑧.????????item ₁:＝NGram[pos[j-1]+i-j+1]；

⑨.????????item ₂:＝NGram[pos[j-1]+i-j+2]；

⑩.????????NGram[pos[j]+i-j+1]:＝CreateGram(item ₁，item ₂，t)；

/ * according to (j-1)-Gram item fill i-j+1 j-Gram item */

???????????j++；

???????????GOTO⑦

?????????END?IF

?????END?FOREACH

?????RETURN?NGram；

Algorithm finishes

The fundamental purpose of this algorithm is by run-down array Path, extracts all i-Gram items that this array comprises, and is filled in the relevant position of NGram array.The length of each i-Gram is determined, with the reference position of each i-Gram of pos storage of array at NGram.According to i, filling mode is as follows:

I=1 fills the 1st 1-Gram

I=2 fills the 2nd 1-Gram, the 1st 2-Gram

I=3 fills the 3rd 1-Gram, the 2nd 2-Gram, the 1st 3-Gram

......???......

I=n fills n 1-Gram, n-1 2-Gram ..., the 1st n-Gram

Find thus,, can calculate to be filled the memory location in NGram in conjunction with array pos when current scanning position i and the item to be filled of known Path belongs to j-Gram.Algorithm the 8. to 10. step algorithm 1, utilize i-j+1 and the i-j+2 item of (j-1)-Gram, generate the i-j+1 item of j-Gram.The path array Path end of scan, the N-Gram information array NGram of its correspondence fills complete.Extract the N-Gram information that path constraint 6 → 3 → 4 → 5 → 3 is filled as Fig. 2 for using N-Gram thought.

Similarity is calculated PathSimilarity between algorithm 3. path constraints

Input: StringPath ₁[1,2 ..., n], StringPath ₂[1,2 ..., m]/the path constraint * of * character string forms/

Output: pathSim/* path similarity */

①.Dictionary[1，2，…，k]；

All labels that comprise in two path constraints of/* array Dictionary for input are arranged according to the dictionary preface

Good dictionary, identical character string only accounts in the dictionary; K is StringPath ₁And StringPath ₂

In different node labels quantity */

②.Path ₁:＝Mapping(StringPath ₁，Dictionary)；

/ * function Mapping returns character string array StringPath ₁In character string all be converted into

The subscript of this character string among the Dictionary and a shaping array * forming/

③.Path ₂:＝Mapping(StringPath ₂，Dictionary)；

④.minLength:＝min(StringPath ₁.Length，StringPath ₂.Length)；

⑤.DecPath ₁:＝PathDecomposition(Path ₁，k+1，minLength)；

/ * is according to algorithm 2, extract N-Gram information * in the path constraint/

⑥.DecPath ₂:＝PathDecomposition(Path ₂，k+1，minLength)；

⑦.pathSim:＝|DecPath ₁∩DecPath ₂|；

⑧.RETURN?pathSim；

9.. algorithm finishes

The purpose of algorithm is to calculate the similarity of two path constraints.K is the quantity of the different node labels that occur in two path constraints to be compared, and this k node label is arranged according to the dictionary preface, and then each node label can be mapped as positive integer in [1, k] successively.Node label with string representation is converted into a numeral like this, and identical tag name has identical numeral number.Path constraint is last so takes the form of a sequential integer array.Adopt t=k+1 as system, thereby reach the purpose that algorithm 1 is introduced this parameter.Two constraint: BOOK → SECTION → TITLE that the explanation of table 2 example is to be compared, the map information of BOOK → SECTION → each character string of FIGURE → CAPTION.

Algorithm 4.BPC similarity BPCSimilarity

Input: node e ₁BPC, node e ₂BPC

Output: BPCsim/* BPC similarity, also be node similarity */

1.. α :=0.6; / * parameter alpha be ancestors' path constraint at the intrafascicular approximately shared proportion of BPC, α is big more, the ancestral

Elder generation's path constraint is big more to the influence of BPC similarity, and child's path constraint is got over the influence of BPC similarity

Little; Otherwise α is more little, and child's path constraint is big more to the influence of BPC similarity, and ancestors' path constraint is right

The more little * of the influence of BPC similarity/

②.BPCsim:＝α×PathSimilarity(P _A(e ₁)，P _A(e ₂))+(1-α)×PathSimilarity(P _S(e ₁)，P _S(e ₂))

③.RETURN?BPCsim；

4.. algorithm finishes

The purpose of algorithm is to calculate the BPC similarity of two nodes.Introduce factor of influence and describe the influence degree of ancestors' path constraint the BPC similarity.This factor of influence need be set according to concrete application, thinks that generally speaking ancestors' path constraint has bigger influence than child path constraint to the BPC similarity, i.e. α＞0.5.

Algorithm 5.XML document similarity

Input: XML document tree D ₁And D ₂

Output: documentSim/* document D ₁And D ₂Similarity */

1.. traversal document tree D ₁And D ₂, set up corresponding BPC model;

②.???s[n×m]；

/ * BPC similar matrix is established document D ₁The node number is n, document D ₂The node number be m*/

③.???s _ij:＝BPCSimilarity((P _A(e _i)，P _S(e _i))，(P _A(e _j)，P _S(e _j)))；

/ * is according to algorithm 4, s _IjThat store is node e _iWith node e _jBetween similarity, node e wherein _i

Belong to document D ₁, node e _jBelong to document D ₂*/

④.

documentSim : = Σ_{i = 1}^{n} w (e_{1}) \max_{j = 1}^{m} (Matrix [ij]) / Σ_{i = 1}^{n} w (e_{1})

/ * function w (e) obtains the weight of node e, and w (e)=2 ^{-lev (e)}*/

The purpose of algorithm is to calculate the similarity of two pieces of XML document.Because the BPC similar matrix satisfies about matrix principal diagonal symmetry, last triangle that can a compute matrix during concrete operations copies to down triangle again, and calculation times reduces half.As Fig. 3 is document similarity algorithm process flow diagram.

Table 1 has been enumerated the BPC that Fig. 1 XML document is set each node

Node	The BPC of node
Node	The BPC of node	?BOOK	?(BOOK，ISBN→SECTION→SECTION)
?ISBN	?(BOOK→ISBN，#text)	?BOOK	?(BOOK，ISBN→SECTION→SECTION)
?ISBN	?(BOOK→ISBN，#text)	?#text	?(BOOK→ISBN→#text，ε)
?SECTION	?(BOOK→SECTION，TITLE→#text→FIGURE)	?#text	?(BOOK→ISBN→#text，ε)
?SECTION	?(BOOK→SECTION，TITLE→#text→FIGURE)	?TITLE	?(BOOK→SECTION→TITLE，ε)
?#text	??(BOOK→SECTION→#text，ε)	?TITLE	?(BOOK→SECTION→TITLE，ε)
?#text	??(BOOK→SECTION→#text，ε)	?FIGURE	??(BOOK→SECTION→FIGURE，CAPTION)
?CAPTION	??(BOOK→SECTION→FIGURE→CAPTION，ε)	?FIGURE	??(BOOK→SECTION→FIGURE，CAPTION)
?CAPTION	??(BOOK→SECTION→FIGURE→CAPTION，ε)	?SECTION	??(BOOK→SECTION，TITLE→#text→BOLD)
?TITLE	??(BOOK→SECTION→TITLE，ε)	?SECTION	??(BOOK→SECTION，TITLE→#text→BOLD)
?TITLE	??(BOOK→SECTION→TITLE，ε)	?#text	??(BOOK→SECTION→#text，ε)
?BOLD	??(BOOK→SECTION→BOLD，#text)	?#text	??(BOOK→SECTION→#text，ε)
?BOLD	??(BOOK→SECTION→BOLD，#text)	?#text	??(BOOK→SECTION→BOLD→#text，ε)

Two constraint: BOOK → SECTION → TITLE that the explanation of table 2 example is to be compared, the map information of BOOK → SECTION → each character string of FIGURE → CAPTION

??BOOK	??1
??BOOK	??1	??SECTION	??2
??TITLE	??3	??SECTION	??2
??TITLE	??3	??FIGURE	??4
??CAPTION	??5	??FIGURE	??4

Claims

1. the method for a calculating similarity of XML documents is characterized in that this method comprises the steps:

Step 1, XML document is defined as XML document tree, and is expressed as one 6 tuple;

Step 2, set up two-way approach constraint Bidirectional path constraints, the BPC model: the BPC of defined node on the basis of step 1 document tree, the BPC set of all nodes that one piece of XML document comprises is called the two-way approach restricted model;

Step 3, use are referred to as the path constraint similarity based on two the ancestors' path constraints of dividing mode calculating of N-Gram or the similarity between child's path constraint;

2. method according to claim 1 is characterized in that the described XML document tree of step 1 is defined as follows:

1), V is the set of all nodes in the document tree;

2), v ₀It is the root node of document tree;

4), ∑ is the set of node label in the document tree;

P &Subset; V \cup V^{2} \cup . . . \cup V^{| V |};

3. method according to claim 1 is characterized in that the BPC of the described node of step 2 is defined as:

Define the BPC:P of 2. nodes _A(e) defined ancestors' path constraint of node e, P _A(e)=(v ₀, v ₁..., e) ∈ P _A, P _S(e) defined child's path constraint of node e, P _S(e)=(u ₁..., u _n) ∈ P _S, (e, u _i) ∈ E _a, cons (e) has defined the BPC of node, cons (e)=(P _A(e), P _S(e)), e ∈ V; For the leaf node of document tree, its P _S(e) be empty, represent with ε.

4. method according to claim 1 is characterized in that the method that the described use of step 3 is calculated the similarity between two path constraints based on the dividing mode of N-Gram is:

If k is the quantity of the different node labels that occur in two path constraints to be compared, this k node label is arranged according to the dictionary preface, then each node label can be mapped as positive integer in [1, k] successively; Node label with string representation is converted into a numeral like this, and identical tag name has identical numeral number; Path constraint is last so takes the form of a sequential integer array;

Definition 3. dividing mode based on N-Gram thought: the integer array that it will be grown for n is divided into n sub-array, wherein (0＜i≤n) individual sub-storage of array is the i-Gram item that extracts to i, this subnumber group abbreviates the i-Gram array as, contain the n-i+1 item, wherein each is i continuous items (a in the former integer array ₁, a ₂..., a _i) result that generates, the generation method is as follows:

i-GramItem＝a ₁×(k+1) ^i-1+a ₂×(k+1) ^i-2+……+a _i×(k+1) ⁰

Array in;

Two path constraints to be compared, by using sign map to be converted into the integer array, length is respectively n and m, they are ancestors' path constraints of certain two node or are both child's path constraint, according to definition 3 they are resolved into the 1-Gram array successively, the 2-Gram array ... min (n, m)-the Gram array;

Define the identical entry number C of 4. two one-dimension array: array is regarded as set, with two intersection of sets set representations identical entry number C;

Define 5. path constraint similarities: according to top definition, path constraint similarity formula is as follows:

Sim (p_{1}, p_{2}) = \frac{C}{\frac{t (t + 1)}{2}} = \frac{2 C}{t (t + 1)},

T=max (n, m), p ₁, p ₂∈ P _AOr p ₁, p ₂∈ P _S

Definition 6.BPC similarity: establish the factor of influence of α for ancestors' path constraint, natural 1-α is the factor of influence of child's path constraint, 0≤α≤1, and BPC similarity formula is as follows:

5. method according to claim 1 is characterized in that all node similarity weighted sums in the described document of step 5 as the method for the similarity of two pieces of documents are:

Define 7. document similarities: two pieces of XML document D ₁And D ₂, the node number is respectively n and m, calculates D according to definition 6 ₁The BPC of each node and D ₂The BPC similarity of each node is selected D after forming similar matrix ₁Each node and D ₂The similar value of the node of similarity maximum, then document similarity formula is as follows:

Sim (D_{1}, D_{2}) = Σ_{i = 1}^{n} w (v_{i}) \max_{j = 1}^{m} (s_{ij}) / Σ_{i = 1}^{n} w (v_{i}),

s _ij＝Sim(cons(v _i)，cons(v _j))，1≤i≤n，1≤j≤m；

In the XML document tree, node is the closer to root node, and it is just big more to the influence of file structure; Introduce

w (v_{i}) = 2^{- lev (v_{i})}