CN101876995A - Method for calculating similarity of XML documents - Google Patents

Method for calculating similarity of XML documents Download PDF

Info

Publication number
CN101876995A
CN101876995A CN2009102449033A CN200910244903A CN101876995A CN 101876995 A CN101876995 A CN 101876995A CN 2009102449033 A CN2009102449033 A CN 2009102449033A CN 200910244903 A CN200910244903 A CN 200910244903A CN 101876995 A CN101876995 A CN 101876995A
Authority
CN
China
Prior art keywords
node
similarity
gram
array
bpc
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2009102449033A
Other languages
Chinese (zh)
Inventor
汪陈应
袁晓洁
廉鑫
林伟坚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nankai University
Original Assignee
Nankai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nankai University filed Critical Nankai University
Priority to CN2009102449033A priority Critical patent/CN101876995A/en
Publication of CN101876995A publication Critical patent/CN101876995A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention belongs to the technical field of databases and aims to establish an XML document constraint model known as a bidirectional path constraint model. Based on the model, the invention discloses a new method for calculating the similarity of XML documents. The structural information of the XML documents is extracted more completely through the bidirectional path constraint of a node, so that the similarity of the XML documents is balanced more accurately. A very mature N-Gram thought in the field of natural languages is introduced and an N-Gram-based partition mode is applied in the similarity calculation of path constraint. The extraction and operation of N-Gram information are simplified by making skillful use of positive integers and weight numbers. The method can be used in the fields of XML document classification, clustering, mode extraction and the like.

Description

A kind of method of calculating similarity of XML documents
[technical field]
The invention belongs to database technical field, be specifically related to a kind of method of calculating similarity of XML documents.
[background technology]
Expandable mark language XML has become the standard format that Web goes up expression and swap data.Along with the promotion and application of XML relevant criterion, all trades and professions all with XML as meta-language, formulate the specific sublanguage in field separately, be used to store and data that shared this area is related.Under this background, every field all can continue to bring out out a large amount of XML document.How from large volume document, to excavate knowledge and become current urgent problem.The XML data mining is important use in the knowledge discovering technologies, and similarity calculating plays a part basic in the XML data mining.
XML document is excavated and is divided into content mining and structure excavation, and it can be used for extraction, integration and some other application of XML data.XML document is a semi-structured data, thereby structure is excavated particularly important.Classification, cluster are the methods that data mining is generally adopted, and the XML document similarity is the basis of classification, cluster, are the key factors that the result is excavated in influence.
At present the XML document similarity is calculated and is mainly contained two class methods, based on the method for tree editing distance with based on the method for frequent path.Wherein the method based on the tree editing distance has obtained widespread usage, and it at first is expressed as orderly tag tree, for example a dom tree to one piece of XML document.And then weigh the similarity of XML document tree by the tree editing distance.Based on the tree editing distance three kinds of classic algorithm: Selkow, Chawathe and Dalamagas are arranged, but tree editing distance algorithm time complexity is generally higher.Method based on frequent path can be calculated the document similarity fast, but loses all non-frequent paths, thereby loses a large amount of structural informations, and accuracy is relatively low.
[summary of the invention]
The objective of the invention is to remedy the prior art above shortcomings, propose a kind of method of new calculating similarity of XML documents.This method uses the BPC model to extract the structural information of XML document, introduces various weights and embodies layer of structure, based on the N-Gram dividing mode, has reduced the time complexity that the XML document similarity is calculated by single pass.
The method of calculating similarity of XML documents provided by the invention comprises the steps:
Step 1, XML document is defined as XML document tree;
Step 2, set up two-way approach constraint (Bidirectional path constraints, BPC) model: the BPC of defined node on the basis of step 1 document tree, the BPC set of all nodes that one piece of XML document comprises is called the two-way approach restricted model;
Step 3, use are referred to as the path constraint similarity based on the similarity between dividing mode two ancestors' path constraints of calculating (or child's path constraint) of N-Gram;
Step 4, the path constraint similarity that draws according to step 3 are calculated the BPC similarity of two nodes, and then the similarity of this BPC similarity as these two nodes;
All node similarities are according to the layer of structure weighted sum of the node similarity as two pieces of documents in step 5, the last document.
Concrete computation process of the present invention is as follows:
1.XML document tree
XML document is defined as an XML document tree, specific as follows:
Definition 1.XML document tree: with an XML document tree representation is one 6 tuple T=(V, v 0, E, ∑, P, lab), wherein:
1), V is the set of all nodes in the document tree;
2), v 0It is the root node of document tree;
3), E aDefined the parent-child constraint set, E a=(u, v) | u ∈ V ∧ v ∈ V, and u is father's node of v }, E sDefined fraternal constrain set, E s=(u, v) | u ∈ V ∧ v ∈ V, and v is the right brotgher of node of u }; Represent constrain set with E, i.e. E=E a∪ E s
4), ∑ is the set of node label in the document tree;
5), P ADefined ancestors' path constraint, P A={ (v 0, v 1..., v n) | (v i, v I+1) ∈ E a, 0≤i<n} ∪ { v 0, P SDefined child's path constraint, P S={ (v 1..., v n) | (v i, v I+1) ∈ E s, 0<i<n, v 1, v nBe respectively first and last child nodes of their father's nodes } ∪ { v 1| v 1Be unique child nodes of its father's node }; Represent path constraint set, i.e. P=P with P A∪ P S, P ⋐ V ∪ V 2 ∪ . . . ∪ V | V | ;
6), the label of function lab return node, promptly as v ∈ V, lab (v) ∈ ∑.
Need to prove that what we paid close attention to is structural similarity, traditional information retrieval technique good treatment the content similarity, so text node unified be the node of #text as label value.In addition attribute node is regarded as a kind of special node element.Document tree example such as Fig. 1.
2. the BPC of node
Define the BPC of 2. nodes.P A(e) defined ancestors' path constraint of node e, P A(e)=(v 0, v 1..., e) ∈ P A, P S(e) defined child's path constraint of node e, P S(e)=(u 1..., u n) ∈ P S, (e, u i) ∈ E a, cons (e) has defined the BPC of node, cons (e)=(P A(e), P S(e)), e ∈ V.For the leaf node of document tree, its P S(e) be empty, represent with ε.
Usually only extract ancestors' path constraint based on the method for tree editing distance.The BPC that the present invention uses has increased child's path constraint on the basis of original ancestors' path constraint.So more fully obtained the structural information of XML document, can improve accuracy according to document similarity cluster result.
3. based on the similarity between two path constraints of N-Gram thought calculating
If k is the quantity of the different node labels that occur in two path constraints to be compared, this k node label is arranged according to the dictionary preface, then each node label can be mapped as positive integer in [1, k] successively.Node label with string representation is converted into a numeral like this, and identical tag name has identical numeral number.Path constraint is last so takes the form of a sequential integer array.
Definition 3. dividing mode based on N-Gram thought.The integer array that it will be grown for n is divided into n sub-array, and wherein (0<i≤n) individual sub-storage of array is the i-Gram item that extracts to i, and this subnumber group abbreviates the i-Gram array as, contains the n-i+1 item, and wherein each is i continuous items (a in the former integer array 1, a 2..., a i) result that generates, the generation method is as follows:
i-GramItem=a 1×(k+1) i-1+a 2×(k+1) i-2+……+a i×(k+1) 0
Introducing k+1 is that as seen, the 1-Gram array has the n item for the uniqueness of the item that guarantees each subnumber group, and the 2-Gram array has the n-1 item ..., (n-1)-and the Gram array has 2, and the n-Gram array has 1; Thereby all subnumber groups are total
Figure G2009102449033D00031
; In order to simplify the processing of back, n sub-array is stored in one successively longly is
Figure G2009102449033D00032
Array in;
Two path constraints to be compared, by using sign map to be converted into the integer array, length is respectively n and m, they are ancestors' path constraints (or being both child's path constraint) of certain two node, according to definition 3 they are resolved into the 1-Gram array successively, the 2-Gram array ... min (n, m)-the Gram array.
Define the identical entry number C of 4. two one-dimension array.Array is regarded as set, with two intersection of sets set representations identical entry number C.
Use C iRepresent that two path constraints decompose the identical entry number of latter two i-Gram array.If when in the i-Gram array complete occurrence being arranged, this all subitems all can mate, and the subitem number of this part coupling has embodied C virtually iWeight, C = ∪ i = 1 n C i ; Therefore represent two identical entry numbers after the path constraints decomposition with C.
Define 5. path constraint similarities.According to top definition, path constraint similarity formula is as follows:
Sim ( p 1 , p 2 ) = C t ( t + 1 ) 2 = 2 C t ( t + 1 ) , T=max (n, m), p 1, p 2∈ P AOr p 1, p 2∈ P S
4.BPC similarity
In order to keep original structural information, the present invention has extracted BPC to each node of XML document, but ancestors path similarity and child path similarity may be different to the influence degree of BPC.Introduce factor of influence and describe the influence degree of ancestors' constraint BPC.This factor of influence is set by the programmer.It is generally acknowledged that ancestors' path constraint has bigger influence to BPC.
Definition 6.BPC similarity.If α is the factor of influence of ancestors' path constraint, natural 1-α is the factor of influence of child's path constraint, 0≤α≤1, and BPC similarity formula is as follows:
Sim(cons(e),cons(e 0))=α×Sim(P A(e),P A(e 0))+(1-α)×Sim(P S(e),P S(e 0))。
5. document similarity
Define 7. document similarities.Two pieces of XML document D 1And D 2, the node number is respectively n and m, calculates D according to definition 6 1The BPC of each node and D 2The BPC similarity of each node is selected D after forming similar matrix 1Each node and D 2The similar value of the node of similarity maximum, then document similarity formula is as follows:
Sim ( D 1 , D 2 ) = Σ i = 1 n w ( v i ) max j = 1 m ( s ij ) / Σ i = 1 n w ( v i ) , s ij=Sim(cons(v i),cons(v j)),1≤i≤n,1≤j≤m。
In the XML document tag tree, node is the closer to root node, and it is just big more to the influence of file structure.Introduce w ( v i ) = 2 - lev ( v i ) The Different Effects of the different node degree of depth is described, lev (v i) be node v iThe number of plies, the number of plies of root node is 0.
Advantage of the present invention and good effect:
The present invention proposes a kind of method of new comparison XML document similarity.This method is used the BPC model, more fully extracts the structural information of XML document, for accurate Calculation XML document similarity lays the foundation.Introduce various weights and embody layer of structure.The tolerance with N-Gram thought simplification path similarity of innovation, accurately efficient height.As the basis of classification, cluster, can improve the accuracy of classification, cluster.
[description of drawings]
Fig. 1 is the XML document tree of one piece of XML document and its correspondence.
Fig. 2 is for using the N-Gram information in the N-Gram thought extraction path constraint 6 → 3 → 4 → 5 → 3, and this figure comprises five processes to e by a, because the maximum integer that occurs is 6, what use in the leaching process is septenary.Wherein,
(a) be the synoptic diagram of filling first 1-Gram behind first element of scanning pattern.
(b) be to fill second 1-Gram, the synoptic diagram of first 2-Gram behind second element of scanning pattern.
(c) be to fill the 3rd 1-Gram, second 2-Gram, the synoptic diagram of first 3-Gram behind the 3rd element of scanning pattern.
(d) be to fill the 4th 1-Gram, the 3rd 2-Gram, second 3-Gram, the synoptic diagram of first 4-Gram behind the 4th element of scanning pattern.
(e) be to fill the 5th 1-Gram, the 4th 2-Gram, the 3rd 3-Gram, second 4-Gram, the synoptic diagram of first 5-Gram behind the 3rd element of scanning pattern.
Fig. 3 is a document similarity algorithm process flow diagram.
[embodiment]
N-Gram (N is first number) is a kind of language model commonly used in the big vocabulary continuous speech recognition.This model is based on a kind of like this hypothesis, and the appearance of N speech is only relevant with a front N-1 speech, and all uncorrelated with other any speech, and the probability of whole sentence is exactly the product of each speech probability of occurrence.These probability can obtain by the number of times that directly N speech of statistics occurs simultaneously from language material.That commonly used is the Bi-Gram of binary and the Tri-Gram of ternary, has been widely used in natural language processing.The meaning of N-Gram can be understood as the sequence that N speech constitutes.
Embodiment 1: the concrete grammar based on XML document tree structure BPC model is described below:
1. an XML document that XML document is defined as that proposes according to the present invention is set, and on this document tree basis each node is set up the BPC model.Fig. 1 has shown the XML document tree of one piece of XML document and its correspondence, and table 1 is the BPC model that example is enumerated each node with Fig. 1 document tree.
Embodiment 2: the concrete grammar based on N-Gram thought calculating document similarity is described below:
Algorithm 1. generates the method CreateGram of i+1-Gram item according to two adjacent i-Gram items
Input: item 1, item 2Two adjacent i-Gram item * that/* represents with positive integer/
T/* system t*/
Output: (i+1)-Gram item * that item/* represents with positive integer/
①.item:=item 1×t+item 2%t;
②.RETURN?item;
3.. algorithm finishes
This algorithm is to generate (i+1)-Gram item according to two adjacent i-Gram items.System t in the algorithm is that different number of tags sums add 1 in two path constraints to be compared.For same path constraint, introduce system t, when i ≠ j, can guarantee that the integer field at i-Gram item place and the integer field at j-Gram place do not occur simultaneously.
The extracting method PathDecomposition of N-Gram information in algorithm 2. path constraints
Input: Path[1,2 ..., n]/* be mapped as path constraint * after the positive integer array/
T/* system t, meaning with algorithm 1*/
n 0The N-Gram item of the maximum that/* need extract promptly extracts
In the k-Gram subnumber group, k≤n 0*/
Output:
Figure G2009102449033D00051
The N-Gram information * that/* extracts/
①.pos[1,2,…,n];
/ * pos[i] record path constraint Path each i-Gram array the NGram array (i=1,2 ...,
Path.Length) the reference position * in/
②. pos [ i ] : = 2 ni - 2 n + 3 i - i 2 2 ;
③.FOREACH?member?IN?Path
4.. the subscript of i:=member in Path;
5.. NGram[i]=member/* fill i 1-Gram item */
6.. j:=2; / * j represent j-Gram item * to be filled/
⑦.???IF?j≤i&&j≤n 0THEN
⑧.????????item 1:=NGram[pos[j-1]+i-j+1];
⑨.????????item 2:=NGram[pos[j-1]+i-j+2];
⑩.????????NGram[pos[j]+i-j+1]:=CreateGram(item 1,item 2,t);
/ * according to (j-1)-Gram item fill i-j+1 j-Gram item */
Figure G2009102449033D00053
???????????j++;
Figure G2009102449033D00054
???????????GOTO⑦
Figure G2009102449033D00055
?????????END?IF
Figure G2009102449033D00056
?????END?FOREACH
Figure G2009102449033D00057
?????RETURN?NGram;
Figure G2009102449033D00058
Algorithm finishes
The fundamental purpose of this algorithm is by run-down array Path, extracts all i-Gram items that this array comprises, and is filled in the relevant position of NGram array.The length of each i-Gram is determined, with the reference position of each i-Gram of pos storage of array at NGram.According to i, filling mode is as follows:
I=1 fills the 1st 1-Gram
I=2 fills the 2nd 1-Gram, the 1st 2-Gram
I=3 fills the 3rd 1-Gram, the 2nd 2-Gram, the 1st 3-Gram
......???......
I=n fills n 1-Gram, n-1 2-Gram ..., the 1st n-Gram
Find thus,, can calculate to be filled the memory location in NGram in conjunction with array pos when current scanning position i and the item to be filled of known Path belongs to j-Gram.Algorithm the 8. to 10. step algorithm 1, utilize i-j+1 and the i-j+2 item of (j-1)-Gram, generate the i-j+1 item of j-Gram.The path array Path end of scan, the N-Gram information array NGram of its correspondence fills complete.Extract the N-Gram information that path constraint 6 → 3 → 4 → 5 → 3 is filled as Fig. 2 for using N-Gram thought.
Similarity is calculated PathSimilarity between algorithm 3. path constraints
Input: StringPath 1[1,2 ..., n], StringPath 2[1,2 ..., m]/the path constraint * of * character string forms/
Output: pathSim/* path similarity */
①.Dictionary[1,2,…,k];
All labels that comprise in two path constraints of/* array Dictionary for input are arranged according to the dictionary preface
Good dictionary, identical character string only accounts in the dictionary; K is StringPath 1And StringPath 2
In different node labels quantity */
②.Path 1:=Mapping(StringPath 1,Dictionary);
/ * function Mapping returns character string array StringPath 1In character string all be converted into
The subscript of this character string among the Dictionary and a shaping array * forming/
③.Path 2:=Mapping(StringPath 2,Dictionary);
④.minLength:=min(StringPath 1.Length,StringPath 2.Length);
⑤.DecPath 1:=PathDecomposition(Path 1,k+1,minLength);
/ * is according to algorithm 2, extract N-Gram information * in the path constraint/
⑥.DecPath 2:=PathDecomposition(Path 2,k+1,minLength);
⑦.pathSim:=|DecPath 1∩DecPath 2|;
⑧.RETURN?pathSim;
9.. algorithm finishes
The purpose of algorithm is to calculate the similarity of two path constraints.K is the quantity of the different node labels that occur in two path constraints to be compared, and this k node label is arranged according to the dictionary preface, and then each node label can be mapped as positive integer in [1, k] successively.Node label with string representation is converted into a numeral like this, and identical tag name has identical numeral number.Path constraint is last so takes the form of a sequential integer array.Adopt t=k+1 as system, thereby reach the purpose that algorithm 1 is introduced this parameter.Two constraint: BOOK → SECTION → TITLE that the explanation of table 2 example is to be compared, the map information of BOOK → SECTION → each character string of FIGURE → CAPTION.
Algorithm 4.BPC similarity BPCSimilarity
Input: node e 1BPC, node e 2BPC
Output: BPCsim/* BPC similarity, also be node similarity */
1.. α :=0.6; / * parameter alpha be ancestors' path constraint at the intrafascicular approximately shared proportion of BPC, α is big more, the ancestral
Elder generation's path constraint is big more to the influence of BPC similarity, and child's path constraint is got over the influence of BPC similarity
Little; Otherwise α is more little, and child's path constraint is big more to the influence of BPC similarity, and ancestors' path constraint is right
The more little * of the influence of BPC similarity/
②.BPCsim:=α×PathSimilarity(P A(e 1),P A(e 2))+(1-α)×PathSimilarity(P S(e 1),P S(e 2))
③.RETURN?BPCsim;
4.. algorithm finishes
The purpose of algorithm is to calculate the BPC similarity of two nodes.Introduce factor of influence and describe the influence degree of ancestors' path constraint the BPC similarity.This factor of influence need be set according to concrete application, thinks that generally speaking ancestors' path constraint has bigger influence than child path constraint to the BPC similarity, i.e. α>0.5.
Algorithm 5.XML document similarity
Input: XML document tree D 1And D 2
Output: documentSim/* document D 1And D 2Similarity */
1.. traversal document tree D 1And D 2, set up corresponding BPC model;
②.???s[n×m];
/ * BPC similar matrix is established document D 1The node number is n, document D 2The node number be m*/
③.???s ij:=BPCSimilarity((P A(e i),P S(e i)),(P A(e j),P S(e j)));
/ * is according to algorithm 4, s IjThat store is node e iWith node e jBetween similarity, node e wherein i
Belong to document D 1, node e jBelong to document D 2*/
④. documentSim : = Σ i = 1 n w ( e 1 ) max j = 1 m ( Matrix [ ij ] ) / Σ i = 1 n w ( e 1 )
/ * function w (e) obtains the weight of node e, and w (e)=2 -lev (e)*/
The purpose of algorithm is to calculate the similarity of two pieces of XML document.Because the BPC similar matrix satisfies about matrix principal diagonal symmetry, last triangle that can a compute matrix during concrete operations copies to down triangle again, and calculation times reduces half.As Fig. 3 is document similarity algorithm process flow diagram.
Table 1 has been enumerated the BPC that Fig. 1 XML document is set each node
Node The BPC of node
?BOOK ?(BOOK,ISBN→SECTION→SECTION)
?ISBN ?(BOOK→ISBN,#text)
?#text ?(BOOK→ISBN→#text,ε)
?SECTION ?(BOOK→SECTION,TITLE→#text→FIGURE)
?TITLE ?(BOOK→SECTION→TITLE,ε)
?#text ??(BOOK→SECTION→#text,ε)
?FIGURE ??(BOOK→SECTION→FIGURE,CAPTION)
?CAPTION ??(BOOK→SECTION→FIGURE→CAPTION,ε)
?SECTION ??(BOOK→SECTION,TITLE→#text→BOLD)
?TITLE ??(BOOK→SECTION→TITLE,ε)
?#text ??(BOOK→SECTION→#text,ε)
?BOLD ??(BOOK→SECTION→BOLD,#text)
?#text ??(BOOK→SECTION→BOLD→#text,ε)
Two constraint: BOOK → SECTION → TITLE that the explanation of table 2 example is to be compared, the map information of BOOK → SECTION → each character string of FIGURE → CAPTION
??BOOK ??1
??SECTION ??2
??TITLE ??3
??FIGURE ??4
??CAPTION ??5

Claims (5)

1. the method for a calculating similarity of XML documents is characterized in that this method comprises the steps:
Step 1, XML document is defined as XML document tree, and is expressed as one 6 tuple;
Step 2, set up two-way approach constraint Bidirectional path constraints, the BPC model: the BPC of defined node on the basis of step 1 document tree, the BPC set of all nodes that one piece of XML document comprises is called the two-way approach restricted model;
Step 3, use are referred to as the path constraint similarity based on two the ancestors' path constraints of dividing mode calculating of N-Gram or the similarity between child's path constraint;
Step 4, the path constraint similarity that draws according to step 3 are calculated the BPC similarity of two nodes, and then the similarity of this BPC similarity as these two nodes;
All node similarities are according to the layer of structure weighted sum of the node similarity as two pieces of documents in step 5, the last document.
2. method according to claim 1 is characterized in that the described XML document tree of step 1 is defined as follows:
Definition 1.XML document tree: with an XML document tree representation is one 6 tuple T=(V, v 0, E, ∑, P, lab), wherein:
1), V is the set of all nodes in the document tree;
2), v 0It is the root node of document tree;
3), E aDefined the parent-child constraint set, E a=(u, v) | u ∈ V ∧ v ∈ V, and u is father's node of v }, E sDefined fraternal constrain set, E s=(u, v) | u ∈ V ∧ v ∈ V, and v is the right brotgher of node of u }; Represent constrain set with E, i.e. E=E a∪ E s
4), ∑ is the set of node label in the document tree;
5), P ADefined ancestors' path constraint, P A={ (v 0, v 1..., v n) | (v i, v I+1) ∈ E a, 0≤i<n} ∪ { v 0, P SDefined child's path constraint, P S={ (v 1..., v n) | (v i, v I+1) ∈ E s, 0<i<n, v 1, v nBe respectively first and last child nodes of their father's nodes } ∪ { v 1| v 1Be unique child nodes of its father's node }; Represent path constraint set, i.e. P=P with P A∪ P S, P ⋐ V ∪ V 2 ∪ . . . ∪ V | V | ;
6), the label of function lab return node, promptly as v ∈ V, lab (v) ∈ ∑.
3. method according to claim 1 is characterized in that the BPC of the described node of step 2 is defined as:
Define the BPC:P of 2. nodes A(e) defined ancestors' path constraint of node e, P A(e)=(v 0, v 1..., e) ∈ P A, P S(e) defined child's path constraint of node e, P S(e)=(u 1..., u n) ∈ P S, (e, u i) ∈ E a, cons (e) has defined the BPC of node, cons (e)=(P A(e), P S(e)), e ∈ V; For the leaf node of document tree, its P S(e) be empty, represent with ε.
4. method according to claim 1 is characterized in that the method that the described use of step 3 is calculated the similarity between two path constraints based on the dividing mode of N-Gram is:
If k is the quantity of the different node labels that occur in two path constraints to be compared, this k node label is arranged according to the dictionary preface, then each node label can be mapped as positive integer in [1, k] successively; Node label with string representation is converted into a numeral like this, and identical tag name has identical numeral number; Path constraint is last so takes the form of a sequential integer array;
Definition 3. dividing mode based on N-Gram thought: the integer array that it will be grown for n is divided into n sub-array, wherein (0<i≤n) individual sub-storage of array is the i-Gram item that extracts to i, this subnumber group abbreviates the i-Gram array as, contain the n-i+1 item, wherein each is i continuous items (a in the former integer array 1, a 2..., a i) result that generates, the generation method is as follows:
i-GramItem=a 1×(k+1) i-1+a 2×(k+1) i-2+……+a i×(k+1) 0
Introducing k+1 is that as seen, the 1-Gram array has the n item for the uniqueness of the item that guarantees each subnumber group, and the 2-Gram array has the n-1 item ..., (n-1)-and the Gram array has 2, and the n-Gram array has 1; Thereby all subnumber groups are total
Figure F2009102449033C00021
; In order to simplify the processing of back, n sub-array is stored in one successively longly is
Figure F2009102449033C00022
Array in;
Two path constraints to be compared, by using sign map to be converted into the integer array, length is respectively n and m, they are ancestors' path constraints of certain two node or are both child's path constraint, according to definition 3 they are resolved into the 1-Gram array successively, the 2-Gram array ... min (n, m)-the Gram array;
Define the identical entry number C of 4. two one-dimension array: array is regarded as set, with two intersection of sets set representations identical entry number C;
Define 5. path constraint similarities: according to top definition, path constraint similarity formula is as follows:
Sim ( p 1 , p 2 ) = C t ( t + 1 ) 2 = 2 C t ( t + 1 ) , T=max (n, m), p 1, p 2∈ P AOr p 1, p 2∈ P S
Definition 6.BPC similarity: establish the factor of influence of α for ancestors' path constraint, natural 1-α is the factor of influence of child's path constraint, 0≤α≤1, and BPC similarity formula is as follows:
Sim(cons(e),cons(e 0))=α×Sim(P A(e),P A(e 0))+(1-α)×Sim(P S(e),P S(e 0))。
5. method according to claim 1 is characterized in that all node similarity weighted sums in the described document of step 5 as the method for the similarity of two pieces of documents are:
Define 7. document similarities: two pieces of XML document D 1And D 2, the node number is respectively n and m, calculates D according to definition 6 1The BPC of each node and D 2The BPC similarity of each node is selected D after forming similar matrix 1Each node and D 2The similar value of the node of similarity maximum, then document similarity formula is as follows:
Sim ( D 1 , D 2 ) = Σ i = 1 n w ( v i ) max j = 1 m ( s ij ) / Σ i = 1 n w ( v i ) , s ij=Sim(cons(v i),cons(v j)),1≤i≤n,1≤j≤m;
In the XML document tree, node is the closer to root node, and it is just big more to the influence of file structure; Introduce w ( v i ) = 2 - lev ( v i ) The Different Effects of the different node degree of depth is described, lev (v i) be node v iThe number of plies, the number of plies of root node is 0.
CN2009102449033A 2009-12-18 2009-12-18 Method for calculating similarity of XML documents Pending CN101876995A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009102449033A CN101876995A (en) 2009-12-18 2009-12-18 Method for calculating similarity of XML documents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009102449033A CN101876995A (en) 2009-12-18 2009-12-18 Method for calculating similarity of XML documents

Publications (1)

Publication Number Publication Date
CN101876995A true CN101876995A (en) 2010-11-03

Family

ID=43019553

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009102449033A Pending CN101876995A (en) 2009-12-18 2009-12-18 Method for calculating similarity of XML documents

Country Status (1)

Country Link
CN (1) CN101876995A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102043848A (en) * 2010-12-20 2011-05-04 北京大学 XML document tree example query method
CN102622432A (en) * 2012-02-27 2012-08-01 北京工业大学 Measuring method of similarity between extensive makeup language (XML) file structure outlines
CN102722556A (en) * 2012-05-29 2012-10-10 清华大学 Model comparison method based on similarity measurement
CN102799680A (en) * 2012-07-24 2012-11-28 华北电力大学(保定) XML (extensible markup language) document spectrum clustering method based on affinity propagation
WO2013063734A1 (en) * 2011-10-31 2013-05-10 Hewlett-Packard Development Company, L.P. Determining document structure similarity using discrete wavelet transformation
CN104750609A (en) * 2015-03-26 2015-07-01 广东欧珀移动通信有限公司 Method and device for determining interface layout compatibility degree
CN106933824A (en) * 2015-12-29 2017-07-07 伊姆西公司 The method and apparatus that the collection of document similar to destination document is determined in multiple documents
CN109885657A (en) * 2019-02-18 2019-06-14 武汉瓯越网视有限公司 A kind of calculation method of text similarity, device and storage medium
CN111381188A (en) * 2020-03-18 2020-07-07 华中科技大学 Bridge arm open-circuit fault diagnosis method for two-level three-phase voltage source inverter
CN111815175A (en) * 2020-07-08 2020-10-23 睿智合创(北京)科技有限公司 Five-layer structure XML language interactive application method in nested list form
CN112364604A (en) * 2020-10-26 2021-02-12 南京工程学院 XML document digitization method and system
CN117610536A (en) * 2024-01-23 2024-02-27 南京邮电大学 Automatic judgment method and system for Office operation questions based on XML document similarity

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102043848A (en) * 2010-12-20 2011-05-04 北京大学 XML document tree example query method
US9405750B2 (en) 2011-10-31 2016-08-02 Hewlett-Packard Development Company, L.P. Discrete wavelet transform method for document structure similarity
WO2013063734A1 (en) * 2011-10-31 2013-05-10 Hewlett-Packard Development Company, L.P. Determining document structure similarity using discrete wavelet transformation
CN102622432A (en) * 2012-02-27 2012-08-01 北京工业大学 Measuring method of similarity between extensive makeup language (XML) file structure outlines
CN102622432B (en) * 2012-02-27 2013-07-31 北京工业大学 Measuring method of similarity between extensive makeup language (XML) file structure outlines
CN102722556A (en) * 2012-05-29 2012-10-10 清华大学 Model comparison method based on similarity measurement
CN102722556B (en) * 2012-05-29 2014-10-22 清华大学 Model comparison method based on similarity measurement
CN102799680A (en) * 2012-07-24 2012-11-28 华北电力大学(保定) XML (extensible markup language) document spectrum clustering method based on affinity propagation
CN104750609A (en) * 2015-03-26 2015-07-01 广东欧珀移动通信有限公司 Method and device for determining interface layout compatibility degree
CN104750609B (en) * 2015-03-26 2018-01-19 广东欧珀移动通信有限公司 Determine the method and device of interface layout compatibility
CN106933824B (en) * 2015-12-29 2021-01-01 伊姆西Ip控股有限责任公司 Method and device for determining document set similar to target document in multiple documents
CN106933824A (en) * 2015-12-29 2017-07-07 伊姆西公司 The method and apparatus that the collection of document similar to destination document is determined in multiple documents
CN109885657A (en) * 2019-02-18 2019-06-14 武汉瓯越网视有限公司 A kind of calculation method of text similarity, device and storage medium
CN109885657B (en) * 2019-02-18 2021-04-27 武汉瓯越网视有限公司 Text similarity calculation method and device and storage medium
CN111381188A (en) * 2020-03-18 2020-07-07 华中科技大学 Bridge arm open-circuit fault diagnosis method for two-level three-phase voltage source inverter
CN111815175A (en) * 2020-07-08 2020-10-23 睿智合创(北京)科技有限公司 Five-layer structure XML language interactive application method in nested list form
CN112364604A (en) * 2020-10-26 2021-02-12 南京工程学院 XML document digitization method and system
CN117610536B (en) * 2024-01-23 2024-04-09 南京邮电大学 Automatic judgment method and system for Office operation questions based on XML document similarity
CN117610536A (en) * 2024-01-23 2024-02-27 南京邮电大学 Automatic judgment method and system for Office operation questions based on XML document similarity

Similar Documents

Publication Publication Date Title
CN101876995A (en) Method for calculating similarity of XML documents
CN109284352B (en) Query method for evaluating indefinite-length words and sentences of class documents based on inverted index
CN106294593B (en) In conjunction with the Relation extraction method of subordinate clause grade remote supervisory and semi-supervised integrated study
Embley et al. Table-processing paradigms: a research survey
CN101079024B (en) Special word list dynamic generation system and method
CN101866337B (en) Part-or-speech tagging system, and device and method thereof for training part-or-speech tagging model
CN1701323B (en) Digital ink database searching using handwriting feature synthesis
CN103500160B (en) A kind of syntactic analysis method based on the semantic String matching that slides
CN111190900B (en) JSON data visualization optimization method in cloud computing mode
CN107122413A (en) A kind of keyword extracting method and device based on graph model
CN101079025B (en) File correlation computing system and method
CN106156272A (en) A kind of information retrieval method based on multi-source semantic analysis
CN106528583A (en) Method for extracting and comparing web page main body
CN109145260A (en) A kind of text information extraction method
CN105653522B (en) A kind of non-categorical relation recognition method for plant field
CN105677638B (en) Web information abstracting method
CN111177591A (en) Knowledge graph-based Web data optimization method facing visualization demand
CN106257441A (en) A kind of training method of skip language model based on word frequency
CN102063424A (en) Method for Chinese word segmentation
CN105975475A (en) Chinese phrase string-based fine-grained thematic information extraction method
CN102063493A (en) Content extraction method based on regular expression group and control logic
CN110427488A (en) The processing method and processing device of document
CN112925901A (en) Evaluation resource recommendation method for assisting online questionnaire evaluation and application thereof
CN116719913A (en) Medical question-answering system based on improved named entity recognition and construction method thereof
CN102360436B (en) Identification method for on-line handwritten Tibetan characters based on components

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20101103