CN102622432A

CN102622432A - Measuring method of similarity between extensive makeup language (XML) file structure outlines

Info

Publication number: CN102622432A
Application number: CN2012100484439A
Authority: CN
Inventors: 高明霞
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2012-02-27
Filing date: 2012-02-27
Publication date: 2012-08-01
Anticipated expiration: 2032-02-27
Also published as: CN102622432B

Abstract

The invention relates to the technical field of data mining, in particular to a measuring method of similarity between extensive makeup language (XML) file structure outlines, which aims at quickly clustering XML data streams on line from the structure view on line and meeting higher requirements of the algorithm for internal memory and time. Structure outlines of the XML file are further provided. The algorithm analyzes the XML file in an SAX form, enables the file to be formed into an outline data structure-element chain (NodeList) capable of being expressed in increment mode by means of a whole name-code index table and the proceeding type stack technology and then calculates the similarity between two element chains through a user-defined formula. The measuring method utilizes the SAX to analyze the XML file, utilizes the proceeding type stack technology to obtain layer value, and achieves the effect that internal memory consumption is small in the process of structure outline construction. The whole internal memory is basically consumed in storage of element chain type clustering results and the whole name-index table.

Description

Method for measuring similarity between a kind of XML document structural outline

Technical field

The present invention relates to the data mining technology field, be specifically related to the method for measuring similarity between a kind of structural outline that is applicable to XML document stream that internal memory and time are had higher requirements or collection.

Background technology

XML is a kind of exchanges data and self-described language of sharing of being used for, become the proposed standard of W3C in February, 1998 after, obtained using widely.Follow the Web application and service of this standard, in real-time Data Transmission and exchange process, will produce in time and the continuous magnanimity streaming XML data that change.The for example various case histories of medical institutions and inspection data, the various passenger informations of aeronautical agency, user that search service is faced request, various network resources that network monitoring need be handled or the like.In order effectively to analyze these data, a possible solution is exactly according to file structure, the similar data with Semantic Clustering of content.

Existing XML data clusters technology is only supported static XML data set; Its core concept is: regard XML document as data point; Calculate distance or distinct matrix between the required document of cluster through selected XML document method for measuring similarity, utilize traditional clustering technique as: k-means or level mode etc. are accomplished the cluster task.The key that used XML document method for measuring similarity is the cluster effect in the cluster.Existing XML document method for measuring similarity roughly can be divided into two types: based on the method for tree editing distance with based on the method for file characteristics collection.

Method based on the tree editing distance is modeled as one tree or a figure with an XML document usually, and the similarity between two XML documents can be measured with the editing distance between these two trees or figure.The basic thought of editing distance be with the distance definition between two trees for utilizing editing operation, such as deletion, insertion, pruning etc., one tree is converted into the required minimum cost of other one tree.These class methods have only considered that node is different, and the node of not distinguishing different layers is also different to the influence of similarity.

Method based on the file characteristics collection is more direct, at first proposes variety of way and is used to represent that the XML document characteristic measures the similarity between XML document through the distance between these characteristics of direct calculating then.Concrete characteristic is varied; Relate to the bitmap index technology and represent the XML document characteristic; Represent the XML document characteristic with vector space model VSM, show the XML document structure, represent file structure characteristic etc. with simplifying the corresponding layer structure LevelStructure of labelled tree with the time sequence table.These class methods are mainly used in the processing static data, generally need repetitious document to read and resolve, and the characteristics of XML data stream are only to allow 1 time and carry out access and parsing according to the order of data arrives.

Research to the XML data has recently also expanded to XML data stream aspect; But prior art concentrates on mostly the processing of this type data stream and directly inquires about the field; For example: Christoph Koch and Stefanie Scherzinger have introduced a kind of Language XML Stream Attribute Grammars (XSAGs) of the XML of being used for continuous query; People such as Koch have proposed a FluXQuery engine of optimizing the XQuery engine; Seldom relate to further knowledge excavation such as online classification, online cluster etc.

Summary of the invention

The objective of the invention is for from the online quick clustering XML of structure angle data stream, satisfy this type algorithm, a kind of structural outline of XML document and the method for measuring similarity between this structural outline are provided the high requirement of internal memory and time.This algorithm with XML document with the SAX format analysis after; But by global name-code index table with carry out the summary data structure that the formula stack technology changes into the document form an incremental representation---element chain (NodeList), calculate the similarity of two element interchains then through a self-defined formula.

The structural outline technology of setting up XML document provided by the invention, concrete steps are following:

1) is pending XML document stream (or document sets) definition global element title-code index table, and this table is put sky.Each node comprises two parts content in this table: a part is the title that string format is used to deposit the differential element that pending XML document stream (or document sets) comprises; Another part is that integer data format is used to deposit the corresponding integer coding of this element.Coding rule is following: when XML document during with the SAX format analysis, this element of this integer representation begins incident and begins in flow of event (only write down element and begin incident) order of appearance for the first time at whole differential elements.

2) according to SAX format analysis XML document, obtain the beginning incident of each element, search global element title-code index table, if element term in chained list, then the coding of this element is exactly the corresponding integer of element term; If element term is not in chained list, then the encoded radio of this element equals in the chained list existing maximum integer and adds one, and this element term and corresponding integer coding are inserted global element title-code index table as new node.

3) based on carry out the formula stack technology obtain element-specific the layer value.Concrete operations are following: according to SAX format analysis XML document; Document begins incident and activates an empty stack structure; Along with the dynamic change of element data tuple in the XML document carry out stacked with go out stack operation; Be that element begins incident and End Event corresponding element is stacked respectively to operate with popping two kinds, the number of plies value of element is equal to the indicator marker that belongs to stack, and pointer increases progressively one since 0 at every turn.

4) but utilize the differential element integer coding get access to create the partial order element chain that XML document structural outline becomes incremental representation with its respective layer value.

5) the element chain is an index with the coding integer of element, has property capable of being combined, and just combined result will satisfy with a layer repeat element of the same name and only keeps a copy.Concrete anabolic process is following: given two element chain a and b; Begin the coding of first node two element chains of comparison from the chained list head, if a=b then continues the relatively layer value of first node; If layer value also equates; Then first node among a is inserted into the result element chain, otherwise first node among a and the b all is inserted into the result element chain, continue the relatively next node of two chained lists; If first nodes encoding comparative result is a＞b, then first node in the b element chain is inserted into the result element chain, next node among first node and the b among the continuation comparison a; If first nodes encoding comparative result is a＜b, then first node in a element chain is inserted into the result element chain, next node among first node and a among the continuation comparison b.

6) relatively two partial order element chains obtain publicly-owned element and respective layer value thereof; Comparison procedure is following: given two element chain a and b; Begin comparison and node is basic Moving Unit from the chained list head, if element encoding is smaller or equal to element encoding among the b among a, then a moves to next node; Otherwise b moves to next node, and comparison procedure continues.Record equal element coding and respective layer value thereof are used to calculate the similarity of element interchain in the comparison procedure.

The self-defining weighting formula of the present invention is used to calculate two element interchain similaritys (NodeSim):

{NodeSim}_{1 &LeftRightArrow; 2} = \frac{{ComWeight}_{1} + {ComWeight}_{2}}{{ObjWeight}_{1} + {ObjWeight}_{2}}

= \frac{Σ_{i = 1}^{M_{1}} {(1 / r)}^{L_{1}^{i}} + Σ_{j = 1}^{M_{2}} {(1 / r)}^{L_{2}^{j}}}{Σ_{k = 1}^{N_{1}} {(1 / r)}^{L_{1}^{k}} + Σ_{k = 1}^{N_{2}} {(1 / r)}^{L_{2}^{k}}}

Wherein, ComWeight ₁With ComWeight ₂The weight of the publicly-owned element of representing respectively to comprise in first and second the element chain add up with; ObjWeight ₁And ObjWeight ₂Represent respectively first with second element chain in the weight of all elements that comprises add up with; N ₁And N ₂Represent first and the element number of second element chain respectively; M ₁And M ₂Represent respectively first with second element chain in the number of publicly-owned element;

I the publicly-owned element number of plies representing first element chain,

The number of plies of j publicly-owned element of second element chain of expression;

With

The number of plies of representing first and k element of second element chain respectively; R is the decrement factor of element weight in the different layers, is designed to the User Defined parameter, and its value is greater than 1, according to experimental result, but the

common value

2,4 of r.

The present invention uses the SAX analyzing XML file, and has utilized and carried out formula stack technology securing layer value, makes to set up in the process of structural outline, and memory consumption is very little.Whole memory consumption spends on the cluster result and global name-concordance list of preserving the element chain type basically.

Description of drawings

Fig. 1-(a) SAX of XML document resolves format sample

Fig. 1-(b) through carrying out the exemplary plot of formula stack securing layer value

Global element title-code index the table of Fig. 1-(c)

The element chain that the XML document of Fig. 1-(d) is corresponding

Fig. 2 makes up the exemplary plot of two element chains

Fig. 3 comparison two element chains obtain the exemplary plot of total element

The total element that Fig. 4 obtains after relatively

The memory consumption of Fig. 5-(a) is with the situation of change of document number

Fig. 5-(b) time spends the situation of change with the document number

Embodiment

Below in conjunction with accompanying drawing and specific embodiment the present invention is elaborated.XML document in following examples can be the concrete individual specimen in the online XML document stream.And suppose that whole process is to begin from the first pending document.Set up XML document corresponding element chain and relatively obtain the concrete treatment scheme that has element following:

1) is pending XML document stream (or document sets) definition global element title-code index table, and this table is put sky.

What an XML document fragment obtained after according to the SAX format analysis shown in Fig. 1-(a) is a marked flows, and the beginning incident of each differential element is used corresponding integer mark at the order that element begins in the flow of event.For example: first element term " W4F-DOC " beginning incident is first appearance, therefore is labeled as integer " 1 ".

2) according to resolving form; For the beginning incident of each element is obtained integer coding; Fig. 1-(c) is the state when handling the 5th element of four differential element aftertreatments, at first utilizes element term " LastName " to search global name-code index table, and this moment, element term was not in chained list; Then the encoded radio of this element equals in the chained list existing maximum integer " 4 " and adds one, and this element term and corresponding integer coding are inserted global element title-code index table as new node.

3) Fig. 1-(b) is based on and carries out the exemplary plot that the formula stack technology obtains element-specific layer value.Just as shown in the figure, the beginning incident of the 4th element is stacked with integer coding " 4 ", and this moment, corresponding stack pointer was 3, so this element layer value is " 3 ", and when this element End Event occurred, this integer coding was popped.

4) according to step 2) and 3), for each differential element gets access to corresponding integer coding and its layer value, created the corresponding partial order element chain of XML document structural outline among Fig. 1-(a) shown in Fig. 1-(d).

5) the element chain is an index with the coding integer of element, has property capable of being combined, and just combined result will satisfy with a layer repeat element of the same name and only keeps a copy.Fig. 2 is two element chain a, the concrete anabolic process of b.Shown in figure, a, the coding of first node all equates with layer value among the b, therefore only first node among a is inserted into the result element chain; The coding of the 3rd node equates, and layer value do not wait, so the 3rd node among a and the b all is inserted into the result element chain.

6) similarity (NodeSim) in order to calculate two element chains, needing relatively, two partial order element chains obtain publicly-owned element and respective layer value thereof.Fig. 3 is given two element chain a and c, the example of concrete comparison procedure.Begin comparison from the chained list head, first nodes encoding is equal, and then a moves to next node, moves to second node as relatively finishing back a among the figure for the first time, and first nodes encoding of continuation and b relatively; Second nodes encoding is greater than first nodes encoding among the b among a, and then b moves to second node, and comparison procedure continues.After such comparison, obtained the total element of the two element chains that thick line indicates among Fig. 4.

After obtaining total element, can utilize the self-defining weighting formula of the present invention to calculate two element interchain similaritys, suppose that the decrement factor of weight is set r=2, (1/r)=0.5 then, but from Fig. 4 major elements chain a element number N ₁=8, element number N among the element chain c ₂=9; Total element number equates M among element chain a, the c ₁=M ₂=8;

The number of plies of i publicly-owned element among the expression element chain a,

The number of plies of j publicly-owned element among the expression element chain c; With The number of plies of representing k element among element chain a and the element chain c respectively; Then the similarity computation process of two element interchains among Fig. 4 is following:

{NodeSim}_{1 &LeftRightArrow; 2} = \frac{{ComWeight}_{1} + {ComWeight}_{2}}{{ObjWeight}_{1} + {ObjWeight}_{2}}

= \frac{Σ_{i = 1}^{M_{1}} {(1 / r)}^{L_{1}^{i}} + Σ_{j = 1}^{M_{2}} {(1 / r)}^{L_{2}^{j}}}{Σ_{k = 1}^{N_{1}} {(1 / r)}^{L_{1}^{k}} + Σ_{k = 1}^{N_{2}} {(1 / r)}^{L_{2}^{k}}}

= \frac{({0.5}^{0} + {0.5}^{1} + 2 \times {0.5}^{2} + 2 \times {0.5}^{3} + 2 \times {0.5}^{4}) + ({0.5}^{0} + {0.5}^{1} + 2 \times {0.5}^{2} + 2 \times {0.5}^{3} + 2 \times {0.5}^{4})}{({0.5}^{0} + {0.5}^{1} + 2 \times {0.5}^{2} + 2 \times {0.5}^{3} + 2 \times {0.5}^{4}) + ({0.5}^{0} + {0.5}^{1} + 2 \times {0.5}^{2} + 3 \times {0.5}^{3} + 2 \times {0.5}^{4})}

= 0.974

For the method for measuring similarity between the structural outline of this XML document of evaluating and testing our invention in the effect aspect time and the internal memory; Adopt traditional division methods to carry out a series of cluster experiments then through this method tolerance similarity; And assess similarity with the up-to-date layer structure LevelStructure that passes through under the same conditions; The XCLS algorithm that uses division methods to carry out cluster then contrasts, and experimental design procedure and net result statement are as follows.

Experiment condition: the PC of Pentium IV, 2.4G internal memory, JAVA language are realized program, the setting of user definition parameter is also identical, specifically is that the decrement factor of weight is set r=2,1/r=0.5 then, minimum similarity threshold value is 0.8, maximum cluster number is 130.

Experimental data: one 10419 document simulated data collection have been used in experiment, and these data are to use traditional industries, such as civil aviaton, and network application etc., the XML Schema of middle maturation produces through XML instrument oXygen xml editor at random.Document size from several k to hundreds of k.

Experimental result: Fig. 5 is one group of contrast experiment's a experimental result; Although two kinds of clustering methods have all used a kind of specific XML document structural outline and corresponding method for measuring similarity; Can see that from the result the present invention obviously is being superior to the XCLS algorithm aspect time cost and the memory consumption.By analyzing the specific operation process of two kinds of measures, the present invention can be summarized as follows in the advantage aspect time cost and the memory consumption:

(1) structural outline of the present invention is an index with orderly integer, in the process of relatively obtaining total element, is O (max{N under the time complexity worst condition therefore ₁, N ₂), and obtaining the process that has element through layer structure LevelStructure assessment similarity, the time complexity worst condition is O{N ₁* N ₂).

(2) memory consumption aspect, the present invention uses the SAX analyzing XML file, and has utilized and carried out formula stack technology securing layer value, makes to set up in the process of structural outline, and memory consumption is very little.Whole memory consumption spends on the cluster result and global name-concordance list of preserving the element chain type basically.And the layer structure used in the XCLS algorithm is based on and simplifies labelled tree; This tree can be regarded as the simplification form of dom tree; Therefore setting up in the process of summary structure need be that document is set up corresponding labelled tree, and when document was very big, the memory consumption of this work was very big.Although it is suitable basically with the cluster result consumes memory of preserving the element chain type to preserve the internal memory of cluster result cost of layer version, the memory consumption difference is very big because of setting up in the process, and final difference as a result is also very remarkable.

Claims

1. the method for measuring similarity between an XML document structural outline is characterized in that step is following:

1) is pending XML document stream or document sets definition global element title-code index table, and this table is put sky; Each node comprises two parts content in this table: a part is the title that string format is used to deposit the differential element that pending XML document stream or document sets comprise; Another part is that integer data format is used to deposit the corresponding integer coding of this element; Coding rule is following: when XML document during with the SAX format analysis, this element of this integer representation begins incident and begins the order that occurs for the first time in the flow of event at whole differential elements;

2) according to SAX format analysis XML document, obtain the beginning incident of each element, search global element title-code index table, if element term in chained list, then the coding of this element is exactly the corresponding integer of element term; If element term is not in chained list, then the encoded radio of this element equals in the chained list existing maximum integer and adds one, and this element term and corresponding integer coding are inserted global element title-code index table as new node;

3) based on carry out the formula stack technology obtain element-specific the layer value; Concrete operations are following: according to SAX format analysis XML document; Document begins incident and activates an empty stack structure; Along with the dynamic change of element data tuple in the XML document carry out stacked with go out stack operation; Be that element begins incident and End Event corresponding element is stacked respectively to operate with popping two kinds, the number of plies value of element is equal to the indicator marker that belongs to stack;

4) but utilize the differential element integer coding get access to create the partial order element chain that the XML document structural outline becomes incremental representation with its layer value;

5) the element chain is an index with the coding integer of element, has property capable of being combined, and just combined result will satisfy with a layer repeat element of the same name and only keeps a copy; Concrete anabolic process is following: given two element chain a and b; Begin the coding of first node two element chains of comparison from the chained list head, if a=b then continues the relatively layer value of first node; If layer value also equates; Then first node among a is inserted into the result element chain, otherwise first node among a and the b all is inserted into the result element chain, continue the relatively next node of two chained lists; If first nodes encoding comparative result is a＞b, then first node in the b element chain is inserted into the result element chain, next node among first node and the b among the continuation comparison a; If first nodes encoding comparative result is a＜b, then first node in a element chain is inserted into the result element chain, next node among first node and a among the continuation comparison b;

Relatively two partial order element chains obtain publicly-owned element and respective layer value thereof; Comparison procedure is following: given two element chain a and b; Begin comparison and node is basic Moving Unit from the chained list head, if element encoding is smaller or equal to element encoding among the b among a, then a moves to next node; Otherwise b moves to next node, and comparison procedure continues; Record equal element coding and respective layer value thereof are used to calculate the similarity of element interchain in the comparison procedure;

{NodeSim}_{1 &LeftRightArrow; 2} = \frac{{ComWeight}_{1} + {ComWeight}_{2}}{{ObjWeight}_{1} + {ObjWeight}_{2}}

= \frac{Σ_{i = 1}^{M_{1}} {(1 / r)}^{L_{1}^{i}} + Σ_{j = 1}^{M_{2}} {(1 / r)}^{L_{2}^{j}}}{Σ_{k = 1}^{N_{1}} {(1 / r)}^{L_{1}^{k}} + Σ_{k = 1}^{N_{2}} {(1 / r)}^{L_{2}^{k}}}

I the publicly-owned element number of plies representing first element chain,

With

The number of plies of representing first and k element of second element chain respectively; R is the decrement factor of weight, and its value is greater than 1.