CN102521325B

CN102521325B - XML (Extensive Makeup Language) structural similarity measuring method based on frequency-associated tag sequence

Info

Publication number: CN102521325B
Application number: CN 201110398187
Authority: CN
Inventors: 张利军; 李战怀; 陈群; 李霞
Original assignee: Northwestern Polytechnical University
Current assignee: Shanghai Di'an Technology Co ltd
Priority date: 2011-12-02
Filing date: 2011-12-02
Publication date: 2013-04-24
Anticipated expiration: 2031-12-02
Also published as: CN102521325A

Abstract

The invention discloses an XML (Extensive Makeup Language) structural similarity measuring method based on a frequency-associated tag sequence. The method comprises the following steps of: resolving an XML document set C to obtain a tag sequence database (TSDB); excavating all frequency tag sequence sets (FTS) from the TSDB; selecting a maximum frequency tag sequence set (MFTS) from the FTS; converting to obtain a new TSDB'; excavating a closed frequency-associated tag sequence set from the TSDB'; and expressing any document in the TSDB' as a closed frequency-associated tag sequence set which is contained in the TSDB', and calculating the structural similarity between any two documents in the document set C. According to the method, the accuracy of a clustering result can be raised.

Description

XML structural similarity measure based on frequent correlation tag sequence

Technical field

The invention belongs to the data management technique field, relate to a kind of structural similarity measure of XML document, particularly relate to a kind of utilization is measured the XML document structural similarity from the concentrated frequent correlation tag sequence of excavating of XML document as feature method.

Background technology

XML represents and the de facto standard of exchanges data as internet data, is widely used.Along with the continuous growth of XML document quantity, how effectively the XML data to be stored, filter, to retrieve and to manage at database and information retrieval field becoming more and more important.Many operation tasks to XML need to be measured the similarity between the XML document, the similarity measurement of XML document has become the basic problem of many XML treatment technologies, and is applied to a plurality of fields, and is integrated such as semi-structured data, classification/the cluster of XML document, XML retrieval etc.

Different from a traditional text document content, comprised hierarchical structure in the XML document.How utilizing the structural information that is included in wherein to calculate structural similarity between the XML document is a key issue during the XML similarity is calculated.Personnel have proposed many diverse ways for this Study on Problems.Some of them are the set in path based on the method in path with the representation of XML document, then utilize set or vector operations to calculate structural similarity between the document.For example, three XML document among consideration Fig. 1.Document [1] " Joshi; S.; Agrawal; N.; Krishnapuram; R., Negi, S.:A Bag of Paths Model for Measuring Structural Similarity in Web Documents.In:Proceedings of the 9th International Conference on Knowledge Discovery and Data Mining (SIGKDD). (2003) 577-582. " in the path bag model (this instructions is called the BOTP model) that proposes; the structure of a document is represented as the set in path, and a path is the sequence from the root node to the leaf node in its corresponding dom tree.Use the BOTP in this model representation such as the table 1 to be listed as such as three documents among Fig. 1.Can find out, the path among doc1 and the doc2 " a/b/c " and " a/b ", " a/e/f/g " and " a/h/f/g " all is regarded as diverse path.In fact, these two groups of paths all are the part couplings, and are similar to a great extent.In addition, although path bag model has kept the set membership between the node, ignored its brotherhood, thought between the path it is separate, it doesn't matter.For example path " a/b/c " and " a/d " among doc1 and the doc3 is considered to separate, and in fact they consist of brotherhood, appear at simultaneously in the same document continually.Document [1] has proposed another path bag model (being called the BOXP model) based on XPath simultaneously.Although this model has comprised the brotherhood between the part of nodes, and not exclusively.Document [2] " Leung; H.P.; Chung; F.L.; Chan; S.C., Luk, R.:XML Document Clustering Using Common XPath.In:Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration. (2005) 91-96. " XPath of Mining Frequent from document sets; be called commonXPath, then XML document be expressed as the vector that is consisted of by commonXPath.For example, establishing minimum support is 60%, and then three documents among Fig. 1 use representing such as the 3rd row in the table 1 of this model.Although doc1 passes through commonXPath with the path " a/e/f/g " among the doc2 with " a/h/f/g ": " a/*/f/g " is considered to similar, the path among the doc3 " a/f/g " still is considered to dissimilar.In fact " a/e/f/g ", these three paths of " a/h/f/g " and " a/f/g " all are very similar.In addition, document [2] thinks between the path it is independently equally by the vector calculation similarity time.For example, three documents all comprise path " a/b " and " a/d ", and consist of brotherhood, but document [2] has been ignored this relation.Document [3] " Rafiei; D.; Moise; D.L.; Sun; D.:Finding Syntactic Similarities Between XML Documents.In:Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA). (2006) 512-516. " except complete trails from the root node to the leaf node as the feature, also considered the subpath of complete trails, use representing such as the 4th row in the table 1 of this model such as three documents among Fig. 1, this method has still been ignored the brotherhood between the node when calculating similarity.

The different path representation of table 1 XML document

In sum, there are following two problems in existing calculating XML document structural similarity method based on the path:

1. can not process well the situation of part coupling between the path.As above " a/e/f/g " in the example, the similarity between " a/h/f/g " and " a/f/g " can not be processed well.

2. although captured set membership or ancestors' descendent relationship between the node, the brotherhood between the node has partly or entirely been ignored.As above path " a/b " and " a/d " are considered to separate in the example.

Owing to not taking full advantage of these information that are included in the XML document, so that the structural similarity between the document that above these methods are calculated is not accurate enough, accuracy has certain loss when being applied to XML document cluster or classification.

Summary of the invention

In order to overcome the deficiency that has now based on the XML document structural similarity measure in path, the present invention introduces the concept of association mining and sequential mode mining, a kind of file structure similarity calculating method based on frequent correlation tag sequence has been proposed, the method has overcome the deficiency that has now based on the method in path, and the similarity of calculating is more accurate.

The problem to be solved in the present invention is: given XML document collection C, calculate wherein any two document d _iAnd d _jBetween structural similarity.

The technical solution adopted for the present invention to solve the technical problems may further comprise the steps:

1. pre-service.All XML document among the analyzing XML file collection C are an orderly tag tree with the structural modeling of each XML document, and each node in the tree represents an element in the document, and node comes mark with masurium, is called label.The set that all labels that extract from all documents consist of is called tally set.The structure of XML document all is expressed as the set of sequence label, obtains sequence label database TSDB.

Described sequence label refers to the ordered list that is made of a plurality of labels in the tally set.The order of label by the path from the root node to the leaf node in tag tree corresponding to XML document the order of process.Sequence label α can be expressed as formally:＜a ₁, a ₂, L, a _n, a wherein _iBe a label in the tally set, the number of the label that wherein comprises is called the length of sequence label, and length is that the sequence label of l is called the l-sequence label.

2. Mining Frequent sequence label.From TSDB, use frequent Sequential Pattern Mining Algorithm to excavate all frequent sequence label set FTS.

Described frequent sequence label refers to for given minimum support threshold value δ (0＜δ≤1), if the support of sequence label α in TSDB, claims then that α is frequent sequence label more than or equal to δ in TSDB.

The support of described sequence label α in TSDB refers in TSDB to support the ratio of all number of files among the number of document of α and the TSDB, is designated as support (α).

The document of described support α refers to have a sequence label β in the document, so that β comprises α.

Described sequence label β:＜b ₁, b ₂, L, b _nComprise sequence label α:＜a ₁, a ₂, L, a _mRefer to exist integer sequence i ₁＜i ₂＜L＜i _m, so that

L,

Be denoted as Claim that also α is the subtab sequence of β, or β is the metatag sequence of α.

3. maximization.From FTS, select greatly frequent sequence label, obtain greatly frequent sequence label collection MFTS.

Described greatly frequent sequence label refers to for sequence label α, and the metatag sequence that does not have it in TSDB also is frequently.

4. conversion database.For each the sequence label α in each document among the TSDB, if there is its a sub-sequence label among the MFTS, then α is replaced with this subtab sequence, if there is no, then delete α.Can obtain new database TSDB ' after all handling.

5. excavate and close frequent correlation tag sequence.Use is closed Frequent Itemsets Mining Algorithm and is excavated all set FATS that frequent correlation tag sequence consists of that closes from TSDB '.

Described correlation tag sequence refers to the set of sequence label, and there is not another one sequence label β in any sequence label α in this set, so that β comprises α or α comprises β in the set.

Described frequent correlation tag sequence refers to for given minimum support threshold value δ (0＜δ≤1), if the support of correlation tag sequence γ in TSDB ', claims then that correlation tag sequence γ is frequent correlation tag sequence more than or equal to δ in TSDB '.

The support of described correlation tag sequence γ in TSDB ' refers to support among the TSDB ' ratio of all number of files among the number of document of γ and the TSDB ', is designated as support (γ).

The document of described support correlation tag sequence γ refers to for any sequence label α among the γ, the document support α.

Describedly close frequent correlation tag sequence γ refer to that γ is frequently in TSDB ', and do not have its true superset η, so that their supports in TSDB ' are identical.

6. document representation.For any one the document d among the TSDB ' _i, it is expressed as the set of closing frequent correlation tag sequence that it comprises.That is:

d _i＝{fats|fats∈FATS∧d _i?supports?fats}

d _j＝{fats|fats∈FATS∧d _j?supports?fats}

7. structural similarity calculates.Utilize following formula to calculate any two document d among the collection of document C _iAnd d _jBetween structural similarity sim (d _i, d _j).

sim (d_{i}, d_{j}) = \frac{| d_{i} \cap d_{j} | + | p_{j}^{i} | + | p_{i}^{j} |}{| d_{i} \cup d_{j} |}

Wherein:

The invention has the beneficial effects as follows: the present invention adopts the concept of sequence pattern and association mode, XML document is regarded as the set of sequence label, then therefrom excavated and close frequent correlation tag sequence and calculate structural similarity between the XML document as the feature of document.The introducing of sequence pattern, solved the existing shortcoming that can not process well the situation of path part coupling based on the method in path, and the introducing of association mode can remedy the existing deficiency of having ignored brotherhood between the XML document element based on the method in path to a certain extent, thereby so that the similarity between the document that the method among use the present invention is calculated is more accurate, experimental result on the True Data collection shows, the method is applied to the cluster of XML document, than other the structural similarity computing method based on the path, can improve the accuracy rate of cluster result.

The present invention is further described below in conjunction with accompanying drawing.

Description of drawings

Fig. 1 is XML document structure tree sample;

Fig. 2 is XML document structural similarity measure process flow diagram;

Fig. 3 is XML document pretreatment process figure;

Fig. 4 is the frequent sequence label process flow diagram of maximization;

Fig. 5 is switch labels sequence library process flow diagram;

Fig. 6 is the cluster result precision ratio.

Embodiment

For a given XML document collection C, the idiographic flow that the present invention calculates the similarity between any two documents comprises the steps: as shown in Figure 2

1. document sets is carried out pre-service, obtain sequence label database TSDB.Treatment scheme as shown in Figure 3, in resolving, the same paths of same XML document only occurs once in TSDB, d_TS represents the set of the sequence label that document d comprises among the figure, d.id represents the sign of document d.

2. given minimum support threshold value δ, Mining Frequent sequence label collection FTS from TSDB.There is the algorithm of multiple Mining Frequent sequence pattern can be used for the Mining Frequent sequence label, we adopt the prefixspan algorithm in implementation, detailed description about this algorithm can be referring to document [4] " Pei; J.; Han; J., Mortazavi-Asl, B.; Pinto; H., Chen, Q., Dayal, U., Hsu, M.C.:PrefixSpan:Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth.In:Proceedings of the 17th International Conference on Data Engineering (ICDE). (2001) 215-224. ".

L,

Be denoted as

Claim that also α is the subtab sequence of β, or β is the metatag sequence of α.

In implementation, we get minimum support threshold value δ is 0.3.

3. obtain frequent sequence label collection FTS for previous step, do maximization and process, obtain greatly frequent sequence label set MFTS.The maximization treatment scheme as shown in Figure 4, m represents the length of frequent sequence label the longest among the FTS among the figure.

4. sequence label database TSDB is converted to the database TSDB ' that represents with greatly frequent sequence label.Flow path switch as shown in Figure 5, d_FTS represents the set of the frequent sequence label that document d comprises among the figure.

5. from TSDB ', excavate and close frequent correlation tag sequence sets FATS.The algorithm that frequent associations collection is closed in multiple excavation all can be used for excavating and close frequent correlation tag sequence, and we adopt the CLOSET+ algorithm in the implementation, can be referring to document [5] about the detailed description of this algorithm.

Equally in implementation we to get minimum support threshold value δ be 0.3.

After obtaining closing frequent correlation tag sequence sets FATS by step in front, can calculate structural similarity between any two documents according to the 6th step in the aforementioned summary of the invention and the 7th step.

Wherein, d _i={ fats|fats ∈ FATS ∧ d _iSupports fats},

d _j＝{fats|fats∈FATS∧d _j?supports?fats}，

sim (d_{i}, d_{j}) = \frac{| d_{i} \cap d_{j} | + | p_{j}^{i} | + | p_{i}^{j} |}{| d_{i} \cup d_{j} |},

For the proved inventive method can Effective Raise based on the accuracy of the similarity calculating method in path, we are with the inventive method and other several similarity calculating method (BOTP based on the path, BOXP, commonXPath, subPath) done the contrast experiment.Experiment is based on two real data sets, one of them comes from list of references [6] " Kurt; A.; Tozal; E.:Classification of XSLT-Generated Web Documents With Support Vector machines.In:Proceedings of the 1st international workshop on Knowledge Discovery from XML Documents. (2006) 33-42. ", is called the Texas data set; The XML version (seeing document " Sigmod Record in XML, http://www.sigmod.org/publications/sigmodrecord/xml-edition. ") that another comes from ACM Sigmod Record is called the Sigmod data set.This several method of the main contrast of experiment result's in the XML document cluster is used precision ratio (precision).Comparing result as shown in Figure 6, as can be seen from Figure 6, on two different data sets, the precision ratio of the inventive method all has in various degree raising than other several method.

Claims

1. the XML structural similarity measure based on frequent correlation tag sequence is characterized in that comprising the steps:

1) pre-service: all XML document among the analyzing XML file collection C are an orderly tag tree with the structural modeling of each XML document, and each node in the tree represents an element in the document, and node comes mark with masurium, is called label; The set that all labels that extract from all documents consist of is called tally set; The structure of XML document all is expressed as the set of sequence label, obtains sequence label database TSDB;

Described sequence label refers to the ordered list that is made of a plurality of labels in the tally set, the order of label by the path from the root node to the leaf node in tag tree corresponding to XML document the order of process, sequence label α can be expressed as formally:＜a1, a2,, an 〉, a wherein _iBe a label in the tally set, the number of the label that wherein comprises is called the length of sequence label, and length is that the sequence label of l is called the l-sequence label;

2) Mining Frequent sequence label: from TSDB, use frequent Sequential Pattern Mining Algorithm to excavate all frequent sequence label set FTS;

Described frequent sequence label refers to for given minimum support threshold value δ, if the support of sequence label α in TSDB, claims then that α is frequent sequence label more than or equal to δ, 0＜δ≤1 in TSDB;

The support of described sequence label α in TSDB refers in TSDB to support the ratio of all number of files among the number of document of α and the TSDB, is designated as support (α);

The document of described support α refers to have a sequence label β in the document, so that β comprises α;

Described sequence label β:＜b ₁, b ₂..., b _nComprise sequence label α:＜a ₁, a ₂..., a _mRefer to exist integer sequence i ₁＜i ₂＜...＜i _m, so that

Be denoted as

Claim that also α is the subtab sequence of β, or β is the metatag sequence of α;

3) maximization: from FTS, select greatly frequent sequence label, obtain greatly frequent sequence label collection MFTS;

Described greatly frequent sequence label refers to for sequence label α, and the metatag sequence that does not have it in TSDB also is frequently;

4) conversion database: for each the sequence label α in each document among the TSDB, if have its a sub-sequence label among the MFTS, then α replaced with this subtab sequence, if there is no, then delete α, can obtain new database TSDB ' after all handling;

5) frequent correlation tag sequence is closed in excavation: use is closed Frequent Itemsets Mining Algorithm and is excavated all set FATS that frequent correlation tag sequence consists of that closes from TSDB ';

Described correlation tag sequence refers to the set of sequence label, and there is not another one sequence label β in any sequence label α in this set, so that β comprises α or α comprises β in the set;

Frequent correlation tag sequence refers to for given minimum support threshold value δ, if the support of correlation tag sequence γ in TSDB ', claims then that correlation tag sequence γ is frequent correlation tag sequence more than or equal to δ in TSDB '; 0＜δ≤1;

The support of described correlation tag sequence γ in TSDB ' refers to support among the TSDB ' ratio of all number of files among the number of document of γ and the TSDB ', is designated as support (γ);

The document of described support correlation tag sequence γ refers to for any sequence label α among the γ, the document support α;

Describedly close frequent correlation tag sequence γ refer to that γ is frequently in TSDB ', and do not have its true superset η, so that their supports in TSDB ' are identical;

6) document representation: for any one the document d among the TSDB ' _i, it is expressed as the set of closing frequent correlation tag sequence that it comprises, i.e. d _i={ fats|fats ∈ FATS^d _iSupportsfats};

7) structural similarity calculates: utilize formula

Calculate any two document d among the collection of document C _iAnd d _jBetween structural similarity sim (d _i, d _j),

Wherein: