CN102722556A

CN102722556A - Model comparison method based on similarity measurement

Info

Publication number: CN102722556A
Application number: CN2012101712517A
Authority: CN
Inventors: 覃征; 赵凤飞; 徐哲; 王珍; 徐文华; 任博岩; 胡浩; 李金星; 王瑶
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2012-05-29
Filing date: 2012-05-29
Publication date: 2012-10-10
Anticipated expiration: 2032-05-29
Also published as: CN102722556B

Abstract

The invention discloses a model comparison method based on similarity measurement. The model comparison method comprises the following steps: step 10: determining models to be compared; step 20: obtaining nodes forming the models form the models to be compared; step 30: calculating the node similarity of the nodes among the models to be compared; step 40: calculating the model similarity of the models to be compared according to the node similarity of the nodes among the models to be compared; and step 50: obtaining a relation of the models to be compared on the basis of the node similarity. The model comparison method disclosed by the invention adopts a means of combining the text similarity and the label similarity at the time of calculating the node similarity, so that the model comparison method overcomes the problem that no label characteristics of elements of the models but the text is taken into the consideration, and further, the actual situations of the models can be reflected by the node similarity.

Description

A kind of model comparison method based on similarity measurement

Technical field

The present invention relates to the computer science database field, relate in particular to a kind of on semi-structured model the comparison method based on similarity measurement.

Background technology

Version Control is the process that system's different editions is identified and follows the tracks of, and is convenient to version is distinguished, retrieved and follows the tracks of, and shows the relation between each version.The comparison of version then is important module in the Version Control; Its purpose is in order to let the user that currently used version is had further understanding; Current version and former version instance are compared, and present to the user to the obvious difference between two versions clearly.

Through development for a long time, traditional version comparison instrument is comparative maturity, and traditional version comparison method majority is based on capable comparison, promptly marks the difference of the text delegation existence of comparing.For model comparison, existing method is directly the literal in the model and structure to be mated to realize.Though correlation technique has also had significant progress, current model comparison method used in modeling tool seems that but some is not fully up to expectations.

In the model comparison of current modeling tool; Only two models are very simply compared; Promptly only could be complementary by approval, and the fine distinction in two models all possibly cause the difference of comparison result at two models, two models when the storage aspect is identical.And the model that the user set up is often based on the structural relation in some semantic relations or the model, and these characteristics obviously can not by whole contrast instrument cognition, therefore, existing model comparison instrument and user's demand still have certain distance.And specifically, the weak point of current model comparison method may be summarized to be following some:

(1) when model is compared, can not differentiate two notions with synonymy or similar semantic relation, two models that are easy to just will to have similar semantic relation directly difference come.For example: two titles are respectively the model of " protection guided missile " and " defending missile ", and In the view of the user, they are consistent, in the model comparison, then can be regarded as two different notions and handle.

(2) shortage is to the understanding of two relationship model in the heterogeneous data source.Owing in modeling process, a plurality of team may occur, therefore, just be easy to cause have certain difference on their the understanding to some model, in the statement of identical model, exist different.The difference that has so just directly caused institute's generation model structure.For example: the statement to the books model in a certain Library can be respectively three models as shown in Figure 1.And above three kinds of expression methods should be identical concerning the user.

The model that (3) can not be suitable for the current relatively modeling tool of main flow is compared.In general modeling tool, model all is to store with the mode of XML on file, and the method for comparing to XML now also emerges in an endless stream, but because the comparison of current model has possessed the characteristics in certain modeling field.

Therefore, need a kind of model comparison method based on similarity measurement badly to address the above problem.

Summary of the invention

One of technical matters to be solved by this invention is that a kind of accurate more, objective model comparison method based on similarity measurement of result that can make the model comparison need be provided.

In order to solve the problems of the technologies described above, the invention provides a kind of model comparison method based on similarity measurement, this method comprises: step 10, confirm model to be compared; Step 20 is obtained the node of forming each model respectively from said model to be compared; Step 30 is calculated the node similarity of each node between the said model to be compared; Step 40 calculates the model similarity between the said model to be compared based on the node similarity of each node between the said model to be compared; Step 50, based on said model similarity to obtain the relation between the said model to be compared.

Model comparison method according to a further aspect of the invention based on similarity measurement; In said step 30, node text similarity through calculating each node between the said model to be compared and node label similarity, to obtain the node similarity of each node between the said model to be compared.

Model comparison method according to a further aspect of the invention based on similarity measurement, based on the semantic relation between the pairing label of each node to obtain the said node label similarity of treating each node between the contrast model.

The model comparison method based on similarity measurement according to a further aspect of the invention obtains the node text similarity of each node between the said model to be compared based on the string editing between each node distance.

Model comparison method according to a further aspect of the invention based on similarity measurement, each node utilizes following expression formula to obtain the node text similarity between the said model to be compared:

SmaticSim (X, Y) = 1 - \frac{E (X, Y)}{\max (| X |, | Y |)}

Wherein, | X|, | Y| representes the length of character string of character string and the node Y of nodes X, E (X, Y) the string editing distance between expression nodes X and the node Y, SmaticSim (X, Y) the node text similarity of expression nodes X and node Y respectively.

Model comparison method according to a further aspect of the invention based on similarity measurement, the following expression formula of each node utilization obtains the node similarity of each node between the said model to be compared between the said model to be compared:

NodeSim (X, Y) = &PartialD; LabSim (X, Y) + (1 - &PartialD;) SmaticSim (X, Y)

Wherein,

representes synthetic weight;

NodeSim (X; Y) the node similarity of expression nodes X and node Y; LabSim (X; Y) the node label similarity between expression nodes X and the node Y, SmaticSim (X, Y) the node text similarity between expression nodes X and the node Y.

The model comparison method based on similarity measurement according to a further aspect of the invention in said step 40, specifically may further comprise the steps:

Step 41 calculates the path similarity in each path between the said model to be compared and level similarity at all levels based on the node similarity of said each node;

Step 42, based on the path similarity in said each path and said level similarity at all levels obtaining the model similarity between the said model to be compared,

Wherein, said path be in the tree construction of model to be compared from the root node to the leaf node via the string formed of node.

Model comparison method according to a further aspect of the invention based on similarity measurement; In said step 41; Utilize the node similarity of said each node, obtain the path similarity in each path between the said model to be compared based on the longest common subsequence method and/or stratification.

The model comparison method based on similarity measurement according to a further aspect of the invention, in said step 41, utilize following expression formula to obtain level similarity at all levels between the model to be compared:

{HorizSim}_{i} (A, B) = \frac{2 \times | trim ({Al}_{i} \cap^{sim} {Bl}_{i}) |}{| {Al}_{i} | + | {Bl}_{i} |}

Wherein, trim (Al _i∩ ^SimBl _i) expression removal Al _i∩ ^SimBl _iNode set after the duplicate node in the set, | trim (Al _i∩ ^SimBl _i) | the size of expression set, | Al _i| with | Bl _i| the interstitial content of difference representation model A and Model B i layer, wherein, Al _i∩ ^SimBl _iSet utilizes following expression formula to define:

(a, b) ∈ Al _i∩ ^SimBl _i≤=>A ∈ Al _i, b ∈ Bl _iAnd NodeSim (a, b)>=k,

Wherein, a, a node of the i layer of b difference representation model A and Model B, k is preset similarity threshold.Al _iBe the node set of the i layer of model A, i is the degree of depth and for more than or equal to 1 and smaller or equal to the integer of h; Bl _iBe the node set of the i layer of Model B, (a b) is the node similarity of node a and node b to NodeSim.

The model comparison method based on similarity measurement according to a further aspect of the invention in said step 42, specifically may further comprise the steps:

Step 421 obtains the vertical similarity of model between the model to be compared based on the path similarity in said each path;

Step 422 obtains the horizontal similarity of model between the model to be compared based on said level similarity at all levels;

Step 423, based on the vertical similarity of said model and/or the horizontal similarity of model to obtain the model similarity between the model to be compared.

Model comparison method according to a further aspect of the invention based on similarity measurement, in said step 421, the following expression formula of said model utilization to be compared obtains the vertical similarity of model between the model to be compared:

VerticSim (A, B)

= \frac{Σ_{P_{1} &Element; P_{A}} [\max_{P_{2} &Element; P_{B}} (SimPath (P_{1}, P_{2}))] + Σ_{P_{2} &Element; P_{B}} [\max_{P_{1} &Element; P_{A}} (SimPath (P_{1}, P_{2}))]}{| P_{A} | + | P_{B} |}

Wherein | P _A|, | P _B| represent the number in the path that model A to be compared and Model B are comprised respectively; P _A, P _BRepresent the set in the path on model A to be compared and the Model B respectively, P ₁And P ₂Represent P respectively _A, P _BIn any paths, SimPath (P ₁, P ₂) path P among the representation model A ₁With path P in the Model B ₂The path similarity, VerticSim (A, B) the vertical similarity of the model of representation model A and Model B.

The model comparison method based on similarity measurement according to a further aspect of the invention, in the said step 422, utilize following expression formula to obtain the horizontal similarity of model between the model to be compared:

HierSim (A, B) = Σ_{i = 1}^{h} \frac{{HorizSim}_{i} (A, B) \times r^{i}}{1 + r + r^{2} + \cdot \cdot \cdot . . + r^{h}}

Wherein, HierSim (A, B) the horizontal similarity of the model between representation model A and the Model B, HorizSim _i(r is a discount factor for A, B) level similarity at all levels between representation model A and the Model B, 0<r≤1.

The model comparison method based on similarity measurement according to a further aspect of the invention is characterized in that, in said step 50, said model similarity and setting threshold is compared to obtain the relation between the model to be compared.

Compared with prior art, one or more embodiment of the present invention can have following advantage:

Because when the computing node similarity, adopted text similarity and the means that the label similarity combines, overcome the problem of only considering text and ignoring the model element tag feature, and then made the node similarity more can reflect the actual conditions of model.

Other features and advantages of the present invention will be set forth in instructions subsequently, and, partly from instructions, become obvious, perhaps understand through embodiment of the present invention.The object of the invention can be realized through the structure that in instructions, claims and accompanying drawing, is particularly pointed out and obtained with other advantages.

Description of drawings

Accompanying drawing is used to provide further understanding of the present invention, and constitutes the part of instructions, is used to explain the present invention jointly with embodiments of the invention, is not construed as limiting the invention.In the accompanying drawings:

Fig. 1 is the synoptic diagram that derives from three models in different pieces of information source;

Fig. 2 is the schematic flow sheet based on the model comparison method of similarity measurement according to the embodiment of the invention;

Fig. 3 (a) and Fig. 3 (b) are respectively the exemplary plot of tree construction of storage organization and the model of model;

Fig. 4 be according to the embodiment of the invention based on the composition of each similarity in the model comparison method of similarity measurement and the synoptic diagram of the mutual relationship between them;

Fig. 5 is the composition synoptic diagram based on node similarity in the model comparison method of similarity measurement according to the embodiment of the invention.

Embodiment

Below will combine accompanying drawing and embodiment to specify embodiment of the present invention, how the application technology means solve technical matters to the present invention whereby, and the implementation procedure of reaching technique effect can make much of and implement according to this.Need to prove that only otherwise constitute conflict, each embodiment among the present invention and each characteristic among each embodiment can mutually combine, formed technical scheme is all within protection scope of the present invention.

In addition; Can in computer system, carry out in the step shown in the process flow diagram of accompanying drawing such as a set of computer-executable instructions, and, though logical order has been shown in process flow diagram; But in some cases, can carry out step shown or that describe with the order that is different from here.

Fig. 2 is the schematic flow sheet based on the model comparison method of similarity measurement according to the embodiment of the invention.

Need to prove; The included all elements of model in the embodiment of the invention is extend markup language (eXtensible Markup Language; Be called for short XML) form; But except the XML form, the various semi-structured model representation modes that can be converted into tree structure can be suitable for the inventive method.

In the present embodiment, the corresponding XML file of each model file, and include a plurality of models in each model file, each model all has own corresponding information, and is like Fig. 1, shown in Figure 3.Therefore, the contrast to model just is converted into the comparison to segment between two XML files.

For the ease of statement, with Fig. 3 the term that will use in the example explanation present embodiment, wherein, Fig. 3 (b) is the tree construction of model, it is that storage organization according to Fig. 3 (a) model is transformed:

(1) node: comprising node element, attribute node and value node, all is nodes like " article " among Fig. 3, " beginning ", " main body " etc.

(2) path of model: a given model; At first it is resolved to corresponding XML tree; Will be from the root node to the leaf node via the string formed of node (with "/" as the separator between the node) be called the path, for example: " article/beginning/summary " is the paths in Fig. 3 model.

From two models to be compared, can find out that whole abstract tree can be regarded as fully by the hierarchical relationship between node and the node and constitutes.

Fig. 4 be according to the embodiment of the invention based on the composition of each similarity in the model comparison method of similarity measurement and the synoptic diagram of the mutual relationship between them.As shown in Figure 4, relate in the present embodiment through between the computation model laterally similarity and/or vertically similarity to obtain the similarity between the model.The vertical similarity of model is based on that the path similarity calculates between the model, and the horizontal similarity of model obtains based on the model hierarchy similarity.This part of node similarity then can be divided into node label similarity and these two parts of node text similarity, and is specifically as shown in Figure 5.

With reference to figure 2, specify each step of present embodiment below.

Step S210 confirms model to be compared, is designated as model A and Model B.

Step S220 obtains the node of forming each model respectively from model to be compared.

Step S230 calculates the node similarity of each node between the model to be compared.

Particularly, node text similarity through calculating each node between the model to be compared and node label similarity, to obtain the node similarity of each node between the model to be compared.

At first, based on the semantic relation between the pairing label of each node to obtain treating the node label similarity of each node between the contrast model.

Owing to have multiple element (being node) in the model in embodiments of the present invention, therefore,, must consider the similarity between them in order to carry out the comparison between the model more accurately.In embodiments of the present invention, each model all corresponding the semi-structured document piece of bottom, therefore, in the bottom XML document the pairing content of text of each label and label just corresponding an element.

Preferably, in the embodiment of the invention, the pairing label of the element of model is following 9 kinds, and they are respectively: notion, and attribute, complex attribute is inherited, synonym, antisense is quoted, and assembles, and is self-defined.Here, according to their practical significance, the relation between them is quantized, shown in the table 1 specific as follows.

Table 1

For example, node 1 is physical culture, and its pairing label is " notion ", and node 2 is a football, and its pairing label is " quoting ", and the node label similarity that then can obtain between two nodes according to above table is 0.1.

Then, obtain the node text similarity of each node between the model to be compared based on the distance of the string editing between each node.

To the tolerance of node text similarity, be exactly the similarity of the text-string of node metric, realize through the string editing distance.The string editing distance refers to a kind of method that is used for measuring similarity between the character string.Given two character string S, T convert S to T needed deletion, insert, and the quantity of replacement operation just is called the edit path of S to T.And the shortest edit path just is called the editing distance of character string S and T.At this, confirm editing distance with dynamic programming method.

The character string of given nodes X is X=[x ₀x ₁... x _i... .x _m], the character string of node Y is Y=[y ₀y ₁Y _j... .y _n], with symbol EDIT (i, j) [x among the expression X ₀x ₁X _i] substring, substring [y in the Y ₀y ₁Y _j] editing distance.With D (i, j) expression X in i character X (i) be transformed into j character Y (j) necessary operations number of times among the Y, if X (i)==Y (j), then without any need for operation be D (i, j)=0; Otherwise, need replacement operation, and D (i, j)=1.According to the characteristics of dynamic programming, can draw X so, the editing distance E between the Y (X, Y) following formula:

EDIT (i, j)=1, if i=0 and j=0;

EDIT (i, j)=EDIT (i, j-1)+1, if i=0 and j＞0;

EDIT (i, j)=EDIT (i-1, j)+1, if i＞0 and j=0;

EDIT (i, j)=min (EDIT (i-1, j)+1, EDIT (i, j-1)+1, EDIT (i-1, j-1)+D (i, j)), if i＞0, j＞0

For the ease of follow-up calculating, now will be through with X, the value of the editing distance between the Y belongs within the scope of 0-1, draws nodes X, and the node text similarity between the Y is:

SmaticSim (X, Y) = 1 - \frac{E (X, Y)}{\max (| X |, | Y |)}

Wherein | X|, | Y| representes the length of character string of character string and the node Y of nodes X respectively.

Comprehensive node text similarity and node label similarity can get the following formula of node similarity finally:

NodeSim (X, Y) = &PartialD; LabSim (X, Y) + (1 - &PartialD;) SmaticSim (X, Y)

Wherein,

representes synthetic weight, specified LabSim (X among the final synthetic result; Y) and SmaticSim (X Y) respectively accounts for great ratio, NodeSim (X; Y) similarity of expression nodes X and node Y; LabSim (X, Y) the node label similarity between expression nodes X and the node Y, SmaticSim (X; Y) expression nodes X, the node text similarity between the Y.

Step S240 calculates the path similarity in each path between the model to be compared based on the node similarity of each node.

At first, the set that the path is resolved in the pairing XML of this model branch, then, calculate the similarity of institute's respective path in two models respectively.Consistent node corresponding in the path is many more, and two paths are just similar more.But, therefore, need relax the consistent condition of node because top mentioning node synonym personnel not of the same name, different possibly occur to the cognitive inconsistent situation of model structure.

In embodiments of the present invention, come computation model A, the similarity between the B between any two paths according to step as follows.

(1) pre-service is carried out in the path of two models.

Particularly, utilize the comparative result of node similarity, obtain having on two paths to be compared the node of similar semantic label, and do marked having on two nodes of similar semantic label, show that they have identical semantic label.

More specifically, for two paths P ₁, P ₂, With

Represent the node on two paths respectively, if

, wherein NodeSim is a node

Between the node similarity, α is a preset threshold, α is the real number between 0 to 1, then thinks

With

Mate each other.In this process, do marked to the node that matees each other.

(2) utilize the node similarity, obtain the path similarity in each path between the said model to be compared based on the longest common subsequence method and/or stratification.

Next, specify based on the path similarity (hereinafter to be referred as LCS method path similarity) between two models of the longest common subsequence method (hereinafter to be referred as LCS) calculating.

For ease the path is compared, use LCS to describe two similarities between the path among the present invention.

The subsequence of a given sequence is from formal, a given sequence X=<x ₁, X ₂..., x _m>, another sequence Z=<z ₁, z ₂..., z _k>Be the sub-sequence of X, if there is the strictly increasing subscript sequence of X<i ₁, i ₂..., i _k>, make to all j=1,2 ... .k, have

For example, Z=<b, C, D, B>Be X=<a, B, C, B, D, A, B>A sub-sequence.

And the definition of long common subsequence is: a sequence S, if be the subsequence of two or more known arrays respectively, and it is the longest to be that all meet in this condition sequence, and then S is called the longest common subsequence of known array.

At first, pretreated two nodes doing mark of process, directly regarding as is identical two nodes; Then; Use the LCS method to compare handling two paths later, in this comparison procedure, the longest common subsequence in two model paths is long more; Just signifying that the structurally overlapping part of two paths is many more, therefore also just similar more.In addition, consider that high-level node more can represent the structural information of The model than the node of low level, thereby in process relatively, need to consider the weight of each node.

Therefore, the path similarity in each path can be expressed as following formula between the model to be compared:

{SP}_{LCS} (P_{1}, P_{2}) = α \times \frac{| LCS (P_{1}, P_{2}) |}{| level (P_{1}) |}

Wherein, SP _LCS(P ₁, P ₂) the expression path P ₁, P ₂Similarity, | level (P ₁) | the expression path P ₁The hierachy number that is had (or node number), | Lcs (P ₁, P ₂) | expression composition model path P ₁, P ₂The number of node in the set of the node of long common subsequence. α is LCS (P ₁, P ₂) in the weight of each common subsequence node.

α can represent through expression:

α = Π_{n = 1}^{| LCS (P_{1}, P_{2}) |} \frac{| level (P_{1}) | - {level}_{P_{1}} ({LCS}_{n} (P_{1}, P_{2})))}{| level (P_{1}) |}

Wherein, | LCS (P ₁, P ₂) | expression composition model path P ₁, P ₂The number of node in the set of the node of long common subsequence, | level (P ₁) | the expression path P ₁The hierachy number that is had (or node number), LCS _n(P ₁, P ₂) expression P ₁, P ₂N node (the model path is order from top to bottom) in the longest common subsequence,

(LCS _n(P ₁, P ₂)) expression P ₁, P ₂N node is at P in the longest common subsequence ₁Residing level on the path.

Through utilizing the LCS method to come the calculating path similarity, can improve the processing of the subpath consistent with certain paths distributing order, can consider effectively to comprise problem each other between the path.

In addition, can also calculate two path similarities (hereinafter to be referred as stratification path similarity) between the model based on stratification.

Need to prove; Different with LCS path similarity; The hierarchical path similarity does not require between the similar node strict with fixing order appearance; Just a certain node in the path A can be selected in all nodes of path B and self immediate node, and two nodes that mate most can appear on the different levels, but residing level is also high more near similarity more.The similarity of in this step, measuring two paths through the residing relative level of node metric.Because the path is made up of node, so at first the similarity between two nodes on two paths is measured.

Particularly, need handle as follows:

At first, mate similarity between two nodes on the calculating path each other.

From pre-service, draw many groups and be positioned on two paths nodes of coupling each other, and the node of coupling best embodies out the similarity of two paths each other, therefore, in the process of considering the path similarity, the node of mutual coupling handled getting final product.

Through to the each other weight of the node of coupling and the calculating of the level degree of correlation on two paths, draw the method for mating similarity between two nodes on the following calculating path each other:

{SV}^{'} (V_{p_{1}}, V_{p_{2}}) = 1 - \frac{| {level}_{P_{1}} (V_{P_{1}}) - {level}_{P_{2}} (V_{P_{2}}) |}{\max (level (P_{1}), level (P_{2}))}

Wherein, level (P ₁), level (P ₂), represent path P respectively ₁And P ₂Hierachy number, level _P1(V _P1) and level _P2(V _P2) represent node V respectively _P1In path P ₁Hierachy number and node V _P2In path P ₂Hierachy number.

Then, choosing to node weights.

Because it is bigger to the influence of entire path than low level node in the model path, often to be in high-level node, it can react the information of entire document structure more accurately than the node of low level.Therefore, when the calculating path similarity, can give different weights respectively to react its importance to node to the whole piece path.

For example, P is arranged ₁And P ₂Two paths, the node of the mutual coupling between them is at P ₁On be followed successively by { V ₀, V ₁... .V _n, give node V so _iThe weights of being given do

, wherein 0＜β＜1 and level (V _i) expression V _iAt P ₁In the level of reality.Therefore; After considering node weights, the similarity

between the node that matees each other on two paths finally is expressed as following formula:

SV (V_{p_{1}}, V_{p_{2}}) = β^{{level}_{P_{1}} (V_{P_{1}})} \times (1 - \frac{| {level}_{P_{1}} (V_{P_{1}}) - {level}_{P_{2}} (V_{P_{2}}) |}{\max (level (P_{1}), level (P_{2}))})

Wherein, level _P1(V _P1) expression node V _P1In path P ₁Hierachy number.

At last, calculate stratification path similarity.

Given path P ₁, P ₂, for the node V on the path P 1 _P1, be its optimum matching node definition on P2:

So, path P 1 and the final similarity SP of P2 _Wl(P ₁, P ₂) do

{SP}_{wl} (P_{1}, P_{2}) = \frac{Σ_{v_{p 1}} {Mv}_{P_{2}} (P_{1})}{| P |}

At last, can through with the LCS method similarity in each path and/or stratification similarity as each path similarity between the model to be compared.

In embodiments of the present invention, preferably LCS method similarity between each path and stratification similarity between the model to be compared are carried out comprehensively obtaining two similarities between the model path.

Final two model path similarity formula can be as follows:

SimPath(P ₁，P ₂)＝γSP _LCS(P ₁，P ₂)+(1-γ)SP _wl(P ₁，P ₂)

γ wherein, 1-γ representes SP respectively _LCS(P ₁, P ₂) and SP _Wl(P ₁, P ₂) shared proportion.

In above-mentioned steps, SP _LCS(P ₁, P ₂) and SP _Wl(P ₁, P ₂) represent LCS method similarity and stratification similarity respectively.But they respectively have relative merits, and form complementary: SP _LCS(P ₁, P ₂) can consider effectively and comprise problem each other between the path, but lack understanding to several sources isomery of isomery.And SP _Wl(P ₁, P ₂) the effective relation of the dislocation in the transaction module level, but can not consider two relation of inclusion between the model.Therefore, this step combines through the similarity that two methods are obtained, and gives a weight to obtain final path similarity value to them respectively.

Step S250, based on the path similarity in each path and level similarity at all levels to obtain the model similarity between the said model to be compared.

Particularly, comprise following substep:

Step S2501 obtains the vertical similarity of model between the model to be compared based on the path similarity in each path;

Step S2502 obtains the horizontal similarity of model between the model to be compared based on level similarity at all levels;

Step S2503, based on the vertical similarity of model and/or the horizontal similarity of model to obtain the model similarity between the model to be compared.

To step S2501, need to prove, in embodiments of the present invention; Vertical similarity between model; Be meant from angle longitudinally the similarity between the model measured, wherein, vertically expression be one from the model root node direction of terminad node progressively.So, just can represent with a paths this from the model root node process of terminad node progressively.Model element in the present invention can comprise a plurality of daughter elements, therefore when the vertical similarity of measurement model, can relate to mulitpath.Path like model among Fig. 3 can be expressed as three: article/beginning/summary; Article/beginning/title; Article/paragraph/main body.

Because the vertical similarity of model is considered is the hierarchical relationship between each node elements of this model; And mutual relationship between hierarchical structure and the node is the most directly embodied is exactly the path of model; Therefore to model A; Similarity between the mulitpath that comprises between the B carries out comprehensively obtaining model A, the vertical similarity between B.

Particularly, given model A and Model B are used P _A, P _BRepresent A respectively, the set of B upper pathway is compared the every paths among the model A and the path of Model B one by one, and gets the similarity of maximum similarity value as this paths.In like manner, Model B is also carried out same operation.At last, with A, the corresponding similarity addition in all paths is asked on average again among the B, get final product model A, the vertical similarity between the B.Shown in the formula specific as follows:

VerticSim (A, B)

= \frac{Σ_{P_{1} &Element; P_{A}} [\max_{P_{2} &Element; P_{B}} (SimPath (P_{1}, P_{2}))] + Σ_{P_{2} &Element; P_{B}} [\max_{P_{1} &Element; P_{A}} (SimPath (P_{1}, P_{2}))]}{| P_{A} | + | P_{B} |}

Wherein | P _A|, | P _B| the number in the path that comprised of representation model A and Model B respectively, P _A, P _BRepresent the set in the path on model A to be compared and the Model B respectively, P ₁And P ₂Represent P respectively _A, P _BIn any paths, SimPath (P ₁, P ₂) path P among the representation model A ₁With path P in the Model B ₂The path similarity, VerticSim (A, B) the vertical similarity of the model of representation model A and Model B.

To step S2502, need to prove, in embodiments of the present invention; Horizontal similarity between model; Be meant from horizontal level the similarity between the model is measured that upper strata unit have a plurality of lowest-rank element, the relation between lowest-rank element and the upper strata element can influence the comparison of similarity." beginning " under model outermost layer as shown in Figure 3 " article " model just belongs to same horizontal level with " main body " daughter element.

Aforesaid path similarity is to vertically the disassembling of model tree (XML tree), and from the similarity degree of longitudinal comparison model, and laterally the tolerance of similarity is from transversely by level model being disassembled the degree of correlation between two model trees of comparison.At this, react horizontal similarity through the level similarity of measurement model.

Particularly, obtain level similarity at all levels between the model to be compared based on the node similarity of each node, then based on the level similarity to obtain the horizontal similarity of model between the model to be compared.

More specifically, at first, the degree of depth of establishing model (tree) A is h, the direction that establishing the degree of depth increases be from root node to leaf node, promptly the height of leaf node is h, the root node height is 1.Laterally disassemble model tree A this moment, and the node set that obtains each layer is Al _i, i is the degree of depth, for more than or equal to 1 and smaller or equal to the integer of h.In like manner for model tree B, the node set that obtains the i layer is Bl _i,, a kind of new computing of two set of definition earlier, promptly similar common factor ∩ ^Sim, (a is set A l to a _iIn a node) and Bl _iSimilar common factor be a and Bl _iIn according to the set of certain approximately equalised element of similarity threshold, the definition of similar common factor is as follows:

A wherein, b representes A respectively, a node of B model i layer, k is predefined similarity threshold.

So, model tree A and model tree B are defined as in the level similarity of level i,

{HorizSim}_{i} (A, B) = \frac{2 \times | trim ({Al}_{i} \cap^{sim} {Bl}_{i}) |}{| {Al}_{i} | + | {Bl}_{i} |}

Then, the weighted sum of each layer similarity is as total level similarity of model tree A, B.Setting is big more the closer to the level importance of root node in the present embodiment, can reflect the character of notion more, and therefore, establishing r is discount factor, and 0 < r ≤1, total level similarity (being the horizontal similarity of model) is defined as:

HierSim (A, B) = Σ_{i = 1}^{h} \frac{{HorizSim}_{i} (A, B) {\times r}^{i}}{1 + r + r^{2} + \cdot \cdot \cdot . . + r^{h}}

According to the method, can calculate the horizontal similarity of model between conceptional tree (model) A and conceptional tree (model) B, the horizontal similarity of model is the important supplement to the vertical similarity of model, helps better to carry out the similarity measurement of model.

To step S2503, according to above-mentioned result, can be based on the vertical similarity of model and/or the horizontal similarity of model to obtain the model similarity between the model to be compared.

In the present embodiment, preferably based on the vertical similarity between two models and laterally similarity with obtain two models between the value of similarity.

Need to prove,, the model similarity is assessed that therefore need carry out comprehensively getting final product to these two kinds of similarities, so to model A, the similarity between B finally be expressed as following formula from horizontal vertical two aspects in order to make the result of assessment more accurate:

ModelSim(A，B)＝VerticSim(A，B)×p+HierSim(A，B)×(1-p)

Wherein ModelSim (A, B) expression A, the model similarity between B, (A, B) (A B) distinguishes vertical similarity of model and the horizontal similarity of model between representation model A and the Model B to VerticSim, and p representes weight with HierSim.

Because the similarity that the method that has adopted horizontal similarity and vertical similarity to combine is come computation model has overcome the model similar situation is reflected one sided problem, make that the tolerance result of similarity is more accurate, comprehensive, objective.

Step S260 obtains the relation between the model to be compared based on similarity between two models.

Particularly, model similarity and setting threshold are compared to obtain the relation between the model to be compared.

In the version comparison process, for two model A that compare and B, each element among A and the B is divided into four types, that is: 1) comprise certain element among the A, and do not have among the B; 2) comprise certain element among the B, and do not have among the A; 3) all comprise certain element among A and the B, but this element is different in A and B; 4) all comprise certain element among A and the B, and identical.

For situation 4), only when this element is identical in A and B, just set up, therefore judge than being easier to, and situation 1) and situation 2) be the problem of homogeneity in fact, promptly certain element exists in a model and in another model, does not exist.When file A and B comprise an elements A i and Bi respectively, be to be divided into situation 1 to Ai), be divided into situation 2 to Bi simultaneously); Still be divided into situation 3 to Ai and Bi) be exactly that we will be through the problem of similarity measurement solution.

In this step; Set a real number similarity threshold d; Preferably; Set 0 < d < 1, among the model A in any one element and the Model B similarity each other of any one element all can calculate through preceding method, and to adopt be respectively the similarity matrix preservation of row and column with element among A, the B.Through this similarity matrix, we can find (when the value of this similarity was 1, this method just was divided into situation 4 with Ai and Bi for Ai, Bi) right similarity) at any time, promptly all comprise certain element among A and the B, and identical; When the value of this similarity more than or equal to threshold value d but less than 1 the time, this method just is divided into situation 3 with Ai and Bi), promptly all comprise certain element among A and the B, but this element is different in A and B; When this similarity during less than threshold value d, this method just is divided into situation 1 respectively with Ai and Bi) and situation 2).

Through above step, present embodiment has realized utilizing the similarity between the model element to calculate carrying out the model comparison.The present invention is used in traditional model comparison, higher efficient is not only arranged, reduced unnecessary information, strengthened practicality, and the result of feasible comparison is more intelligent, hommization, the demand of further having agreed with the user has great convenience for the user.

Those skilled in the art should be understood that; Above-mentioned each module of the present invention or each step can realize that they can concentrate on the single calculation element with the general calculation device, perhaps are distributed on the network that a plurality of calculation element forms; Alternatively; They can realize with the executable program code of calculation element, thereby, can they be stored in the memory storage and carry out by calculation element; Perhaps they are made into each integrated circuit modules respectively, perhaps a plurality of modules in them or step are made into the single integrated circuit module and realize.Like this, the present invention is not restricted to any specific hardware and software combination.

Though the embodiment that the present invention disclosed as above, the embodiment that described content just adopts for the ease of understanding the present invention is not in order to limit the present invention.Technician under any the present invention in the technical field; Under the prerequisite of spirit that does not break away from the present invention and disclosed and scope; Can do any modification and variation what implement in form and on the details; But scope of patent protection of the present invention still must be as the criterion with the scope that appending claims was defined.

Claims

1. the model comparison method based on similarity measurement is characterized in that, comprising:

Step 10 is confirmed model to be compared;

Step 20 is obtained the node of forming each model respectively from said model to be compared;

Step 30 is calculated the node similarity of each node between the said model to be compared;

Step 40 calculates the model similarity between the said model to be compared based on the node similarity of each node between the said model to be compared;

Step 50, based on said model similarity to obtain the relation between the said model to be compared.

2. method according to claim 1 is characterized in that, in said step 30,

Node text similarity through calculating each node between the said model to be compared and node label similarity, to obtain the node similarity of each node between the said model to be compared.

3. method according to claim 2 is characterized in that,

Based on the semantic relation between the pairing label of each node to obtain the said node label similarity of treating each node between the contrast model.

4. method according to claim 2 is characterized in that,

Obtain the node text similarity of each node between the said model to be compared based on the distance of the string editing between each node.

5. method according to claim 4 is characterized in that,

Each node utilizes following expression formula to obtain the node text similarity between the said model to be compared:

SmaticSim (X, Y) = 1 - \frac{E (X, Y)}{\max (| X |, | Y |)}

6. according to each described method of claim 2 to 5, it is characterized in that,

The following expression formula of each node utilization obtains the node similarity of each node between the said model to be compared between the said model to be compared:

NodeSim (X, Y) = &PartialD; LabSim (X, Y) + (1 - &PartialD;) SmaticSim (X, Y)

Wherein,

representes synthetic weight;

7. method according to claim 1 is characterized in that, in said step 40, specifically may further comprise the steps:

8. method according to claim 7 is characterized in that, in said step 41,

Utilize the node similarity of said each node, obtain the path similarity in each path between the said model to be compared based on the longest common subsequence method and/or stratification.

9. according to claim 7 or 8 each described methods, it is characterized in that, in said step 41,

Utilize following expression formula to obtain level similarity at all levels between the model to be compared:

{HorizSim}_{i} (A, B) = \frac{2 \times | trim ({Al}_{i} \cap^{sim} {Bl}_{i}) |}{| {Al}_{i} | + | {Bl}_{i} |}

10. method according to claim 7 is characterized in that, in said step 42, specifically may further comprise the steps:

11. method according to claim 10 is characterized in that, in said step 421,

The following expression formula of said model utilization to be compared obtains the vertical similarity of model between the model to be compared:

VerticSim (A, B)

= \frac{Σ_{P_{1} &Element; P_{A}} [\max_{P_{2} &Element; P_{B}} (SimPath (P_{1}, P_{2}))] + Σ_{P_{2} &Element; P_{B}} [\max_{P_{1} &Element; P_{A}} (SimPath (P_{1}, P_{2}))]}{| P_{A} | + | P_{B} |}

12. method according to claim 10 is characterized in that, in the said step 422,

Utilize following expression formula to obtain the horizontal similarity of model between the model to be compared:

HierSim (A, B) = Σ_{i = 1}^{h} \frac{{HorizSim}_{i} (A, B) \times r^{i}}{1 + r + r^{2} + \cdot \cdot \cdot . . + r^{h}}

13. according to each described method of claim 7 to 12, it is characterized in that, in said step 50,

Said model similarity and setting threshold are compared to obtain the relation between the model to be compared.