CN102722556A - Model comparison method based on similarity measurement - Google Patents

Model comparison method based on similarity measurement Download PDF

Info

Publication number
CN102722556A
CN102722556A CN2012101712517A CN201210171251A CN102722556A CN 102722556 A CN102722556 A CN 102722556A CN 2012101712517 A CN2012101712517 A CN 2012101712517A CN 201210171251 A CN201210171251 A CN 201210171251A CN 102722556 A CN102722556 A CN 102722556A
Authority
CN
China
Prior art keywords
model
similarity
node
compared
path
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012101712517A
Other languages
Chinese (zh)
Other versions
CN102722556B (en
Inventor
覃征
赵凤飞
徐哲
王珍
徐文华
任博岩
胡浩
李金星
王瑶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201210171251.7A priority Critical patent/CN102722556B/en
Publication of CN102722556A publication Critical patent/CN102722556A/en
Application granted granted Critical
Publication of CN102722556B publication Critical patent/CN102722556B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a model comparison method based on similarity measurement. The model comparison method comprises the following steps: step 10: determining models to be compared; step 20: obtaining nodes forming the models form the models to be compared; step 30: calculating the node similarity of the nodes among the models to be compared; step 40: calculating the model similarity of the models to be compared according to the node similarity of the nodes among the models to be compared; and step 50: obtaining a relation of the models to be compared on the basis of the node similarity. The model comparison method disclosed by the invention adopts a means of combining the text similarity and the label similarity at the time of calculating the node similarity, so that the model comparison method overcomes the problem that no label characteristics of elements of the models but the text is taken into the consideration, and further, the actual situations of the models can be reflected by the node similarity.

Description

A kind of model comparison method based on similarity measurement
Technical field
The present invention relates to the computer science database field, relate in particular to a kind of on semi-structured model the comparison method based on similarity measurement.
Background technology
Version Control is the process that system's different editions is identified and follows the tracks of, and is convenient to version is distinguished, retrieved and follows the tracks of, and shows the relation between each version.The comparison of version then is important module in the Version Control; Its purpose is in order to let the user that currently used version is had further understanding; Current version and former version instance are compared, and present to the user to the obvious difference between two versions clearly.
Through development for a long time, traditional version comparison instrument is comparative maturity, and traditional version comparison method majority is based on capable comparison, promptly marks the difference of the text delegation existence of comparing.For model comparison, existing method is directly the literal in the model and structure to be mated to realize.Though correlation technique has also had significant progress, current model comparison method used in modeling tool seems that but some is not fully up to expectations.
In the model comparison of current modeling tool; Only two models are very simply compared; Promptly only could be complementary by approval, and the fine distinction in two models all possibly cause the difference of comparison result at two models, two models when the storage aspect is identical.And the model that the user set up is often based on the structural relation in some semantic relations or the model, and these characteristics obviously can not by whole contrast instrument cognition, therefore, existing model comparison instrument and user's demand still have certain distance.And specifically, the weak point of current model comparison method may be summarized to be following some:
(1) when model is compared, can not differentiate two notions with synonymy or similar semantic relation, two models that are easy to just will to have similar semantic relation directly difference come.For example: two titles are respectively the model of " protection guided missile " and " defending missile ", and In the view of the user, they are consistent, in the model comparison, then can be regarded as two different notions and handle.
(2) shortage is to the understanding of two relationship model in the heterogeneous data source.Owing in modeling process, a plurality of team may occur, therefore, just be easy to cause have certain difference on their the understanding to some model, in the statement of identical model, exist different.The difference that has so just directly caused institute's generation model structure.For example: the statement to the books model in a certain Library can be respectively three models as shown in Figure 1.And above three kinds of expression methods should be identical concerning the user.
The model that (3) can not be suitable for the current relatively modeling tool of main flow is compared.In general modeling tool, model all is to store with the mode of XML on file, and the method for comparing to XML now also emerges in an endless stream, but because the comparison of current model has possessed the characteristics in certain modeling field.
Therefore, need a kind of model comparison method based on similarity measurement badly to address the above problem.
Summary of the invention
One of technical matters to be solved by this invention is that a kind of accurate more, objective model comparison method based on similarity measurement of result that can make the model comparison need be provided.
In order to solve the problems of the technologies described above, the invention provides a kind of model comparison method based on similarity measurement, this method comprises: step 10, confirm model to be compared; Step 20 is obtained the node of forming each model respectively from said model to be compared; Step 30 is calculated the node similarity of each node between the said model to be compared; Step 40 calculates the model similarity between the said model to be compared based on the node similarity of each node between the said model to be compared; Step 50, based on said model similarity to obtain the relation between the said model to be compared.
Model comparison method according to a further aspect of the invention based on similarity measurement; In said step 30, node text similarity through calculating each node between the said model to be compared and node label similarity, to obtain the node similarity of each node between the said model to be compared.
Model comparison method according to a further aspect of the invention based on similarity measurement, based on the semantic relation between the pairing label of each node to obtain the said node label similarity of treating each node between the contrast model.
The model comparison method based on similarity measurement according to a further aspect of the invention obtains the node text similarity of each node between the said model to be compared based on the string editing between each node distance.
Model comparison method according to a further aspect of the invention based on similarity measurement, each node utilizes following expression formula to obtain the node text similarity between the said model to be compared:
SmaticSim ( X , Y ) = 1 - E ( X , Y ) max ( | X | , | Y | )
Wherein, | X|, | Y| representes the length of character string of character string and the node Y of nodes X, E (X, Y) the string editing distance between expression nodes X and the node Y, SmaticSim (X, Y) the node text similarity of expression nodes X and node Y respectively.
Model comparison method according to a further aspect of the invention based on similarity measurement, the following expression formula of each node utilization obtains the node similarity of each node between the said model to be compared between the said model to be compared:
NodeSim ( X , Y ) = ∂ LabSim ( X , Y ) + ( 1 - ∂ ) SmaticSim ( X , Y )
Wherein,
Figure BDA00001695350600033
representes synthetic weight;
Figure BDA00001695350600034
NodeSim (X; Y) the node similarity of expression nodes X and node Y; LabSim (X; Y) the node label similarity between expression nodes X and the node Y, SmaticSim (X, Y) the node text similarity between expression nodes X and the node Y.
The model comparison method based on similarity measurement according to a further aspect of the invention in said step 40, specifically may further comprise the steps:
Step 41 calculates the path similarity in each path between the said model to be compared and level similarity at all levels based on the node similarity of said each node;
Step 42, based on the path similarity in said each path and said level similarity at all levels obtaining the model similarity between the said model to be compared,
Wherein, said path be in the tree construction of model to be compared from the root node to the leaf node via the string formed of node.
Model comparison method according to a further aspect of the invention based on similarity measurement; In said step 41; Utilize the node similarity of said each node, obtain the path similarity in each path between the said model to be compared based on the longest common subsequence method and/or stratification.
The model comparison method based on similarity measurement according to a further aspect of the invention, in said step 41, utilize following expression formula to obtain level similarity at all levels between the model to be compared:
HorizSim i ( A , B ) = 2 × | trim ( Al i ∩ sim Bl i ) | | Al i | + | Bl i |
Wherein, trim (Al iSimBl i) expression removal Al iSimBl iNode set after the duplicate node in the set, | trim (Al iSimBl i) | the size of expression set, | Al i| with | Bl i| the interstitial content of difference representation model A and Model B i layer, wherein, Al iSimBl iSet utilizes following expression formula to define:
(a, b) ∈ Al iSimBl i≤=>A ∈ Al i, b ∈ Bl iAnd NodeSim (a, b)>=k,
Wherein, a, a node of the i layer of b difference representation model A and Model B, k is preset similarity threshold.Al iBe the node set of the i layer of model A, i is the degree of depth and for more than or equal to 1 and smaller or equal to the integer of h; Bl iBe the node set of the i layer of Model B, (a b) is the node similarity of node a and node b to NodeSim.
The model comparison method based on similarity measurement according to a further aspect of the invention in said step 42, specifically may further comprise the steps:
Step 421 obtains the vertical similarity of model between the model to be compared based on the path similarity in said each path;
Step 422 obtains the horizontal similarity of model between the model to be compared based on said level similarity at all levels;
Step 423, based on the vertical similarity of said model and/or the horizontal similarity of model to obtain the model similarity between the model to be compared.
Model comparison method according to a further aspect of the invention based on similarity measurement, in said step 421, the following expression formula of said model utilization to be compared obtains the vertical similarity of model between the model to be compared:
VerticSim ( A , B )
= Σ P 1 ∈ P A [ max P 2 ∈ P B ( SimPath ( P 1 , P 2 ) ) ] + Σ P 2 ∈ P B [ max P 1 ∈ P A ( SimPath ( P 1 , P 2 ) ) ] | P A | + | P B |
Wherein | P A|, | P B| represent the number in the path that model A to be compared and Model B are comprised respectively; P A, P BRepresent the set in the path on model A to be compared and the Model B respectively, P 1And P 2Represent P respectively A, P BIn any paths, SimPath (P 1, P 2) path P among the representation model A 1With path P in the Model B 2The path similarity, VerticSim (A, B) the vertical similarity of the model of representation model A and Model B.
The model comparison method based on similarity measurement according to a further aspect of the invention, in the said step 422, utilize following expression formula to obtain the horizontal similarity of model between the model to be compared:
HierSim ( A , B ) = Σ i = 1 h HorizSim i ( A , B ) × r i 1 + r + r 2 + · · · . . + r h
Wherein, HierSim (A, B) the horizontal similarity of the model between representation model A and the Model B, HorizSim i(r is a discount factor for A, B) level similarity at all levels between representation model A and the Model B, 0<r≤1.
The model comparison method based on similarity measurement according to a further aspect of the invention is characterized in that, in said step 50, said model similarity and setting threshold is compared to obtain the relation between the model to be compared.
Compared with prior art, one or more embodiment of the present invention can have following advantage:
Because when the computing node similarity, adopted text similarity and the means that the label similarity combines, overcome the problem of only considering text and ignoring the model element tag feature, and then made the node similarity more can reflect the actual conditions of model.
Other features and advantages of the present invention will be set forth in instructions subsequently, and, partly from instructions, become obvious, perhaps understand through embodiment of the present invention.The object of the invention can be realized through the structure that in instructions, claims and accompanying drawing, is particularly pointed out and obtained with other advantages.
Description of drawings
Accompanying drawing is used to provide further understanding of the present invention, and constitutes the part of instructions, is used to explain the present invention jointly with embodiments of the invention, is not construed as limiting the invention.In the accompanying drawings:
Fig. 1 is the synoptic diagram that derives from three models in different pieces of information source;
Fig. 2 is the schematic flow sheet based on the model comparison method of similarity measurement according to the embodiment of the invention;
Fig. 3 (a) and Fig. 3 (b) are respectively the exemplary plot of tree construction of storage organization and the model of model;
Fig. 4 be according to the embodiment of the invention based on the composition of each similarity in the model comparison method of similarity measurement and the synoptic diagram of the mutual relationship between them;
Fig. 5 is the composition synoptic diagram based on node similarity in the model comparison method of similarity measurement according to the embodiment of the invention.
Embodiment
Below will combine accompanying drawing and embodiment to specify embodiment of the present invention, how the application technology means solve technical matters to the present invention whereby, and the implementation procedure of reaching technique effect can make much of and implement according to this.Need to prove that only otherwise constitute conflict, each embodiment among the present invention and each characteristic among each embodiment can mutually combine, formed technical scheme is all within protection scope of the present invention.
In addition; Can in computer system, carry out in the step shown in the process flow diagram of accompanying drawing such as a set of computer-executable instructions, and, though logical order has been shown in process flow diagram; But in some cases, can carry out step shown or that describe with the order that is different from here.
Fig. 2 is the schematic flow sheet based on the model comparison method of similarity measurement according to the embodiment of the invention.
Need to prove; The included all elements of model in the embodiment of the invention is extend markup language (eXtensible Markup Language; Be called for short XML) form; But except the XML form, the various semi-structured model representation modes that can be converted into tree structure can be suitable for the inventive method.
In the present embodiment, the corresponding XML file of each model file, and include a plurality of models in each model file, each model all has own corresponding information, and is like Fig. 1, shown in Figure 3.Therefore, the contrast to model just is converted into the comparison to segment between two XML files.
For the ease of statement, with Fig. 3 the term that will use in the example explanation present embodiment, wherein, Fig. 3 (b) is the tree construction of model, it is that storage organization according to Fig. 3 (a) model is transformed:
(1) node: comprising node element, attribute node and value node, all is nodes like " article " among Fig. 3, " beginning ", " main body " etc.
(2) path of model: a given model; At first it is resolved to corresponding XML tree; Will be from the root node to the leaf node via the string formed of node (with "/" as the separator between the node) be called the path, for example: " article/beginning/summary " is the paths in Fig. 3 model.
From two models to be compared, can find out that whole abstract tree can be regarded as fully by the hierarchical relationship between node and the node and constitutes.
Fig. 4 be according to the embodiment of the invention based on the composition of each similarity in the model comparison method of similarity measurement and the synoptic diagram of the mutual relationship between them.As shown in Figure 4, relate in the present embodiment through between the computation model laterally similarity and/or vertically similarity to obtain the similarity between the model.The vertical similarity of model is based on that the path similarity calculates between the model, and the horizontal similarity of model obtains based on the model hierarchy similarity.This part of node similarity then can be divided into node label similarity and these two parts of node text similarity, and is specifically as shown in Figure 5.
With reference to figure 2, specify each step of present embodiment below.
Step S210 confirms model to be compared, is designated as model A and Model B.
Step S220 obtains the node of forming each model respectively from model to be compared.
Step S230 calculates the node similarity of each node between the model to be compared.
Particularly, node text similarity through calculating each node between the model to be compared and node label similarity, to obtain the node similarity of each node between the model to be compared.
At first, based on the semantic relation between the pairing label of each node to obtain treating the node label similarity of each node between the contrast model.
Owing to have multiple element (being node) in the model in embodiments of the present invention, therefore,, must consider the similarity between them in order to carry out the comparison between the model more accurately.In embodiments of the present invention, each model all corresponding the semi-structured document piece of bottom, therefore, in the bottom XML document the pairing content of text of each label and label just corresponding an element.
Preferably, in the embodiment of the invention, the pairing label of the element of model is following 9 kinds, and they are respectively: notion, and attribute, complex attribute is inherited, synonym, antisense is quoted, and assembles, and is self-defined.Here, according to their practical significance, the relation between them is quantized, shown in the table 1 specific as follows.
Table 1
Figure BDA00001695350600081
For example, node 1 is physical culture, and its pairing label is " notion ", and node 2 is a football, and its pairing label is " quoting ", and the node label similarity that then can obtain between two nodes according to above table is 0.1.
Then, obtain the node text similarity of each node between the model to be compared based on the distance of the string editing between each node.
To the tolerance of node text similarity, be exactly the similarity of the text-string of node metric, realize through the string editing distance.The string editing distance refers to a kind of method that is used for measuring similarity between the character string.Given two character string S, T convert S to T needed deletion, insert, and the quantity of replacement operation just is called the edit path of S to T.And the shortest edit path just is called the editing distance of character string S and T.At this, confirm editing distance with dynamic programming method.
The character string of given nodes X is X=[x 0x 1... x i... .x m], the character string of node Y is Y=[y 0y 1Y j... .y n], with symbol EDIT (i, j) [x among the expression X 0x 1X i] substring, substring [y in the Y 0y 1Y j] editing distance.With D (i, j) expression X in i character X (i) be transformed into j character Y (j) necessary operations number of times among the Y, if X (i)==Y (j), then without any need for operation be D (i, j)=0; Otherwise, need replacement operation, and D (i, j)=1.According to the characteristics of dynamic programming, can draw X so, the editing distance E between the Y (X, Y) following formula:
EDIT (i, j)=1, if i=0 and j=0;
EDIT (i, j)=EDIT (i, j-1)+1, if i=0 and j>0;
EDIT (i, j)=EDIT (i-1, j)+1, if i>0 and j=0;
EDIT (i, j)=min (EDIT (i-1, j)+1, EDIT (i, j-1)+1, EDIT (i-1, j-1)+D (i, j)), if i>0, j>0
For the ease of follow-up calculating, now will be through with X, the value of the editing distance between the Y belongs within the scope of 0-1, draws nodes X, and the node text similarity between the Y is:
SmaticSim ( X , Y ) = 1 - E ( X , Y ) max ( | X | , | Y | )
Wherein | X|, | Y| representes the length of character string of character string and the node Y of nodes X respectively.
Comprehensive node text similarity and node label similarity can get the following formula of node similarity finally:
NodeSim ( X , Y ) = &PartialD; LabSim ( X , Y ) + ( 1 - &PartialD; ) SmaticSim ( X , Y )
Wherein,
Figure BDA00001695350600093
representes synthetic weight, specified LabSim (X among the final synthetic result; Y) and SmaticSim (X Y) respectively accounts for great ratio, NodeSim (X; Y) similarity of expression nodes X and node Y; LabSim (X, Y) the node label similarity between expression nodes X and the node Y, SmaticSim (X; Y) expression nodes X, the node text similarity between the Y.
Step S240 calculates the path similarity in each path between the model to be compared based on the node similarity of each node.
At first, the set that the path is resolved in the pairing XML of this model branch, then, calculate the similarity of institute's respective path in two models respectively.Consistent node corresponding in the path is many more, and two paths are just similar more.But, therefore, need relax the consistent condition of node because top mentioning node synonym personnel not of the same name, different possibly occur to the cognitive inconsistent situation of model structure.
In embodiments of the present invention, come computation model A, the similarity between the B between any two paths according to step as follows.
(1) pre-service is carried out in the path of two models.
Particularly, utilize the comparative result of node similarity, obtain having on two paths to be compared the node of similar semantic label, and do marked having on two nodes of similar semantic label, show that they have identical semantic label.
More specifically, for two paths P 1, P 2, With
Figure BDA00001695350600102
Represent the node on two paths respectively, if
Figure BDA00001695350600103
, wherein NodeSim is a node
Figure BDA00001695350600104
Figure BDA00001695350600105
Between the node similarity, α is a preset threshold, α is the real number between 0 to 1, then thinks
Figure BDA00001695350600106
With
Figure BDA00001695350600107
Mate each other.In this process, do marked to the node that matees each other.
(2) utilize the node similarity, obtain the path similarity in each path between the said model to be compared based on the longest common subsequence method and/or stratification.
Next, specify based on the path similarity (hereinafter to be referred as LCS method path similarity) between two models of the longest common subsequence method (hereinafter to be referred as LCS) calculating.
For ease the path is compared, use LCS to describe two similarities between the path among the present invention.
The subsequence of a given sequence is from formal, a given sequence X=<x 1, X 2..., x m>, another sequence Z=<z 1, z 2..., z k>Be the sub-sequence of X, if there is the strictly increasing subscript sequence of X<i 1, i 2..., i k>, make to all j=1,2 ... .k, have
Figure BDA00001695350600108
For example, Z=<b, C, D, B>Be X=<a, B, C, B, D, A, B>A sub-sequence.
And the definition of long common subsequence is: a sequence S, if be the subsequence of two or more known arrays respectively, and it is the longest to be that all meet in this condition sequence, and then S is called the longest common subsequence of known array.
At first, pretreated two nodes doing mark of process, directly regarding as is identical two nodes; Then; Use the LCS method to compare handling two paths later, in this comparison procedure, the longest common subsequence in two model paths is long more; Just signifying that the structurally overlapping part of two paths is many more, therefore also just similar more.In addition, consider that high-level node more can represent the structural information of The model than the node of low level, thereby in process relatively, need to consider the weight of each node.
Therefore, the path similarity in each path can be expressed as following formula between the model to be compared:
SP LCS ( P 1 , P 2 ) = &alpha; &times; | LCS ( P 1 , P 2 ) | | level ( P 1 ) |
Wherein, SP LCS(P 1, P 2) the expression path P 1, P 2Similarity, | level (P 1) | the expression path P 1The hierachy number that is had (or node number), | Lcs (P 1, P 2) | expression composition model path P 1, P 2The number of node in the set of the node of long common subsequence. α is LCS (P 1, P 2) in the weight of each common subsequence node.
α can represent through expression:
&alpha; = &Pi; n = 1 | LCS ( P 1 , P 2 ) | | level ( P 1 ) | - level P 1 ( LCS n ( P 1 , P 2 ) ) ) | level ( P 1 ) |
Wherein, | LCS (P 1, P 2) | expression composition model path P 1, P 2The number of node in the set of the node of long common subsequence, | level (P 1) | the expression path P 1The hierachy number that is had (or node number), LCS n(P 1, P 2) expression P 1, P 2N node (the model path is order from top to bottom) in the longest common subsequence,
Figure BDA00001695350600112
(LCS n(P 1, P 2)) expression P 1, P 2N node is at P in the longest common subsequence 1Residing level on the path.
Through utilizing the LCS method to come the calculating path similarity, can improve the processing of the subpath consistent with certain paths distributing order, can consider effectively to comprise problem each other between the path.
In addition, can also calculate two path similarities (hereinafter to be referred as stratification path similarity) between the model based on stratification.
Need to prove; Different with LCS path similarity; The hierarchical path similarity does not require between the similar node strict with fixing order appearance; Just a certain node in the path A can be selected in all nodes of path B and self immediate node, and two nodes that mate most can appear on the different levels, but residing level is also high more near similarity more.The similarity of in this step, measuring two paths through the residing relative level of node metric.Because the path is made up of node, so at first the similarity between two nodes on two paths is measured.
Particularly, need handle as follows:
At first, mate similarity between two nodes on the calculating path each other.
From pre-service, draw many groups and be positioned on two paths nodes of coupling each other, and the node of coupling best embodies out the similarity of two paths each other, therefore, in the process of considering the path similarity, the node of mutual coupling handled getting final product.
Through to the each other weight of the node of coupling and the calculating of the level degree of correlation on two paths, draw the method for mating similarity between two nodes on the following calculating path each other:
SV &prime; ( V p 1 , V p 2 ) = 1 - | level P 1 ( V P 1 ) - level P 2 ( V P 2 ) | max ( level ( P 1 ) , level ( P 2 ) )
Wherein, level (P 1), level (P 2), represent path P respectively 1And P 2Hierachy number, level P1(V P1) and level P2(V P2) represent node V respectively P1In path P 1Hierachy number and node V P2In path P 2Hierachy number.
Then, choosing to node weights.
Because it is bigger to the influence of entire path than low level node in the model path, often to be in high-level node, it can react the information of entire document structure more accurately than the node of low level.Therefore, when the calculating path similarity, can give different weights respectively to react its importance to node to the whole piece path.
For example, P is arranged 1And P 2Two paths, the node of the mutual coupling between them is at P 1On be followed successively by { V 0, V 1... .V n, give node V so iThe weights of being given do
Figure BDA00001695350600121
, wherein 0<β<1 and level (V i) expression V iAt P 1In the level of reality.Therefore; After considering node weights, the similarity
Figure BDA00001695350600122
between the node that matees each other on two paths finally is expressed as following formula:
SV ( V p 1 , V p 2 ) = &beta; level P 1 ( V P 1 ) &times; ( 1 - | level P 1 ( V P 1 ) - level P 2 ( V P 2 ) | max ( level ( P 1 ) , level ( P 2 ) ) )
Wherein, level P1(V P1) expression node V P1In path P 1Hierachy number.
At last, calculate stratification path similarity.
Given path P 1, P 2, for the node V on the path P 1 P1, be its optimum matching node definition on P2:
Figure BDA00001695350600124
So, path P 1 and the final similarity SP of P2 Wl(P 1, P 2) do
SP wl ( P 1 , P 2 ) = &Sigma; v p 1 Mv P 2 ( P 1 ) | P |
At last, can through with the LCS method similarity in each path and/or stratification similarity as each path similarity between the model to be compared.
In embodiments of the present invention, preferably LCS method similarity between each path and stratification similarity between the model to be compared are carried out comprehensively obtaining two similarities between the model path.
Final two model path similarity formula can be as follows:
SimPath(P 1,P 2)=γSP LCS(P 1,P 2)+(1-γ)SP wl(P 1,P 2)
γ wherein, 1-γ representes SP respectively LCS(P 1, P 2) and SP Wl(P 1, P 2) shared proportion.
In above-mentioned steps, SP LCS(P 1, P 2) and SP Wl(P 1, P 2) represent LCS method similarity and stratification similarity respectively.But they respectively have relative merits, and form complementary: SP LCS(P 1, P 2) can consider effectively and comprise problem each other between the path, but lack understanding to several sources isomery of isomery.And SP Wl(P 1, P 2) the effective relation of the dislocation in the transaction module level, but can not consider two relation of inclusion between the model.Therefore, this step combines through the similarity that two methods are obtained, and gives a weight to obtain final path similarity value to them respectively.
Step S250, based on the path similarity in each path and level similarity at all levels to obtain the model similarity between the said model to be compared.
Particularly, comprise following substep:
Step S2501 obtains the vertical similarity of model between the model to be compared based on the path similarity in each path;
Step S2502 obtains the horizontal similarity of model between the model to be compared based on level similarity at all levels;
Step S2503, based on the vertical similarity of model and/or the horizontal similarity of model to obtain the model similarity between the model to be compared.
To step S2501, need to prove, in embodiments of the present invention; Vertical similarity between model; Be meant from angle longitudinally the similarity between the model measured, wherein, vertically expression be one from the model root node direction of terminad node progressively.So, just can represent with a paths this from the model root node process of terminad node progressively.Model element in the present invention can comprise a plurality of daughter elements, therefore when the vertical similarity of measurement model, can relate to mulitpath.Path like model among Fig. 3 can be expressed as three: article/beginning/summary; Article/beginning/title; Article/paragraph/main body.
Because the vertical similarity of model is considered is the hierarchical relationship between each node elements of this model; And mutual relationship between hierarchical structure and the node is the most directly embodied is exactly the path of model; Therefore to model A; Similarity between the mulitpath that comprises between the B carries out comprehensively obtaining model A, the vertical similarity between B.
Particularly, given model A and Model B are used P A, P BRepresent A respectively, the set of B upper pathway is compared the every paths among the model A and the path of Model B one by one, and gets the similarity of maximum similarity value as this paths.In like manner, Model B is also carried out same operation.At last, with A, the corresponding similarity addition in all paths is asked on average again among the B, get final product model A, the vertical similarity between the B.Shown in the formula specific as follows:
VerticSim ( A , B )
= &Sigma; P 1 &Element; P A [ max P 2 &Element; P B ( SimPath ( P 1 , P 2 ) ) ] + &Sigma; P 2 &Element; P B [ max P 1 &Element; P A ( SimPath ( P 1 , P 2 ) ) ] | P A | + | P B |
Wherein | P A|, | P B| the number in the path that comprised of representation model A and Model B respectively, P A, P BRepresent the set in the path on model A to be compared and the Model B respectively, P 1And P 2Represent P respectively A, P BIn any paths, SimPath (P 1, P 2) path P among the representation model A 1With path P in the Model B 2The path similarity, VerticSim (A, B) the vertical similarity of the model of representation model A and Model B.
To step S2502, need to prove, in embodiments of the present invention; Horizontal similarity between model; Be meant from horizontal level the similarity between the model is measured that upper strata unit have a plurality of lowest-rank element, the relation between lowest-rank element and the upper strata element can influence the comparison of similarity." beginning " under model outermost layer as shown in Figure 3 " article " model just belongs to same horizontal level with " main body " daughter element.
Aforesaid path similarity is to vertically the disassembling of model tree (XML tree), and from the similarity degree of longitudinal comparison model, and laterally the tolerance of similarity is from transversely by level model being disassembled the degree of correlation between two model trees of comparison.At this, react horizontal similarity through the level similarity of measurement model.
Particularly, obtain level similarity at all levels between the model to be compared based on the node similarity of each node, then based on the level similarity to obtain the horizontal similarity of model between the model to be compared.
More specifically, at first, the degree of depth of establishing model (tree) A is h, the direction that establishing the degree of depth increases be from root node to leaf node, promptly the height of leaf node is h, the root node height is 1.Laterally disassemble model tree A this moment, and the node set that obtains each layer is Al i, i is the degree of depth, for more than or equal to 1 and smaller or equal to the integer of h.In like manner for model tree B, the node set that obtains the i layer is Bl i,, a kind of new computing of two set of definition earlier, promptly similar common factor ∩ Sim, (a is set A l to a iIn a node) and Bl iSimilar common factor be a and Bl iIn according to the set of certain approximately equalised element of similarity threshold, the definition of similar common factor is as follows:
(a, b) ∈ Al iSimBl i≤=>A ∈ Al i, b ∈ Bl iAnd NodeSim (a, b)>=k,
A wherein, b representes A respectively, a node of B model i layer, k is predefined similarity threshold.
So, model tree A and model tree B are defined as in the level similarity of level i,
HorizSim i ( A , B ) = 2 &times; | trim ( Al i &cap; sim Bl i ) | | Al i | + | Bl i |
Wherein, trim (Al iSimBl i) expression removal Al iSimBl iRepeat element in the set, | trim (Al iSimBl i) | the size of expression set, | Al i| with | Bl i| the interstitial content of difference representation model A and Model B i layer.
Then, the weighted sum of each layer similarity is as total level similarity of model tree A, B.Setting is big more the closer to the level importance of root node in the present embodiment, can reflect the character of notion more, and therefore, establishing r is discount factor, and 0 < r ≤1, total level similarity (being the horizontal similarity of model) is defined as:
HierSim ( A , B ) = &Sigma; i = 1 h HorizSim i ( A , B ) &times; r i 1 + r + r 2 + &CenterDot; &CenterDot; &CenterDot; . . + r h
According to the method, can calculate the horizontal similarity of model between conceptional tree (model) A and conceptional tree (model) B, the horizontal similarity of model is the important supplement to the vertical similarity of model, helps better to carry out the similarity measurement of model.
To step S2503, according to above-mentioned result, can be based on the vertical similarity of model and/or the horizontal similarity of model to obtain the model similarity between the model to be compared.
In the present embodiment, preferably based on the vertical similarity between two models and laterally similarity with obtain two models between the value of similarity.
Need to prove,, the model similarity is assessed that therefore need carry out comprehensively getting final product to these two kinds of similarities, so to model A, the similarity between B finally be expressed as following formula from horizontal vertical two aspects in order to make the result of assessment more accurate:
ModelSim(A,B)=VerticSim(A,B)×p+HierSim(A,B)×(1-p)
Wherein ModelSim (A, B) expression A, the model similarity between B, (A, B) (A B) distinguishes vertical similarity of model and the horizontal similarity of model between representation model A and the Model B to VerticSim, and p representes weight with HierSim.
Because the similarity that the method that has adopted horizontal similarity and vertical similarity to combine is come computation model has overcome the model similar situation is reflected one sided problem, make that the tolerance result of similarity is more accurate, comprehensive, objective.
Step S260 obtains the relation between the model to be compared based on similarity between two models.
Particularly, model similarity and setting threshold are compared to obtain the relation between the model to be compared.
In the version comparison process, for two model A that compare and B, each element among A and the B is divided into four types, that is: 1) comprise certain element among the A, and do not have among the B; 2) comprise certain element among the B, and do not have among the A; 3) all comprise certain element among A and the B, but this element is different in A and B; 4) all comprise certain element among A and the B, and identical.
For situation 4), only when this element is identical in A and B, just set up, therefore judge than being easier to, and situation 1) and situation 2) be the problem of homogeneity in fact, promptly certain element exists in a model and in another model, does not exist.When file A and B comprise an elements A i and Bi respectively, be to be divided into situation 1 to Ai), be divided into situation 2 to Bi simultaneously); Still be divided into situation 3 to Ai and Bi) be exactly that we will be through the problem of similarity measurement solution.
In this step; Set a real number similarity threshold d; Preferably; Set 0 < d < 1, among the model A in any one element and the Model B similarity each other of any one element all can calculate through preceding method, and to adopt be respectively the similarity matrix preservation of row and column with element among A, the B.Through this similarity matrix, we can find (when the value of this similarity was 1, this method just was divided into situation 4 with Ai and Bi for Ai, Bi) right similarity) at any time, promptly all comprise certain element among A and the B, and identical; When the value of this similarity more than or equal to threshold value d but less than 1 the time, this method just is divided into situation 3 with Ai and Bi), promptly all comprise certain element among A and the B, but this element is different in A and B; When this similarity during less than threshold value d, this method just is divided into situation 1 respectively with Ai and Bi) and situation 2).
Through above step, present embodiment has realized utilizing the similarity between the model element to calculate carrying out the model comparison.The present invention is used in traditional model comparison, higher efficient is not only arranged, reduced unnecessary information, strengthened practicality, and the result of feasible comparison is more intelligent, hommization, the demand of further having agreed with the user has great convenience for the user.
Those skilled in the art should be understood that; Above-mentioned each module of the present invention or each step can realize that they can concentrate on the single calculation element with the general calculation device, perhaps are distributed on the network that a plurality of calculation element forms; Alternatively; They can realize with the executable program code of calculation element, thereby, can they be stored in the memory storage and carry out by calculation element; Perhaps they are made into each integrated circuit modules respectively, perhaps a plurality of modules in them or step are made into the single integrated circuit module and realize.Like this, the present invention is not restricted to any specific hardware and software combination.
Though the embodiment that the present invention disclosed as above, the embodiment that described content just adopts for the ease of understanding the present invention is not in order to limit the present invention.Technician under any the present invention in the technical field; Under the prerequisite of spirit that does not break away from the present invention and disclosed and scope; Can do any modification and variation what implement in form and on the details; But scope of patent protection of the present invention still must be as the criterion with the scope that appending claims was defined.

Claims (13)

1. the model comparison method based on similarity measurement is characterized in that, comprising:
Step 10 is confirmed model to be compared;
Step 20 is obtained the node of forming each model respectively from said model to be compared;
Step 30 is calculated the node similarity of each node between the said model to be compared;
Step 40 calculates the model similarity between the said model to be compared based on the node similarity of each node between the said model to be compared;
Step 50, based on said model similarity to obtain the relation between the said model to be compared.
2. method according to claim 1 is characterized in that, in said step 30,
Node text similarity through calculating each node between the said model to be compared and node label similarity, to obtain the node similarity of each node between the said model to be compared.
3. method according to claim 2 is characterized in that,
Based on the semantic relation between the pairing label of each node to obtain the said node label similarity of treating each node between the contrast model.
4. method according to claim 2 is characterized in that,
Obtain the node text similarity of each node between the said model to be compared based on the distance of the string editing between each node.
5. method according to claim 4 is characterized in that,
Each node utilizes following expression formula to obtain the node text similarity between the said model to be compared:
SmaticSim ( X , Y ) = 1 - E ( X , Y ) max ( | X | , | Y | )
Wherein, | X|, | Y| representes the length of character string of character string and the node Y of nodes X, E (X, Y) the string editing distance between expression nodes X and the node Y, SmaticSim (X, Y) the node text similarity of expression nodes X and node Y respectively.
6. according to each described method of claim 2 to 5, it is characterized in that,
The following expression formula of each node utilization obtains the node similarity of each node between the said model to be compared between the said model to be compared:
NodeSim ( X , Y ) = &PartialD; LabSim ( X , Y ) + ( 1 - &PartialD; ) SmaticSim ( X , Y )
Wherein,
Figure FDA00001695350500022
representes synthetic weight;
Figure FDA00001695350500023
NodeSim (X; Y) the node similarity of expression nodes X and node Y; LabSim (X; Y) the node label similarity between expression nodes X and the node Y, SmaticSim (X, Y) the node text similarity between expression nodes X and the node Y.
7. method according to claim 1 is characterized in that, in said step 40, specifically may further comprise the steps:
Step 41 calculates the path similarity in each path between the said model to be compared and level similarity at all levels based on the node similarity of said each node;
Step 42, based on the path similarity in said each path and said level similarity at all levels obtaining the model similarity between the said model to be compared,
Wherein, said path be in the tree construction of model to be compared from the root node to the leaf node via the string formed of node.
8. method according to claim 7 is characterized in that, in said step 41,
Utilize the node similarity of said each node, obtain the path similarity in each path between the said model to be compared based on the longest common subsequence method and/or stratification.
9. according to claim 7 or 8 each described methods, it is characterized in that, in said step 41,
Utilize following expression formula to obtain level similarity at all levels between the model to be compared:
HorizSim i ( A , B ) = 2 &times; | trim ( Al i &cap; sim Bl i ) | | Al i | + | Bl i |
Wherein, trim (Al iSimBl i) expression removal Al iSimBl iNode set after the duplicate node in the set, | trim (Al iSimBl i) | the size of expression set, | Al i| with | Bl i| the interstitial content of difference representation model A and Model B i layer, wherein, Al iSimBl iSet utilizes following expression formula to define:
(a, b) ∈ Al iSimBl i≤=>A ∈ Al i, b ∈ Bl iAnd NodeSim (a, b)>=k,
Wherein, a, a node of the i layer of b difference representation model A and Model B, k is preset similarity threshold.Al iBe the node set of the i layer of model A, i is the degree of depth and for more than or equal to 1 and smaller or equal to the integer of h; Bl iBe the node set of the i layer of Model B, (a b) is the node similarity of node a and node b to NodeSim.
10. method according to claim 7 is characterized in that, in said step 42, specifically may further comprise the steps:
Step 421 obtains the vertical similarity of model between the model to be compared based on the path similarity in said each path;
Step 422 obtains the horizontal similarity of model between the model to be compared based on said level similarity at all levels;
Step 423, based on the vertical similarity of said model and/or the horizontal similarity of model to obtain the model similarity between the model to be compared.
11. method according to claim 10 is characterized in that, in said step 421,
The following expression formula of said model utilization to be compared obtains the vertical similarity of model between the model to be compared:
VerticSim ( A , B )
= &Sigma; P 1 &Element; P A [ max P 2 &Element; P B ( SimPath ( P 1 , P 2 ) ) ] + &Sigma; P 2 &Element; P B [ max P 1 &Element; P A ( SimPath ( P 1 , P 2 ) ) ] | P A | + | P B |
Wherein | P A|, | P B| represent the number in the path that model A to be compared and Model B are comprised respectively; P A, P BRepresent the set in the path on model A to be compared and the Model B respectively, P 1And P 2Represent P respectively A, P BIn any paths, SimPath (P 1, P 2) path P among the representation model A 1With path P in the Model B 2The path similarity, VerticSim (A, B) the vertical similarity of the model of representation model A and Model B.
12. method according to claim 10 is characterized in that, in the said step 422,
Utilize following expression formula to obtain the horizontal similarity of model between the model to be compared:
HierSim ( A , B ) = &Sigma; i = 1 h HorizSim i ( A , B ) &times; r i 1 + r + r 2 + &CenterDot; &CenterDot; &CenterDot; . . + r h
Wherein, HierSim (A, B) the horizontal similarity of the model between representation model A and the Model B, HorizSim i(r is a discount factor for A, B) level similarity at all levels between representation model A and the Model B, 0<r≤1.
13. according to each described method of claim 7 to 12, it is characterized in that, in said step 50,
Said model similarity and setting threshold are compared to obtain the relation between the model to be compared.
CN201210171251.7A 2012-05-29 2012-05-29 Model comparison method based on similarity measurement Active CN102722556B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210171251.7A CN102722556B (en) 2012-05-29 2012-05-29 Model comparison method based on similarity measurement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210171251.7A CN102722556B (en) 2012-05-29 2012-05-29 Model comparison method based on similarity measurement

Publications (2)

Publication Number Publication Date
CN102722556A true CN102722556A (en) 2012-10-10
CN102722556B CN102722556B (en) 2014-10-22

Family

ID=46948317

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210171251.7A Active CN102722556B (en) 2012-05-29 2012-05-29 Model comparison method based on similarity measurement

Country Status (1)

Country Link
CN (1) CN102722556B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104305957A (en) * 2014-08-28 2015-01-28 中国科学院自动化研究所 Head-wearing molecule image navigation system
CN104750775A (en) * 2013-12-24 2015-07-01 Tcl集团股份有限公司 Content alignment method and system
CN105184387A (en) * 2015-07-23 2015-12-23 北京理工大学 Path similarity comparison method
CN105488084A (en) * 2014-12-24 2016-04-13 哈尔滨安天科技股份有限公司 Tree isomorphism based software installation package classification method and system
CN109582759A (en) * 2018-11-15 2019-04-05 中电科大数据研究院有限公司 A method of measuring official document similitude
CN109597913A (en) * 2018-11-05 2019-04-09 东软集团股份有限公司 The method for being aligned document picture, device, storage medium and electronic equipment
CN110225007A (en) * 2019-05-27 2019-09-10 国家计算机网络与信息安全管理中心 The clustering method of webshell data on flows and controller and medium
CN111307194A (en) * 2020-01-21 2020-06-19 中南民族大学 Beidou-based environmental equipment detection method, device, equipment and storage medium
TWI707243B (en) * 2015-11-30 2020-10-11 大陸商中國銀聯股份有限公司 Method, apparatus, and system for detecting living body based on eyeball tracking
CN111985519A (en) * 2019-05-21 2020-11-24 创新先进技术有限公司 Text similarity quantification method, equipment and system
CN113168544A (en) * 2018-12-19 2021-07-23 西门子股份公司 Method and system for providing services for complex industrial systems
CN115378824A (en) * 2022-08-24 2022-11-22 中国联合网络通信集团有限公司 Model similarity determination method, device, equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101876995A (en) * 2009-12-18 2010-11-03 南开大学 Method for calculating similarity of XML documents
CN101930462A (en) * 2010-08-20 2010-12-29 华中科技大学 Comprehensive body similarity detection method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101876995A (en) * 2009-12-18 2010-11-03 南开大学 Method for calculating similarity of XML documents
CN101930462A (en) * 2010-08-20 2010-12-29 华中科技大学 Comprehensive body similarity detection method

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104750775A (en) * 2013-12-24 2015-07-01 Tcl集团股份有限公司 Content alignment method and system
CN104750775B (en) * 2013-12-24 2019-03-19 Tcl集团股份有限公司 Content comparison method and system
CN104305957A (en) * 2014-08-28 2015-01-28 中国科学院自动化研究所 Head-wearing molecule image navigation system
CN105488084A (en) * 2014-12-24 2016-04-13 哈尔滨安天科技股份有限公司 Tree isomorphism based software installation package classification method and system
CN105184387A (en) * 2015-07-23 2015-12-23 北京理工大学 Path similarity comparison method
TWI707243B (en) * 2015-11-30 2020-10-11 大陸商中國銀聯股份有限公司 Method, apparatus, and system for detecting living body based on eyeball tracking
CN109597913A (en) * 2018-11-05 2019-04-09 东软集团股份有限公司 The method for being aligned document picture, device, storage medium and electronic equipment
CN109582759A (en) * 2018-11-15 2019-04-05 中电科大数据研究院有限公司 A method of measuring official document similitude
CN109582759B (en) * 2018-11-15 2021-10-22 中电科大数据研究院有限公司 Method for measuring similarity of documents
CN113168544A (en) * 2018-12-19 2021-07-23 西门子股份公司 Method and system for providing services for complex industrial systems
CN111985519A (en) * 2019-05-21 2020-11-24 创新先进技术有限公司 Text similarity quantification method, equipment and system
US11210553B2 (en) 2019-05-21 2021-12-28 Advanced New Technologies Co., Ltd. Methods and devices for quantifying text similarity
CN110225007A (en) * 2019-05-27 2019-09-10 国家计算机网络与信息安全管理中心 The clustering method of webshell data on flows and controller and medium
CN111307194B (en) * 2020-01-21 2020-12-25 中南民族大学 Beidou-based environmental equipment detection method, device, equipment and storage medium
CN111307194A (en) * 2020-01-21 2020-06-19 中南民族大学 Beidou-based environmental equipment detection method, device, equipment and storage medium
CN115378824A (en) * 2022-08-24 2022-11-22 中国联合网络通信集团有限公司 Model similarity determination method, device, equipment and storage medium
CN115378824B (en) * 2022-08-24 2023-07-14 中国联合网络通信集团有限公司 Model similarity determination method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN102722556B (en) 2014-10-22

Similar Documents

Publication Publication Date Title
CN102722556A (en) Model comparison method based on similarity measurement
CN102521416B (en) Data correlation query method and data correlation query device
Arion et al. Structured materialized views for XML queries
CN103440329A (en) Authoritative author and high-quality paper recommending system and recommending method
CN103500208A (en) Deep layer data processing method and system combined with knowledge base
CN104199857A (en) Tax document hierarchical classification method based on multi-tag classification
CN104750819A (en) Biomedicine literature search method and system based on word grading sorting algorithm
Zhang et al. Adversarial learning for discourse rhetorical structure parsing
CN103544309A (en) Splitting method for search string of Chinese vertical search
CN108255881B (en) Method and device for generating release keywords
CN104239359A (en) Multi-mode based image annotating device and method
Hilton et al. The phosphate life-cycle: rethinking the options for a finite resource.
Parameswaran et al. Optimal schemes for robust web extraction
CN101702171A (en) Approximating matching method for numerous character strings
CN107436955A (en) A kind of English word relatedness computation method and apparatus based on Wikipedia Concept Vectors
CN110543585A (en) RDF graph and attribute graph unified storage method based on relational model
Li et al. Research and application of computer aided design system for product innovation
CN104217025A (en) System and method for extracting record items of multi-record web page
CN102033886B (en) Fabric search method and system utilizing same
Liu et al. Automatically extracting user reviews from forum sites
CN104899652A (en) Cultural performing operation platform decision support system under integrated data interaction
Furche et al. Amber: Automatic supervision for multi-attribute extraction
Liu et al. Automatically mining review records from forum Web sites
Aksoy et al. Clustering query results to support keyword search on tree data
Sellers et al. Taking the OXPath down the deep web

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant