CN105183832A

CN105183832A - Data similarity analysis method

Info

Publication number: CN105183832A
Application number: CN201510545966.8A
Authority: CN
Inventors: 唐雪飞; 陈科
Original assignee: CHENGDU COMSYS INFORMATION TECHNOLOGY Co Ltd
Current assignee: CHENGDU COMSYS INFORMATION TECHNOLOGY Co Ltd
Priority date: 2015-08-31
Filing date: 2015-08-31
Publication date: 2015-12-23

Abstract

The invention discloses a data similarity analysis method. The method comprises the following steps of: S1, setting a scoring policy; S2, establishing a scoring matrix; S3, filling the scoring matrix; and S4, performing regression on the scoring matrix and obtaining a comparison result of two sets of data. According the data similarity analysis method, the scoring policy is introduced and the score is calculated according to the policy; and the method is used to quantitatively determine the similarity between the data, and can perform similarity comparison aimed at the inaccurate matching.

Description

A kind of data similarity analysis method

Technical field

The invention belongs to computing machine Business intelligence field, be specifically related to a kind of design of data similarity analysis method.

Background technology

Along with the development of infotech, many IT field all can produce a large amount of data every day, and in many service environments, there is different databases, all store a large amount of data in these databases, and these data often all also exist certain similarity, such as in the business datum of certain colleges and universities, education administration system and all in store data relevant to student in a large number in work system.

Under these circumstances, we are many times concerned about the similarity degree between these different pieces of information sources very much, to analyze the redundancy condition between these data.These different data are all generally store with different data layouts, generally there are numeric type, character type, date type etc., and it is well-known, for computing machine, these data store all in the form of a binary number, further, these different data formats can be converted into character string type, therefore, the correlation technique of data similarity-rough set, in fact can be converted into the correlation technique of character string comparison.The technology that character string comparison is relevant mainly comprises:

1, simple comparative approach

Direct compare string is the whether identical similarity degree determining the two in relevant position.Due to the whole traversal of needs, therefore comparison procedure needs to consume the more time.The advantage of simple comparative approach is to realize simple, can be used for the situation that data volume is less, but for the scene of big data quantity, does not generally use this original data comparing method.

2, BM method

BM method is the abbreviation of the Boyer-Moore of precise character String matching.The time complexity of this method is lower, is present a kind of many methods.

So-called precise character String matching problem is in text T, find substring that is all and inquiry P exact matching.BM algorithm has mainly used three kinds of ingenious and effective methods, namely scans from right to left, and batter accords with rule and to become reconciled suffix rule.

Being meant to of scanning from right to left is mated forward from last character, instead of customaryly mates backward from the outset.

Batter accords with rule, in scanning process from right to left, finds that Ti and Pj is different, if it is identical to there is a character Pk and Ti in P, and k<i, so just by directly being moved right by P, Pk and Ti is alignd, and then mate from right to left.If there is not any character identical with Ti in P, then direct the first character of P to be alignd with the character late of Ti, then compare from right to left.

Good suffix rule is, in scanning process from right to left, finds that Ti and Pj is different, checks that other position t' of identical part t whether in P occurs.

(1) if the previous letter of t and t' is not identical, just P is moved right, the t in t' and T is alignd.

(2) if t' does not occur, then find the longest-prefix x of the P identical with the suffix of t, move right P, makes the suffix of t in x and T corresponding.

3, KMP method

KMP algorithm is a kind of string matching algorithm of improvement, is found by D.E.Knuth, J.H.Morris and V.R.Pratt simultaneously.The key of KMP algorithm is the information utilized after it fails to match, reduces the matching times of pattern string and main string to reach the object of Rapid matching as far as possible.

KMP method when execution character string T and W compares, to the matching check of T [i] and W [j].If T [i]=W [j], then continue to check whether T [i+1] and W [j+1] mates.If T [i] ≠ W [j], be then divided into two kinds of situations: if j=1, then pattern string moves to right one, checks whether T [i+1] and W [1] mates; If 1<j<=m, then pattern string moves to right j-next (j) position, checks whether T [i] and W [next (j)] mates.Repeat this process until j=m or i=n terminates.

Except above three kinds of methods, also has the method for some character string comparisons, these methods can calculate for the matching process of data, but they have a common shortcoming, can only compare accurately exactly, such as compare AABBCC and BC, can be easy to obtain match point, but for inexact matching, as AABBCC and AACC compare time, these accurate comparative approach just can not obtain the result of similarity, and in fact these two character strings are all identical at head and tail, and we can draw the two similar conclusion completely accordingly.

Summary of the invention

The object of the invention is can only compare accurately to solve string comparison method in prior art, the problem of the result of similarity can not be obtained for inexact matching, propose a kind of data similarity analysis method.

Technical scheme of the present invention is: a kind of data similarity analysis method, comprises the following steps:

S1, scoring strategy is set;

S2, structure score matrix;

S3, filling score matrix;

S4, score matrix to be returned, obtain the comparative result of two groups of data.

Further, step S1 is specially: suppose that two groups of data to be compared are S=s ₁s ₂s _nand T=t ₁t ₂t _m, length is respectively n and m, obtains S ' and T ', make by appropriate location insertion space "-" in S and T | S ' |=| T ' |=l;

Relatively S ' and T ', if the value on the i of position is equal, then 1 point, if the value on the i of position is unequal and be space, then 0 point, if the value on the i of position has space, then obtain-1 point, namely deduct points, as shown in formula (1):

Wherein σ (S ' [i], T ' [i]) is the score on the i of position;

Then the final score of data S and T is:

S c o r e = Σ_{i = 1}^{l} σ (S^{'} [i], T^{'} [i]) - - - (2) .

Further, step S2 is specially: build (n+1) × (m+1) rank matrix V, except V (0,0) outward, the 1st row are corresponding with data sequence S-phase, and the 1st row is corresponding with data sequence T-phase;

Matrix V starting condition is as shown in formula (3):

V(0,0)＝0

V(i,0)＝V(i-1,0)+σ(S[i],-),1≤i≤|S|(3)。

V(0,j)＝V(0,j-1)+σ(-,T[i]),1≤j≤|T|

Further, step S3 is specially: according to scoring strategy, utilizes formula (4) that summation is carried out with the value of its known adjacent position in space remaining in matrix V and compares, be filled in cell by maximum value:

V (i, j) = m a x \{\begin{matrix} V (i - 1, j - 1) + σ (S [i], T [j]) \\ V (i - 1, j) + σ (S [i], -) \\ V (i, j - 1) + σ (-, T [j]) \end{matrix} - - - (4) .

Further, step S4 is specially: set out from V (| S|, | T|), returns its source, and with arrow mark, obtains the comparative result of data S and T.

The invention has the beneficial effects as follows: invention introduces scoring strategy, and according to policy calculation score value, for the similarity degree between rational judgment data, similarity-rough set can be carried out for the situation of inexact matching.

Accompanying drawing explanation

Fig. 1 is a kind of data similarity analysis method flow diagram provided by the invention.

Embodiment

Below in conjunction with accompanying drawing, embodiments of the invention are further described.

The invention provides a kind of data similarity analysis method, as shown in Figure 1, comprise the following steps:

S1, scoring strategy is set.

Suppose that two groups of data to be compared are S=s ₁s ₂s _nand T=t ₁t ₂t _m, length is respectively n and m, in order to represent the similarity degree of the two quantitatively, introduces scoring strategy, and according to policy calculation score value, for the similarity of finally both expressions.

In the comparison procedure of carrying out S and T, because the length of the two is different, need to expand S and T, the appropriate location in S and T is namely needed to insert some spaces (representing with symbol "-"), the length of both guarantees is consistent, and score value is the highest under the condition of scoring strategy, then this score value is exactly final similarity.Therefore, how finding the position, space that these are suitable, is key of the present invention.If obtain S ' and T ' by insertion space, position suitable in S and T, and | S ' |=| T ' |=l.

Wherein σ (S ' [i], T ' [i]) is the score on the i of position.

It should be noted that S ' and T ' position i upper can not be space simultaneously because space is artificial introducing, all introduce space at same position nonsensical.

Then the final score of data S and T is:

S c o r e = Σ_{i = 1}^{l} σ (S^{'} [i], T^{'} [i]) - - - (2) .

S2, structure score matrix.

Build (n+1) × (m+1) rank matrix V, except V (0,0) outward, the 1st row are corresponding with data sequence S-phase, and the 1st row is corresponding with data sequence T-phase;

Matrix V starting condition is as shown in formula (3):

V(0,0)＝0

V(i,0)＝V(i-1,0)+σ(S[i],-),1≤i≤|S|(3)。

V(0,j)＝V(0,j-1)+σ(-,T[i]),1≤j≤|T|

In the embodiment of the present invention, S=MNMXYMX, T=MNYNX, then matrix V is as follows:

S3, filling score matrix.

According to scoring strategy, utilize formula (4) that summation is carried out with the value of its known adjacent position (i.e. upper and lower, left and right and oblique line adjacent position) in space remaining in matrix V and compare, maximum value is filled in cell:

V (i, j) = m a x \{\begin{matrix} V (i - 1, j - 1) + σ (S [i], T [j]) \\ V (i - 1, j) + σ (S [i], -) \\ V (i, j - 1) + σ (-, T [j]) \end{matrix} - - - (4) .

Such as, work as i=1, during j=1, V (i, j)=V (1,1), S [i]=S [1]=T [j]=T [1]=M, then σ (S [i], T [j])=1, and V (i-1, j-1)=V (0,0)=0, V (i-1, j)=V (0,1)=-1, V (i, j-1)=V (1,0)=-1, therefore according to formula (4), maximal value V (i, j)=0+1=1.The value of subsequent cell lattice can obtain by formula (4) recursion.

Then the score matrix filling result of above-described embodiment is:

In the embodiment of the present invention, the value 2 of V (| S|, | T|) is the score value finally compared, namely regression process is retrodict the process of marking order: set out from V (| S|, | T|), return its source, and with arrow mark, obtain the comparative result of data S and T.Then regression result is:

Backtracking rule be: from last lattice V (| S|, | T|) set out, check that the value of each cell according to which subitem of formula (4) calculates, if first, i.e. V (i, j)=V (i-1, j-1)+σ (S [i], T [j]), then show a clinodiagonal cell on the cell of source; If second, i.e. V (i, j)=V (i-1, j)+σ (S [i],-), then show source cell be directly over cell, expression will insert a space in S; If the 3rd, i.e. V (i, j)=V (i, j-1)+σ (-, T [j]), then show to come that source unit is left cell, expression will insert a space in T.

According to above rule, comparative result corresponding to this path is:

MN--YNX

MNMXYMX

Namely insert two spaces in the 3rd, No. 4 position of S, the data obtained and the similarity of T the highest, concrete similarity is 2.

Those of ordinary skill in the art will appreciate that, embodiment described here is to help reader understanding's principle of the present invention, should be understood to that protection scope of the present invention is not limited to so special statement and embodiment.Those of ordinary skill in the art can make various other various concrete distortion and combination of not departing from essence of the present invention according to these technology enlightenment disclosed by the invention, and these distortion and combination are still in protection scope of the present invention.

Claims

1. a data similarity analysis method, is characterized in that, comprises the following steps:

S1, scoring strategy is set;

S2, structure score matrix;

S3, filling score matrix;

2. data similarity analysis method according to claim 1, it is characterized in that, described step S1 is specially:

Suppose that two groups of data to be compared are S=s ₁s ₂s _nand T=t ₁t ₂t _m, length is respectively n and m, obtains S ' and T ', make by appropriate location insertion space "-" in S and T | S ' |=| T ' |=l;

Wherein σ (S ' [i], T ' [i]) is the score on the i of position;

Then the final score of data S and T is:

S c o r e = Σ_{i = 1}^{l} σ (S^{'} [i], T^{'} [i]) - - - (2) .

3. data similarity analysis method according to claim 2, it is characterized in that, described step S2 is specially:

Matrix V starting condition is as shown in formula (3):

V(0,0)＝0

V(i,0)＝V(i-1,0)+σ(S[i],-),1≤i≤|S|(3)。

V(0,j)＝V(0,j-1)+σ(-,T[i]),1≤j≤|T|

4. data similarity analysis method according to claim 3, it is characterized in that, described step S3 is specially:

According to scoring strategy, utilize formula (4) that summation is carried out with the value of its known adjacent position in space remaining in matrix V and compare, maximum value is filled in cell:

V (i, j) = m a x \{\begin{matrix} V (i - 1, j - 1) + σ (S [i], T [j]) \\ V (i - 1, j) + σ (S [i], -) \\ V (i, j - 1) + σ (-, T [j]) \end{matrix} - - - (4) .

5. data similarity analysis method according to claim 4, it is characterized in that, described step S4 is specially:

Set out from V (| S|, | T|), return its source, and with arrow mark, obtain the comparative result of data S and T.