CN105183832A - Data similarity analysis method - Google Patents

Data similarity analysis method Download PDF

Info

Publication number
CN105183832A
CN105183832A CN201510545966.8A CN201510545966A CN105183832A CN 105183832 A CN105183832 A CN 105183832A CN 201510545966 A CN201510545966 A CN 201510545966A CN 105183832 A CN105183832 A CN 105183832A
Authority
CN
China
Prior art keywords
data
rsqb
lsqb
analysis method
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510545966.8A
Other languages
Chinese (zh)
Inventor
唐雪飞
陈科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CHENGDU COMSYS INFORMATION TECHNOLOGY Co Ltd
Original Assignee
CHENGDU COMSYS INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CHENGDU COMSYS INFORMATION TECHNOLOGY Co Ltd filed Critical CHENGDU COMSYS INFORMATION TECHNOLOGY Co Ltd
Priority to CN201510545966.8A priority Critical patent/CN105183832A/en
Publication of CN105183832A publication Critical patent/CN105183832A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data similarity analysis method. The method comprises the following steps of: S1, setting a scoring policy; S2, establishing a scoring matrix; S3, filling the scoring matrix; and S4, performing regression on the scoring matrix and obtaining a comparison result of two sets of data. According the data similarity analysis method, the scoring policy is introduced and the score is calculated according to the policy; and the method is used to quantitatively determine the similarity between the data, and can perform similarity comparison aimed at the inaccurate matching.

Description

A kind of data similarity analysis method
Technical field
The invention belongs to computing machine Business intelligence field, be specifically related to a kind of design of data similarity analysis method.
Background technology
Along with the development of infotech, many IT field all can produce a large amount of data every day, and in many service environments, there is different databases, all store a large amount of data in these databases, and these data often all also exist certain similarity, such as in the business datum of certain colleges and universities, education administration system and all in store data relevant to student in a large number in work system.
Under these circumstances, we are many times concerned about the similarity degree between these different pieces of information sources very much, to analyze the redundancy condition between these data.These different data are all generally store with different data layouts, generally there are numeric type, character type, date type etc., and it is well-known, for computing machine, these data store all in the form of a binary number, further, these different data formats can be converted into character string type, therefore, the correlation technique of data similarity-rough set, in fact can be converted into the correlation technique of character string comparison.The technology that character string comparison is relevant mainly comprises:
1, simple comparative approach
Direct compare string is the whether identical similarity degree determining the two in relevant position.Due to the whole traversal of needs, therefore comparison procedure needs to consume the more time.The advantage of simple comparative approach is to realize simple, can be used for the situation that data volume is less, but for the scene of big data quantity, does not generally use this original data comparing method.
2, BM method
BM method is the abbreviation of the Boyer-Moore of precise character String matching.The time complexity of this method is lower, is present a kind of many methods.
So-called precise character String matching problem is in text T, find substring that is all and inquiry P exact matching.BM algorithm has mainly used three kinds of ingenious and effective methods, namely scans from right to left, and batter accords with rule and to become reconciled suffix rule.
Being meant to of scanning from right to left is mated forward from last character, instead of customaryly mates backward from the outset.
Batter accords with rule, in scanning process from right to left, finds that Ti and Pj is different, if it is identical to there is a character Pk and Ti in P, and k<i, so just by directly being moved right by P, Pk and Ti is alignd, and then mate from right to left.If there is not any character identical with Ti in P, then direct the first character of P to be alignd with the character late of Ti, then compare from right to left.
Good suffix rule is, in scanning process from right to left, finds that Ti and Pj is different, checks that other position t' of identical part t whether in P occurs.
(1) if the previous letter of t and t' is not identical, just P is moved right, the t in t' and T is alignd.
(2) if t' does not occur, then find the longest-prefix x of the P identical with the suffix of t, move right P, makes the suffix of t in x and T corresponding.
3, KMP method
KMP algorithm is a kind of string matching algorithm of improvement, is found by D.E.Knuth, J.H.Morris and V.R.Pratt simultaneously.The key of KMP algorithm is the information utilized after it fails to match, reduces the matching times of pattern string and main string to reach the object of Rapid matching as far as possible.
KMP method when execution character string T and W compares, to the matching check of T [i] and W [j].If T [i]=W [j], then continue to check whether T [i+1] and W [j+1] mates.If T [i] ≠ W [j], be then divided into two kinds of situations: if j=1, then pattern string moves to right one, checks whether T [i+1] and W [1] mates; If 1<j<=m, then pattern string moves to right j-next (j) position, checks whether T [i] and W [next (j)] mates.Repeat this process until j=m or i=n terminates.
Except above three kinds of methods, also has the method for some character string comparisons, these methods can calculate for the matching process of data, but they have a common shortcoming, can only compare accurately exactly, such as compare AABBCC and BC, can be easy to obtain match point, but for inexact matching, as AABBCC and AACC compare time, these accurate comparative approach just can not obtain the result of similarity, and in fact these two character strings are all identical at head and tail, and we can draw the two similar conclusion completely accordingly.
Summary of the invention
The object of the invention is can only compare accurately to solve string comparison method in prior art, the problem of the result of similarity can not be obtained for inexact matching, propose a kind of data similarity analysis method.
Technical scheme of the present invention is: a kind of data similarity analysis method, comprises the following steps:
S1, scoring strategy is set;
S2, structure score matrix;
S3, filling score matrix;
S4, score matrix to be returned, obtain the comparative result of two groups of data.
Further, step S1 is specially: suppose that two groups of data to be compared are S=s 1s 2s nand T=t 1t 2t m, length is respectively n and m, obtains S ' and T ', make by appropriate location insertion space "-" in S and T | S ' |=| T ' |=l;
Relatively S ' and T ', if the value on the i of position is equal, then 1 point, if the value on the i of position is unequal and be space, then 0 point, if the value on the i of position has space, then obtain-1 point, namely deduct points, as shown in formula (1):
Wherein σ (S ' [i], T ' [i]) is the score on the i of position;
Then the final score of data S and T is:
S c o r e = &Sigma; i = 1 l &sigma; ( S &prime; &lsqb; i &rsqb; , T &prime; &lsqb; i &rsqb; ) - - - ( 2 ) .
Further, step S2 is specially: build (n+1) × (m+1) rank matrix V, except V (0,0) outward, the 1st row are corresponding with data sequence S-phase, and the 1st row is corresponding with data sequence T-phase;
Matrix V starting condition is as shown in formula (3):
V(0,0)=0
V(i,0)=V(i-1,0)+σ(S[i],-),1≤i≤|S|(3)。
V(0,j)=V(0,j-1)+σ(-,T[i]),1≤j≤|T|
Further, step S3 is specially: according to scoring strategy, utilizes formula (4) that summation is carried out with the value of its known adjacent position in space remaining in matrix V and compares, be filled in cell by maximum value:
V ( i , j ) = m a x V ( i - 1 , j - 1 ) + &sigma; ( S &lsqb; i &rsqb; , T &lsqb; j &rsqb; ) V ( i - 1 , j ) + &sigma; ( S &lsqb; i &rsqb; , - ) V ( i , j - 1 ) + &sigma; ( - , T &lsqb; j &rsqb; ) - - - ( 4 ) .
Further, step S4 is specially: set out from V (| S|, | T|), returns its source, and with arrow mark, obtains the comparative result of data S and T.
The invention has the beneficial effects as follows: invention introduces scoring strategy, and according to policy calculation score value, for the similarity degree between rational judgment data, similarity-rough set can be carried out for the situation of inexact matching.
Accompanying drawing explanation
Fig. 1 is a kind of data similarity analysis method flow diagram provided by the invention.
Embodiment
Below in conjunction with accompanying drawing, embodiments of the invention are further described.
The invention provides a kind of data similarity analysis method, as shown in Figure 1, comprise the following steps:
S1, scoring strategy is set.
Suppose that two groups of data to be compared are S=s 1s 2s nand T=t 1t 2t m, length is respectively n and m, in order to represent the similarity degree of the two quantitatively, introduces scoring strategy, and according to policy calculation score value, for the similarity of finally both expressions.
In the comparison procedure of carrying out S and T, because the length of the two is different, need to expand S and T, the appropriate location in S and T is namely needed to insert some spaces (representing with symbol "-"), the length of both guarantees is consistent, and score value is the highest under the condition of scoring strategy, then this score value is exactly final similarity.Therefore, how finding the position, space that these are suitable, is key of the present invention.If obtain S ' and T ' by insertion space, position suitable in S and T, and | S ' |=| T ' |=l.
Relatively S ' and T ', if the value on the i of position is equal, then 1 point, if the value on the i of position is unequal and be space, then 0 point, if the value on the i of position has space, then obtain-1 point, namely deduct points, as shown in formula (1):
Wherein σ (S ' [i], T ' [i]) is the score on the i of position.
It should be noted that S ' and T ' position i upper can not be space simultaneously because space is artificial introducing, all introduce space at same position nonsensical.
Then the final score of data S and T is:
S c o r e = &Sigma; i = 1 l &sigma; ( S &prime; &lsqb; i &rsqb; , T &prime; &lsqb; i &rsqb; ) - - - ( 2 ) .
S2, structure score matrix.
Build (n+1) × (m+1) rank matrix V, except V (0,0) outward, the 1st row are corresponding with data sequence S-phase, and the 1st row is corresponding with data sequence T-phase;
Matrix V starting condition is as shown in formula (3):
V(0,0)=0
V(i,0)=V(i-1,0)+σ(S[i],-),1≤i≤|S|(3)。
V(0,j)=V(0,j-1)+σ(-,T[i]),1≤j≤|T|
In the embodiment of the present invention, S=MNMXYMX, T=MNYNX, then matrix V is as follows:
S3, filling score matrix.
According to scoring strategy, utilize formula (4) that summation is carried out with the value of its known adjacent position (i.e. upper and lower, left and right and oblique line adjacent position) in space remaining in matrix V and compare, maximum value is filled in cell:
V ( i , j ) = m a x V ( i - 1 , j - 1 ) + &sigma; ( S &lsqb; i &rsqb; , T &lsqb; j &rsqb; ) V ( i - 1 , j ) + &sigma; ( S &lsqb; i &rsqb; , - ) V ( i , j - 1 ) + &sigma; ( - , T &lsqb; j &rsqb; ) - - - ( 4 ) .
Such as, work as i=1, during j=1, V (i, j)=V (1,1), S [i]=S [1]=T [j]=T [1]=M, then σ (S [i], T [j])=1, and V (i-1, j-1)=V (0,0)=0, V (i-1, j)=V (0,1)=-1, V (i, j-1)=V (1,0)=-1, therefore according to formula (4), maximal value V (i, j)=0+1=1.The value of subsequent cell lattice can obtain by formula (4) recursion.
Then the score matrix filling result of above-described embodiment is:
S4, score matrix to be returned, obtain the comparative result of two groups of data.
In the embodiment of the present invention, the value 2 of V (| S|, | T|) is the score value finally compared, namely regression process is retrodict the process of marking order: set out from V (| S|, | T|), return its source, and with arrow mark, obtain the comparative result of data S and T.Then regression result is:
Backtracking rule be: from last lattice V (| S|, | T|) set out, check that the value of each cell according to which subitem of formula (4) calculates, if first, i.e. V (i, j)=V (i-1, j-1)+σ (S [i], T [j]), then show a clinodiagonal cell on the cell of source; If second, i.e. V (i, j)=V (i-1, j)+σ (S [i],-), then show source cell be directly over cell, expression will insert a space in S; If the 3rd, i.e. V (i, j)=V (i, j-1)+σ (-, T [j]), then show to come that source unit is left cell, expression will insert a space in T.
According to above rule, comparative result corresponding to this path is:
MN--YNX
MNMXYMX
Namely insert two spaces in the 3rd, No. 4 position of S, the data obtained and the similarity of T the highest, concrete similarity is 2.
Those of ordinary skill in the art will appreciate that, embodiment described here is to help reader understanding's principle of the present invention, should be understood to that protection scope of the present invention is not limited to so special statement and embodiment.Those of ordinary skill in the art can make various other various concrete distortion and combination of not departing from essence of the present invention according to these technology enlightenment disclosed by the invention, and these distortion and combination are still in protection scope of the present invention.

Claims (5)

1. a data similarity analysis method, is characterized in that, comprises the following steps:
S1, scoring strategy is set;
S2, structure score matrix;
S3, filling score matrix;
S4, score matrix to be returned, obtain the comparative result of two groups of data.
2. data similarity analysis method according to claim 1, it is characterized in that, described step S1 is specially:
Suppose that two groups of data to be compared are S=s 1s 2s nand T=t 1t 2t m, length is respectively n and m, obtains S ' and T ', make by appropriate location insertion space "-" in S and T | S ' |=| T ' |=l;
Relatively S ' and T ', if the value on the i of position is equal, then 1 point, if the value on the i of position is unequal and be space, then 0 point, if the value on the i of position has space, then obtain-1 point, namely deduct points, as shown in formula (1):
Wherein σ (S ' [i], T ' [i]) is the score on the i of position;
Then the final score of data S and T is:
S c o r e = &Sigma; i = 1 l &sigma; ( S &prime; &lsqb; i &rsqb; , T &prime; &lsqb; i &rsqb; ) - - - ( 2 ) .
3. data similarity analysis method according to claim 2, it is characterized in that, described step S2 is specially:
Build (n+1) × (m+1) rank matrix V, except V (0,0) outward, the 1st row are corresponding with data sequence S-phase, and the 1st row is corresponding with data sequence T-phase;
Matrix V starting condition is as shown in formula (3):
V(0,0)=0
V(i,0)=V(i-1,0)+σ(S[i],-),1≤i≤|S|(3)。
V(0,j)=V(0,j-1)+σ(-,T[i]),1≤j≤|T|
4. data similarity analysis method according to claim 3, it is characterized in that, described step S3 is specially:
According to scoring strategy, utilize formula (4) that summation is carried out with the value of its known adjacent position in space remaining in matrix V and compare, maximum value is filled in cell:
V ( i , j ) = m a x V ( i - 1 , j - 1 ) + &sigma; ( S &lsqb; i &rsqb; , T &lsqb; j &rsqb; ) V ( i - 1 , j ) + &sigma; ( S &lsqb; i &rsqb; , - ) V ( i , j - 1 ) + &sigma; ( - , T &lsqb; j &rsqb; ) - - - ( 4 ) .
5. data similarity analysis method according to claim 4, it is characterized in that, described step S4 is specially:
Set out from V (| S|, | T|), return its source, and with arrow mark, obtain the comparative result of data S and T.
CN201510545966.8A 2015-08-31 2015-08-31 Data similarity analysis method Pending CN105183832A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510545966.8A CN105183832A (en) 2015-08-31 2015-08-31 Data similarity analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510545966.8A CN105183832A (en) 2015-08-31 2015-08-31 Data similarity analysis method

Publications (1)

Publication Number Publication Date
CN105183832A true CN105183832A (en) 2015-12-23

Family

ID=54905914

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510545966.8A Pending CN105183832A (en) 2015-08-31 2015-08-31 Data similarity analysis method

Country Status (1)

Country Link
CN (1) CN105183832A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105701256A (en) * 2016-03-23 2016-06-22 南京南瑞继保电气有限公司 Communication point table file comparison method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080104062A1 (en) * 2004-02-09 2008-05-01 Mailfrontier, Inc. Approximate Matching of Strings for Message Filtering
CN102298582A (en) * 2010-06-23 2011-12-28 商业对象软件有限公司 Data searching and matching method and system
US20150066969A1 (en) * 2013-08-30 2015-03-05 International Business Machines Corporation Combined deterministic and probabilistic matching for data management

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080104062A1 (en) * 2004-02-09 2008-05-01 Mailfrontier, Inc. Approximate Matching of Strings for Message Filtering
CN102298582A (en) * 2010-06-23 2011-12-28 商业对象软件有限公司 Data searching and matching method and system
US20150066969A1 (en) * 2013-08-30 2015-03-05 International Business Machines Corporation Combined deterministic and probabilistic matching for data management

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105701256A (en) * 2016-03-23 2016-06-22 南京南瑞继保电气有限公司 Communication point table file comparison method

Similar Documents

Publication Publication Date Title
CN110941722B (en) Knowledge graph fusion method based on entity alignment
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
Yu et al. Overview of SIGHAN 2014 bake-off for Chinese spelling check
CN107330100A (en) Combine the two-way search method of image text of embedded space based on multi views
CN110390006B (en) Question-answer corpus generation method, device and computer readable storage medium
CN104991889A (en) Fuzzy word segmentation based non-multi-character word error automatic proofreading method
CN109359172B (en) Entity alignment optimization method based on graph partitioning
CN103617280B (en) Method and system for mining Chinese event information
CN105426711A (en) Similarity detection method of computer software source code
Crochemore et al. Order-preserving incomplete suffix trees and order-preserving indexes
CN106649597A (en) Method for automatically establishing back-of-book indexes of book based on book contents
CN103853710A (en) Coordinated training-based dual-language named entity identification method
US10528664B2 (en) Preserving and processing ambiguity in natural language
CN109117464A (en) A kind of data similarity detection method based on editing distance
CN106021541A (en) Secondary k-anonymity privacy protection algorithm for differentiating quasi-identifier attributes
CN102402561B (en) Searching method and device
CN106169096B (en) A kind of appraisal procedure of machine learning system learning performance
CN104484380A (en) Personalized search method and personalized search device
CN103377237B (en) The neighbor search method of high dimensional data and fast approximate image searching method
CN109325019A (en) Data correlation relation network establishing method
CN107436955A (en) A kind of English word relatedness computation method and apparatus based on Wikipedia Concept Vectors
Faro et al. A multiple sliding windows approach to speed up string matching algorithms
US7548652B1 (en) Rapid comparison of similar data strings
CN105183832A (en) Data similarity analysis method
CN107291730A (en) Method, device and the probabilistic dictionaries construction method of correction suggestion are provided query word

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20151223