CN105183832A - Data similarity analysis method - Google Patents
Data similarity analysis method Download PDFInfo
- Publication number
- CN105183832A CN105183832A CN201510545966.8A CN201510545966A CN105183832A CN 105183832 A CN105183832 A CN 105183832A CN 201510545966 A CN201510545966 A CN 201510545966A CN 105183832 A CN105183832 A CN 105183832A
- Authority
- CN
- China
- Prior art keywords
- data
- rsqb
- lsqb
- analysis method
- matrix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a data similarity analysis method. The method comprises the following steps of: S1, setting a scoring policy; S2, establishing a scoring matrix; S3, filling the scoring matrix; and S4, performing regression on the scoring matrix and obtaining a comparison result of two sets of data. According the data similarity analysis method, the scoring policy is introduced and the score is calculated according to the policy; and the method is used to quantitatively determine the similarity between the data, and can perform similarity comparison aimed at the inaccurate matching.
Description
Technical field
The invention belongs to computing machine Business intelligence field, be specifically related to a kind of design of data similarity analysis method.
Background technology
Along with the development of infotech, many IT field all can produce a large amount of data every day, and in many service environments, there is different databases, all store a large amount of data in these databases, and these data often all also exist certain similarity, such as in the business datum of certain colleges and universities, education administration system and all in store data relevant to student in a large number in work system.
Under these circumstances, we are many times concerned about the similarity degree between these different pieces of information sources very much, to analyze the redundancy condition between these data.These different data are all generally store with different data layouts, generally there are numeric type, character type, date type etc., and it is well-known, for computing machine, these data store all in the form of a binary number, further, these different data formats can be converted into character string type, therefore, the correlation technique of data similarity-rough set, in fact can be converted into the correlation technique of character string comparison.The technology that character string comparison is relevant mainly comprises:
1, simple comparative approach
Direct compare string is the whether identical similarity degree determining the two in relevant position.Due to the whole traversal of needs, therefore comparison procedure needs to consume the more time.The advantage of simple comparative approach is to realize simple, can be used for the situation that data volume is less, but for the scene of big data quantity, does not generally use this original data comparing method.
2, BM method
BM method is the abbreviation of the Boyer-Moore of precise character String matching.The time complexity of this method is lower, is present a kind of many methods.
So-called precise character String matching problem is in text T, find substring that is all and inquiry P exact matching.BM algorithm has mainly used three kinds of ingenious and effective methods, namely scans from right to left, and batter accords with rule and to become reconciled suffix rule.
Being meant to of scanning from right to left is mated forward from last character, instead of customaryly mates backward from the outset.
Batter accords with rule, in scanning process from right to left, finds that Ti and Pj is different, if it is identical to there is a character Pk and Ti in P, and k<i, so just by directly being moved right by P, Pk and Ti is alignd, and then mate from right to left.If there is not any character identical with Ti in P, then direct the first character of P to be alignd with the character late of Ti, then compare from right to left.
Good suffix rule is, in scanning process from right to left, finds that Ti and Pj is different, checks that other position t' of identical part t whether in P occurs.
(1) if the previous letter of t and t' is not identical, just P is moved right, the t in t' and T is alignd.
(2) if t' does not occur, then find the longest-prefix x of the P identical with the suffix of t, move right P, makes the suffix of t in x and T corresponding.
3, KMP method
KMP algorithm is a kind of string matching algorithm of improvement, is found by D.E.Knuth, J.H.Morris and V.R.Pratt simultaneously.The key of KMP algorithm is the information utilized after it fails to match, reduces the matching times of pattern string and main string to reach the object of Rapid matching as far as possible.
KMP method when execution character string T and W compares, to the matching check of T [i] and W [j].If T [i]=W [j], then continue to check whether T [i+1] and W [j+1] mates.If T [i] ≠ W [j], be then divided into two kinds of situations: if j=1, then pattern string moves to right one, checks whether T [i+1] and W [1] mates; If 1<j<=m, then pattern string moves to right j-next (j) position, checks whether T [i] and W [next (j)] mates.Repeat this process until j=m or i=n terminates.
Except above three kinds of methods, also has the method for some character string comparisons, these methods can calculate for the matching process of data, but they have a common shortcoming, can only compare accurately exactly, such as compare AABBCC and BC, can be easy to obtain match point, but for inexact matching, as AABBCC and AACC compare time, these accurate comparative approach just can not obtain the result of similarity, and in fact these two character strings are all identical at head and tail, and we can draw the two similar conclusion completely accordingly.
Summary of the invention
The object of the invention is can only compare accurately to solve string comparison method in prior art, the problem of the result of similarity can not be obtained for inexact matching, propose a kind of data similarity analysis method.
Technical scheme of the present invention is: a kind of data similarity analysis method, comprises the following steps:
S1, scoring strategy is set;
S2, structure score matrix;
S3, filling score matrix;
S4, score matrix to be returned, obtain the comparative result of two groups of data.
Further, step S1 is specially: suppose that two groups of data to be compared are S=s
1s
2s
nand T=t
1t
2t
m, length is respectively n and m, obtains S ' and T ', make by appropriate location insertion space "-" in S and T | S ' |=| T ' |=l;
Relatively S ' and T ', if the value on the i of position is equal, then 1 point, if the value on the i of position is unequal and be space, then 0 point, if the value on the i of position has space, then obtain-1 point, namely deduct points, as shown in formula (1):
Wherein σ (S ' [i], T ' [i]) is the score on the i of position;
Then the final score of data S and T is:
Further, step S2 is specially: build (n+1) × (m+1) rank matrix V, except V (0,0) outward, the 1st row are corresponding with data sequence S-phase, and the 1st row is corresponding with data sequence T-phase;
Matrix V starting condition is as shown in formula (3):
V(0,0)=0
V(i,0)=V(i-1,0)+σ(S[i],-),1≤i≤|S|(3)。
V(0,j)=V(0,j-1)+σ(-,T[i]),1≤j≤|T|
Further, step S3 is specially: according to scoring strategy, utilizes formula (4) that summation is carried out with the value of its known adjacent position in space remaining in matrix V and compares, be filled in cell by maximum value:
Further, step S4 is specially: set out from V (| S|, | T|), returns its source, and with arrow mark, obtains the comparative result of data S and T.
The invention has the beneficial effects as follows: invention introduces scoring strategy, and according to policy calculation score value, for the similarity degree between rational judgment data, similarity-rough set can be carried out for the situation of inexact matching.
Accompanying drawing explanation
Fig. 1 is a kind of data similarity analysis method flow diagram provided by the invention.
Embodiment
Below in conjunction with accompanying drawing, embodiments of the invention are further described.
The invention provides a kind of data similarity analysis method, as shown in Figure 1, comprise the following steps:
S1, scoring strategy is set.
Suppose that two groups of data to be compared are S=s
1s
2s
nand T=t
1t
2t
m, length is respectively n and m, in order to represent the similarity degree of the two quantitatively, introduces scoring strategy, and according to policy calculation score value, for the similarity of finally both expressions.
In the comparison procedure of carrying out S and T, because the length of the two is different, need to expand S and T, the appropriate location in S and T is namely needed to insert some spaces (representing with symbol "-"), the length of both guarantees is consistent, and score value is the highest under the condition of scoring strategy, then this score value is exactly final similarity.Therefore, how finding the position, space that these are suitable, is key of the present invention.If obtain S ' and T ' by insertion space, position suitable in S and T, and | S ' |=| T ' |=l.
Relatively S ' and T ', if the value on the i of position is equal, then 1 point, if the value on the i of position is unequal and be space, then 0 point, if the value on the i of position has space, then obtain-1 point, namely deduct points, as shown in formula (1):
Wherein σ (S ' [i], T ' [i]) is the score on the i of position.
It should be noted that S ' and T ' position i upper can not be space simultaneously because space is artificial introducing, all introduce space at same position nonsensical.
Then the final score of data S and T is:
S2, structure score matrix.
Build (n+1) × (m+1) rank matrix V, except V (0,0) outward, the 1st row are corresponding with data sequence S-phase, and the 1st row is corresponding with data sequence T-phase;
Matrix V starting condition is as shown in formula (3):
V(0,0)=0
V(i,0)=V(i-1,0)+σ(S[i],-),1≤i≤|S|(3)。
V(0,j)=V(0,j-1)+σ(-,T[i]),1≤j≤|T|
In the embodiment of the present invention, S=MNMXYMX, T=MNYNX, then matrix V is as follows:
S3, filling score matrix.
According to scoring strategy, utilize formula (4) that summation is carried out with the value of its known adjacent position (i.e. upper and lower, left and right and oblique line adjacent position) in space remaining in matrix V and compare, maximum value is filled in cell:
Such as, work as i=1, during j=1, V (i, j)=V (1,1), S [i]=S [1]=T [j]=T [1]=M, then σ (S [i], T [j])=1, and V (i-1, j-1)=V (0,0)=0, V (i-1, j)=V (0,1)=-1, V (i, j-1)=V (1,0)=-1, therefore according to formula (4), maximal value V (i, j)=0+1=1.The value of subsequent cell lattice can obtain by formula (4) recursion.
Then the score matrix filling result of above-described embodiment is:
S4, score matrix to be returned, obtain the comparative result of two groups of data.
In the embodiment of the present invention, the value 2 of V (| S|, | T|) is the score value finally compared, namely regression process is retrodict the process of marking order: set out from V (| S|, | T|), return its source, and with arrow mark, obtain the comparative result of data S and T.Then regression result is:
Backtracking rule be: from last lattice V (| S|, | T|) set out, check that the value of each cell according to which subitem of formula (4) calculates, if first, i.e. V (i, j)=V (i-1, j-1)+σ (S [i], T [j]), then show a clinodiagonal cell on the cell of source; If second, i.e. V (i, j)=V (i-1, j)+σ (S [i],-), then show source cell be directly over cell, expression will insert a space in S; If the 3rd, i.e. V (i, j)=V (i, j-1)+σ (-, T [j]), then show to come that source unit is left cell, expression will insert a space in T.
According to above rule, comparative result corresponding to this path is:
MN--YNX
MNMXYMX
Namely insert two spaces in the 3rd, No. 4 position of S, the data obtained and the similarity of T the highest, concrete similarity is 2.
Those of ordinary skill in the art will appreciate that, embodiment described here is to help reader understanding's principle of the present invention, should be understood to that protection scope of the present invention is not limited to so special statement and embodiment.Those of ordinary skill in the art can make various other various concrete distortion and combination of not departing from essence of the present invention according to these technology enlightenment disclosed by the invention, and these distortion and combination are still in protection scope of the present invention.
Claims (5)
1. a data similarity analysis method, is characterized in that, comprises the following steps:
S1, scoring strategy is set;
S2, structure score matrix;
S3, filling score matrix;
S4, score matrix to be returned, obtain the comparative result of two groups of data.
2. data similarity analysis method according to claim 1, it is characterized in that, described step S1 is specially:
Suppose that two groups of data to be compared are S=s
1s
2s
nand T=t
1t
2t
m, length is respectively n and m, obtains S ' and T ', make by appropriate location insertion space "-" in S and T | S ' |=| T ' |=l;
Relatively S ' and T ', if the value on the i of position is equal, then 1 point, if the value on the i of position is unequal and be space, then 0 point, if the value on the i of position has space, then obtain-1 point, namely deduct points, as shown in formula (1):
Wherein σ (S ' [i], T ' [i]) is the score on the i of position;
Then the final score of data S and T is:
3. data similarity analysis method according to claim 2, it is characterized in that, described step S2 is specially:
Build (n+1) × (m+1) rank matrix V, except V (0,0) outward, the 1st row are corresponding with data sequence S-phase, and the 1st row is corresponding with data sequence T-phase;
Matrix V starting condition is as shown in formula (3):
V(0,0)=0
V(i,0)=V(i-1,0)+σ(S[i],-),1≤i≤|S|(3)。
V(0,j)=V(0,j-1)+σ(-,T[i]),1≤j≤|T|
4. data similarity analysis method according to claim 3, it is characterized in that, described step S3 is specially:
According to scoring strategy, utilize formula (4) that summation is carried out with the value of its known adjacent position in space remaining in matrix V and compare, maximum value is filled in cell:
5. data similarity analysis method according to claim 4, it is characterized in that, described step S4 is specially:
Set out from V (| S|, | T|), return its source, and with arrow mark, obtain the comparative result of data S and T.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510545966.8A CN105183832A (en) | 2015-08-31 | 2015-08-31 | Data similarity analysis method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510545966.8A CN105183832A (en) | 2015-08-31 | 2015-08-31 | Data similarity analysis method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105183832A true CN105183832A (en) | 2015-12-23 |
Family
ID=54905914
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510545966.8A Pending CN105183832A (en) | 2015-08-31 | 2015-08-31 | Data similarity analysis method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105183832A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105701256A (en) * | 2016-03-23 | 2016-06-22 | 南京南瑞继保电气有限公司 | Communication point table file comparison method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080104062A1 (en) * | 2004-02-09 | 2008-05-01 | Mailfrontier, Inc. | Approximate Matching of Strings for Message Filtering |
CN102298582A (en) * | 2010-06-23 | 2011-12-28 | 商业对象软件有限公司 | Data searching and matching method and system |
US20150066969A1 (en) * | 2013-08-30 | 2015-03-05 | International Business Machines Corporation | Combined deterministic and probabilistic matching for data management |
-
2015
- 2015-08-31 CN CN201510545966.8A patent/CN105183832A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080104062A1 (en) * | 2004-02-09 | 2008-05-01 | Mailfrontier, Inc. | Approximate Matching of Strings for Message Filtering |
CN102298582A (en) * | 2010-06-23 | 2011-12-28 | 商业对象软件有限公司 | Data searching and matching method and system |
US20150066969A1 (en) * | 2013-08-30 | 2015-03-05 | International Business Machines Corporation | Combined deterministic and probabilistic matching for data management |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105701256A (en) * | 2016-03-23 | 2016-06-22 | 南京南瑞继保电气有限公司 | Communication point table file comparison method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110941722B (en) | Knowledge graph fusion method based on entity alignment | |
CN104391942B (en) | Short essay eigen extended method based on semantic collection of illustrative plates | |
Yu et al. | Overview of SIGHAN 2014 bake-off for Chinese spelling check | |
CN107330100A (en) | Combine the two-way search method of image text of embedded space based on multi views | |
CN110390006B (en) | Question-answer corpus generation method, device and computer readable storage medium | |
CN104991889A (en) | Fuzzy word segmentation based non-multi-character word error automatic proofreading method | |
CN109359172B (en) | Entity alignment optimization method based on graph partitioning | |
CN103617280B (en) | Method and system for mining Chinese event information | |
CN105426711A (en) | Similarity detection method of computer software source code | |
Crochemore et al. | Order-preserving incomplete suffix trees and order-preserving indexes | |
CN106649597A (en) | Method for automatically establishing back-of-book indexes of book based on book contents | |
CN103853710A (en) | Coordinated training-based dual-language named entity identification method | |
US10528664B2 (en) | Preserving and processing ambiguity in natural language | |
CN109117464A (en) | A kind of data similarity detection method based on editing distance | |
CN106021541A (en) | Secondary k-anonymity privacy protection algorithm for differentiating quasi-identifier attributes | |
CN102402561B (en) | Searching method and device | |
CN106169096B (en) | A kind of appraisal procedure of machine learning system learning performance | |
CN104484380A (en) | Personalized search method and personalized search device | |
CN103377237B (en) | The neighbor search method of high dimensional data and fast approximate image searching method | |
CN109325019A (en) | Data correlation relation network establishing method | |
CN107436955A (en) | A kind of English word relatedness computation method and apparatus based on Wikipedia Concept Vectors | |
Faro et al. | A multiple sliding windows approach to speed up string matching algorithms | |
US7548652B1 (en) | Rapid comparison of similar data strings | |
CN105183832A (en) | Data similarity analysis method | |
CN107291730A (en) | Method, device and the probabilistic dictionaries construction method of correction suggestion are provided query word |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20151223 |