CN105718430B - A kind of method for calculating similarity as fingerprint based on packet minimum value - Google Patents
A kind of method for calculating similarity as fingerprint based on packet minimum value Download PDFInfo
- Publication number
- CN105718430B CN105718430B CN201610019243.9A CN201610019243A CN105718430B CN 105718430 B CN105718430 B CN 105718430B CN 201610019243 A CN201610019243 A CN 201610019243A CN 105718430 B CN105718430 B CN 105718430B
- Authority
- CN
- China
- Prior art keywords
- group
- mat
- elements
- fingerprint
- emp
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Collating Specific Patterns (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of fingerprint method for measuring similarity based on packet minimum value, this method breaks through the limitation of random alignment, fingerprint is not generated by random alignment, and its fingerprint still can be used for estimating similitude, solve poorly efficient, the complexity problem of random alignment, it can simplify the Hash procedure stage of Minwise algorithms and its mutation algorithm, be the optimization method of hash function in detection algorithm.
Description
Technical Field
The invention relates to a method for calculating similarity based on a grouping minimum value as a fingerprint.
Background
WEB is undergoing explosive growth, more and more literature data are published on the internet, and the trend causes document resources on the network to grow in geometric progression, provides unprecedented convenience for human knowledge sharing and wealth creation, and also has positive promotion effect on the modernization construction of China. However, while these digital resources provide help to people, the easy availability of the resources makes illegal copying, plagiarism and the like of documents more and more rampant, so that in various papers and project application books and the like, a serious plagiarism phenomenon may exist. Meanwhile, with the great investment of the nation in educational scientific research, various education and scientific projects are subsidized, such as: national science fund projects, doctor's point projects of the department of education, fund projects of provinces and cities, various scientific and technical plans and the like. Because the projects belong to different branch management of functional department units, the project application has the phenomenon of multiple declarations and multi-head declaration. The phenomena of plagiarism, repeated declaration and multi-head declaration of the application book seriously affect the objectivity and fairness of project approval, and have bad influence on the reasonable distribution of national scientific research funding, so that the scientific research funding can not be efficiently utilized. The research on the document similarity detection technology is very significant in order to prevent plagiarism and correct academic spirit. Therefore, search engines, libraries, foundation, research libraries, intellectual property departments and the like all over the world invest huge manpower, material resources and financial resources, and the research and exploration are endeavored on the detection of the similarity of the documents, so that the key scientific problems of the similarity detection are broken through as soon as possible, and a good solution is provided for the duplication checking of papers, project application books, reward application books, patents or web page duplication removal of the search engines.
The similarity detection data has the characteristic of mass, and taking the national science fund application as an example, the number of the application is more than 17 ten thousand at present in 2013 application amount, and the similarity detection data can also increase at a higher speed every year. For another example, in recent years, the number of graduates of colleges and universities in china is about 700 thousands, most of the graduate papers need to be subjected to similarity detection, the detection amount of the papers in month 5 reaches a peak, the number of the papers in each year is more than tens of thousands, the similarity detection needs to be carried out not only with data in the current year but also with historical data, and the detection of such a large number of documents cannot be carried out at all by means of conventional detection methods, so that a set of detection mechanism with excellent precision and efficiency is urgently needed to be established, and the similarity comparison technology for the large number of documents is realized.
The construction of the estimator of the Minwise hash and the variety thereof is based on random arrangement, and the basic principle of the Minwise hash and the variety thereof is as follows:
let the complete set Ω = {0, 1., D-1}, and get the relevant sets of shinles S through the shinling document D d . Document S 1 And S 2 The similarity of (a) is defined as:wherein f is 1 =|S 1 |,f 2 =|S 2 |,a=|S 1 ∩S 2 L. Assume a random independent permutation over the full set Ω: π Ω → Ω, Ω = {0, 1.., D-1}, by k independent random permutations of π 1 ,π 2 ,...,π k The shinles collection of any document d is converted into:
the similarity estimator of Minwise Hash R is:
in equation (1), the function min { π (X) } is a Minwise hash function.
And the unbiased estimation of R is measured as:
the estimated variance is:
where k is the sample size (or number of experiments).
As shown in formula (2), k is the number of experiments, and k fingerprints can be obtained by k pi, so that similarity can be approximately solved by comparing the fingerprints in equal proportion. The k has great significance for estimation calculation, the larger the k is, the smaller the estimated variance is, and the higher the estimation accuracy and recall rate are; the smaller k is, the larger the estimated variance is, and the lower the estimated accuracy and recall rate is, so that k >1000 is often required in a practical system, and the variance can be generally reduced to a range acceptable by a user.
Whether random permutations are generated or used, a significant amount of computation time is required because the range of random permutations is for the full set [0,2 ] 32 ) And (4) carrying out arrangement. Such large arrangements are generally not available in practical systems. Generally using an approximate arrangement, for [0,2 ] 32 ) Smaller value range modulus is carried out, the improvement efficiency is limited, the precision is reduced, and even if the k still needs to be generated&1000 random permutations, the calculation time is still very long.
The above-mentioned hash fingerprint technique is a commonly used technique, but due to the limitation of random arrangement, the generated fingerprint is inefficient, and the accuracy of the phase comparison detection is not high.
Disclosure of Invention
In order to overcome the problems of the prior art, the invention provides a fingerprint similarity method based on the minimum grouping value, and the method can break through the limitation of random arrangement, namely, the fingerprint generation method without random arrangement.
In order to solve the technical problems, the technical scheme of the invention is as follows:
the fingerprint similarity method based on the grouping minimum value comprises the following steps:
s1, text feature extraction: for extracting text feature set S d ;
Scanning and analyzing the text information, segmenting words of the text information, filtering noise data in the text information, and obtaining a segmentation set which is a word set S of the text shgs (ii) a Word set S shgs Adopting Rabin function, mapping 32-bit integer, and naming the mapped set as S d ;
S2, data grouping: given 2 documents S 1 And S 2 To S 1 And S 2 The elements in (1) are grouped in a fixed size m, the number of groups is k, the total set is omega = {0,1,2 \8230;, 2 32 -1},|Ω|=2 32 Some elements in the set are labeled 1, noneElements are labeled 0;
s3, selecting the representatives in the group to form a fingerprint:
selecting a representative from each group, wherein the representative is the maximum value, the minimum value or the intermediate element of each group; if there is no element in the group, called empty group, then set as e, S d K sets of elements represented in groups, namely fingerprint sets, are formed;
s4, calculating the similarity R in the following way:
R=N mat /(k-N emp ) (4)
wherein N is mat Indicating that each group represents an equal number of times, N emp Indicating the number of empty groups; I.C. A mat,i Equal counts at the i-th group comparison; i is emp,i Indicating a simultaneously empty count.
The method comprises the following specific steps:
equation (7) is specifically: when both elements are compared to be not e, i.e. neither element is empty, and both elements are equal, then I mat,i =1, otherwise I mat,i =0;
Equation (8) is specifically: when both elements are compared to e, i.e. both are empty, then I emp,i =1, otherwise I emp,i =0。
Compared with the prior art, the invention has the advantages that:
1) The method omits random arrangement of pi 1 ,π 2 ,...,π k The storage cost can be saved by generating and storing; 2) The method does not need to map the fingerprints through random arrangement, and reduces the calculation time for generating the fingerprints.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings, but embodiments of the present invention are not limited thereto.
A fingerprint similarity method based on a grouping minimum value specifically comprises the following steps:
step one, text feature extraction: this step is used to extract the text feature set S d ;
Firstly, scanning and analyzing text information, segmenting the text information by utilizing a Chinese word segmentation algorithm, and filtering out text noise data by utilizing a stop word list to obtain a segmentation set S of the text information shgs . Noise is meaningless words in the text, and is generally high-frequency low-meaning auxiliary words, fictional words and the like; word set S shgs Adopting Rabin function, mapping 32-bit integer, and naming the mapped set as S d 。
Step two, data grouping
Given 2 documents S 1 And S 2 To S 1 And S 2 The elements in (1) are grouped in a fixed size m, the number of groups is k, the total set is omega = {0,1,2 \8230;, 2 32 -1},|Ω|=2 32 . Some elements in the set are labeled 1 and none are labeled 0.
In this embodiment, Ω = {0,1,2 \8230, 15}, | Ω | =16, m =4, k =4, m = | Ω |,
S 1 ={0,3,4,6,713,16}
S 2 ={0,1,3,6,7,13,14,16}
to S 1 And S 2 After the elements in (1) are grouped according to a fixed size m:
thus, at this time
S 1 ={1 0 0 1|1 0 1 1|0 0 0 0|0 1 0 0}
S 2 ={1 1 0 1|0 0 1 1|0 0 0 0|0 1 1 0}
And thirdly, selecting the representatives in the group to form the fingerprint.
One representative is selected from each group, and the maximum value, the minimum value or the middle element of each group can be selected as the representative of one group. If there is no element in the group, called empty group, then set as e, S d K sets of elements represented as intra-groups, defined herein as fingerprint sets, are formed.
The following examples are represented by the minimum values:
Printfinger(S 1 )=[0,4,e,13]
Printfinger(S 2 )=[0,6,e,13]
because a group has at most 4 elements, the representation of each group can be modulo 4, saving memory space.
Printfinger(S 1 )=[0,0,0,1]
Printfinger(S 2 )=[0,2,e,1]
And step four, calculating the similarity R.
The calculation formula is as follows:
R=N mat /(k-N emp ) (4)
wherein N is mat Indicating that each group represents equal times. N is a radical of emp Indicating the number of empty groups. The specific calculation formula is as follows:
equation (7) is specifically: when both elements are compared to be not e, i.e. neither element is empty, and both elements are equal, then I mat,i =1, otherwise I mat,i =0;
The formula (8) is specifically: when both elements are compared to e, i.e. both are empty, then I emp,i =1, otherwise I emp,i =0。
The embodiment specifically includes:
initial, N mat =0,N emp =0
Group 1,0=0, then N mat =Nmat+1,
At this time N mat =1,N emp =0。
Group 2,0 ≠ 2, then N mat The temperature of the molten steel is not changed,
at this time N mat =1,N emp =0。
Group 3, empty groups e appear on both sides, then N emp =N emp +1,
At this time N mat =1,N emp =1。
Group 4,1=1, then N mat =N mat +1,
At this time N mat =2,N emp =1。
Calculating similarity as R = N mat /(k-N emp )=2/(4-1)=2/3=0.667=66.7%
Hereinafter, this formula (4) is demonstrated with R = N mat /(k-N emp ) The correctness of the data.
Only the mathematical expectation E N needs to be proven mat /(k-N emp )]=R
(1) k is the number of groups, which is necessarily greater than the number N of empty groups emp Then k-N emp >0。
k-N emp >0=>Pr(k-N emp >0)=1
(2) As can be seen from the definition of,
I emp,i =1=>I mat,i =0 (9)
compared with the prior art, the invention has the advantages that:
1) The method omits random arrangement of pi 1 ,π 2 ,...,π k Generation and storage of; 2) This method does not require mapping fingerprints by random permutation.
Step five, further optimization can be performed: the fingerprint is compressed to 1 bit.
The calculation formula is as follows:
R mat,b =2E mat,b -1 (13)
the correctness of equation (13) is demonstrated below:
let B be a family of special independently distributed functions, where the function B ∈ B, B can remove elements from the domain [0, | Ω | -1 with uniform probability]Are mapped to {0,1},i.e. 50% probability maps to 0 and 50% probability maps to 1. Therefore, let u, v ∈ [0, | Ω | -1]When u ≠ v, pr b∈B { b (u) = b (v) } =1/2, and when u = v, pr b∈B { b (u) = b (v) } =1. The symbol E is used in combination with B and formula (4) mat,b Represents the probability of equality after the minimum function b is mapped to {0,1}, and is expressed by the formula:
the above method is demonstrated: the occurrence of R is min (i-th group) 1 =min(i-th group) 2 Not equal to e, there is then a probability b (min (i-th group) of 100% 1 )=b(min(i-th group) 2 ) Not equal to e;1-R probability min (i-th group) 1 ≠min(i-th group) 2 Not equal to e, then there is a probability b (min (i-th group) of 1/2 1 )=b(min(i-th group) 2 )≠e,
Thus, from formula (14), with R b The similarity calculation formula to represent the compressed fingerprints is: r b =2E mat,b -1。
The above-described embodiments of the present invention do not limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present invention shall be included in the protection scope of the claims of the present invention.
Claims (3)
1. A method for calculating similarity of fingerprints without random arrangement is characterized by comprising the following steps:
s1, text feature extraction: for extracting text feature set S d ;
S2, data grouping: given 2 documents S 1 And S 2 To S to 1 And S 2 The elements in (1) are grouped in a fixed size m, the number of groups is k, the total set is omega = {0,1,2 \8230;, 2 32 -1},|Ω|=2 32 Marking elements in the set as 1 and elements in the set as 0;
s3, selecting representatives in the group to form a fingerprint:
selecting a representative from each group, wherein the representative is the maximum value, the minimum value or the intermediate element of each group; if there is no element in the group, it is called empty group, set as e, S d K sets of elements represented in groups, namely fingerprint sets, are formed;
s4, calculating the similarity R in the following way:
R=N mat /(k-N emp ) (4)
wherein, N mat Denotes S 1 And S 2 Each group of sets of fingerprints representing an equal number of times, N emp Denotes S 1 And S 2 The number of times that the fingerprint sets are simultaneously empty groups; i is mat,i Equal counts at the i-th group comparison; I.C. A emp,i A count indicating simultaneous null;
the method comprises the following specific steps:
equation (7) is specifically: when both elements are compared to be not e, i.e. neither element is empty, and both elements are equal, then I mat,i =1, otherwise I mat,i =0;
Equation (8) is specifically: when both elements are compared to e, i.e. both are empty, then I emp,i =1, otherwise I emp,i =0。
2. The method for calculating similarity of fingerprints without random arrangement according to claim 1, wherein the data in step S2 is grouped into k groups, some elements in the set are marked as 1, and none elements are marked as 0; the minimum value in the group is selected as a representative, and a fingerprint is formed.
3. The method for calculating similarity of fingerprints without random arrangement according to claim 2, wherein after calculating the similarity R in step S4, further comprising compressing the fingerprint to 1 bit, and using R b To express the similarity estimator after compression, the formula is:
R b =2E mat,b -1
wherein, E mat,b Representing the probability of equality after the minimum function b is mapped to 0, 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610019243.9A CN105718430B (en) | 2016-01-13 | 2016-01-13 | A kind of method for calculating similarity as fingerprint based on packet minimum value |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610019243.9A CN105718430B (en) | 2016-01-13 | 2016-01-13 | A kind of method for calculating similarity as fingerprint based on packet minimum value |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105718430A CN105718430A (en) | 2016-06-29 |
CN105718430B true CN105718430B (en) | 2018-05-04 |
Family
ID=56147793
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610019243.9A Active CN105718430B (en) | 2016-01-13 | 2016-01-13 | A kind of method for calculating similarity as fingerprint based on packet minimum value |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105718430B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111104484B (en) * | 2019-12-19 | 2021-09-03 | 南京中孚信息技术有限公司 | Text similarity detection method and device and electronic equipment |
CN111444325B (en) * | 2020-03-30 | 2023-06-20 | 湖南工业大学 | Method for measuring document similarity by position coding single random replacement hash |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102222085A (en) * | 2011-05-17 | 2011-10-19 | 华中科技大学 | Data de-duplication method based on combination of similarity and locality |
CN102682104A (en) * | 2012-05-04 | 2012-09-19 | 中南大学 | Method for searching similar texts and link bit similarity measuring algorithm |
CN103020174A (en) * | 2012-11-28 | 2013-04-03 | 华为技术有限公司 | Similarity analysis method, device and system |
CN104636325A (en) * | 2015-02-06 | 2015-05-20 | 中南大学 | Document similarity determining method based on maximum likelihood estimation |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100070509A1 (en) * | 2008-08-15 | 2010-03-18 | Kai Li | System And Method For High-Dimensional Similarity Search |
-
2016
- 2016-01-13 CN CN201610019243.9A patent/CN105718430B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102222085A (en) * | 2011-05-17 | 2011-10-19 | 华中科技大学 | Data de-duplication method based on combination of similarity and locality |
CN102682104A (en) * | 2012-05-04 | 2012-09-19 | 中南大学 | Method for searching similar texts and link bit similarity measuring algorithm |
CN103020174A (en) * | 2012-11-28 | 2013-04-03 | 华为技术有限公司 | Similarity analysis method, device and system |
CN104636325A (en) * | 2015-02-06 | 2015-05-20 | 中南大学 | Document similarity determining method based on maximum likelihood estimation |
Non-Patent Citations (2)
Title |
---|
Near-duplicate document detection with improved similarity measurement;YUAN Xin-pan 等;《Springer》;20121231;第2231-2237页 * |
基于分组指纹的细粒度相似性检测系统;盛鑫海 等;《湖南工业大学学报》;20141130;第28卷(第6期);第81-85页 * |
Also Published As
Publication number | Publication date |
---|---|
CN105718430A (en) | 2016-06-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wang et al. | Fast large-scale trajectory clustering | |
CN112528025A (en) | Text clustering method, device and equipment based on density and storage medium | |
CN103294671B (en) | The detection method and system of document | |
CN103279478B (en) | A kind of based on distributed mutual information file characteristics extracting method | |
CN110597870A (en) | Enterprise relation mining method | |
CN108536657A (en) | The address text similarity processing method and system artificially filled in | |
CN109344263A (en) | A kind of address matching method | |
CN113032403B (en) | Data insight method, device, electronic equipment and storage medium | |
CN102855245A (en) | Image similarity determining method and image similarity determining equipment | |
CN116402166B (en) | Training method and device of prediction model, electronic equipment and storage medium | |
CN105718430B (en) | A kind of method for calculating similarity as fingerprint based on packet minimum value | |
CN108170691A (en) | It is associated with the determining method and apparatus of document | |
CN109885797B (en) | Relational network construction method based on multi-identity space mapping | |
CN109388635A (en) | A kind of data storage method of the multi-value data based on binary system and dictionary table | |
CN117494711A (en) | Semantic-based electricity utilization address similarity matching method | |
Stenflo | Markov chains in random environments and random iterated function systems | |
CN116595567A (en) | Dynamic data isolation method and system based on multiple data sources | |
CN102509058B (en) | Point type GIS vector data disguise and recovery method based on redundant bit replacement | |
CN116579791A (en) | User mining method and device | |
Zaslavsky et al. | Visualization of large influenza virus sequence datasets using adaptively aggregated trees with sampling-based subscale representation | |
CN110941663B (en) | Method and device for acquiring association rule of certificate information | |
CN109919811B (en) | Insurance agent culture scheme generation method based on big data and related equipment | |
CN111444325B (en) | Method for measuring document similarity by position coding single random replacement hash | |
WANG et al. | Automatic generalization methods of cyberspace point cluster features considering characteristics | |
CN118364363B (en) | Authentication method and device for same user in different social media platforms |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20200529 Address after: Room g0044, headquarters building, Changsha Zhongdian Software Park Co., Ltd., No. 39, Jianshan Road, Changsha hi tech Development Zone, Changsha City, Hunan Province Patentee after: HUNAN YUN ZHI IOT NETWORKTECHNOLOGY Co.,Ltd. Address before: 412000 Taishan Road, Tianyuan District, Hunan, No. 88, No. Patentee before: HUNAN University OF TECHNOLOGY |
|
TR01 | Transfer of patent right |