CN105718430B

CN105718430B - A kind of method for calculating similarity as fingerprint based on packet minimum value

Info

Publication number: CN105718430B
Application number: CN201610019243.9A
Authority: CN
Inventors: 袁鑫攀; 何频捷; 张澎; 汪灿飞; 向平; 向一平; 高灿
Original assignee: Hunan University of Technology
Current assignee: Hunan Yun Zhi Iot Networktechnology Co ltd
Priority date: 2016-01-13
Filing date: 2016-01-13
Publication date: 2018-05-04
Anticipated expiration: 2036-01-13
Also published as: CN105718430A

Abstract

The invention discloses a kind of fingerprint method for measuring similarity based on packet minimum value, this method breaks through the limitation of random alignment, fingerprint is not generated by random alignment, and its fingerprint still can be used for estimating similitude, solve poorly efficient, the complexity problem of random alignment, it can simplify the Hash procedure stage of Minwise algorithms and its mutation algorithm, be the optimization method of hash function in detection algorithm.

Description

Method for calculating similarity by taking minimum grouping value as fingerprint

Technical Field

The invention relates to a method for calculating similarity based on a grouping minimum value as a fingerprint.

Background

WEB is undergoing explosive growth, more and more literature data are published on the internet, and the trend causes document resources on the network to grow in geometric progression, provides unprecedented convenience for human knowledge sharing and wealth creation, and also has positive promotion effect on the modernization construction of China. However, while these digital resources provide help to people, the easy availability of the resources makes illegal copying, plagiarism and the like of documents more and more rampant, so that in various papers and project application books and the like, a serious plagiarism phenomenon may exist. Meanwhile, with the great investment of the nation in educational scientific research, various education and scientific projects are subsidized, such as: national science fund projects, doctor's point projects of the department of education, fund projects of provinces and cities, various scientific and technical plans and the like. Because the projects belong to different branch management of functional department units, the project application has the phenomenon of multiple declarations and multi-head declaration. The phenomena of plagiarism, repeated declaration and multi-head declaration of the application book seriously affect the objectivity and fairness of project approval, and have bad influence on the reasonable distribution of national scientific research funding, so that the scientific research funding can not be efficiently utilized. The research on the document similarity detection technology is very significant in order to prevent plagiarism and correct academic spirit. Therefore, search engines, libraries, foundation, research libraries, intellectual property departments and the like all over the world invest huge manpower, material resources and financial resources, and the research and exploration are endeavored on the detection of the similarity of the documents, so that the key scientific problems of the similarity detection are broken through as soon as possible, and a good solution is provided for the duplication checking of papers, project application books, reward application books, patents or web page duplication removal of the search engines.

The similarity detection data has the characteristic of mass, and taking the national science fund application as an example, the number of the application is more than 17 ten thousand at present in 2013 application amount, and the similarity detection data can also increase at a higher speed every year. For another example, in recent years, the number of graduates of colleges and universities in china is about 700 thousands, most of the graduate papers need to be subjected to similarity detection, the detection amount of the papers in month 5 reaches a peak, the number of the papers in each year is more than tens of thousands, the similarity detection needs to be carried out not only with data in the current year but also with historical data, and the detection of such a large number of documents cannot be carried out at all by means of conventional detection methods, so that a set of detection mechanism with excellent precision and efficiency is urgently needed to be established, and the similarity comparison technology for the large number of documents is realized.

The construction of the estimator of the Minwise hash and the variety thereof is based on random arrangement, and the basic principle of the Minwise hash and the variety thereof is as follows:

let the complete set Ω = {0, 1., D-1}, and get the relevant sets of shinles S through the shinling document D _d . Document S ₁ And S ₂ The similarity of (a) is defined as:wherein f is ₁ ＝|S ₁ |，f ₂ ＝|S ₂ |，a＝|S ₁ ∩S ₂ L. Assume a random independent permutation over the full set Ω: π Ω → Ω, Ω = {0, 1.., D-1}, by k independent random permutations of π ₁ ,π ₂ ,...,π _k The shinles collection of any document d is converted into:

the similarity estimator of Minwise Hash R is:

in equation (1), the function min { π (X) } is a Minwise hash function.

And the unbiased estimation of R is measured as:

the estimated variance is:

where k is the sample size (or number of experiments).

As shown in formula (2), k is the number of experiments, and k fingerprints can be obtained by k pi, so that similarity can be approximately solved by comparing the fingerprints in equal proportion. The k has great significance for estimation calculation, the larger the k is, the smaller the estimated variance is, and the higher the estimation accuracy and recall rate are; the smaller k is, the larger the estimated variance is, and the lower the estimated accuracy and recall rate is, so that k >1000 is often required in a practical system, and the variance can be generally reduced to a range acceptable by a user.

Whether random permutations are generated or used, a significant amount of computation time is required because the range of random permutations is for the full set [0,2 ] ³² ) And (4) carrying out arrangement. Such large arrangements are generally not available in practical systems. Generally using an approximate arrangement, for [0,2 ] ³² ) Smaller value range modulus is carried out, the improvement efficiency is limited, the precision is reduced, and even if the k still needs to be generated&1000 random permutations, the calculation time is still very long.

The above-mentioned hash fingerprint technique is a commonly used technique, but due to the limitation of random arrangement, the generated fingerprint is inefficient, and the accuracy of the phase comparison detection is not high.

Disclosure of Invention

In order to overcome the problems of the prior art, the invention provides a fingerprint similarity method based on the minimum grouping value, and the method can break through the limitation of random arrangement, namely, the fingerprint generation method without random arrangement.

In order to solve the technical problems, the technical scheme of the invention is as follows:

the fingerprint similarity method based on the grouping minimum value comprises the following steps:

s1, text feature extraction: for extracting text feature set S _d ；

Scanning and analyzing the text information, segmenting words of the text information, filtering noise data in the text information, and obtaining a segmentation set which is a word set S of the text _shgs (ii) a Word set S _shgs Adopting Rabin function, mapping 32-bit integer, and naming the mapped set as S _d ；

S2, data grouping: given 2 documents S ₁ And S ₂ To S ₁ And S ₂ The elements in (1) are grouped in a fixed size m, the number of groups is k, the total set is omega = {0,1,2 \8230;, 2 ³² -1}，|Ω|＝2 ³² Some elements in the set are labeled 1, noneElements are labeled 0;

s3, selecting the representatives in the group to form a fingerprint:

selecting a representative from each group, wherein the representative is the maximum value, the minimum value or the intermediate element of each group; if there is no element in the group, called empty group, then set as e, S _d K sets of elements represented in groups, namely fingerprint sets, are formed;

s4, calculating the similarity R in the following way:

R＝N _mat /(k-N _emp ) (4)

wherein N is _mat Indicating that each group represents an equal number of times, N _emp Indicating the number of empty groups; I.C. A _mat,i Equal counts at the i-th group comparison; i is _emp,i Indicating a simultaneously empty count.

The method comprises the following specific steps:

equation (7) is specifically: when both elements are compared to be not e, i.e. neither element is empty, and both elements are equal, then I _mat,i =1, otherwise I _mat,i ＝0；

Equation (8) is specifically: when both elements are compared to e, i.e. both are empty, then I _emp,i =1, otherwise I _emp,i ＝0。

Compared with the prior art, the invention has the advantages that:

1) The method omits random arrangement of pi ₁ ,π ₂ ,...,π _k The storage cost can be saved by generating and storing; 2) The method does not need to map the fingerprints through random arrangement, and reduces the calculation time for generating the fingerprints.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings, but embodiments of the present invention are not limited thereto.

A fingerprint similarity method based on a grouping minimum value specifically comprises the following steps:

step one, text feature extraction: this step is used to extract the text feature set S _d ；

Firstly, scanning and analyzing text information, segmenting the text information by utilizing a Chinese word segmentation algorithm, and filtering out text noise data by utilizing a stop word list to obtain a segmentation set S of the text information _shgs . Noise is meaningless words in the text, and is generally high-frequency low-meaning auxiliary words, fictional words and the like; word set S _shgs Adopting Rabin function, mapping 32-bit integer, and naming the mapped set as S _d 。

Step two, data grouping

Given 2 documents S ₁ And S ₂ To S ₁ And S ₂ The elements in (1) are grouped in a fixed size m, the number of groups is k, the total set is omega = {0,1,2 \8230;, 2 ³² -1},|Ω|＝2 ³² . Some elements in the set are labeled 1 and none are labeled 0.

In this embodiment, Ω = {0,1,2 \8230, 15}, | Ω | =16, m =4, k =4, m = | Ω |,

S ₁ ＝{0,3,4,6,713,16}

S ₂ ＝{0,1,3,6,7,13,14,16}

to S ₁ And S ₂ After the elements in (1) are grouped according to a fixed size m:

thus, at this time

S ₁ ＝{1 0 0 1|1 0 1 1|0 0 0 0|0 1 0 0}

S ₂ ＝{1 1 0 1|0 0 1 1|0 0 0 0|0 1 1 0}

And thirdly, selecting the representatives in the group to form the fingerprint.

One representative is selected from each group, and the maximum value, the minimum value or the middle element of each group can be selected as the representative of one group. If there is no element in the group, called empty group, then set as e, S _d K sets of elements represented as intra-groups, defined herein as fingerprint sets, are formed.

The following examples are represented by the minimum values:

Printfinger(S ₁ )＝[0,4,e,13]

Printfinger(S ₂ )＝[0,6,e,13]

because a group has at most 4 elements, the representation of each group can be modulo 4, saving memory space.

Printfinger(S ₁ )＝[0,0,0,1]

Printfinger(S ₂ )＝[0,2,e,1]

And step four, calculating the similarity R.

The calculation formula is as follows:

R＝N _mat /(k-N _emp ) (4)

wherein N is _mat Indicating that each group represents equal times. N is a radical of _emp Indicating the number of empty groups. The specific calculation formula is as follows:

The formula (8) is specifically: when both elements are compared to e, i.e. both are empty, then I _emp,i =1, otherwise I _emp,i ＝0。

The embodiment specifically includes:

initial, N _mat ＝0，N _emp ＝0

Group 1,0=0, then N _mat ＝Nmat+1，

At this time N _mat ＝1，N _emp ＝0。

Group 2,0 ≠ 2, then N _mat The temperature of the molten steel is not changed,

at this time N _mat ＝1，N _emp ＝0。

Group 3, empty groups e appear on both sides, then N _emp ＝N _emp +1，

At this time N _mat ＝1，N _emp ＝1。

Group 4,1=1, then N _mat ＝N _mat +1，

At this time N _mat ＝2，N _emp ＝1。

Calculating similarity as R = N _mat /(k-N _emp )＝2/(4-1)＝2/3＝0.667＝66.7％

Hereinafter, this formula (4) is demonstrated with R = N _mat /(k-N _emp ) The correctness of the data.

Only the mathematical expectation E N needs to be proven _mat /(k-N _emp )]＝R

(1) k is the number of groups, which is necessarily greater than the number N of empty groups _emp Then k-N _emp >0。

k-N _emp >0＝>Pr(k-N _emp >0)＝1

(2) As can be seen from the definition of,

I _emp,i ＝1＝>I _mat,i ＝0 (9)

compared with the prior art, the invention has the advantages that:

1) The method omits random arrangement of pi ₁ ,π ₂ ,...,π _k Generation and storage of; 2) This method does not require mapping fingerprints by random permutation.

Step five, further optimization can be performed: the fingerprint is compressed to 1 bit.

The calculation formula is as follows:

R _mat,b ＝2E _mat,b -1 (13)

the correctness of equation (13) is demonstrated below:

let B be a family of special independently distributed functions, where the function B ∈ B, B can remove elements from the domain [0, | Ω | -1 with uniform probability]Are mapped to {0,1},i.e. 50% probability maps to 0 and 50% probability maps to 1. Therefore, let u, v ∈ [0, | Ω | -1]When u ≠ v, pr _b∈B { b (u) = b (v) } =1/2, and when u = v, pr _b∈B { b (u) = b (v) } =1. The symbol E is used in combination with B and formula (4) _mat,b Represents the probability of equality after the minimum function b is mapped to {0,1}, and is expressed by the formula:

the above method is demonstrated: the occurrence of R is min (i-th group) ₁ ＝min(i-th group) ₂ Not equal to e, there is then a probability b (min (i-th group) of 100% ₁ )＝b(min(i-th group) ₂ ) Not equal to e;1-R probability min (i-th group) ₁ ≠min(i-th group) ₂ Not equal to e, then there is a probability b (min (i-th group) of 1/2 ₁ )＝b(min(i-th group) ₂ )≠e，

Thus, from formula (14), with R _b The similarity calculation formula to represent the compressed fingerprints is: r _b ＝2E _mat,b -1。

The above-described embodiments of the present invention do not limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present invention shall be included in the protection scope of the claims of the present invention.

Claims

1. A method for calculating similarity of fingerprints without random arrangement is characterized by comprising the following steps:

s1, text feature extraction: for extracting text feature set S _d ；

S2, data grouping: given 2 documents S ₁ And S ₂ To S to ₁ And S ₂ The elements in (1) are grouped in a fixed size m, the number of groups is k, the total set is omega = {0,1,2 \8230;, 2 ³² -1}，|Ω|＝2 ³² Marking elements in the set as 1 and elements in the set as 0;

s3, selecting representatives in the group to form a fingerprint:

selecting a representative from each group, wherein the representative is the maximum value, the minimum value or the intermediate element of each group; if there is no element in the group, it is called empty group, set as e, S _d K sets of elements represented in groups, namely fingerprint sets, are formed;

s4, calculating the similarity R in the following way:

R＝N _mat /(k-N _emp ) (4)

wherein, N _mat Denotes S ₁ And S ₂ Each group of sets of fingerprints representing an equal number of times, N _emp Denotes S ₁ And S ₂ The number of times that the fingerprint sets are simultaneously empty groups; i is _mat,i Equal counts at the i-th group comparison; I.C. A _emp,i A count indicating simultaneous null;

the method comprises the following specific steps:

2. The method for calculating similarity of fingerprints without random arrangement according to claim 1, wherein the data in step S2 is grouped into k groups, some elements in the set are marked as 1, and none elements are marked as 0; the minimum value in the group is selected as a representative, and a fingerprint is formed.

3. The method for calculating similarity of fingerprints without random arrangement according to claim 2, wherein after calculating the similarity R in step S4, further comprising compressing the fingerprint to 1 bit, and using R _b To express the similarity estimator after compression, the formula is:

R _b ＝2E _mat,b -1

wherein, E _mat,b Representing the probability of equality after the minimum function b is mapped to 0, 1.