CN105718430B - A kind of method for calculating similarity as fingerprint based on packet minimum value - Google Patents

A kind of method for calculating similarity as fingerprint based on packet minimum value Download PDF

Info

Publication number
CN105718430B
CN105718430B CN201610019243.9A CN201610019243A CN105718430B CN 105718430 B CN105718430 B CN 105718430B CN 201610019243 A CN201610019243 A CN 201610019243A CN 105718430 B CN105718430 B CN 105718430B
Authority
CN
China
Prior art keywords
group
mat
elements
fingerprint
emp
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610019243.9A
Other languages
Chinese (zh)
Other versions
CN105718430A (en
Inventor
袁鑫攀
何频捷
张澎
汪灿飞
向平
向一平
高灿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Yun Zhi Iot Networktechnology Co ltd
Original Assignee
Hunan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University of Technology filed Critical Hunan University of Technology
Priority to CN201610019243.9A priority Critical patent/CN105718430B/en
Publication of CN105718430A publication Critical patent/CN105718430A/en
Application granted granted Critical
Publication of CN105718430B publication Critical patent/CN105718430B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The invention discloses a kind of fingerprint method for measuring similarity based on packet minimum value, this method breaks through the limitation of random alignment, fingerprint is not generated by random alignment, and its fingerprint still can be used for estimating similitude, solve poorly efficient, the complexity problem of random alignment, it can simplify the Hash procedure stage of Minwise algorithms and its mutation algorithm, be the optimization method of hash function in detection algorithm.

Description

Method for calculating similarity by taking minimum grouping value as fingerprint
Technical Field
The invention relates to a method for calculating similarity based on a grouping minimum value as a fingerprint.
Background
WEB is undergoing explosive growth, more and more literature data are published on the internet, and the trend causes document resources on the network to grow in geometric progression, provides unprecedented convenience for human knowledge sharing and wealth creation, and also has positive promotion effect on the modernization construction of China. However, while these digital resources provide help to people, the easy availability of the resources makes illegal copying, plagiarism and the like of documents more and more rampant, so that in various papers and project application books and the like, a serious plagiarism phenomenon may exist. Meanwhile, with the great investment of the nation in educational scientific research, various education and scientific projects are subsidized, such as: national science fund projects, doctor's point projects of the department of education, fund projects of provinces and cities, various scientific and technical plans and the like. Because the projects belong to different branch management of functional department units, the project application has the phenomenon of multiple declarations and multi-head declaration. The phenomena of plagiarism, repeated declaration and multi-head declaration of the application book seriously affect the objectivity and fairness of project approval, and have bad influence on the reasonable distribution of national scientific research funding, so that the scientific research funding can not be efficiently utilized. The research on the document similarity detection technology is very significant in order to prevent plagiarism and correct academic spirit. Therefore, search engines, libraries, foundation, research libraries, intellectual property departments and the like all over the world invest huge manpower, material resources and financial resources, and the research and exploration are endeavored on the detection of the similarity of the documents, so that the key scientific problems of the similarity detection are broken through as soon as possible, and a good solution is provided for the duplication checking of papers, project application books, reward application books, patents or web page duplication removal of the search engines.
The similarity detection data has the characteristic of mass, and taking the national science fund application as an example, the number of the application is more than 17 ten thousand at present in 2013 application amount, and the similarity detection data can also increase at a higher speed every year. For another example, in recent years, the number of graduates of colleges and universities in china is about 700 thousands, most of the graduate papers need to be subjected to similarity detection, the detection amount of the papers in month 5 reaches a peak, the number of the papers in each year is more than tens of thousands, the similarity detection needs to be carried out not only with data in the current year but also with historical data, and the detection of such a large number of documents cannot be carried out at all by means of conventional detection methods, so that a set of detection mechanism with excellent precision and efficiency is urgently needed to be established, and the similarity comparison technology for the large number of documents is realized.
The construction of the estimator of the Minwise hash and the variety thereof is based on random arrangement, and the basic principle of the Minwise hash and the variety thereof is as follows:
let the complete set Ω = {0, 1., D-1}, and get the relevant sets of shinles S through the shinling document D d . Document S 1 And S 2 The similarity of (a) is defined as:wherein f is 1 =|S 1 |,f 2 =|S 2 |,a=|S 1 ∩S 2 L. Assume a random independent permutation over the full set Ω: π Ω → Ω, Ω = {0, 1.., D-1}, by k independent random permutations of π 12 ,...,π k The shinles collection of any document d is converted into:
the similarity estimator of Minwise Hash R is:
in equation (1), the function min { π (X) } is a Minwise hash function.
And the unbiased estimation of R is measured as:
the estimated variance is:
where k is the sample size (or number of experiments).
As shown in formula (2), k is the number of experiments, and k fingerprints can be obtained by k pi, so that similarity can be approximately solved by comparing the fingerprints in equal proportion. The k has great significance for estimation calculation, the larger the k is, the smaller the estimated variance is, and the higher the estimation accuracy and recall rate are; the smaller k is, the larger the estimated variance is, and the lower the estimated accuracy and recall rate is, so that k >1000 is often required in a practical system, and the variance can be generally reduced to a range acceptable by a user.
Whether random permutations are generated or used, a significant amount of computation time is required because the range of random permutations is for the full set [0,2 ] 32 ) And (4) carrying out arrangement. Such large arrangements are generally not available in practical systems. Generally using an approximate arrangement, for [0,2 ] 32 ) Smaller value range modulus is carried out, the improvement efficiency is limited, the precision is reduced, and even if the k still needs to be generated&1000 random permutations, the calculation time is still very long.
The above-mentioned hash fingerprint technique is a commonly used technique, but due to the limitation of random arrangement, the generated fingerprint is inefficient, and the accuracy of the phase comparison detection is not high.
Disclosure of Invention
In order to overcome the problems of the prior art, the invention provides a fingerprint similarity method based on the minimum grouping value, and the method can break through the limitation of random arrangement, namely, the fingerprint generation method without random arrangement.
In order to solve the technical problems, the technical scheme of the invention is as follows:
the fingerprint similarity method based on the grouping minimum value comprises the following steps:
s1, text feature extraction: for extracting text feature set S d
Scanning and analyzing the text information, segmenting words of the text information, filtering noise data in the text information, and obtaining a segmentation set which is a word set S of the text shgs (ii) a Word set S shgs Adopting Rabin function, mapping 32-bit integer, and naming the mapped set as S d
S2, data grouping: given 2 documents S 1 And S 2 To S 1 And S 2 The elements in (1) are grouped in a fixed size m, the number of groups is k, the total set is omega = {0,1,2 \8230;, 2 32 -1},|Ω|=2 32 Some elements in the set are labeled 1, noneElements are labeled 0;
s3, selecting the representatives in the group to form a fingerprint:
selecting a representative from each group, wherein the representative is the maximum value, the minimum value or the intermediate element of each group; if there is no element in the group, called empty group, then set as e, S d K sets of elements represented in groups, namely fingerprint sets, are formed;
s4, calculating the similarity R in the following way:
R=N mat /(k-N emp ) (4)
wherein N is mat Indicating that each group represents an equal number of times, N emp Indicating the number of empty groups; I.C. A mat,i Equal counts at the i-th group comparison; i is emp,i Indicating a simultaneously empty count.
The method comprises the following specific steps:
equation (7) is specifically: when both elements are compared to be not e, i.e. neither element is empty, and both elements are equal, then I mat,i =1, otherwise I mat,i =0;
Equation (8) is specifically: when both elements are compared to e, i.e. both are empty, then I emp,i =1, otherwise I emp,i =0。
Compared with the prior art, the invention has the advantages that:
1) The method omits random arrangement of pi 12 ,...,π k The storage cost can be saved by generating and storing; 2) The method does not need to map the fingerprints through random arrangement, and reduces the calculation time for generating the fingerprints.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings, but embodiments of the present invention are not limited thereto.
A fingerprint similarity method based on a grouping minimum value specifically comprises the following steps:
step one, text feature extraction: this step is used to extract the text feature set S d
Firstly, scanning and analyzing text information, segmenting the text information by utilizing a Chinese word segmentation algorithm, and filtering out text noise data by utilizing a stop word list to obtain a segmentation set S of the text information shgs . Noise is meaningless words in the text, and is generally high-frequency low-meaning auxiliary words, fictional words and the like; word set S shgs Adopting Rabin function, mapping 32-bit integer, and naming the mapped set as S d
Step two, data grouping
Given 2 documents S 1 And S 2 To S 1 And S 2 The elements in (1) are grouped in a fixed size m, the number of groups is k, the total set is omega = {0,1,2 \8230;, 2 32 -1},|Ω|=2 32 . Some elements in the set are labeled 1 and none are labeled 0.
In this embodiment, Ω = {0,1,2 \8230, 15}, | Ω | =16, m =4, k =4, m = | Ω |,
S 1 ={0,3,4,6,713,16}
S 2 ={0,1,3,6,7,13,14,16}
to S 1 And S 2 After the elements in (1) are grouped according to a fixed size m:
thus, at this time
S 1 ={1 0 0 1|1 0 1 1|0 0 0 0|0 1 0 0}
S 2 ={1 1 0 1|0 0 1 1|0 0 0 0|0 1 1 0}
And thirdly, selecting the representatives in the group to form the fingerprint.
One representative is selected from each group, and the maximum value, the minimum value or the middle element of each group can be selected as the representative of one group. If there is no element in the group, called empty group, then set as e, S d K sets of elements represented as intra-groups, defined herein as fingerprint sets, are formed.
The following examples are represented by the minimum values:
Printfinger(S 1 )=[0,4,e,13]
Printfinger(S 2 )=[0,6,e,13]
because a group has at most 4 elements, the representation of each group can be modulo 4, saving memory space.
Printfinger(S 1 )=[0,0,0,1]
Printfinger(S 2 )=[0,2,e,1]
And step four, calculating the similarity R.
The calculation formula is as follows:
R=N mat /(k-N emp ) (4)
wherein N is mat Indicating that each group represents equal times. N is a radical of emp Indicating the number of empty groups. The specific calculation formula is as follows:
equation (7) is specifically: when both elements are compared to be not e, i.e. neither element is empty, and both elements are equal, then I mat,i =1, otherwise I mat,i =0;
The formula (8) is specifically: when both elements are compared to e, i.e. both are empty, then I emp,i =1, otherwise I emp,i =0。
The embodiment specifically includes:
initial, N mat =0,N emp =0
Group 1,0=0, then N mat =Nmat+1,
At this time N mat =1,N emp =0。
Group 2,0 ≠ 2, then N mat The temperature of the molten steel is not changed,
at this time N mat =1,N emp =0。
Group 3, empty groups e appear on both sides, then N emp =N emp +1,
At this time N mat =1,N emp =1。
Group 4,1=1, then N mat =N mat +1,
At this time N mat =2,N emp =1。
Calculating similarity as R = N mat /(k-N emp )=2/(4-1)=2/3=0.667=66.7%
Hereinafter, this formula (4) is demonstrated with R = N mat /(k-N emp ) The correctness of the data.
Only the mathematical expectation E N needs to be proven mat /(k-N emp )]=R
(1) k is the number of groups, which is necessarily greater than the number N of empty groups emp Then k-N emp >0。
k-N emp >0=>Pr(k-N emp >0)=1
(2) As can be seen from the definition of,
I emp,i =1=>I mat,i =0 (9)
compared with the prior art, the invention has the advantages that:
1) The method omits random arrangement of pi 12 ,...,π k Generation and storage of; 2) This method does not require mapping fingerprints by random permutation.
Step five, further optimization can be performed: the fingerprint is compressed to 1 bit.
The calculation formula is as follows:
R mat,b =2E mat,b -1 (13)
the correctness of equation (13) is demonstrated below:
let B be a family of special independently distributed functions, where the function B ∈ B, B can remove elements from the domain [0, | Ω | -1 with uniform probability]Are mapped to {0,1},i.e. 50% probability maps to 0 and 50% probability maps to 1. Therefore, let u, v ∈ [0, | Ω | -1]When u ≠ v, pr b∈B { b (u) = b (v) } =1/2, and when u = v, pr b∈B { b (u) = b (v) } =1. The symbol E is used in combination with B and formula (4) mat,b Represents the probability of equality after the minimum function b is mapped to {0,1}, and is expressed by the formula:
the above method is demonstrated: the occurrence of R is min (i-th group) 1 =min(i-th group) 2 Not equal to e, there is then a probability b (min (i-th group) of 100% 1 )=b(min(i-th group) 2 ) Not equal to e;1-R probability min (i-th group) 1 ≠min(i-th group) 2 Not equal to e, then there is a probability b (min (i-th group) of 1/2 1 )=b(min(i-th group) 2 )≠e,
Thus, from formula (14), with R b The similarity calculation formula to represent the compressed fingerprints is: r b =2E mat,b -1。
The above-described embodiments of the present invention do not limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present invention shall be included in the protection scope of the claims of the present invention.

Claims (3)

1. A method for calculating similarity of fingerprints without random arrangement is characterized by comprising the following steps:
s1, text feature extraction: for extracting text feature set S d
S2, data grouping: given 2 documents S 1 And S 2 To S to 1 And S 2 The elements in (1) are grouped in a fixed size m, the number of groups is k, the total set is omega = {0,1,2 \8230;, 2 32 -1},|Ω|=2 32 Marking elements in the set as 1 and elements in the set as 0;
s3, selecting representatives in the group to form a fingerprint:
selecting a representative from each group, wherein the representative is the maximum value, the minimum value or the intermediate element of each group; if there is no element in the group, it is called empty group, set as e, S d K sets of elements represented in groups, namely fingerprint sets, are formed;
s4, calculating the similarity R in the following way:
R=N mat /(k-N emp ) (4)
wherein, N mat Denotes S 1 And S 2 Each group of sets of fingerprints representing an equal number of times, N emp Denotes S 1 And S 2 The number of times that the fingerprint sets are simultaneously empty groups; i is mat,i Equal counts at the i-th group comparison; I.C. A emp,i A count indicating simultaneous null;
the method comprises the following specific steps:
equation (7) is specifically: when both elements are compared to be not e, i.e. neither element is empty, and both elements are equal, then I mat,i =1, otherwise I mat,i =0;
Equation (8) is specifically: when both elements are compared to e, i.e. both are empty, then I emp,i =1, otherwise I emp,i =0。
2. The method for calculating similarity of fingerprints without random arrangement according to claim 1, wherein the data in step S2 is grouped into k groups, some elements in the set are marked as 1, and none elements are marked as 0; the minimum value in the group is selected as a representative, and a fingerprint is formed.
3. The method for calculating similarity of fingerprints without random arrangement according to claim 2, wherein after calculating the similarity R in step S4, further comprising compressing the fingerprint to 1 bit, and using R b To express the similarity estimator after compression, the formula is:
R b =2E mat,b -1
wherein, E mat,b Representing the probability of equality after the minimum function b is mapped to 0, 1.
CN201610019243.9A 2016-01-13 2016-01-13 A kind of method for calculating similarity as fingerprint based on packet minimum value Active CN105718430B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610019243.9A CN105718430B (en) 2016-01-13 2016-01-13 A kind of method for calculating similarity as fingerprint based on packet minimum value

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610019243.9A CN105718430B (en) 2016-01-13 2016-01-13 A kind of method for calculating similarity as fingerprint based on packet minimum value

Publications (2)

Publication Number Publication Date
CN105718430A CN105718430A (en) 2016-06-29
CN105718430B true CN105718430B (en) 2018-05-04

Family

ID=56147793

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610019243.9A Active CN105718430B (en) 2016-01-13 2016-01-13 A kind of method for calculating similarity as fingerprint based on packet minimum value

Country Status (1)

Country Link
CN (1) CN105718430B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111104484B (en) * 2019-12-19 2021-09-03 南京中孚信息技术有限公司 Text similarity detection method and device and electronic equipment
CN111444325B (en) * 2020-03-30 2023-06-20 湖南工业大学 Method for measuring document similarity by position coding single random replacement hash

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102222085A (en) * 2011-05-17 2011-10-19 华中科技大学 Data de-duplication method based on combination of similarity and locality
CN102682104A (en) * 2012-05-04 2012-09-19 中南大学 Method for searching similar texts and link bit similarity measuring algorithm
CN103020174A (en) * 2012-11-28 2013-04-03 华为技术有限公司 Similarity analysis method, device and system
CN104636325A (en) * 2015-02-06 2015-05-20 中南大学 Document similarity determining method based on maximum likelihood estimation

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100070509A1 (en) * 2008-08-15 2010-03-18 Kai Li System And Method For High-Dimensional Similarity Search

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102222085A (en) * 2011-05-17 2011-10-19 华中科技大学 Data de-duplication method based on combination of similarity and locality
CN102682104A (en) * 2012-05-04 2012-09-19 中南大学 Method for searching similar texts and link bit similarity measuring algorithm
CN103020174A (en) * 2012-11-28 2013-04-03 华为技术有限公司 Similarity analysis method, device and system
CN104636325A (en) * 2015-02-06 2015-05-20 中南大学 Document similarity determining method based on maximum likelihood estimation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Near-duplicate document detection with improved similarity measurement;YUAN Xin-pan 等;《Springer》;20121231;第2231-2237页 *
基于分组指纹的细粒度相似性检测系统;盛鑫海 等;《湖南工业大学学报》;20141130;第28卷(第6期);第81-85页 *

Also Published As

Publication number Publication date
CN105718430A (en) 2016-06-29

Similar Documents

Publication Publication Date Title
Wang et al. Fast large-scale trajectory clustering
CN103279478B (en) A kind of based on distributed mutual information file characteristics extracting method
CN103294671B (en) The detection method and system of document
EP2804115B1 (en) Index scan device and index scan method
CN108536657A (en) The address text similarity processing method and system artificially filled in
CN111143547A (en) Big data display method based on knowledge graph
CN105718430B (en) A kind of method for calculating similarity as fingerprint based on packet minimum value
CN102521713B (en) Data processing equipment and data processing method
Karatzoglou et al. Kernel-based machine learning for fast text mining in R
CN111461197A (en) Spatial load distribution rule research method based on feature extraction
CN116402166B (en) Training method and device of prediction model, electronic equipment and storage medium
CN109885797B (en) Relational network construction method based on multi-identity space mapping
CN116150185A (en) Data standard extraction method, device, equipment and medium based on artificial intelligence
CN102509058B (en) Point type GIS vector data disguise and recovery method based on redundant bit replacement
Zaslavsky et al. Visualization of large influenza virus sequence datasets using adaptively aggregated trees with sampling-based subscale representation
CN110941663B (en) Method and device for acquiring association rule of certificate information
CN114490667A (en) Multidimensional data analysis method and device, electronic equipment and medium
CN109919811B (en) Insurance agent culture scheme generation method based on big data and related equipment
CN105808723B (en) The picture retrieval method hashed based on picture semantic and vision
CN111444325B (en) Method for measuring document similarity by position coding single random replacement hash
Ji et al. An improved random walk based community detection algorithm
CN117251532B (en) Large-scale literature mechanism disambiguation method based on dynamic multistage matching
CN113704411B (en) Word vector-based similar guest group mining method, device, equipment and storage medium
JP5582358B2 (en) Document search system, document search method, and program
Ardiansyah et al. K-Means Clustering Application of Open‎ Unemployment in 2020 Caused by COVID-19 in West Java Province

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20200529

Address after: Room g0044, headquarters building, Changsha Zhongdian Software Park Co., Ltd., No. 39, Jianshan Road, Changsha hi tech Development Zone, Changsha City, Hunan Province

Patentee after: HUNAN YUN ZHI IOT NETWORKTECHNOLOGY Co.,Ltd.

Address before: 412000 Taishan Road, Tianyuan District, Hunan, No. 88, No.

Patentee before: HUNAN University OF TECHNOLOGY

TR01 Transfer of patent right