CN111444325B - Method for measuring document similarity by position coding single random replacement hash - Google Patents
Method for measuring document similarity by position coding single random replacement hash Download PDFInfo
- Publication number
- CN111444325B CN111444325B CN202010235463.1A CN202010235463A CN111444325B CN 111444325 B CN111444325 B CN 111444325B CN 202010235463 A CN202010235463 A CN 202010235463A CN 111444325 B CN111444325 B CN 111444325B
- Authority
- CN
- China
- Prior art keywords
- area
- hash
- similarity
- empty
- key value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 38
- 238000005259 measurement Methods 0.000 claims abstract description 8
- 238000012545 processing Methods 0.000 claims abstract description 4
- 230000006835 compression Effects 0.000 claims description 14
- 238000007906 compression Methods 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 6
- 230000011218 segmentation Effects 0.000 claims description 6
- 238000001914 filtration Methods 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 4
- 208000021068 Pulmonary arterial hypertension associated with portal hypertension Diseases 0.000 abstract description 18
- 238000004364 calculation method Methods 0.000 abstract description 12
- 238000011524 similarity measure Methods 0.000 abstract description 4
- 238000001514 detection method Methods 0.000 description 13
- 230000006870 function Effects 0.000 description 8
- 238000011160 research Methods 0.000 description 6
- 238000000691 measurement method Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 239000011800 void material Substances 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000003247 decreasing effect Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000002411 adverse Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/325—Hash tables
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Position coding single random replacement hash measurement document similarityThe method belongs to the field of searching similar texts in information retrieval, and comprises the following steps: s1, preliminarily extracting text features to generate a single random replacement hash set O x The method comprises the steps of carrying out a first treatment on the surface of the S2, further extracting text features to generate a single random replacement position coding hash set P x : traversing set O in S1 x The non-empty area in the data processing system takes the serial number of the non-empty area as key and the hash value as value, and generates key value pairs with the structure of < k and v > by mixed coding to form a set P x The method comprises the steps of carrying out a first treatment on the surface of the S3: similarity measure: traversal P a 、P b All key value pairs in (1) according to the similarityThe similarity of the two documents a, b is compared. The invention has high calculation precision and keeps consistent with OPH; as the number of empty areas increases, the POPH method for measuring the document similarity saves the calculation time and the storage space.
Description
Technical Field
The invention belongs to the field of searching similar texts in information retrieval, and particularly relates to a method for measuring document similarity by single random permutation hash of position codes.
Background
The WEB is undergoing explosive growth, more and more literature materials are beginning to be published on the internet, the trend leads document resources on the network to grow in geometric series, unprecedented convenience is provided for human sharing knowledge and creating wealth, and the trend has positive promotion effect on modern construction in China. However, while these digitized resources provide assistance to people, the availability of resources also makes the illegal copying, plagiarism and other actions of documents more and more rampant, so that serious plagiarism phenomena may exist in various papers, project applications and other documents. Meanwhile, with the great investment of the country to education and scientific research, various education and technological projects are funded, such as: national natural science foundation projects, doctor point projects of education departments, foundation projects of various provinces and cities, various technological plans and the like. Because the projects belong to different functional department units for management, the phenomena of multiple declarations and multiple declarations exist in the project application. The phenomena of plagiarism, repeated declaration and multi-head declaration of the application greatly influence the objectivity and fairness of project approval, and have adverse effects on reasonable distribution of national scientific research expenses, so that scientific research expenses may not be efficiently utilized. In order to prevent plagiarism, the academic wind is straightened, and the research on document similarity detection technology is very significant. Therefore, search engines, libraries, foundation, theoretical libraries, intellectual property departments and the like all over the world are invested in huge manpower, material resources and financial resources, and are struggling and exploring on document similarity detection so as to break through key scientific problems of the similarity detection as soon as possible, thereby providing a good solution for papers, project application books, rewarding declarations, patent duplicate checking or web page duplicate removal of search engines and the like.
The similarity detection data has the characteristic of mass, takes national natural science foundation application form as an example, and the application form number reaches more than 20 ten thousand in terms of 2019 application quantity at present, and the application form number also increases at a faster speed every year. For example, the number of graduates in colleges and universities in China is about 700 ten thousand in recent years, most of the graduates need to be subjected to similarity detection, the paper detection amount reaches a peak in 5 months each year, the daily data is more than tens of thousands, the similarity detection is required to be compared with the data in the current year, the similarity detection is required to be performed with historical data, and the vast amount of documents are not feasible at all in a conventional detection mode, so that a set of detection mechanism with excellent precision and efficiency is urgently required to be established by means of a Hash similarity estimation technology, and the similarity comparison technology of the vast amount of documents is realized. Text similarity measures this concept and related techniques have also been developed. A good text similarity measurement method has important significance in the research fields of similarity detection, automatic question-answering systems, intelligent retrieval, webpage duplication removal, natural language processing and the like.
Text similarity refers to a measure of the degree of matching between two or more texts, with higher similarity indicating greater similarity between two texts and lower similarity otherwise. The traditional text similarity measurement method is that a Vector Space Model (VSM) calculates the frequency vector inner product with weight between a document to be checked and one document in a data set to obtain the similarity of the two documents. The algorithm has the defects of large quantity of characteristic words, low comparison speed, low accuracy and the like, and cannot be applied to similarity measurement in mass data. The similarity measurement algorithm based on Minwise is used as the most mainstream and mature similarity detection method, the similarity problem is converted into the occurrence probability problem of an event, the text vocabulary set is mapped into the hash value set, the character string comparison problem is converted into the characteristic fingerprint comparison problem, and the similarity measurement method is suitable for mass data similarity measurement.
The similarity measurement algorithm based on Minwise and the variant algorithm thereof have higher estimation precision, but each research institution is still pursuing higher precision. This is due to the diversity and randomness of the actual detected data, which often tends to occur in a class of large text containing small text (f 1 >>f 2 Case of ≡a). Wherein f 1 、f 2 Is the word set size of the document 1 and the document 2, and a is the intersection size. Because f 1 >>f 2 So the similarity is small, and because f 2 Because of the fact that the inclusion rate of the document 2 relative to the document 1 is close to 1, the inclusion rate is high, which also indicates that the document 2 is completely plagiated to the document 1. For the situations of low similarity and high inclusion rate, the variance based on the Minwise similarity measurement algorithm is large, and the precision is insufficient. Although the method is a special data, the method is common in practice, the similarity deviation can reach more than 20% in some cases, and no better processing method exists at present.
However, the disadvantage of the Minwise similarity metric-based algorithm is that: k random independent permutations are needed to generate k hash values, then a comparison of the k hash values is performed to calculate the similarity value of two document pairs, and the k permutations take a relatively large amount of time, 80% of the total time required. The single random permutation hash One Permutation Hashing (OPH) proposes that the effect of k permutations can be achieved only by one permutation, and k hash values are generated, so that the calculation efficiency is improved.
The patent of the invention, which is disclosed in 2018.08.17 and has publication number of CN108415889A and is named as a text similarity detection method based on weighted one-time substitution hash algorithm, proposes a method for unevenly dividing regions, which can reduce the comparison of hash values by setting a threshold value, so that the calculation efficiency can be effectively improved.
However, when the number of hash values in the area is excessive, the above-mentioned single random permutation hash method or the weighted single permutation hash method has a problem of excessive performance consumption.
Disclosure of Invention
In order to solve the technical problems, the invention provides a method for measuring document similarity by position coding single random permutation hash (position on page parameter hash) to solve the performance consumption problem of hash value comparison when OPH generates excessive space, improve the calculation performance and have important scientific significance and practical application value.
The invention adopts the following technical scheme:
a method for measuring document similarity by position coding single random permutation hash, comprising the following steps:
s1, preliminarily extracting text features to generate a single random replacement hash set O x ;
S2, further extracting text features to generate a single random replacement position coding hash set P x : traversing set O in S1 x The non-empty area in the data processing system takes the serial number of the non-empty area as key and the hash value as value, and generates key value pairs with the structure of < k and v > by mixed coding to form a set P x ;
S3: similarity measure: traversal P a 、P b All key value pairs in (1) according to the similarityComparing the similarity of the two documents a and b;
wherein the subscript x represents any document, P a 、P b Key value pairs < k, v > set and N generated by S2 method for documents a and b respectively emp For set O a 、O b Number of simultaneous empty areas,N mat Representing set O a 、O b Is not null and the hash value is equal, k is set O a 、O b The number of the integrated areas of the end dividing bits, which is larger in the total area number.
Further, S1 is the Hash set O x The generation steps of (a) are as follows:
s1.1: word segmentation and noise filtering are carried out on the document x to obtain a word segmentation set S x ;
S1.2: using Rabin function pair S x Mapping to obtain a new set S xD For the set S xD Performs random permutation once to generate a set pi (S xD );
S1.3:π(S xD ) The hash value generated on the corpus Ω is S xR ;
S1.4: for S xR Compression encoding to obtain O x 。
Further, S in S1.2 xD The random permutation performed satisfies: the probability that any one element Y in the data set Y has the same probability under random permutation pi is the minimum value after the data set is permuted, namelyWherein, the data set Y ε Ω and Y ε Y, pi is a random minwise permutation.
Further, the whole set Ω in S1.3 is uniformly divided into k regions, all regions have the same size and m, and each region is numbered from 1 to k.
Further, each region in the corpus Ω generates a hash value: if a certain area does not have non-zero elements, the area is a dead area, and the hash value of the area is "/x"; if a certain area has non-zero elements, the area is a non-empty area, and the minimum non-zero element in the area is used as a hash value of the area; hash value set of empty area and non-empty area forms S xR 。
Further, the compression encoding process in S1.4 adopts an encoding compression function f (hash) =hashmom, where mod is a modulo function, m is the region size of the whole set Ω, and is equal to S xR Each hash value in (a) employs a compression encoding functionAfter counting, generating a set O x 。
Further, the step of the similarity measurement method in S3 is as follows:
s3.1: let N mat =0,N emp Respectively read P from the beginning =0, i=1 a 、P b Key value pairs of the non-empty area, and mini is set P a And P b A current smaller key value of (a);
s3.2: when P is read a Non-empty region sequence number and P b N when the sequence numbers of the non-empty areas are not equal emp =N emp +minindex-i,N mat Unchanged, current P a And P b The key value pair with the larger area sequence number in the middle is continuously compared with the key value pair with the next non-empty area in the set with the smaller area sequence number;
s3.2.1: i=mini+1, mini becomes set P a And P b When the current smaller key value of P is read a Non-empty region sequence number and P b N when the sequence numbers of the non-empty areas are not equal emp =N emp +minindex-i,N mat If not, entering a step S3.3;
s3.3: when P is read a Non-empty region sequence number and P b When the non-empty region numbers are equal, i=mini+1, and mini becomes set P a And P b If the value of the two key value pairs is equal to the current smaller key value, N mat =N mat +1,N emp Unchanged, otherwise, go to step S3.3.1;
s3.3.1: if the two key value pairs are not equal, N emp =N emp +minindex-i,N mat Unchanged, continue reading P a 、P b Key value pair of the next non-empty region in (i=mini+1), mini becomes set P a And P b The step S3.4 is entered;
s3.4: if go through P a 、P b If the key value pairs reach the end bit, the comparison is stopped, otherwise, the step S3.2 is started.
The beneficial effects of the invention are as follows:
(1) The calculation accuracy is high, and the accuracy is kept consistent because the similarity R calculation formula is kept consistent with the OPH;
(2) As the number of empty areas increases, the consumed time of POPH is shorter than that of OPH, the method for measuring document similarity by POPH only compares hash values of non-empty areas, and then calculates N through position coding mat The method saves the calculation time and the storage space.
Drawings
FIG. 1 is a diagram of a document generation single random permutation hash set OPH (S) in accordance with an embodiment of the invention x ) A process diagram;
FIG. 2 is a region diagram corresponding to FIG. 1 after compression encoding of the single random permutation hash value;
fig. 3 is a position-coding hash set POPH (S) for generating a single random permutation for compression coding in fig. 2 x ) A process diagram;
FIG. 4 is a graph showing the comparison of hash values for different document pairs using OPH and POPH, respectively, in an embodiment without space;
FIG. 5 is a graph of time contrast for comparing hash values of the 1 st pair of documents of FIG. 4 with OPH and POPH, respectively, in the presence of empty space;
FIG. 6 is a time comparison chart for comparing hash values of the 2 nd pair of documents in FIG. 4 with OPH and POPH, respectively, in the presence of empty areas;
FIG. 7 is a graph showing the time taken for comparing hash values of the 3 rd pair of documents of FIG. 4 with OPH and POPH, respectively, in the presence of empty space;
FIG. 8 is a graph of time contrast for comparing hash values of the 4 th pair of documents of FIG. 4 with OPH and POPH, respectively, in the presence of empty space;
fig. 9 shows a set P when the counter i=1 according to an embodiment a And P b Is a comparison of the conditions of (2);
fig. 10 shows a set P when the counter i=3 in the embodiment of fig. 9 a And P b Is a comparison of the conditions of (2);
fig. 11 shows a set P when the counter i=4 in the embodiment of fig. 9 a And P b Is a comparison of the conditions of (2);
fig. 12 shows a set P when the counter i=6 in the embodiment of fig. 9 a And P b Is a comparison of (2)A situation;
fig. 13 shows a set P when the counter i=8 in the embodiment of fig. 9 a And P b Is a comparison of the comparison of (2).
Detailed Description
The invention will be further illustrated with reference to specific examples. Unless otherwise indicated, the methods employed in the examples of the present invention are methods conventionally used in the art.
Example 1
A method for measuring document similarity by position coding single random permutation hash, comprising the following steps:
s1, preliminarily extracting text features to generate a single random replacement hash set O x ;
S2, further extracting text features to generate a single random replacement position coding hash set P x : traversing set O in S1 x The non-empty area in the data processing system takes the serial number of the non-empty area as key and the hash value as value, and generates key value pairs with the structure of < k and v > by mixed coding to form a set P x ;
S3: similarity measure: traversal P a 、P b All key value pairs in (1) according to the similarityComparing the similarity of the two documents a and b;
wherein the subscript x represents any document, P a 、P b Key value pairs < k, v > set and N generated by S2 method for documents a and b respectively emp For set O a 、O b The number of the middle and empty areas N mat Representing set O a 、O b Is not null and the hash value is equal, k is set O a 、O b The number of the integrated areas of the end dividing bits, which is larger in the total area number.
S1 specifically comprises the following steps: firstly, scanning and analyzing text information, segmenting a document by using a Chinese word segmentation algorithm, and filtering word segmentation sets obtained by filtering text noise data by using a stop word list to obtain word sets S of the document x x . Noise, i.e. nonsensical words in text, typically of high frequencyLow-definition assisted words, virtual words and the like;
to word set S x (subscript x represents any document) using Rabin function (a set of strings can be mapped to a 32-bit or 64-bit natural data set), mapping 32-bit integers, and naming the mapped set as S xD . Assume that the full set Ω= {0,1, [ D-1 ], [ a ] 0 a 1 ......a D-1 An arrangement on constant finger Ω, vector (a 0 ,a 1 ,···,a D-1 ) Represents a permutation of Ω:
if for data set Y εΩ and Y εy, there is an arrangement of pi such that
Pi is a random minwise permutation, in other words, the probability that any element Y in the data set Y has the same probability under the random permutation pi is the minimum value after the permutation of this data set. For set S xD Set naming pi (S) generated by performing one random permutation xD )。
Since the whole set Ω is uniformly divided into k regions (simply referred to as Bin), the sizes (sizes) of all the regions are equal, and the regions are given a size of m, and each region (Bin) is numbered from 1 to k, which is called BinId (simply referred to as Bid). Pi (S) xD ) Can be found in the Bin of the corpus Ω, and then a hash value is generated in each Bin, the specific procedure of generating the hash value is as follows: if the Bin does not have a non-zero element, the hash value of the region is taken as ". Ex. the Bin is also named as a null region (all zeros are null regions). If the ith Bin has non-zero elements, selecting a minimum non-zero element value from the Bin as the hash value of the region, wherein the Bin is named as a non-empty region because of the non-empty elements, and the hash values of the empty region and the non-empty region are collectively called as Binhash, pi (S xD ) The hash value set generated on the corpus omega is S xR The method comprises the steps of carrying out a first treatment on the surface of the FinallyFor S xR Compression coding is carried out, and the compression coding process is as follows: setting a coding compression function: f (hash) =hashmod m, where mod is a modulo function, for S xR Each hash value in the set is generated by applying a compression coding function and is OPH (S x ) Short for O x 。
As shown in fig. 1 and 2, assume that the full set Ω= {0,1,2, … } (d=36), the full set Ω is uniformly divided into 9 areas, that is, k=9; the word set of a certain document x is S x Then, for the word set S x Mapping into 32-bit integers by Rabin function, and naming the mapped set as S xD Then to S xD The set formed by random permutation is pi (S xD ) = {6,19,25,32}; will aggregate pi (S xD ) Each value in the total set omega corresponds to the whole set omega to generate a hash value, and the generated hash value set is S xR S is then xR = {, 6,/19, 25, 32; finally to S xR The hash value in the code is compressed and encoded to generate a compressed and encoded set as O x O is then x ={*,2,*,*,3,*,1,*,0}。
S2 specifically comprises the following steps: as shown in fig. 3, a bit of a certain region is mixed-coded as a position feature with a hash value of the Bin, and a key value pair with a structure of < k, v > is generated. Traversing set O x The non-empty area in (1) takes the Bid of the Bin as key, takes the Binhash of the Bin as value, and stores the Binhash into a structure < k, v > to form a set P x 。
The similarity measurement method in S3 comprises the following steps:
s3.1: let N mat =0,N emp Respectively read P from the beginning =0, i=1 a 、P b Key value pairs of the non-empty area, and mini is set P a And P b A current smaller key value of (a);
s3.2: when P is read a Non-empty region sequence number and P b N when the sequence numbers of the non-empty areas are not equal emp =N emp +minindex-i,N mat Unchanged, current P a And P b The key value pair with the larger area sequence number in the middle is continuously compared with the key value pair with the next non-empty area in the set with the smaller area sequence number;
s3.2.1: i=mini+ 1, mini becomes set P a And P b When the current smaller key value of P is read a Non-empty region sequence number and P b N when the sequence numbers of the non-empty areas are not equal emp =N emp +minindex-i,N mat If not, entering a step S3.3;
s3.3: when P is read a Non-empty region sequence number and P b When the non-empty region numbers are equal, i=mini+ 1, and mini becomes set P a And P b If the value of the two key value pairs is equal to the current smaller key value, N mat =N mat +1,N emp Unchanged, otherwise, go to step S3.3.1;
s3.3.1: if the two key value pairs are not equal, N emp =N emp +minindex-i,N mat Unchanged, continue reading P a 、P b Key value pair of the next non-empty region in (i=mini+1), mini becomes set P a And P b The step S3.4 is entered;
s3.4: if go through P a 、P b If the key value pairs reach the end bit, the comparison is stopped, otherwise, the step S3.2 is started.
Example 2
4 pairs of documents of the experimental data set are selected to form the data set, the pairs of documents are divided into 4 groups according to the similarity from high to low, a pair of words is randomly selected in each pair of documents to represent the pair of documents, and the experimental data are shown in the following table 1 (f 1 、f 2 Is the word set size of the document 1 and the document 2, a is the intersection size), if the randomly permuted set pi (S xD ) Absence of empty region, in computing N emp And N mat In this case, the POPH is superior to the POPH in terms of calculation speed because the comparison of the Bid and the Binhash is performed at the same time, and the comparison of the Binhash is performed only by the OPH, as shown in FIG. 4.
Statistics OPH when empty regions occur at different ratiosThe time taken to complete the comparison of the hash value with the POPH: the data set is constructed on the basis of the first data set, and the basic formula adopted according to the similarity between the measurement sets S1 and S2 is as follows:wherein f 1 =|S 1 |,f 2 =|S 2 |,a=|S 1 ∩S 2 When the total number of regions uniformly divided in the corpus Ω is unchanged, different numbers of empty regions can be formed by decreasing the value of a, and thus the number of empty regions is increased by decreasing the number of a in each document pair in table 1.
TABLE 1
As shown in table 2 below, each document pair has 5 different void percentages, for example, in the document pair "right-RESERVED," a=a×0.8 represents 20% of the total number of void of the document, and similarly, a=a×0.7, a=a×0.5, a=a×0.3, and a=a×0.2 represent the document pair "right-RESERVED," respectively, containing 30%, 50%, 70%, and 80% of void, respectively, and the comparison of OPH and POPH to finish the hash value for the different percentage of void numbers is shown in fig. 5-8 below.
TABLE 2
Thus, as the percentage of empty area increases, the calculation time for both POPH and OPH decreases; as the number of empty areas increases, the POPH will consume less and less time because the POPH is calculating N emp And N mat When the method is used, not all areas are traversed like OPH, but only the hash values of the non-empty areas are compared, and then N is calculated through position coding mat Thus, the calculation time is saved, and the storage space is saved.
Example 3
As shown in fig. 9 to 13, the present embodiment enumerates two key-value pair sets P a And P b The number of areas is 10, and the process of the drawing shows P a And P b And calculating the similarity R.
Wherein i is a counter, representing the 1 st to 10 th regions, and mini refers to the set P a And P b Instead of traversing k regions using loops as in OPH calculation, POPH saves time, as the counter i increases with the change in mini (i=mini+1).
Specifically, P a ={*,2,*,*,3,*,1,*,0,end},P b ={*,*,3,*,*,*,1,*,1,end},N emp For set P a 、P b The number of the middle and empty areas N mat Representing the set P a 、P b Is not null and the hash value is equal, k is the set P a 、P b The number of the largest collection area of the dividing end bit, P a 、P b The key value pairs < k, v > set generated by the method (POPH) of single random permutation hash of the position coding of the documents a and b respectively.
S1: data initialization, N at this time mat =0,N emp =0;
S2: let i=1, at which point P a And P b The first non-empty regions in (a)<2,2>And (3) with<3,3>Then mini=2, n emp =minindex-i=1,N mat =0;
S3: i=mini+1=3, where P is taken a And P b The non-empty areas in (a) are respectively<5,3>And (3) with<3,3>Then mini=3, n emp =N emp +mini-i=1 (number of empty areas without new increase, number in S2 is maintained), N mat =0;
S4: i=mini+1=4, where P is taken a And P b The non-empty areas in (a) are respectively<5,3>And (3) with<7,1>Mini= 5,N emp =N emp +minindex-i=2,N mat =0;
S5: i=mini+1=6, at this timeTaken P a And P b The non-empty areas in (a) are respectively<7,1>And (3) with<7,1>Then mini=7, n emp =N emp +minindex-i=3,N mat =1;
S6: i=mini+1=8, where P is taken a And P b The non-empty areas in (a) are respectively<9,0>And (3) with<9,1>Mini= 9,N emp =N emp +minindex-i=4,N mat =1;
S7: i=mini+1=10, at which point P has been completed a And P b Non-empty region traversal, counter to end bit, stop alignment, k= 9,N emp =4,N mat =1, so similarity
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.
Claims (4)
1. A method for measuring document similarity by position coding single random permutation hash, which is characterized by comprising the following steps:
s1, preliminarily extracting text features to generate a single random replacement hash set O x ;
S1.1: word segmentation and noise filtering are carried out on the document x to obtain a word segmentation set S x ;
S1.2: using Rabin function pair S x Mapping to obtain a new set S xD For the set S xD Performs random permutation once to generate a set pi (S xD );
S1.3:π(S xD ) The hash value generated on the corpus Ω is S xR ;
S1.4: for S xR Compression encoding to obtain O x ;
S2, further extracting textCharacterization, generating a single random permutation position-coding hash set P x : traversing set O in S1 x The non-empty area in the data processing system takes the serial number of the non-empty area as key and the hash value as value, and generates key value pairs with the structure of < k and v > by mixed coding to form a set P x ;
S3, similarity measurement: traversal P a 、P b All key value pairs in (1) according to the similarityComparing the similarity of the two documents a and b;
wherein, the whole set Ω is uniformly divided into k regions, all regions have the same size and m, each region is numbered from 1 to k, and each region in the whole set Ω generates a hash value: if a certain area does not have non-zero elements, the area is a dead area, and the hash value of the area is "/x"; if a certain area has non-zero elements, the area is a non-empty area, and the minimum non-zero element in the area is used as a hash value of the area; hash value set of empty area and non-empty area forms S xR Subscript x denotes any document, P a 、P b Key value pairs < k, v > set and N generated by S2 method for documents a and b respectively emp For set O a 、O b The number of the middle and empty areas N mat Representing set O a 、O b Is not null and the hash value is equal, k is set O a 、O b The number of the integrated areas of the end dividing bits, which is larger in the total area number.
2. The method of claim 1, wherein S in S1.2 is xD The random permutation performed satisfies: the probability that any one element Y in the data set Y has the same probability under random permutation pi is the minimum value after the data set is permuted, namelyWherein, the data set Y ε Ω and Y ε Y, pi is a random minwise permutation.
3. The method of claim 1, wherein the compression encoding process in S1.4 uses an encoding compression function f (hash) =hashmom, where mod is a modulo function, m is the region size of the corpus Ω, for S xR Each hash value in the set is used for generating a set O after a compression coding function is applied x 。
4. The method for measuring similarity of documents according to claim 1, wherein the step of the similarity measuring method in S3 is as follows:
s3.1: let N mat =0,N emp Respectively read P from the beginning =0, i=1 a 、P b Key value pairs of the non-empty area, and mini is set P a And P b A current smaller key value of (a);
s3.2: when P is read a Sequence number and P of non-hollow area b When the sequence numbers of the non-empty areas are not equal, N emp =N emp +minindex-i,N mat Unchanged, current P a And P b The key value pair with the larger area sequence number in the middle is continuously compared with the key value pair with the next non-empty area in the set with the smaller area sequence number;
s3.2.1: i=mini+1, mini becomes set P a And P b When the current smaller key value of P is read a Sequence number and P of non-hollow area b When the sequence numbers of the non-empty areas are not equal, N emp =N emp +minindex-i,N mat If not, entering a step S3.3;
s3.3: when P is read a Sequence number and P of non-hollow area b When the sequence numbers of the non-empty areas are equal, i=mini+1, and mini becomes a set P a And P b If the value of the two key value pairs is equal to the current smaller key value, N mat =N mat +1,N emp Unchanged, otherwise, go to step S3.3.1;
s3.3.1: if the two key value pairs are not equal, N emp =N emp +minindex-i,N mat Unchanged, continue reading P a 、P b Key value of next non-empty region in the listFor i=mini+1, mini becomes set P a And P b The step S3.4 is entered;
s3.4: if go through P a 、P b If the key value pairs reach the end bit, the comparison is stopped, otherwise, the step S3.2 is started.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010235463.1A CN111444325B (en) | 2020-03-30 | 2020-03-30 | Method for measuring document similarity by position coding single random replacement hash |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010235463.1A CN111444325B (en) | 2020-03-30 | 2020-03-30 | Method for measuring document similarity by position coding single random replacement hash |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111444325A CN111444325A (en) | 2020-07-24 |
CN111444325B true CN111444325B (en) | 2023-06-20 |
Family
ID=71654022
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010235463.1A Active CN111444325B (en) | 2020-03-30 | 2020-03-30 | Method for measuring document similarity by position coding single random replacement hash |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111444325B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9311403B1 (en) * | 2010-06-16 | 2016-04-12 | Google Inc. | Hashing techniques for data set similarity determination |
WO2016180268A1 (en) * | 2015-05-13 | 2016-11-17 | 阿里巴巴集团控股有限公司 | Text aggregate method and device |
CN109766455A (en) * | 2018-11-15 | 2019-05-17 | 南京邮电大学 | A kind of full similitude reservation Hash cross-module state search method having identification |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104636325B (en) * | 2015-02-06 | 2015-09-30 | 中南大学 | A kind of method based on Maximum-likelihood estimation determination Documents Similarity |
CN105373521B (en) * | 2015-12-04 | 2018-06-29 | 湖南工业大学 | It is a kind of that the method for calculating text similarity is filtered based on Minwise Hash dynamics multi-threshold |
CN105718430B (en) * | 2016-01-13 | 2018-05-04 | 湖南工业大学 | A kind of method for calculating similarity as fingerprint based on packet minimum value |
EP3408786B1 (en) * | 2017-01-23 | 2019-11-20 | Istanbul Teknik Universitesi | A method of privacy preserving document similarity detection |
CN108415889B (en) * | 2018-03-19 | 2021-05-14 | 中南大学 | Text similarity detection method based on weighted one-time permutation hash algorithm |
CN108595517B (en) * | 2018-03-26 | 2021-03-09 | 南京邮电大学 | Large-scale document similarity detection method |
-
2020
- 2020-03-30 CN CN202010235463.1A patent/CN111444325B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9311403B1 (en) * | 2010-06-16 | 2016-04-12 | Google Inc. | Hashing techniques for data set similarity determination |
WO2016180268A1 (en) * | 2015-05-13 | 2016-11-17 | 阿里巴巴集团控股有限公司 | Text aggregate method and device |
CN109766455A (en) * | 2018-11-15 | 2019-05-17 | 南京邮电大学 | A kind of full similitude reservation Hash cross-module state search method having identification |
Also Published As
Publication number | Publication date |
---|---|
CN111444325A (en) | 2020-07-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Rashtchian et al. | Clustering billions of reads for DNA data storage | |
CN109960724A (en) | A kind of text snippet method based on TF-IDF | |
US11050436B2 (en) | Advanced database compression | |
US11962330B2 (en) | Advanced database decompression | |
CN109740660A (en) | Image processing method and device | |
CN108536657A (en) | The address text similarity processing method and system artificially filled in | |
Andrade et al. | Evolutionary algorithms for overlapping correlation clustering | |
CN111143547A (en) | Big data display method based on knowledge graph | |
CN115759082A (en) | Text duplicate checking method and device based on improved Simhash algorithm | |
Cai et al. | MWFP-outlier: Maximal weighted frequent-pattern-based approach for detecting outliers from uncertain weighted data streams | |
CN111444325B (en) | Method for measuring document similarity by position coding single random replacement hash | |
Geravand et al. | A novel adjustable matrix bloom filter-based copy detection system for digital libraries | |
CN105718430B (en) | A kind of method for calculating similarity as fingerprint based on packet minimum value | |
CN116204612A (en) | Text similarity calculation method and system | |
Soliman et al. | FIRLA: a Fast Incremental Record Linkage Algorithm | |
CN110348469A (en) | A kind of user's method for measuring similarity based on DeepWalk internet startup disk model | |
Cohen et al. | Incremental info-fuzzy algorithm for real time data mining of non-stationary data streams | |
CN105373521B (en) | It is a kind of that the method for calculating text similarity is filtered based on Minwise Hash dynamics multi-threshold | |
Maria et al. | Computation of large asymptotics of 3-manifold quantum invariants | |
Wangmo et al. | Efficient Subgraph Indexing for Biochemical Graphs. | |
CN114360729A (en) | Medical text information automatic extraction method based on deep neural network | |
Bille et al. | Hierarchical relative lempel-ziv compression | |
Rebenich et al. | FLOTT—A Fast, Low Memory T-TransformAlgorithm for Measuring String Complexity | |
Iliopoulos et al. | Faster index for property matching | |
Zhou et al. | A dynamic pattern recognition approach based on neural network for stock time-series |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
EE01 | Entry into force of recordation of patent licensing contract |
Application publication date: 20200724 Assignee: Changsha Feilaishi Information Technology Co.,Ltd. Assignor: HUNAN University OF TECHNOLOGY Contract record no.: X2024980008031 Denomination of invention: A Method for Measuring Document Similarity through a Single Random Permutation Hash with Position Encoding Granted publication date: 20230620 License type: Exclusive License Record date: 20240626 |
|
EE01 | Entry into force of recordation of patent licensing contract |