CN111444325B

CN111444325B - Method for measuring document similarity by position coding single random replacement hash

Info

Publication number: CN111444325B
Application number: CN202010235463.1A
Authority: CN
Inventors: 袁鑫攀; 王松林; 毛鑫鑫
Original assignee: Hunan University of Technology
Current assignee: Hunan University of Technology
Priority date: 2020-03-30
Filing date: 2020-03-30
Publication date: 2023-06-20
Anticipated expiration: 2040-03-30
Also published as: CN111444325A

Abstract

Position coding single random replacement hash measurement document similarityThe method belongs to the field of searching similar texts in information retrieval, and comprises the following steps: s1, preliminarily extracting text features to generate a single random replacement hash set O _x The method comprises the steps of carrying out a first treatment on the surface of the S2, further extracting text features to generate a single random replacement position coding hash set P _x : traversing set O in S1 _x The non-empty area in the data processing system takes the serial number of the non-empty area as key and the hash value as value, and generates key value pairs with the structure of < k and v > by mixed coding to form a set P _x The method comprises the steps of carrying out a first treatment on the surface of the S3: similarity measure: traversal P _a 、P _b All key value pairs in (1) according to the similarity

The similarity of the two documents a, b is compared. The invention has high calculation precision and keeps consistent with OPH; as the number of empty areas increases, the POPH method for measuring the document similarity saves the calculation time and the storage space.

Description

A Method for Measuring Document Similarity by Position-encoded Single Random Permutation Hash

技术领域technical field

本发明属于信息检索中查找相似文本领域，更具体地，涉及一种位置编码单次随机置换哈希度量文档相似度的方法。The invention belongs to the field of searching similar texts in information retrieval, and more specifically relates to a method for measuring document similarity by position coding single random permutation hash.

背景技术Background technique

WEB正经历着爆炸性增长，越来越多的文献资料开始在网上公布，这种趋势使得网络上文档资源成几何级数增长，为人类共享知识和创造财富提供了前所未有的便利，也对我国的现代化建设有着积极的推动作用。然而，这些数字化资源给人们提供帮助的同时，资源的易获性也使得文档的非法复制、抄袭、剽窃等行为越来越猖獗，使得在各种论文和项目申请书等文档中，可能存在比较严重的抄袭现象。同时，随着国家对教育科研的大量投入，提供了各种教育和科技项目的资助，如：国家自然科学基金项目、教育部的博士点项目，各省市的基金项目、各种科技计划等。由于这些项目属于不同的职能部门单位分管，这就使得项目申请书存在着多次申报和多头申报的现象。申请书的抄袭、多次申报和多头申报现象严重影响了项目审批的客观性和公平性，对国家科研经费的合理分配产生不良的影响，造成科学研究经费可能得不到高效利用。为防止抄袭剽窃，端正学术风气，进行文档相似性检测技术的研究非常有意义。由此，世界各地的搜索引擎、图书馆、基金会、论文库、知识产权部门等都投入巨大的人力、物力和财力，正努力地在文档相似性检测上摸索和探究，以期尽快突破相似性检测的关键科学问题，为论文、项目申请书、奖励申报书、专利的查重或搜索引擎的网页去重等提供了良好的解决方案。The WEB is experiencing explosive growth, and more and more documents are published on the Internet. This trend makes the document resources on the Internet grow exponentially, which provides unprecedented convenience for human beings to share knowledge and create wealth. Modernization has a positive role in promoting. However, while these digital resources provide help to people, the easy availability of resources also makes the illegal copying, plagiarism, plagiarism and other behaviors of documents more and more rampant, so that there may be comparisons in documents such as various papers and project applications serious plagiarism. At the same time, with the country's large investment in education and scientific research, it has provided funding for various educational and technological projects, such as: National Natural Science Foundation projects, doctoral programs of the Ministry of Education, fund projects of various provinces and cities, and various scientific and technological plans. Since these projects belong to different functional departments and units, there are multiple declarations and multiple declarations in the project application. The phenomenon of plagiarism, multiple declarations and multiple declarations of applications seriously affects the objectivity and fairness of project approval, and has a negative impact on the rational allocation of national scientific research funds, resulting in the possibility that scientific research funds may not be used efficiently. In order to prevent plagiarism and correct the academic atmosphere, it is very meaningful to conduct research on document similarity detection technology. As a result, search engines, libraries, foundations, dissertation databases, and intellectual property departments all over the world have invested huge manpower, material resources, and financial resources, and are working hard to explore and explore document similarity detection in order to break through the similarity as soon as possible. The key scientific issues of detection provide a good solution for papers, project applications, award declarations, patent duplication checks or search engine webpage deduplication, etc.

相似性检测数据具有海量性的特点，以国家自然科学基金申请书为例，目前就2019年申请量来说，申请书数量达20万份以上，每年还将以较快的速度增长。又如，近几年中国每年高校毕业生人数约为700万，其中大部分的毕业论文都需要进行相似性检测，每年5月份论文检测量达到高峰，日均在几万份以上，相似性检测不但要和当年的数据进行查重，还需要和历史数据进行检测，而如此海量的文档，光靠常规检测方式是根本行不通的，因此迫切需要借助哈希相似性估计技术，建立一套精度和效率俱优的检测机制，实现对海量文档的相似性比对技术。文本相似度度量这一概念及相关技术也应运而生。一个好的文本相似度度量方法在相似性检测、自动问答系统、智能检索、网页去重、自然语言处理等研究领域具有重要的意义。Similarity detection data is characterized by massive amounts. Taking the applications for the National Natural Science Foundation of China as an example, the current number of applications in 2019 has reached more than 200,000, and will continue to grow at a relatively rapid rate every year. As another example, in recent years, the number of college graduates in China is about 7 million each year, and most of the graduation thesis need similarity testing. Not only does it need to be checked against the data of the year, but it also needs to be checked against historical data. For such a large amount of documents, it is impossible to rely on conventional detection methods. Therefore, it is urgent to use hash similarity estimation technology to establish a set of accuracy The detection mechanism with excellent efficiency and high efficiency realizes the similarity comparison technology for massive documents. The concept of text similarity measurement and related technologies also came into being. A good text similarity measurement method is of great significance in research fields such as similarity detection, automatic question answering system, intelligent retrieval, web page deduplication, and natural language processing.

文本相似度是指两个或者多个文本之间的匹配程度的一个度量参数，相似度越高，表示两个文本之间的相似程度越大，反之越低。传统的文本相似度度量方法是向量空间模型(VSM)通过计算待查文档与数据集中某一篇文档带有权重的频率向量内积，得到两篇文档的相似度。算法需要存储大量的特征词汇、比对速度慢、准确率低等缺点，无法应用于海量数据中相似度度量。基于Minwise相似性度量算法作为最主流、成熟的相似性检测方法，通过将相似度问题转换为一个事件的发生概率问题，将文本词汇集合映射到hash值集合中，将字符串比对问题转化为特征指纹比对问题，适用于海量数据相似度度量。Text similarity refers to a measurement parameter of the degree of matching between two or more texts. The higher the similarity, the greater the similarity between the two texts, and vice versa. The traditional text similarity measurement method is the Vector Space Model (VSM), which calculates the frequency vector inner product with weights between the document to be searched and a certain document in the data set to obtain the similarity between the two documents. The algorithm needs to store a large number of feature words, has the disadvantages of slow comparison speed and low accuracy, and cannot be applied to similarity measurement in massive data. Based on the Minwise similarity measurement algorithm as the most mainstream and mature similarity detection method, by converting the similarity problem into an event probability problem, the text vocabulary set is mapped to the hash value set, and the string comparison problem is transformed into The feature fingerprint comparison problem is suitable for the similarity measurement of massive data.

基于Minwise相似性度量算法及其变种算法具有较高的估计精度，但各研究机构仍在不断追求更高的精度。这是由于实际检测数据的多样性和随机性，经常容易出现一类大文本包含小文本(f₁>>f₂≈a)的情况。其中，f₁、f₂是文档1、文档2的词集大小，a为交集大小。因为f₁>>f₂，故相似性很小，又因为f₂≈a，故文档2相对于文档1的包含率接近于1，如此高的包含率也说明了文档2完全抄袭于文档1。对于这类低相似率、高包含率的情况，基于Minwise相似性度量算法的方差均较大，精度不够。这虽然是一类较为特殊的数据，但实际中屡见不鲜，有时候相似度偏差可高达20％以上，目前并没有较好的处理方法。The similarity measurement algorithm based on Minwise and its variants have high estimation accuracy, but various research institutions are still pursuing higher accuracy. This is due to the diversity and randomness of the actual detection data, and it is often easy for a class of large texts to contain small texts (f ₁ >>f ₂ ≈a). Among them, f ₁ and f ₂ are the word set sizes of document 1 and document 2, and a is the intersection size. Because f ₁ >>f ₂ , the similarity is very small, and because f ₂ ≈a, the inclusion rate of document 2 relative to document 1 is close to 1, such a high inclusion rate also shows that document 2 is completely plagiarized from document 1 . For such cases of low similarity rate and high inclusion rate, the variance of the similarity measurement algorithm based on Minwise is large and the accuracy is not enough. Although this is a relatively special type of data, it is not uncommon in practice. Sometimes the similarity deviation can be as high as more than 20%, and there is no better processing method at present.

但是，基于Minwise相似性度量算法的缺陷在于：需要进行k次随机独立的置换以产生k个hash值，然后进行k个hash值的一一比对计算出两文档对的相似值，并且k次置换耗时较大，所需总时间的80％。单次随机置换哈希One Permutation Hashing(OPH)提出只需一次置换，就能达到k次置换的效果，产生k个hash值，从而提高了计算效率。However, the disadvantage of the Minwise-based similarity measurement algorithm is that k random independent permutations are required to generate k hash values, and then one-to-one comparison of k hash values is performed to calculate the similarity value of the two document pairs, and k times Replacement is time-consuming, accounting for 80% of the total time. One Permutation Hashing (OPH) proposes that only one permutation can achieve the effect of k permutations and generate k hash values, thereby improving the calculation efficiency.

于2018.08.17公开的、公布号为CN108415889A的、名称为一种基于带权一次置换哈希算法的文本相似性检测方法的发明专利，提出一种非均匀划分区域的方法，其通过设置阈值可以减少hash值的比对，从而能有效地提升计算效率。The invention patent published on 2018.08.17 with the publication number CN108415889A and the name of a text similarity detection method based on a weighted one-time permutation hash algorithm proposes a method for non-uniform division of regions, which can be achieved by setting a threshold Reduce the comparison of hash values, which can effectively improve the calculation efficiency.

但是在区域中hash值为空的数量过多时，无论是上述单次随机置换哈希方法还是带权单次置换哈希方法都存在性能消耗过大的问题。However, when there are too many empty hash values in the region, both the above-mentioned single random permutation hashing method and the weighted single permutation hashing method have the problem of excessive performance consumption.

发明内容Contents of the invention

为了解决上述技术问题，本发明提出一种位置编码单次随机置换哈希(PositionOnePermutationHashingPOPH)度量文档相似度的方法，用以解决OPH在生成过量空区时进行哈希值比对的性能消耗问题，提高计算性能，具有较为重要的科学意义和实际应用价值。In order to solve the above-mentioned technical problems, the present invention proposes a method for measuring document similarity by Position One Permutation Hashing (POPH), which is used to solve the performance consumption problem of comparing hash values when OPH generates excessive empty areas. Improving computing performance has important scientific significance and practical application value.

本发明采取如下技术方案：The present invention takes following technical scheme:

一种位置编码单次随机置换哈希度量文档相似度的方法，包括如下步骤：A method for measuring document similarity by position coding single random permutation hash, comprising the following steps:

S1，初步提取文本特征，生成单次随机置换哈希集合O_x；S1, initially extracting text features and generating a single random permutation hash set O _x ;

S2，进一步提取文本特征，生成单次随机置换的位置编码哈希集合P_x：遍历S1中集合O_x中的非空区，将非空区的序号作为key，哈希值作为value，混合编码生成结构为＜k,v＞的键值对，形成集合P_x；S2, further extract text features, and generate a single random permutation position encoding hash set P _x : traverse the non-empty area in the set O _x in S1, use the serial number of the non-empty area as the key, and the hash value as the value, and mix encoding Generate key-value pairs with the structure <k,v> to form a set P _x ;

S3：相似性度量：遍历P_a、P_b中所有键值对，根据相似度

比较两文档a、b的相似度；S3: Similarity measurement: traverse all key-value pairs in P _a and P _b , according to the similarity

Compare the similarity of two documents a and b;

其中，下标x表示任意文档，P_a、P_b分别是文档a、b通过S2的方法生成的键值对＜k,v＞集合，N_emp为集合O_a、O_b中同时为空区的数量，N_mat表示集合O_a、O_b中不为空且哈希值相等的数量，k为集合O_a、O_b中总区域数量较大的除结束位的集合区域数。Among them, the subscript x represents any document, P _a and P _b are the key-value pairs <k, v> sets generated by the method of S2 for documents a and b respectively, _Nemp is the empty area in the sets O _a and O _b at the same time N _mat represents the number of non-empty and equal hash values in the sets O _a and O _b , and k is the number of set areas in the sets O _a and O _b that have a larger total area except the end bit.

进一步的，S1所述哈希集合O_x的生成步骤为：Further, the generation steps of the hash set O _x described in S1 are:

S1.1：对文档x进行分词、滤噪得到分词集合S_x；S1.1: Perform word segmentation and noise filtering on the document x to obtain the word segmentation set S _x ;

S1.2：采用Rabin函数对S_x进行映射得新集合S_xD，对集合S_xD进行一次随机置换，生成集合π(S_xD)；S1.2: Use the Rabin function to map S _x to obtain a new set S _xD , perform a random permutation on the set S _xD to generate a set π(S _xD );

S1.3：π(S_xD)在全集Ω上生成的哈希值为S_xR；S1.3: The hash value generated by π(S _xD ) on the complete set Ω is S _xR ;

S1.4：对S_xR进行压缩编码，得到O_x。S1.4: Perform compression coding on S _xR to obtain O _x .

进一步的，所述S1.2中S_xD进行的随机置换满足：数据集Y中的任意一个元素y在随机置换π下都有相同的概率是这个数据集置换后的最小值，即

其中，数据集Y∈Ω且y∈Y，π为一个随机minwise排列。Further, the random permutation of S _{x D} in S1.2 satisfies: any element y in the data set Y has the same probability of being the minimum value of this data set after the permutation under random permutation π, namely

Among them, the data set Y∈Ω and y∈Y, π is a random minwise arrangement.

进一步的，所述S1.3中全集Ω均匀划分成k个区域，所有区域的大小相等且为m，对每个区域从1到k进行编号。Further, the complete set Ω in S1.3 is evenly divided into k regions, all regions are equal in size and m, and each region is numbered from 1 to k.

进一步的，所述全集Ω中每个区域生成一个哈希值：若某区域不存在非零元素，该区域为空区，其哈希值为“*”；若某区域存在非零元素，该区域为非空区，将区域中最小非零元素作为该区域的哈希值；空区与非空区的哈希值集合形成S_xR。Further, each area in the complete set Ω generates a hash value: if there is no non-zero element in a certain area, the area is an empty area, and its hash value is "*"; if there is a non-zero element in a certain area, the The area is a non-empty area, and the minimum non-zero element in the area is used as the hash value of the area; the hash value set of the empty area and the non-empty area forms S _xR .

进一步的，所述S1.4中的压缩编码过程采用编码压缩函数f(hash)＝hashmodm，其中，mod为取模函数，m为全集Ω的区域大小，对S_xR中的每一个哈希值运用压缩编码函数后生成集合O_x。Further, the compression encoding process in the S1.4 adopts the encoding compression function f(hash)=hashmodm, wherein, mod is a modulus function, m is the area size of the complete set Ω, for each hash value in S _xR The set O _x is generated after applying the compression coding function.

进一步的，所述S3中相似性度量方法的步骤为：Further, the steps of the similarity measurement method in the S3 are:

S3.1：令N_mat＝0，N_emp＝0，i＝1，从头开始分别读取P_a、P_b中非空区域的键值对，minindex为集合P_a与P_b中当前较小key值；S3.1: Let N _mat =0, _Nemp =0, i=1, read the key-value pairs of the non-empty areas in P _a and P _b respectively from the beginning, and the minindex is the current smaller one in the sets P _a and P _b key value;

S3.2：当读取的P_a非空区域序号与P_b非空区域序号不相等时，N_emp＝N_emp+minindex-i，N_mat不变，当前P_a与P_b中较大区域序号的键值对继续与区域序号较小的集合中下一非空区域键值对进行比较；S3.2: When the serial number of the non-empty area of P _a read is not equal to the serial number of the non-empty area of P _b , N _emp =N _emp +minindex-i, N _mat remains unchanged, and the larger area of the current P _a and P _b The key-value pair of the serial number continues to be compared with the next non-empty zone key-value pair in the set with the smaller zone serial number;

S3.2.1：i＝minindex+1，minindex变为集合P_a与P_b中当前较小key值，当读取的P_a非空区域序号与P_b非空区域序号不相等时，N_emp＝N_emp+minindex-i，N_mat不变，否则，进入步骤S3.3；S3.2.1: i=minindex+1, minindex becomes the current smaller key value in the sets P _a and P _b , when the serial number of the non-empty area of P _a read is not equal to the serial number of the non-empty area of P _b , N _emp = N _emp +minindex-i, N _mat unchanged, otherwise, go to step S3.3;

S3.3：当读取的P_a非空区域序号与P_b非空区域序号相等时，i＝minindex+1，minindex变为集合P_a与P_b中当前较小key值，若两键值对中value相等，N_mat＝N_mat+1，N_emp不变，否则，进入步骤S3.3.1；S3.3: When the serial number of the read P _a non-empty area is equal to the serial number of the P _b non-empty area, i=minindex+1, and the minindex becomes the current smaller key value in the set P _a and P _b , if the two key values The values in the pair are equal, N _mat = N _mat +1, N _emp remains unchanged, otherwise, enter step S3.3.1;

S3.3.1：若两键值对中value不相等，N_emp＝N_emp+minindex-i，N_mat不变，继续读取P_a、P_b中下一非空区域的键值对，i＝minindex+1，minindex变为集合P_a与P_b中当前较小key值，进入步骤S3.4；S3.3.1: If the values in the two key-value pairs are not equal, N _emp =N _emp +minindex-i, N _mat remains unchanged, continue to read the key-value pair of the next non-empty area in P _a and P _b , i= minindex+1, minindex becomes the current smaller key value in the sets P _a and P _b , and enters step S3.4;

S3.4：若遍历P_a、P_b中所有键值对至结束位，则停止对比，否则进入步骤S3.2。S3.4: If all the key-value pairs in P _a and P _b are traversed to the end bit, then stop the comparison, otherwise go to step S3.2.

本发明的有益效果为：The beneficial effects of the present invention are:

(1)计算精度高，因相似度R计算公式与OPH保持一致，精度也保持一致；(1) The calculation accuracy is high, because the similarity R calculation formula is consistent with OPH, and the accuracy is also consistent;

(2)随着空区数量的增加，POPH较OPH消耗的时间将越来越短，POPH度量文档相似度的方法只对非空区的哈希值进行对比，然后通过位置编码计算出N_mat，既节约了计算时间又节省了存储空间。(2) With the increase of the number of empty areas, the time consumed by POPH will be shorter and shorter than that of OPH. The method of POPH to measure the similarity of documents only compares the hash values of non-empty areas, and then calculates N _mat through position coding , which not only saves computing time but also saves storage space.

附图说明Description of drawings

图1为本发明一实施例文档生成单次随机置换哈希集合OPH(S_x)过程图；Fig. 1 is a document generation single random permutation hash set OPH(S _x ) process diagram of an embodiment of the present invention;

图2为对应图1的单次随机置换哈希值压缩编码后的区域图；Fig. 2 is a region map corresponding to the single random permutation hash value compression encoding in Fig. 1;

图3为针对图2中的压缩编码生成单次随机置换的位置编码哈希集合POPH(S_x)过程图；Fig. 3 is the position code hash set POPH (S _x ) process diagram that generates a single random permutation for the compression coding in Fig. 2;

图4为无空区时一实施例不同文档对分别采用OPH、POPH完成哈希值的比对所用时间对比图；Fig. 4 is a comparison chart of the time used for the comparison of different documents using OPH and POPH to complete the comparison of hash values when there is no empty space;

图5为空区存在时图4中第1对文档对分别采用OPH、POPH完成哈希值的比对所用时间对比图；Fig. 5 is a time comparison diagram for the first pair of documents in Fig. 4 to use OPH and POPH to complete the comparison of hash values when the empty area exists;

图6为空区存在时图4中第2对文档对分别采用OPH、POPH完成哈希值的比对所用时间对比图；Fig. 6 is a time comparison diagram for the second pair of documents in Fig. 4 to use OPH and POPH to complete the comparison of hash values when the empty area exists;

图7为空区存在时图4中第3对文档对分别采用OPH、POPH完成哈希值的比对所用时间对比图；Fig. 7 is a time comparison diagram for the third pair of documents in Fig. 4 to use OPH and POPH to complete the comparison of hash values when the empty area exists;

图8为空区存在时图4中第4对文档对分别采用OPH、POPH完成哈希值的比对所用时间对比图；Fig. 8 is a time comparison diagram for the fourth pair of documents in Fig. 4 to use OPH and POPH to complete the comparison of hash values when the empty area exists;

图9为一实施例计数器i＝1时，集合P_a与P_b的比较情况；FIG. 9 is a comparison of sets P _a and P _b when the counter i=1 in one embodiment;

图10为图9实施例计数器i＝3时，集合P_a与P_b的比较情况；Fig. 10 is the comparison situation of sets P _a and P _b when the counter i=3 of the embodiment of Fig. 9;

图11为图9实施例计数器i＝4时，集合P_a与P_b的比较情况；Fig. 11 is the comparison situation of sets P _a and P _b when the counter i=4 of the embodiment of Fig. 9;

图12为图9实施例计数器i＝6时，集合P_a与P_b的比较情况；Fig. 12 is the comparison situation of sets P _a and P _b when the counter i=6 of the embodiment of Fig. 9;

图13为图9实施例计数器i＝8时，集合P_a与P_b的比较情况。FIG. 13 is a comparison between sets P _a and P _b when the counter i=8 in the embodiment of FIG. 9 .

具体实施方式Detailed ways

下面结合具体实施例进一步说明本发明。除非特别说明，本发明实施例中采用的方法为本领域常规使用的方法。The present invention will be further described below in conjunction with specific examples. Unless otherwise specified, the methods used in the examples of the present invention are conventionally used methods in the art.

实施例1Example 1

S3：相似性度量：遍历P_a、P_b中所有键值对，根据相似度

Compare the similarity of two documents a and b;

S1具体的有：首先，对文本信息进行扫描分析，利用中文分词算法对文档进行分词，利用停用词表过滤掉文本噪音数据后的分词集合即为文档x的词集S_x。噪音即为文本中无意义的词语，一般是高频低义的助词、虚词等；The details of S1 are as follows: First, scan and analyze the text information, use the Chinese word segmentation algorithm to segment the document, use the stop word list to filter out the text noise data, and the word segmentation set is the word set S _x of the document x. Noise refers to meaningless words in the text, generally high-frequency and low-meaning auxiliary words, function words, etc.;

对词集S_x(下标x表示任意文档)采用Rabin函数(可以将字符串集合映射成为32位或64位的自然数据集)，映射32位的整数，映射后集合命名为S_xD。假定全集Ω＝{0,1,···,D-1}，a₀a₁......a_D-1恒指Ω上的一个排列，向量(a₀,a₁,···,a_D-1)代表Ω的一个置换：For word set S _x (the subscript x represents any document), the Rabin function (which can map the string set into a 32-bit or 64-bit natural data set) is used to map the 32-bit integer, and the set is named S _xD after mapping. Assume that the complete set Ω={0,1,...,D-1}, a ₀ a ₁ ...... a _D-1 HSI is a permutation on Ω, the vector (a ₀ ,a ₁ ,·· , a _D-1 ) represents a permutation of Ω:

如果对于数据集Y∈Ω且y∈Y，存在一个排列π，使得If for a data set Y∈Ω and y∈Y, there exists a permutation π such that

则π为一个随机minwise排列，换句话说，数据集Y中的任意一个元素y在随机置换π下都有相同的概率是这个数据集置换后的最小值。对集合S_xD进行一次随机置换生成的集合命名π(S_xD)。Then π is a random minwise arrangement. In other words, any element y in the data set Y has the same probability of being the minimum value of the data set after the replacement of π. The set generated by a random permutation of the set S _xD is named π(S _xD ).

将全集Ω均匀划分成k个区域(简称为Bin)，因此所有区域的大小(size)相等，设区域的大小为m，对每个区域(Bin)从1到k进行编号，该编号称为BinId(简称为Bid)。π(S_xD)中的每个值都可以在全集Ω的Bin中找到，然后在每个Bin中生成一个hash值，hash值生成的具体过程：如果该Bin不存在非零元素，就以“*”作为该区域的hash值，该Bin也被命名为空区(全零则为空区)。如果第i个Bin存在非零元素，则在该Bin中选择一个最小的非零元素值作为该区域的hash值，该Bin因为有非空元素，被命名为非空区，将空区与非空区的hash值统称为Binhash，π(S_xD)在全集Ω上生成的hash值集合为S_xR；最后对S_xR进行压缩编码，压缩编码过程为：设编码压缩函数：f(hash)＝hashmod m，其中mod为取模函数，对S_xR中的每一个hash值运用压缩编码函数后生成的集合为OPH(S_x)，简称O_x。The complete set Ω is evenly divided into k regions (referred to as Bins), so the size of all regions is equal. Let the size of the region be m, and each region (Bin) is numbered from 1 to k. The number is called BinId (abbreviated as Bid). Each value in π(S _xD ) can be found in the Bin of the complete set Ω, and then a hash value is generated in each Bin. The specific process of hash value generation: if there is no non-zero element in the Bin, it will be "*" as the hash value of this area, and this Bin is also named as an empty area (all zeros are empty). If there is a non-zero element in the i-th Bin, select a minimum non-zero element value in the Bin as the hash value of the area. Because the Bin has non-empty elements, it is named a non-empty area, and the empty area and non-empty The hash value of the empty area is collectively referred to as Binhash, and the hash value set generated by π(S _{x D} ) on the complete set Ω is S _{x R} ; finally, S _{x R} is compressed and encoded, and the compression encoding process is: set the encoding compression function: f(hash)= hashmod m, where mod is a modulo function, and the set generated by applying the compression coding function to each hash value in S _xR is OPH(S _x ), referred to as O _x .

如图1、图2所示，假设全集Ω＝{0,1,2,…35}(D＝36)，将全集Ω均匀划分成9个区域，即k＝9；某文档x的词集为S_x，然后，对词集S_x采用Rabin函数，映射成32位的整数，映射后集合命名为S_xD，再对S_xD进行随机置换形成的集合为π(S_xD)＝{6,19,25,32}；将集合π(S_xD)中的每一个值对应到全集Ω中进行hash值的生成，生成的hash值集合为S_xR，则S_xR＝{*,6,*,*,19,*,25,*,32}；最后对S_xR中的hash值进行压缩编码，生成压缩编码后的集合为O_x，则O_x＝{*,2,*,*,3,*,1,*,0}。As shown in Figure 1 and Figure 2, assuming that the complete set Ω={0,1,2,...35} (D=36), the complete set Ω is evenly divided into 9 regions, that is, k=9; the word set of a certain document x is S _x , and then, use the Rabin function on the word set S _x to map it into a 32-bit integer. After the mapping, the set is named S _xD , and then the set formed by random replacement of S _xD is π(S _xD )={6, 19,25,32}; Correspond each value in the set π(S _xD ) to the complete set Ω to generate the hash value, the generated hash value set is S _xR , then S _xR ={*,6,*, *,19,*,25,*,32}; Finally, the hash value in S _xR is compressed and coded, and the compressed coded set is generated as O _x , then O _x ={*,2,*,*,3, *,1,*,0}.

S2具体的有：如图3所示，将某个区域的Bid作为位置特征和该Bin的hash值进行混合编码，生成结构为＜k,v＞的键值对。遍历集合O_x中的所有非空区，将该Bin的Bid作为key，将该Bin的Binhash作为value，存入＜k,v＞结构中，形成集合P_x。S2 specifically includes: as shown in Figure 3, the Bid of a certain area is used as a location feature and the hash value of the Bin is mixed and encoded to generate a key-value pair with a structure of <k, v>. Traverse all non-empty areas in the set O _x , use the Bin's Bid as the key, and the Bin's Binhash as the value, and store them in the <k, v> structure to form the set P _x .

S3中相似性度量方法的步骤为：The steps of the similarity measurement method in S3 are:

最后运用公式

计算出相似度R。Finally apply the formula

Calculate the similarity R.

实施例2Example 2

选取实验数据集的4对文档组成数据集，根据相似度从高到低将文档对分成4组，在每个文档对中随机选取一对词来表示文档对，实验数据如下表1所示(f₁、f₂是文档1、文档2的词集大小，a为交集大小)，若随机置换后的集合π(S_xD)不存在空区，在计算N_emp和N_mat时，POPH既要进行Bid的比较，同时要比较Binhash，而OPH只需要进行Binhash的比较，故计算速度要优于POPH，如图4所示。Select 4 pairs of documents in the experimental data set to form the data set, divide the document pairs into 4 groups according to the similarity from high to low, and randomly select a pair of words in each document pair to represent the document pair. The experimental data is shown in Table 1 below ( f ₁ and f ₂ are the word set sizes of document 1 and document 2, and a is the intersection size), if there is no empty area in the randomly permuted set π(S _xD ), when calculating _Nemp and N _mat , POPH needs to To compare Bid, Binhash must be compared at the same time, while OPH only needs to compare Binhash, so the calculation speed is better than POPH, as shown in Figure 4.

统计在空区以不同的比例出现时，OPH与POPH完成hash值的比对所用的时间：本数据集是在上述第一个数据集的基础上构建的，根据度量集合S1与S2间的相似性采用的基本式子为：

其中f₁＝|S₁|,f₂＝|S₂|,a＝|S₁∩S₂|，当在全集Ω中均匀划分出的总区数不变时，通过减少a的值，就可以形成不同数量的空区，因此通过减少表1中每个文档对中a的数量来增加空区的数量。Count the time it takes for OPH and POPH to complete the comparison of hash values when empty areas appear in different proportions: this data set is constructed on the basis of the first data set above, according to the similarity between the measurement sets S1 and S2 The basic formula used is:

Where f ₁ =|S ₁ |, f ₂ =|S ₂ |, a=|S ₁ ∩S ₂ |, when the total number of regions evenly divided in the complete set Ω remains unchanged, by reducing the value of a, the A different number of voids can be formed, so increase the number of voids by reducing the number of a in each document pair in Table 1.

表1Table 1

如下表2所示，每一种文档对有5种不同的空区百分比，例如，在文档对“RIGHTS-RESERVED”中，a＝a*0.8表示文档的空区数量占总区数的20％，同理，a＝a*0.7、a＝a*0.5、a＝a*0.3和a＝a*0.2分别表示“RIGHTS-RESERVED”的文档对，分别包含30％、50％、70％和80％的空区，不同百分比的空区数量下，OPH与POPH完成hash值的比对所消耗的时间对比如下图5-图8所示。As shown in Table 2 below, each document pair has 5 different empty area percentages. For example, in the document pair "RIGHTS-RESERVED", a=a*0.8 means that the number of empty areas in the document accounts for 20% of the total area , in the same way, a=a*0.7, a=a*0.5, a=a*0.3 and a=a*0.2 respectively represent "RIGHTS-RESERVED" document pairs, containing 30%, 50%, 70% and 80 respectively % empty area, under different percentages of empty area, the comparison of the time consumed by OPH and POPH to complete the hash value comparison is shown in Figure 5-8 below.

表2Table 2

因此，随着空区百分比的增加，POPH和OPH的计算时间都在减少；随着空区数量的增加，POPH消耗的时间将越来越短，因为POPH在计算N_emp和N_mat时，并不像OPH一样遍历所有的区，而只是对非空区的hash值进行比对，然后通过位置编码计算出N_mat，这样既节约了计算时间，又节省了存储空间。Therefore, as the percentage of empty areas increases, the calculation time of POPH and OPH decreases; as the number of empty areas increases, the time consumed by POPH will become shorter and shorter, because POPH calculates _Nemp and N _mat , and Unlike OPH, which traverses all areas, it only compares the hash values of non-empty areas, and then calculates N _mat through position coding, which not only saves calculation time, but also saves storage space.

实施例3Example 3

如图9至图13所示，本实施例列举两个键值对集合P_a与P_b，区域数均为10，图示过程展示了P_a与P_b相似度R的计算过程。As shown in FIG. 9 to FIG. 13 , this embodiment enumerates two sets of key-value pairs P _a and P _b , and the number of regions is 10. The illustrated process shows the calculation process of the similarity R between P _a and P _b .

其中，i是计数器，表示第1-10的区域，minindex指取集合P_a与P_b中当前较小k值，计数器i随着minindex的变化而增加(i＝minindex+1)，而不是OPH的计算方法那样使用循环遍历完k个区域，所以POPH就节省时间。Among them, i is a counter, indicating the 1-10th area, minindex refers to the current smaller k value in the set P _a and P _b , and the counter i increases with the change of minindex (i=minindex+1), not OPH The calculation method uses a loop to traverse k areas, so POPH saves time.

具体的，P_a＝{*,2,*,*,3,*,1,*,0,end}，P_b＝{*,*,3,*,*,*,1,*,1,end}，N_emp为集合P_a、P_b中同时为空区的数量，N_mat表示集合P_a、P_b中不为空且哈希值相等的数量，k为集合P_a、P_b中所取到的最大的除结束位的集合区域数，P_a、P_b分别是文档a、b通过位置编码单次随机置换哈希的方法(POPH)生成的键值对＜k,v＞集合。Specifically, P _a ={*,2,*,*,3,*,1,*,0,end}, P _b ={*,*,3,*,*,*,1,*,1, end}, N _emp is the number of empty areas in the sets P _a and P _b at the same time, N _mat is the number of non-empty and equal hash values in the sets P _a and P _b , and k is the number of areas in the sets P _a and P _b The largest number of collection areas obtained except the end bit, P _a and P _b are respectively the key-value pair <k, v> collection generated by the position encoding single random permutation hash method (POPH) of documents a and b .

S1：数据初始化，此时N_mat＝0，N_emp＝0；S1: Data initialization, at this time N _mat =0, N _emp =0;

S2：令i＝1，此时P_a与P_b中第一个非空区域分别为<2,2>与<3,3>，则minindex＝2，N_emp＝minindex-i＝1，N_mat＝0；S2: let i=1, at this time the first non-empty areas in P _a and P _b are <2,2> and <3,3> respectively, then minindex=2, N _emp =minindex-i=1, N _mat = 0;

S3：i＝minindex+1＝3，此时所取P_a与P_b中的非空区域分别为<5,3>与<3,3>，则minindex＝3，N_emp＝N_emp+minindex-i＝1(无新增同时为空区的数量，维持S2中的数量)，N_mat＝0；S3: i=minindex+1=3, at this time, the non-empty areas in P _a and P _b are <5,3> and <3,3> respectively, then minindex=3, N _emp =N _emp +minindex -i=1 (no newly added quantity that is empty area at the same time, maintain the quantity in S2), N _mat =0;

S4：i＝minindex+1＝4，此时所取P_a与P_b中的非空区域分别为<5,3>与<7,1>，则minindex＝5，N_emp＝N_emp+minindex-i＝2，N_mat＝0；S4: i=minindex+1=4, at this time the non-empty areas in P _a and P _b are <5,3> and <7,1> respectively, then minindex=5, N _emp =N _emp +minindex - i=2, _Nmat =0;

S5：i＝minindex+1＝6，此时所取P_a与P_b中的非空区域分别为<7,1>与<7,1>，则minindex＝7，N_emp＝N_emp+minindex-i＝3，N_mat＝1；S5: i=minindex+1=6, at this time the non-empty areas in P _a and P _b are <7,1> and <7,1> respectively, then minindex=7, N _emp =N _emp +minindex -i=3, _Nmat =1;

S6：i＝minindex+1＝8，此时所取P_a与P_b中的非空区域分别为<9,0>与<9,1>，则minindex＝9，N_emp＝N_emp+minindex-i＝4，N_mat＝1；S6: i=minindex+1=8, at this time, the non-empty areas in P _a and P _b are <9,0> and <9,1> respectively, then minindex=9, N _emp =N _emp +minindex -i=4, _Nmat =1;

S7：i＝minindex+1＝10，此时已经完成P_a与P_b非空区域遍历，计数器到结束位，停止比对，则k＝9，N_emp＝4，N_mat＝1，所以相似度

S7: i=minindex+1=10, the traversal of the non-empty areas of P _a and P _b has been completed at this time, the counter reaches the end bit, and the comparison is stopped, then k=9, N _emp =4, N _mat =1, so similar Spend

以上所述仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention are included within the protection scope of the present invention.

Claims

1. A method for measuring document similarity by position coding single random permutation hash, which is characterized by comprising the following steps:

s1, preliminarily extracting text features to generate a single random replacement hash set O _x ；

S1.1: word segmentation and noise filtering are carried out on the document x to obtain a word segmentation set S _x ；

S1.2: using Rabin function pair S _x Mapping to obtain a new set S _xD For the set S _xD Performs random permutation once to generate a set pi (S _xD )；

S1.3：π(S _xD ) The hash value generated on the corpus Ω is S _xR ；

S1.4: for S _xR Compression encoding to obtain O _x ；

S2, further extracting textCharacterization, generating a single random permutation position-coding hash set P _x : traversing set O in S1 _x The non-empty area in the data processing system takes the serial number of the non-empty area as key and the hash value as value, and generates key value pairs with the structure of < k and v > by mixed coding to form a set P _x ；

S3, similarity measurement: traversal P _a 、P _b All key value pairs in (1) according to the similarity

Comparing the similarity of the two documents a and b;

wherein, the whole set Ω is uniformly divided into k regions, all regions have the same size and m, each region is numbered from 1 to k, and each region in the whole set Ω generates a hash value: if a certain area does not have non-zero elements, the area is a dead area, and the hash value of the area is "/x"; if a certain area has non-zero elements, the area is a non-empty area, and the minimum non-zero element in the area is used as a hash value of the area; hash value set of empty area and non-empty area forms S _xR Subscript x denotes any document, P _a 、P _b Key value pairs < k, v > set and N generated by S2 method for documents a and b respectively _emp For set O _a 、O _b The number of the middle and empty areas N _mat Representing set O _a 、O _b Is not null and the hash value is equal, k is set O _a 、O _b The number of the integrated areas of the end dividing bits, which is larger in the total area number.

2. The method of claim 1, wherein S in S1.2 is _xD The random permutation performed satisfies: the probability that any one element Y in the data set Y has the same probability under random permutation pi is the minimum value after the data set is permuted, namely

Wherein, the data set Y ε Ω and Y ε Y, pi is a random minwise permutation.

3. The method of claim 1, wherein the compression encoding process in S1.4 uses an encoding compression function f (hash) =hashmom, where mod is a modulo function, m is the region size of the corpus Ω, for S _xR Each hash value in the set is used for generating a set O after a compression coding function is applied _x 。

4. The method for measuring similarity of documents according to claim 1, wherein the step of the similarity measuring method in S3 is as follows:

s3.1: let N _mat ＝0，N _emp Respectively read P from the beginning =0, i=1 _a 、P _b Key value pairs of the non-empty area, and mini is set P _a And P _b A current smaller key value of (a);

s3.2: when P is read _a Sequence number and P of non-hollow area _b When the sequence numbers of the non-empty areas are not equal, N _emp ＝N _emp +minindex-i，N _mat Unchanged, current P _a And P _b The key value pair with the larger area sequence number in the middle is continuously compared with the key value pair with the next non-empty area in the set with the smaller area sequence number;

s3.2.1: i=mini+1, mini becomes set P _a And P _b When the current smaller key value of P is read _a Sequence number and P of non-hollow area _b When the sequence numbers of the non-empty areas are not equal, N _emp ＝N _emp +minindex-i，N _mat If not, entering a step S3.3;

s3.3: when P is read _a Sequence number and P of non-hollow area _b When the sequence numbers of the non-empty areas are equal, i=mini+1, and mini becomes a set P _a And P _b If the value of the two key value pairs is equal to the current smaller key value, N _mat ＝N _mat +1，N _emp Unchanged, otherwise, go to step S3.3.1;

s3.3.1: if the two key value pairs are not equal, N _emp ＝N _emp +minindex-i，N _mat Unchanged, continue reading P _a 、P _b Key value of next non-empty region in the listFor i=mini+1, mini becomes set P _a And P _b The step S3.4 is entered;

s3.4: if go through P _a 、P _b If the key value pairs reach the end bit, the comparison is stopped, otherwise, the step S3.2 is started.