CN113918974B

CN113918974B - Method for quickly matching fingerprints based on documents

Info

Publication number: CN113918974B
Application number: CN202111198737.5A
Authority: CN
Inventors: 崔新安; 苗功勋; 侯洪涛; 李言非; 唐孝军; 赵鑫; 高伟; 袁浩
Original assignee: Nanjing Zhongfu Information Technology Co Ltd
Current assignee: Nanjing Zhongfu Information Technology Co Ltd
Priority date: 2021-10-14
Filing date: 2021-10-14
Publication date: 2024-04-12
Anticipated expiration: 2041-10-14
Also published as: CN113918974A

Abstract

The invention discloses a quick matching method based on document fingerprint similarity in the technical field of data protection safety, which realizes positioning of document similarity fingerprints, thereby providing technical support for document safety; when the fingerprint threshold is larger, the bk tree is utilized to construct a dense point library, and the dense point fingerprint similarity matching efficiency is improved.

Description

Method for quickly matching fingerprints based on documents

Technical Field

The invention relates to the technical field of data protection safety, in particular to a method for quickly matching fingerprints based on documents.

Background

At present, the secret document fingerprint needs to perform one-time Hamming distance calculation with each secret point fingerprint in a secret point fingerprint library, when the distance is smaller than a set threshold value, the two fingerprints are considered to be similar, namely the fingerprint is the secret point fingerprint, but when the capacity of the secret point fingerprint library is large, the method is too violent and takes a long time; the method based on the drawer principle combined with the inverted index can solve the problem of fingerprint similarity quick matching efficiency when the threshold value is smaller, but when the threshold value is more than 10, for example, within 15, two simhash values are considered to be similar, the method based on the drawer principle combined with the inverted index is not improved in fingerprint similarity matching efficiency.

When the capacity of the fingerprint library reaches millions or even tens of millions, the violent matching of similar fingerprints takes a few seconds or even longer; or when the two close-point fingerprints are considered similar at a larger threshold, but no improvement in matching efficiency is obtained.

Based on the method, the invention designs a method for quickly matching the fingerprints of the documents, which aims to solve the problems.

Disclosure of Invention

The invention aims to provide a method for quickly matching fingerprints based on documents so as to solve the problems in the background technology.

In order to achieve the above purpose, the present invention provides the following technical solutions: a method for quickly matching fingerprints based on documents is characterized by comprising the following steps: the method comprises the following specific steps:

s1, constructing a secret-related information base: generating a corresponding simhash value, namely a document secret point fingerprint string, by the aid of a simhash algorithm, and storing the generated secret point fingerprint string into a database;

s2: judging whether the two fingerprints are similar or not according to the service scene and setting a threshold value; the two hamming distances are within 4, which are considered to be similar, and the constructed document fingerprint library is stored into the memory by combining the drawer principle with the inverted index;

s3: if the hamming distance of 2 simhash values is within 4, 1 block is equal in 5 blocks which are cut into;

s4, when retrieving other simhash values with the hamming distance within 4 according to a certain simhash, dividing the simhash into 5 blocks, converting each block, namely 12-bit or 13-bit strings, into corresponding integers, searching the same block in a corresponding table, taking simhash values corresponding to the same block, namely secret point fingerprint strings, storing the secret point fingerprint strings in a candidate set R, screening out simhash values appearing in a set corresponding to at least 1 block, and then calculating the hamming distance one by one to obtain a minimum value, namely similar secret point fingerprints;

s5, for business scenes with the threshold value within 15, storing each secret point fingerprint string in the secret-related information base as a node into a BK tree memory structure;

s6, in the tree building process, firstly randomly finding a secret point fingerprint string in a secret information base as a root node, then inserting each secret point fingerprint string, and firstly calculating the editing distance d between the secret point fingerprint string and the root (the editing distance refers to the number of characters at corresponding positions in the two secret point fingerprint strings); if the distance value d is the first occurrence at the BK tree node, establishing a new child node, otherwise recursively proceeding along the corresponding edge; each node in the BK tree has any number of child nodes, and the numerical value on each side represents the editing distance between two nodes;

s7, setting a search distance threshold as n, calculating the editing distance between the secret related point string to be judged and the nodes in the BK tree, and adding the nodes which meet the requirement that the editing distance between the secret related point string to be judged and the nodes is not more than n into a result candidate set;

s8, when searching is carried out in the BK tree, firstly calculating the editing distance d between the secret-related point string to be judged and the root node, and then recursively searching all sides of each child node value in the [ d-n, d+n ] interval; if the distance d between the secret related point string to be judged and the checked node is smaller than the threshold value n, returning the node and continuing to carry out recursive query until the BK tree is ended;

s9, calculating the distance between the fingerprint of the band-determined document and the fingerprint of the dense point in the BK tree by adopting the Hamming distance, and setting an initial threshold value of input times according to different elements and service scenes;

s10, fingerprinting a current fixed-density document according to the distance of secret-related point fingerprint strings in a BK tree and the current document to be fixed, removing each secret point fingerprint in a search threshold in the constructed BK tree, and storing the fingerprint in a candidate set R;

if the candidate set R currently input in the BK tree structure has the dense point fingerprint data, the candidate set R is considered to have similar dense point fingerprint information;

if no dense point fingerprint data is returned from the candidate set R in the current input BK tree structure, the Hamming distance is larger than a preset value, and no similar dense point fingerprint information is considered.

As a further scheme of the invention, 5 tables are created for all simhash values in S3, different tables store blocks with different positions, such as the first table stores 0-12 bits, the second table stores 13-25 bits, the third table stores 26-38 bits, the fourth table stores 39-51 bits, the fifth table stores 52-63 bits, inverted indexes are used in the tables, and simhash values are indexed by 13 or 12 bit strings.

As a further scheme of the invention, by properly setting the threshold value n in the S8, according to the structural characteristics of the BK tree, nodes which are not in the range of d-n, d+n and the whole branches thereof do not meet the searching condition, namely, 10% of all nodes can be traversed in the searching process, so that the efficiency is much higher than that of violent searching.

Compared with the prior art, the invention has the beneficial effects that:

the prior secret point fingerprint searching is not improved, when the secret point fingerprint data amount reaches 100 ten thousand, the searching time is about 2 seconds, when the secret point fingerprint data amount reaches tens of thousands of grades, the searching time is tens of seconds, and the user experience is quite unsatisfactory; after the BK tree improvement method is adopted, when the capacity of the dense point fingerprint library is in the millions, the time for a normal file fixed-density search only needs about 70 milliseconds.

Drawings

FIG. 1 is a flow chart illustrating the operation of the present invention.

Detailed Description

Referring to fig. 1, the present invention provides a technical solution: a method for quickly matching similar fingerprints of a document realizes positioning of similar fingerprints of the document, thereby providing technical support for document security;

the method comprises the following specific steps:

s2: judging whether two fingerprints are similar to each other or not to set a threshold according to a service scene, and storing a constructed document fingerprint library into a memory by combining a drawer principle and an inverted index when the two fingerprints are considered to be similar and the hamming distance is within 4;

s3: if the hamming distance of 2 simhash values is within 4, 1 block is equal in 5 blocks which are cut into; in order to improve the retrieval efficiency and simultaneously consider the space expense, 5 tables are created for all simhash values, different tables store different position blocks, such as the first table stores 0-12 bits, the second table stores 13-25 bits, the third table stores 26-38 bits, the fourth table stores 39-51 bits, the fifth table stores 52-63 bits, inverted indexes are also used in the tables, and the simhash values are indexed by 13 or 12 bit strings;

s6, in the tree building process, firstly randomly finding a secret point fingerprint string in a secret information base as a root node, then inserting each secret point fingerprint string, firstly calculating the editing distance d between the secret point fingerprint string and the root, wherein the editing distance is between two secret point fingerprint strings, and is the number of characters corresponding to different positions in the two secret point fingerprint strings; if the distance value d is the first occurrence at the BK tree node, establishing a new child node; otherwise recursively down the corresponding edges. Each node in the BK tree has any number of child nodes, and the value on each side represents the edit distance between two nodes.

For example, for a scene with a threshold of 15, a secret related point library is built based on the bk tree:

for example: assuming that the dense-point fingerprint n1:17661816605251706157 is taken as a plurality of BK nodes, inserting key information n2:17661816605251706156, wherein the distance between the key information n2: 17661816605251706157 and the node n1:17661816605251706157 is 1, and then creating a child node and connecting an edge with the number 1; the next insertion, n3:17661816605251706145, is calculated to be 2 from n1:17661816605251706157 and is then placed under the side numbered 2. Then we insert n4:17661816605251706173 next, which is 1 from n1:17661816605251706157, then recursively insert it along that 1 numbered edge into the sub-tree where n2:17661816605251706156 is located; n4:17661816605251706173 is 2 from n2:17661816605251706156, thus placing n4:17661816605251706173 under node n2:17661816605251706156 with edge number 2.

And S7, setting a search distance threshold value as n, calculating the editing distance between the secret related point string to be judged and the nodes in the BK tree, and adding the nodes which meet the requirement that the editing distance between the secret related point string to be judged and the nodes is not more than n into a result candidate set.

And S8, when searching is carried out in the BK tree, firstly calculating the editing distance d between the secret-related point string to be judged and the root node, and then recursively searching all sides of each child node value in the [ d-n, d+n ] interval. If the distance d between the secret related point string to be determined and the checked node is smaller than the threshold value n, returning the node and continuing to carry out recursive query until the BK tree is ended. By properly setting the threshold value n, according to the structural characteristics of the BK tree, nodes which are not in the range of [ d-n, d+n ] and the whole branches thereof do not meet the search condition, namely, 10% of all nodes can be traversed in the query process, so that the efficiency is much higher than that of violent search.

S9, calculating the distance between the fingerprint of the band-determined document and the fingerprint of the dense point in the BK tree by adopting the Hamming distance, and setting an initial threshold value of input times according to different elements and service scenes.

S10, fingerprinting the current fixed-density document according to the distance of the fingerprint strings of the secret-related points in the BK tree and the current document to be fixed, searching each secret-point fingerprint in the threshold value in the constructed BK tree, and storing the fingerprint in the candidate set R.

In order to improve the file encryption efficiency, traditional modes such as a drawer principle, an inverted index and a hashmap and drawer principle have certain search efficiency improvement when the threshold value is smaller, but the search efficiency is greatly reduced when the threshold value is larger, and when the fingerprint threshold value is larger, a secret point library is constructed by utilizing a bk tree, so that the secret point fingerprint similarity matching efficiency is improved.

Claims

1. A method for quickly matching fingerprints based on documents is characterized by comprising the following steps: the method comprises the following specific steps:

2. The document fingerprint similarity-based rapid matching method according to claim 1, wherein the method comprises the following steps: and 5 tables are created for all simhash values in the S3, different tables store blocks with different positions, such as the first table stores 0-12 bits, the second table stores 13-25 bits, the third table stores 26-38 bits, the fourth table stores 39-51 bits, the fifth table stores 52-63 bits, inverted indexes are used in the tables, and the simhash values are indexed by 13 or 12 bit strings.

3. The document fingerprint similarity-based rapid matching method according to claim 2, wherein: by properly setting the threshold value n in the S8, according to the structural characteristics of the BK tree, the nodes which are not in the range of d-n, d+n and the whole branches thereof do not meet the searching condition, namely, 10% of all the nodes can be traversed in the searching process, so that the efficiency is much higher than that of violent searching.