CN113918974B - Method for quickly matching fingerprints based on documents - Google Patents
Method for quickly matching fingerprints based on documents Download PDFInfo
- Publication number
- CN113918974B CN113918974B CN202111198737.5A CN202111198737A CN113918974B CN 113918974 B CN113918974 B CN 113918974B CN 202111198737 A CN202111198737 A CN 202111198737A CN 113918974 B CN113918974 B CN 113918974B
- Authority
- CN
- China
- Prior art keywords
- secret
- fingerprint
- tree
- point
- distance
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 26
- 238000012216 screening Methods 0.000 claims description 3
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/602—Providing cryptographic facilities or services
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Bioethics (AREA)
- General Health & Medical Sciences (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Software Systems (AREA)
- Collating Specific Patterns (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a quick matching method based on document fingerprint similarity in the technical field of data protection safety, which realizes positioning of document similarity fingerprints, thereby providing technical support for document safety; when the fingerprint threshold is larger, the bk tree is utilized to construct a dense point library, and the dense point fingerprint similarity matching efficiency is improved.
Description
Technical Field
The invention relates to the technical field of data protection safety, in particular to a method for quickly matching fingerprints based on documents.
Background
At present, the secret document fingerprint needs to perform one-time Hamming distance calculation with each secret point fingerprint in a secret point fingerprint library, when the distance is smaller than a set threshold value, the two fingerprints are considered to be similar, namely the fingerprint is the secret point fingerprint, but when the capacity of the secret point fingerprint library is large, the method is too violent and takes a long time; the method based on the drawer principle combined with the inverted index can solve the problem of fingerprint similarity quick matching efficiency when the threshold value is smaller, but when the threshold value is more than 10, for example, within 15, two simhash values are considered to be similar, the method based on the drawer principle combined with the inverted index is not improved in fingerprint similarity matching efficiency.
When the capacity of the fingerprint library reaches millions or even tens of millions, the violent matching of similar fingerprints takes a few seconds or even longer; or when the two close-point fingerprints are considered similar at a larger threshold, but no improvement in matching efficiency is obtained.
Based on the method, the invention designs a method for quickly matching the fingerprints of the documents, which aims to solve the problems.
Disclosure of Invention
The invention aims to provide a method for quickly matching fingerprints based on documents so as to solve the problems in the background technology.
In order to achieve the above purpose, the present invention provides the following technical solutions: a method for quickly matching fingerprints based on documents is characterized by comprising the following steps: the method comprises the following specific steps:
s1, constructing a secret-related information base: generating a corresponding simhash value, namely a document secret point fingerprint string, by the aid of a simhash algorithm, and storing the generated secret point fingerprint string into a database;
s2: judging whether the two fingerprints are similar or not according to the service scene and setting a threshold value; the two hamming distances are within 4, which are considered to be similar, and the constructed document fingerprint library is stored into the memory by combining the drawer principle with the inverted index;
s3: if the hamming distance of 2 simhash values is within 4, 1 block is equal in 5 blocks which are cut into;
s4, when retrieving other simhash values with the hamming distance within 4 according to a certain simhash, dividing the simhash into 5 blocks, converting each block, namely 12-bit or 13-bit strings, into corresponding integers, searching the same block in a corresponding table, taking simhash values corresponding to the same block, namely secret point fingerprint strings, storing the secret point fingerprint strings in a candidate set R, screening out simhash values appearing in a set corresponding to at least 1 block, and then calculating the hamming distance one by one to obtain a minimum value, namely similar secret point fingerprints;
s5, for business scenes with the threshold value within 15, storing each secret point fingerprint string in the secret-related information base as a node into a BK tree memory structure;
s6, in the tree building process, firstly randomly finding a secret point fingerprint string in a secret information base as a root node, then inserting each secret point fingerprint string, and firstly calculating the editing distance d between the secret point fingerprint string and the root (the editing distance refers to the number of characters at corresponding positions in the two secret point fingerprint strings); if the distance value d is the first occurrence at the BK tree node, establishing a new child node, otherwise recursively proceeding along the corresponding edge; each node in the BK tree has any number of child nodes, and the numerical value on each side represents the editing distance between two nodes;
s7, setting a search distance threshold as n, calculating the editing distance between the secret related point string to be judged and the nodes in the BK tree, and adding the nodes which meet the requirement that the editing distance between the secret related point string to be judged and the nodes is not more than n into a result candidate set;
s8, when searching is carried out in the BK tree, firstly calculating the editing distance d between the secret-related point string to be judged and the root node, and then recursively searching all sides of each child node value in the [ d-n, d+n ] interval; if the distance d between the secret related point string to be judged and the checked node is smaller than the threshold value n, returning the node and continuing to carry out recursive query until the BK tree is ended;
s9, calculating the distance between the fingerprint of the band-determined document and the fingerprint of the dense point in the BK tree by adopting the Hamming distance, and setting an initial threshold value of input times according to different elements and service scenes;
s10, fingerprinting a current fixed-density document according to the distance of secret-related point fingerprint strings in a BK tree and the current document to be fixed, removing each secret point fingerprint in a search threshold in the constructed BK tree, and storing the fingerprint in a candidate set R;
if the candidate set R currently input in the BK tree structure has the dense point fingerprint data, the candidate set R is considered to have similar dense point fingerprint information;
if no dense point fingerprint data is returned from the candidate set R in the current input BK tree structure, the Hamming distance is larger than a preset value, and no similar dense point fingerprint information is considered.
As a further scheme of the invention, 5 tables are created for all simhash values in S3, different tables store blocks with different positions, such as the first table stores 0-12 bits, the second table stores 13-25 bits, the third table stores 26-38 bits, the fourth table stores 39-51 bits, the fifth table stores 52-63 bits, inverted indexes are used in the tables, and simhash values are indexed by 13 or 12 bit strings.
As a further scheme of the invention, by properly setting the threshold value n in the S8, according to the structural characteristics of the BK tree, nodes which are not in the range of d-n, d+n and the whole branches thereof do not meet the searching condition, namely, 10% of all nodes can be traversed in the searching process, so that the efficiency is much higher than that of violent searching.
Compared with the prior art, the invention has the beneficial effects that:
the prior secret point fingerprint searching is not improved, when the secret point fingerprint data amount reaches 100 ten thousand, the searching time is about 2 seconds, when the secret point fingerprint data amount reaches tens of thousands of grades, the searching time is tens of seconds, and the user experience is quite unsatisfactory; after the BK tree improvement method is adopted, when the capacity of the dense point fingerprint library is in the millions, the time for a normal file fixed-density search only needs about 70 milliseconds.
Drawings
FIG. 1 is a flow chart illustrating the operation of the present invention.
Detailed Description
Referring to fig. 1, the present invention provides a technical solution: a method for quickly matching similar fingerprints of a document realizes positioning of similar fingerprints of the document, thereby providing technical support for document security;
the method comprises the following specific steps:
s1, constructing a secret-related information base: generating a corresponding simhash value, namely a document secret point fingerprint string, by the aid of a simhash algorithm, and storing the generated secret point fingerprint string into a database;
s2: judging whether two fingerprints are similar to each other or not to set a threshold according to a service scene, and storing a constructed document fingerprint library into a memory by combining a drawer principle and an inverted index when the two fingerprints are considered to be similar and the hamming distance is within 4;
s3: if the hamming distance of 2 simhash values is within 4, 1 block is equal in 5 blocks which are cut into; in order to improve the retrieval efficiency and simultaneously consider the space expense, 5 tables are created for all simhash values, different tables store different position blocks, such as the first table stores 0-12 bits, the second table stores 13-25 bits, the third table stores 26-38 bits, the fourth table stores 39-51 bits, the fifth table stores 52-63 bits, inverted indexes are also used in the tables, and the simhash values are indexed by 13 or 12 bit strings;
s4, when retrieving other simhash values with the hamming distance within 4 according to a certain simhash, dividing the simhash into 5 blocks, converting each block, namely 12-bit or 13-bit strings, into corresponding integers, searching the same block in a corresponding table, taking simhash values corresponding to the same block, namely secret point fingerprint strings, storing the secret point fingerprint strings in a candidate set R, screening out simhash values appearing in a set corresponding to at least 1 block, and then calculating the hamming distance one by one to obtain a minimum value, namely similar secret point fingerprints;
s5, for business scenes with the threshold value within 15, storing each secret point fingerprint string in the secret-related information base as a node into a BK tree memory structure;
s6, in the tree building process, firstly randomly finding a secret point fingerprint string in a secret information base as a root node, then inserting each secret point fingerprint string, firstly calculating the editing distance d between the secret point fingerprint string and the root, wherein the editing distance is between two secret point fingerprint strings, and is the number of characters corresponding to different positions in the two secret point fingerprint strings; if the distance value d is the first occurrence at the BK tree node, establishing a new child node; otherwise recursively down the corresponding edges. Each node in the BK tree has any number of child nodes, and the value on each side represents the edit distance between two nodes.
For example, for a scene with a threshold of 15, a secret related point library is built based on the bk tree:
for example: assuming that the dense-point fingerprint n1:17661816605251706157 is taken as a plurality of BK nodes, inserting key information n2:17661816605251706156, wherein the distance between the key information n2: 17661816605251706157 and the node n1:17661816605251706157 is 1, and then creating a child node and connecting an edge with the number 1; the next insertion, n3:17661816605251706145, is calculated to be 2 from n1:17661816605251706157 and is then placed under the side numbered 2. Then we insert n4:17661816605251706173 next, which is 1 from n1:17661816605251706157, then recursively insert it along that 1 numbered edge into the sub-tree where n2:17661816605251706156 is located; n4:17661816605251706173 is 2 from n2:17661816605251706156, thus placing n4:17661816605251706173 under node n2:17661816605251706156 with edge number 2.
And S7, setting a search distance threshold value as n, calculating the editing distance between the secret related point string to be judged and the nodes in the BK tree, and adding the nodes which meet the requirement that the editing distance between the secret related point string to be judged and the nodes is not more than n into a result candidate set.
And S8, when searching is carried out in the BK tree, firstly calculating the editing distance d between the secret-related point string to be judged and the root node, and then recursively searching all sides of each child node value in the [ d-n, d+n ] interval. If the distance d between the secret related point string to be determined and the checked node is smaller than the threshold value n, returning the node and continuing to carry out recursive query until the BK tree is ended. By properly setting the threshold value n, according to the structural characteristics of the BK tree, nodes which are not in the range of [ d-n, d+n ] and the whole branches thereof do not meet the search condition, namely, 10% of all nodes can be traversed in the query process, so that the efficiency is much higher than that of violent search.
S9, calculating the distance between the fingerprint of the band-determined document and the fingerprint of the dense point in the BK tree by adopting the Hamming distance, and setting an initial threshold value of input times according to different elements and service scenes.
S10, fingerprinting the current fixed-density document according to the distance of the fingerprint strings of the secret-related points in the BK tree and the current document to be fixed, searching each secret-point fingerprint in the threshold value in the constructed BK tree, and storing the fingerprint in the candidate set R.
If the candidate set R currently input in the BK tree structure has the dense point fingerprint data, the candidate set R is considered to have similar dense point fingerprint information;
if no dense point fingerprint data is returned from the candidate set R in the current input BK tree structure, the Hamming distance is larger than a preset value, and no similar dense point fingerprint information is considered.
In order to improve the file encryption efficiency, traditional modes such as a drawer principle, an inverted index and a hashmap and drawer principle have certain search efficiency improvement when the threshold value is smaller, but the search efficiency is greatly reduced when the threshold value is larger, and when the fingerprint threshold value is larger, a secret point library is constructed by utilizing a bk tree, so that the secret point fingerprint similarity matching efficiency is improved.
Claims (3)
1. A method for quickly matching fingerprints based on documents is characterized by comprising the following steps: the method comprises the following specific steps:
s1, constructing a secret-related information base: generating a corresponding simhash value, namely a document secret point fingerprint string, by the aid of a simhash algorithm, and storing the generated secret point fingerprint string into a database;
s2: judging whether the two fingerprints are similar or not according to the service scene and setting a threshold value; the two hamming distances are within 4, which are considered to be similar, and the constructed document fingerprint library is stored into the memory by combining the drawer principle with the inverted index;
s3: if the hamming distance of 2 simhash values is within 4, 1 block is equal in 5 blocks which are cut into;
s4, when retrieving other simhash values with the hamming distance within 4 according to a certain simhash, dividing the simhash into 5 blocks, converting each block, namely 12-bit or 13-bit strings, into corresponding integers, searching the same block in a corresponding table, taking simhash values corresponding to the same block, namely secret point fingerprint strings, storing the secret point fingerprint strings in a candidate set R, screening out simhash values appearing in a set corresponding to at least 1 block, and then calculating the hamming distance one by one to obtain a minimum value, namely similar secret point fingerprints;
s5, for business scenes with the threshold value within 15, storing each secret point fingerprint string in the secret-related information base as a node into a BK tree memory structure;
s6, in the tree building process, firstly randomly finding a secret point fingerprint string in a secret information base as a root node, then inserting each secret point fingerprint string, and firstly calculating the editing distance d between the secret point fingerprint string and the root (the editing distance refers to the number of characters at corresponding positions in the two secret point fingerprint strings); if the distance value d is the first occurrence at the BK tree node, establishing a new child node, otherwise recursively proceeding along the corresponding edge; each node in the BK tree has any number of child nodes, and the numerical value on each side represents the editing distance between two nodes;
s7, setting a search distance threshold as n, calculating the editing distance between the secret related point string to be judged and the nodes in the BK tree, and adding the nodes which meet the requirement that the editing distance between the secret related point string to be judged and the nodes is not more than n into a result candidate set;
s8, when searching is carried out in the BK tree, firstly calculating the editing distance d between the secret-related point string to be judged and the root node, and then recursively searching all sides of each child node value in the [ d-n, d+n ] interval; if the distance d between the secret related point string to be judged and the checked node is smaller than the threshold value n, returning the node and continuing to carry out recursive query until the BK tree is ended;
s9, calculating the distance between the fingerprint of the band-determined document and the fingerprint of the dense point in the BK tree by adopting the Hamming distance, and setting an initial threshold value of input times according to different elements and service scenes;
s10, fingerprinting a current fixed-density document according to the distance of secret-related point fingerprint strings in a BK tree and the current document to be fixed, removing each secret point fingerprint in a search threshold in the constructed BK tree, and storing the fingerprint in a candidate set R;
if the candidate set R currently input in the BK tree structure has the dense point fingerprint data, the candidate set R is considered to have similar dense point fingerprint information;
if no dense point fingerprint data is returned from the candidate set R in the current input BK tree structure, the Hamming distance is larger than a preset value, and no similar dense point fingerprint information is considered.
2. The document fingerprint similarity-based rapid matching method according to claim 1, wherein the method comprises the following steps: and 5 tables are created for all simhash values in the S3, different tables store blocks with different positions, such as the first table stores 0-12 bits, the second table stores 13-25 bits, the third table stores 26-38 bits, the fourth table stores 39-51 bits, the fifth table stores 52-63 bits, inverted indexes are used in the tables, and the simhash values are indexed by 13 or 12 bit strings.
3. The document fingerprint similarity-based rapid matching method according to claim 2, wherein: by properly setting the threshold value n in the S8, according to the structural characteristics of the BK tree, the nodes which are not in the range of d-n, d+n and the whole branches thereof do not meet the searching condition, namely, 10% of all the nodes can be traversed in the searching process, so that the efficiency is much higher than that of violent searching.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111198737.5A CN113918974B (en) | 2021-10-14 | 2021-10-14 | Method for quickly matching fingerprints based on documents |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111198737.5A CN113918974B (en) | 2021-10-14 | 2021-10-14 | Method for quickly matching fingerprints based on documents |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113918974A CN113918974A (en) | 2022-01-11 |
CN113918974B true CN113918974B (en) | 2024-04-12 |
Family
ID=79240623
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111198737.5A Active CN113918974B (en) | 2021-10-14 | 2021-10-14 | Method for quickly matching fingerprints based on documents |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113918974B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110647505A (en) * | 2019-08-21 | 2020-01-03 | 杭州电子科技大学 | Computer-assisted secret point marking method based on fingerprint characteristics |
CN111581947A (en) * | 2020-04-29 | 2020-08-25 | 华南理工大学 | Similar text calibration method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2939117C (en) * | 2014-03-04 | 2022-01-18 | Interactive Intelligence Group, Inc. | Optimization of audio fingerprint search |
-
2021
- 2021-10-14 CN CN202111198737.5A patent/CN113918974B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110647505A (en) * | 2019-08-21 | 2020-01-03 | 杭州电子科技大学 | Computer-assisted secret point marking method based on fingerprint characteristics |
CN111581947A (en) * | 2020-04-29 | 2020-08-25 | 华南理工大学 | Similar text calibration method |
Non-Patent Citations (1)
Title |
---|
基于Simhash的海量相似文档快速搜索优化方法;张广庆;葛唯益;贺成龙;;指挥信息系统与技术;20150428(02);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN113918974A (en) | 2022-01-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Navarro et al. | Optimal dynamic sequence representations | |
US9195738B2 (en) | Tokenization platform | |
JP3849279B2 (en) | Index creation method and search method | |
US20090094262A1 (en) | Automatic Generation Of Ontologies Using Word Affinities | |
CN108009265B (en) | Spatial data indexing method in cloud computing environment | |
JP2009244996A (en) | Character string retrieval system and method | |
CN111984732B (en) | Method, node and blockchain network for implementing decentralization search on blockchain | |
CN111125119A (en) | HBase-based spatio-temporal data storage and indexing method | |
CN102867049A (en) | Chinese PINYIN quick word segmentation method based on word search tree | |
Tseng et al. | Generating frequent patterns with the frequent pattern list | |
CN109359481B (en) | Anti-collision search reduction method based on BK tree | |
CN113918974B (en) | Method for quickly matching fingerprints based on documents | |
CN108304384B (en) | Word splitting method and device | |
CN112711649A (en) | Database multi-field matching method, device, equipment and storage medium | |
Zheng et al. | INSPIRE: A framework for incremental spatial prefix query relaxation | |
US7620640B2 (en) | Cascading index method and apparatus | |
KR101070738B1 (en) | Method and apparatus for multi-stage document clustering using ontology | |
CN107463676B (en) | Text data storage method and device | |
Akarsha et al. | Coarse-to-fine secure image deduplication with merkle-hash and image features for cloud storage | |
KR101089722B1 (en) | Method and apparatus for prefix tree based indexing, and recording medium thereof | |
CN115563058A (en) | Similar case retrieval method based on element extraction | |
CN111881309B (en) | Electronic license retrieval method, device and computer readable medium | |
CN115543993A (en) | Data processing method and device, electronic equipment and storage medium | |
Petri et al. | Efficient indexing algorithms for approximate pattern matching in text | |
JPH10240741A (en) | Managing method for tree structure type data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |