CN113918974B - Method for quickly matching fingerprints based on documents - Google Patents

Method for quickly matching fingerprints based on documents Download PDF

Info

Publication number
CN113918974B
CN113918974B CN202111198737.5A CN202111198737A CN113918974B CN 113918974 B CN113918974 B CN 113918974B CN 202111198737 A CN202111198737 A CN 202111198737A CN 113918974 B CN113918974 B CN 113918974B
Authority
CN
China
Prior art keywords
secret
fingerprint
tree
point
distance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111198737.5A
Other languages
Chinese (zh)
Other versions
CN113918974A (en
Inventor
崔新安
苗功勋
侯洪涛
李言非
唐孝军
赵鑫
高伟
袁浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Zhongfu Information Technology Co Ltd
Original Assignee
Nanjing Zhongfu Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Zhongfu Information Technology Co Ltd filed Critical Nanjing Zhongfu Information Technology Co Ltd
Priority to CN202111198737.5A priority Critical patent/CN113918974B/en
Publication of CN113918974A publication Critical patent/CN113918974A/en
Application granted granted Critical
Publication of CN113918974B publication Critical patent/CN113918974B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Collating Specific Patterns (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a quick matching method based on document fingerprint similarity in the technical field of data protection safety, which realizes positioning of document similarity fingerprints, thereby providing technical support for document safety; when the fingerprint threshold is larger, the bk tree is utilized to construct a dense point library, and the dense point fingerprint similarity matching efficiency is improved.

Description

Method for quickly matching fingerprints based on documents
Technical Field
The invention relates to the technical field of data protection safety, in particular to a method for quickly matching fingerprints based on documents.
Background
At present, the secret document fingerprint needs to perform one-time Hamming distance calculation with each secret point fingerprint in a secret point fingerprint library, when the distance is smaller than a set threshold value, the two fingerprints are considered to be similar, namely the fingerprint is the secret point fingerprint, but when the capacity of the secret point fingerprint library is large, the method is too violent and takes a long time; the method based on the drawer principle combined with the inverted index can solve the problem of fingerprint similarity quick matching efficiency when the threshold value is smaller, but when the threshold value is more than 10, for example, within 15, two simhash values are considered to be similar, the method based on the drawer principle combined with the inverted index is not improved in fingerprint similarity matching efficiency.
When the capacity of the fingerprint library reaches millions or even tens of millions, the violent matching of similar fingerprints takes a few seconds or even longer; or when the two close-point fingerprints are considered similar at a larger threshold, but no improvement in matching efficiency is obtained.
Based on the method, the invention designs a method for quickly matching the fingerprints of the documents, which aims to solve the problems.
Disclosure of Invention
The invention aims to provide a method for quickly matching fingerprints based on documents so as to solve the problems in the background technology.
In order to achieve the above purpose, the present invention provides the following technical solutions: a method for quickly matching fingerprints based on documents is characterized by comprising the following steps: the method comprises the following specific steps:
s1, constructing a secret-related information base: generating a corresponding simhash value, namely a document secret point fingerprint string, by the aid of a simhash algorithm, and storing the generated secret point fingerprint string into a database;
s2: judging whether the two fingerprints are similar or not according to the service scene and setting a threshold value; the two hamming distances are within 4, which are considered to be similar, and the constructed document fingerprint library is stored into the memory by combining the drawer principle with the inverted index;
s3: if the hamming distance of 2 simhash values is within 4, 1 block is equal in 5 blocks which are cut into;
s4, when retrieving other simhash values with the hamming distance within 4 according to a certain simhash, dividing the simhash into 5 blocks, converting each block, namely 12-bit or 13-bit strings, into corresponding integers, searching the same block in a corresponding table, taking simhash values corresponding to the same block, namely secret point fingerprint strings, storing the secret point fingerprint strings in a candidate set R, screening out simhash values appearing in a set corresponding to at least 1 block, and then calculating the hamming distance one by one to obtain a minimum value, namely similar secret point fingerprints;
s5, for business scenes with the threshold value within 15, storing each secret point fingerprint string in the secret-related information base as a node into a BK tree memory structure;
s6, in the tree building process, firstly randomly finding a secret point fingerprint string in a secret information base as a root node, then inserting each secret point fingerprint string, and firstly calculating the editing distance d between the secret point fingerprint string and the root (the editing distance refers to the number of characters at corresponding positions in the two secret point fingerprint strings); if the distance value d is the first occurrence at the BK tree node, establishing a new child node, otherwise recursively proceeding along the corresponding edge; each node in the BK tree has any number of child nodes, and the numerical value on each side represents the editing distance between two nodes;
s7, setting a search distance threshold as n, calculating the editing distance between the secret related point string to be judged and the nodes in the BK tree, and adding the nodes which meet the requirement that the editing distance between the secret related point string to be judged and the nodes is not more than n into a result candidate set;
s8, when searching is carried out in the BK tree, firstly calculating the editing distance d between the secret-related point string to be judged and the root node, and then recursively searching all sides of each child node value in the [ d-n, d+n ] interval; if the distance d between the secret related point string to be judged and the checked node is smaller than the threshold value n, returning the node and continuing to carry out recursive query until the BK tree is ended;
s9, calculating the distance between the fingerprint of the band-determined document and the fingerprint of the dense point in the BK tree by adopting the Hamming distance, and setting an initial threshold value of input times according to different elements and service scenes;
s10, fingerprinting a current fixed-density document according to the distance of secret-related point fingerprint strings in a BK tree and the current document to be fixed, removing each secret point fingerprint in a search threshold in the constructed BK tree, and storing the fingerprint in a candidate set R;
if the candidate set R currently input in the BK tree structure has the dense point fingerprint data, the candidate set R is considered to have similar dense point fingerprint information;
if no dense point fingerprint data is returned from the candidate set R in the current input BK tree structure, the Hamming distance is larger than a preset value, and no similar dense point fingerprint information is considered.
As a further scheme of the invention, 5 tables are created for all simhash values in S3, different tables store blocks with different positions, such as the first table stores 0-12 bits, the second table stores 13-25 bits, the third table stores 26-38 bits, the fourth table stores 39-51 bits, the fifth table stores 52-63 bits, inverted indexes are used in the tables, and simhash values are indexed by 13 or 12 bit strings.
As a further scheme of the invention, by properly setting the threshold value n in the S8, according to the structural characteristics of the BK tree, nodes which are not in the range of d-n, d+n and the whole branches thereof do not meet the searching condition, namely, 10% of all nodes can be traversed in the searching process, so that the efficiency is much higher than that of violent searching.
Compared with the prior art, the invention has the beneficial effects that:
the prior secret point fingerprint searching is not improved, when the secret point fingerprint data amount reaches 100 ten thousand, the searching time is about 2 seconds, when the secret point fingerprint data amount reaches tens of thousands of grades, the searching time is tens of seconds, and the user experience is quite unsatisfactory; after the BK tree improvement method is adopted, when the capacity of the dense point fingerprint library is in the millions, the time for a normal file fixed-density search only needs about 70 milliseconds.
Drawings
FIG. 1 is a flow chart illustrating the operation of the present invention.
Detailed Description
Referring to fig. 1, the present invention provides a technical solution: a method for quickly matching similar fingerprints of a document realizes positioning of similar fingerprints of the document, thereby providing technical support for document security;
the method comprises the following specific steps:
s1, constructing a secret-related information base: generating a corresponding simhash value, namely a document secret point fingerprint string, by the aid of a simhash algorithm, and storing the generated secret point fingerprint string into a database;
s2: judging whether two fingerprints are similar to each other or not to set a threshold according to a service scene, and storing a constructed document fingerprint library into a memory by combining a drawer principle and an inverted index when the two fingerprints are considered to be similar and the hamming distance is within 4;
s3: if the hamming distance of 2 simhash values is within 4, 1 block is equal in 5 blocks which are cut into; in order to improve the retrieval efficiency and simultaneously consider the space expense, 5 tables are created for all simhash values, different tables store different position blocks, such as the first table stores 0-12 bits, the second table stores 13-25 bits, the third table stores 26-38 bits, the fourth table stores 39-51 bits, the fifth table stores 52-63 bits, inverted indexes are also used in the tables, and the simhash values are indexed by 13 or 12 bit strings;
s4, when retrieving other simhash values with the hamming distance within 4 according to a certain simhash, dividing the simhash into 5 blocks, converting each block, namely 12-bit or 13-bit strings, into corresponding integers, searching the same block in a corresponding table, taking simhash values corresponding to the same block, namely secret point fingerprint strings, storing the secret point fingerprint strings in a candidate set R, screening out simhash values appearing in a set corresponding to at least 1 block, and then calculating the hamming distance one by one to obtain a minimum value, namely similar secret point fingerprints;
s5, for business scenes with the threshold value within 15, storing each secret point fingerprint string in the secret-related information base as a node into a BK tree memory structure;
s6, in the tree building process, firstly randomly finding a secret point fingerprint string in a secret information base as a root node, then inserting each secret point fingerprint string, firstly calculating the editing distance d between the secret point fingerprint string and the root, wherein the editing distance is between two secret point fingerprint strings, and is the number of characters corresponding to different positions in the two secret point fingerprint strings; if the distance value d is the first occurrence at the BK tree node, establishing a new child node; otherwise recursively down the corresponding edges. Each node in the BK tree has any number of child nodes, and the value on each side represents the edit distance between two nodes.
For example, for a scene with a threshold of 15, a secret related point library is built based on the bk tree:
for example: assuming that the dense-point fingerprint n1:17661816605251706157 is taken as a plurality of BK nodes, inserting key information n2:17661816605251706156, wherein the distance between the key information n2: 17661816605251706157 and the node n1:17661816605251706157 is 1, and then creating a child node and connecting an edge with the number 1; the next insertion, n3:17661816605251706145, is calculated to be 2 from n1:17661816605251706157 and is then placed under the side numbered 2. Then we insert n4:17661816605251706173 next, which is 1 from n1:17661816605251706157, then recursively insert it along that 1 numbered edge into the sub-tree where n2:17661816605251706156 is located; n4:17661816605251706173 is 2 from n2:17661816605251706156, thus placing n4:17661816605251706173 under node n2:17661816605251706156 with edge number 2.
And S7, setting a search distance threshold value as n, calculating the editing distance between the secret related point string to be judged and the nodes in the BK tree, and adding the nodes which meet the requirement that the editing distance between the secret related point string to be judged and the nodes is not more than n into a result candidate set.
And S8, when searching is carried out in the BK tree, firstly calculating the editing distance d between the secret-related point string to be judged and the root node, and then recursively searching all sides of each child node value in the [ d-n, d+n ] interval. If the distance d between the secret related point string to be determined and the checked node is smaller than the threshold value n, returning the node and continuing to carry out recursive query until the BK tree is ended. By properly setting the threshold value n, according to the structural characteristics of the BK tree, nodes which are not in the range of [ d-n, d+n ] and the whole branches thereof do not meet the search condition, namely, 10% of all nodes can be traversed in the query process, so that the efficiency is much higher than that of violent search.
S9, calculating the distance between the fingerprint of the band-determined document and the fingerprint of the dense point in the BK tree by adopting the Hamming distance, and setting an initial threshold value of input times according to different elements and service scenes.
S10, fingerprinting the current fixed-density document according to the distance of the fingerprint strings of the secret-related points in the BK tree and the current document to be fixed, searching each secret-point fingerprint in the threshold value in the constructed BK tree, and storing the fingerprint in the candidate set R.
If the candidate set R currently input in the BK tree structure has the dense point fingerprint data, the candidate set R is considered to have similar dense point fingerprint information;
if no dense point fingerprint data is returned from the candidate set R in the current input BK tree structure, the Hamming distance is larger than a preset value, and no similar dense point fingerprint information is considered.
In order to improve the file encryption efficiency, traditional modes such as a drawer principle, an inverted index and a hashmap and drawer principle have certain search efficiency improvement when the threshold value is smaller, but the search efficiency is greatly reduced when the threshold value is larger, and when the fingerprint threshold value is larger, a secret point library is constructed by utilizing a bk tree, so that the secret point fingerprint similarity matching efficiency is improved.

Claims (3)

1. A method for quickly matching fingerprints based on documents is characterized by comprising the following steps: the method comprises the following specific steps:
s1, constructing a secret-related information base: generating a corresponding simhash value, namely a document secret point fingerprint string, by the aid of a simhash algorithm, and storing the generated secret point fingerprint string into a database;
s2: judging whether the two fingerprints are similar or not according to the service scene and setting a threshold value; the two hamming distances are within 4, which are considered to be similar, and the constructed document fingerprint library is stored into the memory by combining the drawer principle with the inverted index;
s3: if the hamming distance of 2 simhash values is within 4, 1 block is equal in 5 blocks which are cut into;
s4, when retrieving other simhash values with the hamming distance within 4 according to a certain simhash, dividing the simhash into 5 blocks, converting each block, namely 12-bit or 13-bit strings, into corresponding integers, searching the same block in a corresponding table, taking simhash values corresponding to the same block, namely secret point fingerprint strings, storing the secret point fingerprint strings in a candidate set R, screening out simhash values appearing in a set corresponding to at least 1 block, and then calculating the hamming distance one by one to obtain a minimum value, namely similar secret point fingerprints;
s5, for business scenes with the threshold value within 15, storing each secret point fingerprint string in the secret-related information base as a node into a BK tree memory structure;
s6, in the tree building process, firstly randomly finding a secret point fingerprint string in a secret information base as a root node, then inserting each secret point fingerprint string, and firstly calculating the editing distance d between the secret point fingerprint string and the root (the editing distance refers to the number of characters at corresponding positions in the two secret point fingerprint strings); if the distance value d is the first occurrence at the BK tree node, establishing a new child node, otherwise recursively proceeding along the corresponding edge; each node in the BK tree has any number of child nodes, and the numerical value on each side represents the editing distance between two nodes;
s7, setting a search distance threshold as n, calculating the editing distance between the secret related point string to be judged and the nodes in the BK tree, and adding the nodes which meet the requirement that the editing distance between the secret related point string to be judged and the nodes is not more than n into a result candidate set;
s8, when searching is carried out in the BK tree, firstly calculating the editing distance d between the secret-related point string to be judged and the root node, and then recursively searching all sides of each child node value in the [ d-n, d+n ] interval; if the distance d between the secret related point string to be judged and the checked node is smaller than the threshold value n, returning the node and continuing to carry out recursive query until the BK tree is ended;
s9, calculating the distance between the fingerprint of the band-determined document and the fingerprint of the dense point in the BK tree by adopting the Hamming distance, and setting an initial threshold value of input times according to different elements and service scenes;
s10, fingerprinting a current fixed-density document according to the distance of secret-related point fingerprint strings in a BK tree and the current document to be fixed, removing each secret point fingerprint in a search threshold in the constructed BK tree, and storing the fingerprint in a candidate set R;
if the candidate set R currently input in the BK tree structure has the dense point fingerprint data, the candidate set R is considered to have similar dense point fingerprint information;
if no dense point fingerprint data is returned from the candidate set R in the current input BK tree structure, the Hamming distance is larger than a preset value, and no similar dense point fingerprint information is considered.
2. The document fingerprint similarity-based rapid matching method according to claim 1, wherein the method comprises the following steps: and 5 tables are created for all simhash values in the S3, different tables store blocks with different positions, such as the first table stores 0-12 bits, the second table stores 13-25 bits, the third table stores 26-38 bits, the fourth table stores 39-51 bits, the fifth table stores 52-63 bits, inverted indexes are used in the tables, and the simhash values are indexed by 13 or 12 bit strings.
3. The document fingerprint similarity-based rapid matching method according to claim 2, wherein: by properly setting the threshold value n in the S8, according to the structural characteristics of the BK tree, the nodes which are not in the range of d-n, d+n and the whole branches thereof do not meet the searching condition, namely, 10% of all the nodes can be traversed in the searching process, so that the efficiency is much higher than that of violent searching.
CN202111198737.5A 2021-10-14 2021-10-14 Method for quickly matching fingerprints based on documents Active CN113918974B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111198737.5A CN113918974B (en) 2021-10-14 2021-10-14 Method for quickly matching fingerprints based on documents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111198737.5A CN113918974B (en) 2021-10-14 2021-10-14 Method for quickly matching fingerprints based on documents

Publications (2)

Publication Number Publication Date
CN113918974A CN113918974A (en) 2022-01-11
CN113918974B true CN113918974B (en) 2024-04-12

Family

ID=79240623

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111198737.5A Active CN113918974B (en) 2021-10-14 2021-10-14 Method for quickly matching fingerprints based on documents

Country Status (1)

Country Link
CN (1) CN113918974B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110647505A (en) * 2019-08-21 2020-01-03 杭州电子科技大学 Computer-assisted secret point marking method based on fingerprint characteristics
CN111581947A (en) * 2020-04-29 2020-08-25 华南理工大学 Similar text calibration method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2939117C (en) * 2014-03-04 2022-01-18 Interactive Intelligence Group, Inc. Optimization of audio fingerprint search

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110647505A (en) * 2019-08-21 2020-01-03 杭州电子科技大学 Computer-assisted secret point marking method based on fingerprint characteristics
CN111581947A (en) * 2020-04-29 2020-08-25 华南理工大学 Similar text calibration method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于Simhash的海量相似文档快速搜索优化方法;张广庆;葛唯益;贺成龙;;指挥信息系统与技术;20150428(02);全文 *

Also Published As

Publication number Publication date
CN113918974A (en) 2022-01-11

Similar Documents

Publication Publication Date Title
Navarro et al. Optimal dynamic sequence representations
US9195738B2 (en) Tokenization platform
JP3849279B2 (en) Index creation method and search method
US20090094262A1 (en) Automatic Generation Of Ontologies Using Word Affinities
CN108009265B (en) Spatial data indexing method in cloud computing environment
JP2009244996A (en) Character string retrieval system and method
CN111984732B (en) Method, node and blockchain network for implementing decentralization search on blockchain
CN111125119A (en) HBase-based spatio-temporal data storage and indexing method
CN102867049A (en) Chinese PINYIN quick word segmentation method based on word search tree
Tseng et al. Generating frequent patterns with the frequent pattern list
CN109359481B (en) Anti-collision search reduction method based on BK tree
CN113918974B (en) Method for quickly matching fingerprints based on documents
CN108304384B (en) Word splitting method and device
CN112711649A (en) Database multi-field matching method, device, equipment and storage medium
Zheng et al. INSPIRE: A framework for incremental spatial prefix query relaxation
US7620640B2 (en) Cascading index method and apparatus
KR101070738B1 (en) Method and apparatus for multi-stage document clustering using ontology
CN107463676B (en) Text data storage method and device
Akarsha et al. Coarse-to-fine secure image deduplication with merkle-hash and image features for cloud storage
KR101089722B1 (en) Method and apparatus for prefix tree based indexing, and recording medium thereof
CN115563058A (en) Similar case retrieval method based on element extraction
CN111881309B (en) Electronic license retrieval method, device and computer readable medium
CN115543993A (en) Data processing method and device, electronic equipment and storage medium
Petri et al. Efficient indexing algorithms for approximate pattern matching in text
JPH10240741A (en) Managing method for tree structure type data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant