CN113590606B

CN113590606B - Bloom filter-based large data volume secret key duplication eliminating method and system

Info

Publication number: CN113590606B
Application number: CN202111133541.8A
Authority: CN
Inventors: 丁胜建; 封连重
Original assignee: Zhejiang Quantum Technologies Co ltd
Current assignee: Zhejiang Quantum Technologies Co ltd
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2021-12-31
Anticipated expiration: 2041-09-27
Also published as: CN113590606A

Abstract

A bloom filter-based large data volume key deduplication method comprises the following steps: acquiring data to be deduplicated; initializing a duplicate removal system; dividing and storing data; bloom deduplication of data; performing traversal statistics on positive data; accurate data deduplication; the invention also provides a bloom filter-based large data volume secret key duplication removal system. Compared with the prior art, the invention provides a divide-and-conquer storage method and an accurate duplicate removal method based on positive data aiming at the duplicate removal of the key with large data volume, and the key with large data volume is uniformly guided and stored to different storage units according to the hash residue, thereby not only ensuring that the repeated key is in the same data set, and reducing the BitSet space occupation and the duplicate removal operation consumption required by a single bloom filter, namely improving the space and time efficiency of the bloom filter during the duplicate removal operation, but also realizing the accurate duplicate removal of the key data based on the traversal statistics of the HashSet set of the positive data, and improving the duplicate removal accuracy and the key quality.

Description

Bloom filter-based large data volume secret key duplication eliminating method and system

Technical Field

The invention relates to the technical field of electric digital data processing, in particular to a bloom filter-based large-data-volume secret key duplication eliminating method and system.

Background

With the continuous development of quantum key distribution technology and quantum key relay technology, the situation that a server stores a large data volume key has occurred in practical application. With the continuous increase of the data volume of the key, the repeated key removal of the key with large data volume becomes an urgent need, and the key security can be more effectively ensured and the key quality can be improved by the key removal. At present, for the heavy data processing and deduplication, a bloom filter algorithm is often adopted, the purpose of data deduplication is achieved based on a plurality of hash functions and Bitmap binary vector storage, the efficiency in time and space is high, but the pure adoption of the bloom filter scheme has misjudgment rate and cannot accurately deduplicate. In the existing data deduplication technology, the invention patent CN 108804242A-a data counting deduplication method, system, server and storage medium-public discloses a data deduplication method, according to a preset deduplication level, counting deduplication at a corresponding level is performed through a Bloom Filter algorithm, and based on multi-level deduplication, redis cache can be used to improve deduplication efficiency, but the requirement of accurate deduplication is still not solved; the invention also discloses a method and a system for removing the duplicate of data, namely CN110704407A, a method and a system for removing the duplicate of the data, which discloses a method and a system for removing the duplicate of the data, wherein a list of data acceleration layers are added in a list of the removal dictionary table array in a first operation level of a database, the data to be removed is mapped into the removal dictionary table array, then the removal dictionary table array is led into a data management system of the data acceleration layers, so that the data to be removed is converted into a bit format and stored in a Bitmap set, and finally the exact duplicate removal of the duplicate removal data is achieved in the Bitmap set. Due to the randomness of the key, the possible range of a group of 64-bit keys is uniformly distributed between 0-2 ^64, and if the key data is mapped based on the Bitmap, 2147483648GB binary vector bitmaps are needed for storage, which is impossible to realize.

Disclosure of Invention

The invention aims to provide a bloom filter-based large data volume secret key deduplication method and a bloom filter-based large data volume secret key deduplication system, so as to solve the technical defect that accurate deduplication is difficult to achieve by a large data volume data processing deduplication method in the prior art.

The technical scheme of the invention is realized as follows:

a bloom filter-based large data volume key deduplication method is characterized by comprising the following steps:

acquiring key data to be de-duplicated, and acquiring key data to be stored and to be accurately de-duplicated;

initializing a deduplication system, and creating a plurality of persistent storage units and corresponding bloom filter examples according to preset parameters;

the method comprises the steps of dividing and storing a secret key, calculating a secret key hash value through a hash function when a group of secret keys are input, carrying out mapping operation on the hash value and the number of storage units, and storing the secret key to the storage unit identified by a mapping operation result;

the key is subjected to bloom deduplication, the key is subjected to deduplication judgment through a bloom filter during storage, if the key does not exist, deduplication detection identification fields of the key are set to be not repeated, and if the key exists, the key is added into a positive data set;

traversing statistics of the positive data, traversing the storage unit, judging whether the key exists in the positive data set, if so, indicating that the key is repeated, recording the characteristic information of the key, the storage position and the like to a traversal statistical result set in a key-value pair mode;

accurate duplicate removal of the secret key, in a traversal statistical result set, if the secret key of the element is repeated for multiple times, the repeated secret key is removed according to multiple storage positions in the element, at most one group of the secret key is reserved, otherwise, the secret key is unique, and the duplicate removal detection identification field is updated to the unique secret key and is not repeated;

and finishing accurate deduplication of the key with large data volume.

Preferably, in the step of initializing the deduplication system, the preset parameters at least include a target data total amount, a storage capacity of a single storage unit, and an expected error rate, the target data total amount and the storage capacity of the single storage unit determine the number of the storage units, the storage capacity of the single storage unit and the expected error rate determine the size of the Bitmap array of the bloom filter and the number of hash functions, and the storage units correspond to the bloom filter one to one.

Preferably, the key is stored in a manner of division and multiplication, a key hash value is calculated through a hash function, the hash value is mapped to the designated storage unit, and the used hash function comprises any one of java hashCode method, SM3 hash algorithm, SHA algorithm and MD5 algorithm.

Preferably, the key is added into the positive data Set and stored, the positive data Set is stored, and the positive data is added into the Set by an Add method by using a Set frame in the form of any one unique value element of Set, HashSet and LinkedHashSet.

Preferably, the traversal statistical result set is stored, the set uses a set frame in the form of any key-value pair element of a HashMap, a linkedfashmap and a HashTable, the repeated key and the storage location feature information are organized into key-value pair elements, and the statistical result is added or updated to the traversal statistical result set.

Preferably, the storage units correspond to the bloom filters one to one, and further include synchronously performing bloom deduplication on the keys during key storage, mapping and storing different keys to the corresponding storage units, associating the different keys with the corresponding bloom filters, and processing the plurality of bloom filters in parallel.

Preferably, the hash value is mapped to a designated storage unit, the mapping method at least comprises the steps of taking the surplus of the hash value and the number of the storage units, assigning the serial number of the storage unit with the remainder, and storing the hash value of the repeated key to the same storage unit.

Preferably, the statistical result is added to a traversal statistical result set, if the key-value element already exists, the element is taken out, and the storage location feature information is added to an ordered set corresponding to the element value, where the ordered set uses one of an array and a List.

Preferably, the Bloom deduplication of the key is performed, when the key is stored, the existence condition of the key is calculated through a Bloom Filter algorithm, if the existence condition is determined to be not existed, the group of keys is put into the Bloom Filter instance, and the deduplication detection identification field of the key in the storage unit is set to be not repeated, which indicates that the key is unique; if it is determined that the set of keys is Positive data, that is, the same key may already exist in the bloom filter instance, the deduplication detection flag field of the same key in the storage unit is set to Positive, and the set of keys is also stored in the < Value > set that stores the Positive data.

Preferably, the positive data is subjected to traversal statistics, a specified storage unit is traversed, and each time a group of keys is taken out, whether the group of keys exists in a < Value > set of the positive data is judged, if the group of keys does not exist in the < Value > set, the group of keys is unique, and no processing is skipped; if the group Key exists in the < Value > set, indicating that the group Key is possible to be duplicated, judging whether the < Key and the Value > set storing the positive data duplication statistical result already have a Key corresponding to the group Key, if not, creating a Key and Value > Key Value pair, the Key being the group Key, initializing the Value into a List, adding storage location information corresponding to the group Key into the List, if so, indicating that the duplication Key corresponding to the Key of the group Key and the storage location information thereof are found, acquiring an appointed Key Value pair element for the Key through the group Key, adding the storage location information corresponding to the group Key into the List, and then writing the updated Key and Value Key Value pair back to the < Value > set storing the Key data duplication statistical result, and completing the updating of the appointed Key Value pair element.

Preferably, accurate duplication removal of the Key is performed, a Key and Value set storing a positive data duplication statistical result is traversed, if the number of elements in a Value list of a certain element is greater than 1, it is indicated that a duplicate Key of a Key indication Key is found, the duplicate Key is removed according to storage location information of the duplicate Key stored in the Value list, only a group 1 Key in the duplicate Key is reserved, and a duplication removal detection identification field of the group of keys is updated to be non-duplicate; if the Value list element number of a certain element is 1, the Key indicates that the Key is unique, and the deduplication detection identification field of the group of keys is updated to be not duplicated.

The invention also provides a bloom filter-based large data volume secret key deduplication system, which comprises the following steps:

a key acquisition module: the method comprises the steps of obtaining a large data volume key to be stored and to be subjected to duplication elimination detection;

the duplicate removal system initialization module: the device comprises a plurality of storage units and corresponding bloom filters, wherein the storage units and the corresponding bloom filters are established according to preset parameters;

divide and manage the storage module: the key storage unit is used for calculating a hash value according to the input key and storing the key to the appointed storage unit according to the hash value and the mapping result of the number of the storage units;

bloom deduplication module: when the method is used for storing the key, the initial judgment of duplicate removal is carried out through a bloom filter, if the key does not exist, the key is unique, and if the key possibly exists, the key is added into a positive data set;

the positive data traversal statistic module comprises: the device is used for traversing the storage unit, sequentially judging whether the key exists in the positive data set or not, if the key exists and indicates that the key is possibly repeated, recording the characteristic information of the key, the storage position and the like to the traversal statistical result set in a key-value pair mode;

accurate deduplication module: and traversing the statistical result set, eliminating redundant repeated keys according to the storage position list of the elements, reserving at most one group, and updating the duplication removal detection identification field of the unique key into non-duplication.

Compared with the prior art, the invention has the following beneficial effects:

the invention provides a big data volume secret key duplication eliminating method and a big data volume secret key duplication eliminating system based on a bloom filter, aiming at the big data volume secret key duplication eliminating, a divide-and-conquer storage method and an accurate duplication eliminating method based on positive data are provided, the big data volume secret key is uniformly guided and stored to different storage units according to hash surplus, the repeated secret key is ensured to be in the same data set, the BitSet space occupation and duplication eliminating operation consumption required by a single bloom filter are reduced, and the space and time efficiency of the bloom filter during duplication eliminating operation is improved.

The plurality of storage units generated by the invention correspond to the plurality of bloom filters, so that parallel real-time processing of a plurality of key duplication removal process instances can be realized, and the efficiency of an integral system for duplication removal of large-data-volume key data is effectively improved;

the invention provides an accurate duplication removal method and an accurate duplication removal system aiming at duplication removal of a large-data-volume secret key in the field of quantum information security.

Drawings

Fig. 1 is a schematic flowchart of a key deduplication method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a key deduplication system according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of sub-modules of the duplicate removal system initialization module 202 according to a second embodiment of the present invention;

fig. 4 is a schematic structural diagram of a sub-module of the positive data traversal statistics module 205 according to the second embodiment of the present invention;

fig. 5 is a schematic flowchart of the traversal statistics of the positive data in step S5 according to the third embodiment of the present invention;

fig. 6 is a schematic diagram of a parallel processing framework of a key deduplication system according to a fourth embodiment of the present invention.

Detailed Description

The present invention will now be described more fully hereinafter with reference to the accompanying drawings, in which embodiments of the invention are shown.

As shown in fig. 1, a bloom filter-based large data volume key deduplication method includes the following steps:

step 1: acquiring key data to be deduplicated, acquiring a key from a key distribution device or system, and performing deduplication detection subsequently;

step 2: initializing a deduplication system, determining the number N of storage units according to a target key total S preset by the system and the expected storage capacity N of a single persistent storage unit, and immediately creating N database tables or N files; meanwhile, calculating the size m of a Bitmap array of the bloom filter and the number k of hash functions according to the expected storage capacity n of a single storage unit and the preset expected error judgment rate fpp, creating a corresponding bloom filter, and creating a corresponding number of bloom filters according to the number of the storage units;

and step 3: and (3) dividing and storing the key, calculating the hash value of the serial number of the key storage unit once when a group of keys are input, calculating the hash value hash corresponding to the key by a hash function such as a java Object type hash code method, and carrying out remainder operation on the hash value hash and the number N of the storage units to obtain the serial number index of the storage unit, wherein the storage unit with the serial number index represents that the group of keys should be stored to the storage unit with the serial number of the storage unit. If repeated keys exist and the calculated hash is definitely the same, the repeated keys are all guided to be stored in the same storage unit;

and 4, step 4: the method comprises the steps of performing Bloom deduplication on a secret key, calculating the existence condition of the secret key through a Bloom Filter algorithm when the secret key is stored, if the secret key does not exist, putting the group of secret keys into a Bloom Filter example, and setting deduplication detection identification fields of the secret key in a storage unit to be not repetitive to indicate that the secret key is unique; if the set of keys is determined to exist, the set of keys is Positive data, namely the bloom filter instance may already exist the same key, the deduplication detection identifier field of the key in the storage unit is set to Positive, and the set of keys is also stored in a < Value > set for storing the Positive data;

and 5: and traversing and counting the positive data, traversing the specified storage unit, and judging whether the group of keys exist in the < Value > set of the positive data or not when taking out one group of keys. If the group key does not exist in the < Value > set, the group key is unique, and no processing is skipped; if the group of keys exists in the < Value > set, indicating that the group of keys is possibly duplicated, judging whether a Key corresponding to the group of keys exists in the < Key, Value > set for storing a positive data duplication statistical result, if not, creating a Key, Value > Key Value pair, the Key being the group of keys, initializing the Value into a List, adding storage location information corresponding to the group of keys into the List, if so, indicating that the duplication Key corresponding to the Key of the group of keys and the storage location information thereof are found, acquiring an appointed Key Value pair element for the Key through the group of keys, adding the storage location information corresponding to the group of keys into the List, and then writing the updated Key, Value Key Value pair back to the < Key, Value > set for storing the positive data duplication statistical result, and completing the updating of the appointed Key Value pair element.

Step 6: accurate duplicate removal of a Key, traversing a set of keys (Key and Value) for storing a positive data duplicate statistical result, if the number of elements in a Value list of a certain element is greater than 1, indicating that a duplicate Key of the Key indication Key is found, removing the duplicate Key according to the storage position information of the duplicate Key stored in the Value list, only reserving a group 1 Key in the duplicate Key, and updating a duplicate removal detection identification field of the group of keys into non-duplicate; if the Value list element number of a certain element is 1, the Key indicates that the Key is unique, and the deduplication detection identification field of the group of keys is updated to be not duplicated.

Thus, accurate deduplication of the keys with large data size is completed.

In step 2, the storage capacity N and the number N of the storage units of a single storage unit are adjusted and modified according to the hardware performance and the database performance of the memory of the key storage server, the CPU and the like, and N × N > = S is ensured.

In step 2, the expected misjudgment rate fpp of the bloom filter provides an fpp parameter modification interface in the stage of creating the bloom filter, and the supply is adjusted according to the actual requirement.

In step 2, the corresponding number of bloom filters is created according to the number of the storage units, the storage units correspond to the bloom filters one by one, bloom duplication elimination of the key is performed when the key is stored, and the bloom filters are processed in parallel in the system, so that real-time parallel efficient duplication elimination query is realized.

In step 3, the hash value calculation related to the key storage location uses a hash function with uniformly distributed mapping and fast calculation performance, and uses one of java hashCode method, SM3 hash algorithm, SHA algorithm and MD5 algorithm.

In step 3, the key is stored in a manner of dividing and dividing, using a manner such as redis, mysql, file storage, and the like, and selecting a preferred one according to the actual application scenario of the system, wherein the dividing and dividing storage represents that the key is uniformly distributed and stored to N base tables when a database storage scheme is used, and represents that the key is uniformly distributed and stored to N files when a file storage scheme is used.

In step 3, the storage unit sequence number index is represented as a base table sequence number when a database storage scheme such as redis and mysql is used, and is represented as a file sequence number when a file storage scheme is used.

In step 4, the duplication removal detection identification field of the key is set to be non-duplication or Positive, and is used for logic judgment of whether to output the key preferentially or not when the system outputs the key, the duplication removal detection identification field indicates that the key is unique and preferentially output, and the key with the duplication removal detection identification field being Positive is removed or updated to be non-duplication after the accurate duplication removal in steps 7 and 8.

In step 4, the < Value > Set for storing positive data uses a Set framework including any one of the unique element forms of Set, HashSet, LinkedHashSet.

In step 5, the < Key, Value > set storing the traversal result of the positive data uses a set frame in the form of any Key-Value pair of the HashMap, linkedhhashmap, and HashTable.

In step 5, the judgment key is determined by the methods of the HashSet class of contacts, whether the key already exists in the < Value > set of the positive data.

In step 5, the judgment whether the < Key, Value > set storing the repeated statistical result of the positive data has a Key corresponding to the group of keys is determined by the conti in Key method of the HashMap class.

In step 5, the Value is initialized to the List using any one of ArrayList and LinkedList.

The invention provides a bloom filter-based large data volume secret key deduplication system, which comprises:

a key acquisition module: the method is used for acquiring the large-data-volume key to be stored and to be subjected to deduplication detection from key distribution systems such as a quantum key distribution system and a quantum key relay network system.

The duplicate removal system initialization module: determining the number N of storage units according to the expected key total S input by the system and the expected storage capacity N of a single persistent storage unit, and immediately creating N database tables or N files; meanwhile, according to parameters such as the expected storage capacity n of a single storage unit input by the system, the preset expected error judgment rate fpp and the like, the size m of a Bitmap array of the single bloom filter is calculated, the number k of hash functions is selected, the corresponding bloom filter is created, and the corresponding bloom filter is created according to the number of the storage units.

Divide and manage the storage module: and the method is used for taking the remainder N of the hash result and taking the remainder as the storage unit serial number index corresponding to the key when a group of keys are input as the input parameter of the hash function of the calculation storage unit serial number, and finally guiding and storing the key to a database table or a file with the serial number of the index.

Bloom deduplication module: the method comprises the steps that when a key is stored, the repetition condition of the key is calculated through a Bloom Filter algorithm, if the repetition condition is judged not to be repeated, the group of keys is placed into a Bloom Filter example, and a duplication removal detection identification field of the key in a storage unit is set to be not repeated, so that the key is unique; if it is determined that the set of keys is likely to be duplicated, the set of keys is Positive data, that is, the same key may already exist in the bloom filter instance, the deduplication detection flag field of the key is set to Positive in the storage unit, and the set of keys is also stored in the < Value > set that holds the Positive data.

The positive data traversal statistic module comprises: the method is used for traversing the corresponding storage unit according to the < Value > set of the positive data, and if the group of keys does not exist in the < Value > set, the group of keys is unique, and the group of keys is skipped without processing; if the group of keys exists in the < Value > set, indicating that the group of keys is possibly duplicated, judging whether a Key corresponding to the group of keys exists in the < Key, Value > set for storing a positive data duplication statistical result, if not, creating a Key, Value > Key Value pair, the Key being the group of keys, initializing the Value into a List, adding storage location information corresponding to the group of keys into the List, if so, indicating that the duplication Key corresponding to the Key of the group of keys and the storage location information thereof are found, acquiring an appointed Key Value pair element for the Key through the group of keys, adding the storage location information corresponding to the group of keys into the List, and then writing the updated Key, Value Key Value pair back to the < Key, Value > set for storing the positive data duplication statistical result, and completing the updating of the appointed Key Value pair element.

Accurate deduplication module: the Key and Value collection is used for traversing and storing a positive data repeated statistical result, if the element number of a Value list of a certain element is greater than 1, the Key indicates that a repeated Key of a Key indication Key is found, the repeated Key is removed according to the storage position information of the repeated Key stored in the Value list, only the 1 st group of keys in the repeated Key are reserved, and the duplication removal detection identification field corresponding to the group of keys is updated to be not duplicated; if the Value list element of a certain element is 1, it indicates that the Key indicates that the Key is unique, and updates the deduplication detection identification field corresponding to the group of keys to be non-repetitive.

The first embodiment is as follows: referring to fig. 1, the present invention provides a bloom filter-based big data volume key deduplication method, including the following steps:

s1: and acquiring data of the key to be deduplicated, acquiring a key K from a quantum key distribution system, a quantum key relay network system and other security key distribution systems, and waiting for deduplication detection.

S2: initializing a deduplication system, determining the number N of storage units according to the total S of a target secret key designed by the system and the expected storage capacity N of a single persistent storage unit, and immediately creating N database tables or N files; meanwhile, the size m of a single bloom filter Bitmap array and the number k of hash functions are calculated according to the expected storage capacity n of a single storage unit and the preset expected error judgment rate fpp, a bloom filter BF is created, and a corresponding number of bloom filters is created according to the number of the storage units.

S3: the key division storage is carried out, each time a group of keys X is input, the hash value calculation of the key storage position is carried out, the hash value hash _ X corresponding to the key X is calculated through a hash function such as a java Object type hash code method, the hash value hash _ X and the number N of storage units are subjected to remainder operation to obtain the serial number index of the storage unit, when the database table is used for storage, the serial number (the range is from 0 to N-1, and the total number is N) of the database table represents that the group of keys are stored to the storage unit with the serial number index, if repeated keys exist, the calculated hash _ X is certain same, and the repeated keys are all stored to the same storage unit in a guiding mode.

S4: the key Bloom deduplication method includes the steps that a key Bloom deduplication algorithm is used for conducting Bloom deduplication calculation on the key when the key is stored, if the key Bloom deduplication calculation is judged to be not available, the group of keys is placed into a Bloom Filter BF through a Bloom Filter. If the group key is determined to be the Positive data, that is, the bloom filter BF may already have the same key, the deduplication detection flag field of the key is set to Positive in the storage unit, and the group key is added as a Value element to the hashSet set in the form of a < Value > element that stores the Positive data by the hashSet.

S5: and traversing and counting the positive data, namely traversing the corresponding storage units for a certain key duplication removal process in the parallel processing process of the system, and judging whether the group of keys exist in a hashSet set of the positive data or not when each group of keys is taken out. If the group key does not exist in the hashSet set, the group key is unique, and skipping and no processing are performed; if the group Key exists in the hashSet set, which indicates that the group Key is possibly duplicated, judging whether the hashMap set in the form of the < Key, Value > element for storing the positive data duplication statistical result already has the Key corresponding to the group Key, if not, then a Key, Value > Key Value pair is created, the Key being the set of keys, the Value is initialized to the ArrayList list, for storing Key storage location information, such as database primary Key, file displacement, actual storage address, etc., adding the group of Key storage location information to the ArrayList list using the ArrayList. And acquiring a corresponding Key-Value pair element for the Key through the group of keys, adding the storage position information of the group of keys into the Value ArrayList list, and writing the updated Key-Value pair and the updated Value Key-Value pair back to the hashMap set through a hashMap.

S6: accurate deduplication of Key data, traversing a hashMap set storing a positive data repeated statistical result, if the number of elements in a Value ArrayList list of a certain element is greater than 1, indicating that a repeated Key of a Key indication Key is found, removing the repeated Key according to storage position information of the repeated Key stored in the Value ArrayList list, only reserving a group 1 Key in the repeated Key, and updating deduplication detection identification fields of the group of keys in a corresponding storage unit to be non-repeated; if the Value list element of a certain element is 1, it indicates that the Key indicates that the Key is unique, and updates the deduplication detection identification field of the group of keys in the corresponding storage unit to be non-repetitive.

Thus, accurate deduplication of the keys with large data size is completed.

In step S2, the expected storage capacity n of the single persistent storage unit generally takes on the order of 8 hundred million to 10 hundred million of group key data volume, and when the expected misjudgment rate fpp is 0.0005, the single key bloom deduplication process takes 20 to 50 minutes under different performance parameters CPU.

In step S2, the expected misjudgment rate fpp of the bloom filter is generally between 0.0005 and 0.005, and the amount of the key data that the bloom filter misjudges positive is between several thousand and several tens of thousands under the above parameters.

In the embodiment of the present invention, the total amount S of the target key and the expected error rate fpp mentioned in step S2 are comprehensively considered and set according to various factors such as system storage capacity, hardware parameters such as memory CPU, database performance, file IO performance, application requirements, etc., the present invention does not relate to the specific implementation of the comprehensive consideration process, and these implementation techniques do not belong to the content of the present invention, and they are only used as the technical basis for describing the method proposed by the present invention.

Example two: referring to fig. 2, fig. 3 and fig. 4, the present invention further provides a bloom filter-based large data volume key deduplication system, which includes the following components:

the key acquisition module 201: the method is used for acquiring the large-data-volume key to be stored and to be subjected to deduplication detection from key distribution systems such as a quantum key distribution system and a quantum key relay network system.

Deduplication system initialization module 202: for creating a storage unit and a bloom filter according to the entries, as shown in fig. 3, the deduplication system initialization module 202 includes the following sub-modules:

(1) create memory cell submodule 2021: the device is used for determining the number N of the storage units according to the expected key total S input by the system and the expected storage capacity N of a single persistent storage unit, and immediately creating N database tables or N files;

(2) create bloom filter sub-module 2022: calculating the size m of a single bloom filter Bitmap array and the number k of hash functions according to parameters such as the expected storage capacity n of a single storage unit, the preset expected error judgment rate fpp and the like input by a system, creating a corresponding bloom filter BF, and creating a corresponding number of bloom filters according to the number of the storage units.

Divide and conquer storage module 203: the method is used for obtaining a remainder index by taking the input key as an input parameter of a hash function for calculating the serial number of the storage unit, obtaining a remainder result by taking the hash result and the number N of the storage units, calling a corresponding storage interface, and leading and storing the group of input keys to the storage unit indicated by the serial number index, namely a certain database table or a certain file, so as to ensure that repeated keys are stored in the same storage unit and are deduplicated by the same bloom filter.

Bloom de-duplication module 204: when the key is stored, Bloom deduplication calculation is carried out on the key through a Bloom Filter algorithm, if the key does not exist, the group of keys is placed into a Bloom Filter BF through a Bloom Filter. If the group key is determined to be the Positive data, that is, the bloom filter BF may already have the same key, the deduplication detection flag field of the key is set to Positive in the storage unit, and the group key is added as a Value element to the hashSet set in the form of a < Value > element that stores the Positive data by the hashSet.

Positive data traversal statistics module 205: for a certain key deduplication process in the parallel processing process of the system, the module 205 is configured to traverse the corresponding storage unit, determine and count the duplication condition of the group of keys each time a group of keys is taken out, and store the duplicate screening result including the storage location information of the same group of duplicate keys in an arsylist list to a hashMap set, as shown in fig. 4, where the positive data traversal statistics module 205 includes the following sub-modules:

(1) positive data traversal submodule 2051: each time a set of keys is taken out, whether the keys exist in the hashSet set of the positive data is judged. If the group key does not exist in the hashSet set, the group key is unique, and skipping and no processing are performed; if the group of keys exists in the hashSet set, it is indicated that the group of keys may be repeated, and the repeated result statistics sub-module 2052 is handed over to process:

(2) repeat statistics sub-module 2052: the sub-module 2051 determines that the Key may be duplicated, and then the sub-module 2052 determines whether the hashMap set in the form of the < Key, Value > element that stores the result of the positive data duplication statistics already has a Key corresponding to the group of keys, and if not, then a Key, Value > Key Value pair is created, the Key being the set of keys, the Value is initialized to the ArrayList list, for storing Key storage location information, such as database primary Key, file displacement, actual storage address, etc., adding the group of Key storage location information to the ArrayList list using the ArrayList. And acquiring a corresponding Key-Value pair element for the Key through the group of keys, adding the storage position information of the group of keys into the Value ArrayList list, and writing the updated Key-Value pair and the updated Value Key-Value pair back to the hashMap set through a hashMap.

The precision de-weighting module 206: the hash map set is used for traversing and storing a positive data repeated statistical result, so that accurate duplicate removal of a duplicate Key is realized, if the number of elements in a Value list of a certain element in the set is greater than 1, the fact that the duplicate Key of a Key indication Key is found is indicated, the duplicate Key is removed according to the storage position information of the duplicate Key stored in the Value list, only a group 1 of keys in the duplicate Key is reserved, and a duplicate removal detection identification field corresponding to the group of keys is updated to be non-duplicate; if the Value list element of a certain element is 1, it indicates that the Key indicates that the Key is unique, and updates the deduplication detection identification field corresponding to the group of keys to be non-repetitive.

Example three: referring to fig. 5, on the basis of the first embodiment, the flow of the step S5 positive data traversal statistics is detailed with reference to fig. 5, and includes sub-steps S501, S502, S503, S504, S505, S506, and the like, as follows:

s501: traversing and taking out a group of keys K in the specified storage unit;

s502: judging whether the key K exists in the HashSet set of the positive data output in the step S4, if the key K does not exist, the key K is unique and does not need to be processed, jumping to the step S501 to start the next round of traversal statistics, and if the key K exists, processing by the step S503;

s503: the key K exists in the HashSet set of the positive data, which indicates that the key K is possibly repeated, and the information of the actual storage position of the key K is obtained, namely the information which can represent the actual storage position of the key, such as the file displacement of the key K in the storage unit or the main key of the database, and the like;

s504: judging whether an element with a Key K in a HashMap set for storing the positive data repeated statistical result exists, if not, processing by S505, and if so, processing by S506;

s505: creating a Key Value pair element, wherein a Key Key is a Key K, a Value is initialized to an ArrayList list, actual storage position information of the Key K is added, and the Key Value pair is added to a HashMap set for storing a positive data repeated statistical result;

s506: and taking out a Key Key as an element of the Key K from a HashMap set for storing the repeated statistical result of the positive data, and adding the actual storage position information of the Key K into an ArrayList list of the Value of the element of the set.

Example four: referring to fig. 6, on the basis of the structure of the key deduplication system provided by the present invention and described in fig. 2, the present invention also provides a parallel processing framework of the key deduplication system, as follows:

example deduplication system Inst: the deduplication system instance Inst includes N deduplication process instances, that is, N process instances such as the following deduplication process instances Inst1, InstX, InstN, and the like, where N is the number of storage units, that is, the number of bloom filters required by the deduplication system.

Deduplication process instance Inst 1: the key deduplication system process instance Inst1 is the 1 st process instance in the key deduplication system parallel process, and includes a storage unit 601, a bloom filter 602, a HashSet set 603, and a HashMap set 604.

Deduplication process instance InstX: the key deduplication system process instance InstX is the xth process instance in the parallel process of the key deduplication system, and includes a storage unit 601, a bloom filter 602, a HashSet set 603, and a HashMap set 604.

Deduplication process instance InstN: the key deduplication system process instance Inst1 is the nth process instance in the parallel process of the key deduplication system, and includes a storage unit 601, a bloom filter 602, a HashSet set 603, and a HashMap set 604.

The storage unit 601: the storage unit 601 is one of N storage units created by the deduplication system initialization module 202 in the second embodiment, and is configured to store a specific storage unit serial number specified by calculating a key hash and a remainder result according to an input key.

Bloom filter 602: the bloom filter 602 is one of the N bloom filters created by the deduplication system initialization module 202 in the second embodiment, and is used for calculating deduplication detection of a key based on a bloom filter algorithm when the storage unit 601 stores the key.

HashSet set 603: the HashSet set 603 is one of the N HashSet sets created by the bloom deduplication computing module 204 in the second embodiment, and is used for storing the positive data output by the bloom filter 602 in deduplication detection.

HashMap set 604: the HashMap set 604 is one of the N HashMap sets created by the positive data traversal statistics module 205 in the second embodiment, and is used to store repeated result statistics data output by traversing all keys of the storage unit by the HashSet set 603.

The duplication removing system instance Inst comprises N duplication removing process instances, and when the number of CPU cores of a server where the system is located is more than or equal to N and the memory occupied by the N process instances is less than the available memory of the server, the duplication removing system instance Inst runs the N duplication removing process instances in parallel; and when the conditions are not met, sequentially running a plurality of duplicate removal process examples in batches by the duplicate removal system examples Inst according to the number of the CPU cores and the available memory amount of the server until the N duplicate removal process examples are executed.

It can be known from the structure, method and embodiment of the present invention that, the big data volume secret key deduplication method and system based on the bloom filter provided by the present invention provides a divide-and-conquer storage method for big data volume secret key deduplication, and the big data volume secret key is uniformly guided and stored to different storage units according to hash remainder mapping, so that not only is the repeated secret key ensured to be stored in the same data set, but also the expected data volume of the bloom filter is reduced, the BitSet space occupation required to be allocated by a single bloom filter is reduced, and the deduplication operation consumption of a single bloom filter is reduced, that is, the space and time efficiency of the bloom filter during deduplication operation are improved; the plurality of storage units generated by the invention correspond to the plurality of bloom filters, so that parallel real-time processing of a plurality of key duplication removal process instances can be realized, and the efficiency of an integral system for duplication removal of large-data-volume key data is effectively improved; the invention provides an accurate duplication removal method and an accurate duplication removal system aiming at duplication removal of a large-data-volume secret key in the field of quantum information security.

Claims

1. A bloom filter-based large data volume key deduplication method is characterized by comprising the following steps:

acquiring a key to be stored and accurately deduplicated;

traversing statistics of the positive data, namely traversing the storage unit, judging whether the key exists in the positive data set, if so, indicating that the key is repeated, and recording the repeated key and the storage position characteristic information to a traversal statistical result set in a key-value pair mode;

accurate duplicate removal of the secret key, in a traversal statistical result set, if the secret key of the element is repeated for multiple times, the repeated secret key is removed according to multiple storage positions in the element, a group of secret keys is reserved, otherwise, the secret key is unique, and the duplicate removal detection identification field of the unique secret key is updated to be not repeated;

and finishing accurate deduplication of the key with large data volume.

2. The bloom filter-based large data volume key deduplication method according to claim 1, wherein in the deduplication system initialization step, the preset parameters include a target data total volume, a single storage unit storage capacity and an expected false positive rate, the target data total volume and the single storage unit storage capacity are used for determining the number of storage units, the single storage unit storage capacity and the expected false positive rate are used for determining a bloom filter Bitmap array size and the number of hash functions together, and the storage units are in one-to-one correspondence with the bloom filters.

3. The bloom filter-based big data size key deduplication method as claimed in claim 2, wherein the key is divisionally stored, a key hash value is calculated by a hash function, the hash value is mapped to a designated storage unit, and the hash function used comprises any one of a java hashCode method, an SM3 hash algorithm, an SHA algorithm, and an MD5 algorithm.

4. The bloom filter-based big data size key deduplication method as claimed in claim 2, wherein the key is added to the positive data Set and saved, and the positive data Set is saved by using a Set frame in the form of any one of unique value elements of Set, HashSet and linkedhhashset, and the positive data is added to the Set by the Add method.

5. The bloom filter-based big data size key deduplication method as claimed in claim 2, wherein the traversal statistics result set is saved by using a set frame in the form of any one of a HashMap, a LinkedHashMap, and a HashTable, organizing repeated keys and storage location feature information into key-value pair form elements, and adding or updating the statistics results to the traversal statistics result set.

6. The bloom filter-based large data size key deduplication method as claimed in claim 2, wherein the storage units correspond to bloom filters one to one, and further comprises performing bloom deduplication of keys synchronously when the keys are stored, mapping different keys to be stored in the corresponding storage units and associated with the corresponding bloom filters, and processing a plurality of bloom filters in parallel.

7. The bloom filter-based large data size key deduplication method as claimed in claim 3, wherein the hash value is mapped to a designated storage unit, the mapping method at least comprises that the hash value is complemented by the number of the storage units, the remainder designates the serial number of the storage unit, and the hash values of the repeated keys are the same and are stored in the same storage unit.

8. The bloom filter based big data volume key deduplication method as claimed in claim 5, wherein the statistical result is added to a traversal statistical result set, if the key value pair element already exists, the element is fetched, and the storage location characteristic information is added to an ordered set corresponding to the element value, wherein the ordered set uses one of an array and a List.

9. The Bloom Filter-based large data volume key deduplication method as claimed in claim 2, wherein the Bloom deduplication of the key is performed, during key storage, the existence condition of the key is calculated through a Bloom Filter algorithm, if the existence condition is determined not to exist, the group of keys is placed in the Bloom Filter instance, and a deduplication detection identification field of the key in the storage unit is set to be not repeated, which indicates that the key is unique; if the set of keys is determined to exist, the set of keys is Positive data, that is, the same key already exists in the bloom filter instance, the deduplication detection identifier field of the same key in the storage unit is set to Positive, and the set of keys is also stored in the < Value > set for storing the Positive data.

10. The bloom filter-based massive key deduplication method as claimed in claim 1, wherein the positive data traversal statistics is performed, a designated storage unit is traversed, each time a group of keys is taken out, whether the group of keys already exists in a < Value > set of the positive data is determined, and if the group of keys does not exist in the < Value > set, the group of keys is unique, and no processing is skipped; if the group Key exists in the < Value > set, the group Key is indicated to be duplicated, whether a Key corresponding to the group Key exists in the < Key, Value > set for storing a positive data duplication statistical result is judged, if the Key does not exist, a Key, Value > Key Value pair is created, the Key is the group Key, the Value is initialized to a List, storage location information corresponding to the group Key is added to the List, if the Key exists, the duplication Key corresponding to the Key of the group Key and the storage location information thereof are found, an appointed Key Value pair element is obtained for the Key through the group Key, the storage location information corresponding to the group Key is added to the List, and the updated Key, Value Key Value pair is written back to the < Key, Value > set for storing the positive data duplication statistical result, and updating of the appointed Key Value pair element is completed.

11. The bloom filter-based big data volume Key deduplication method as claimed in claim 10, wherein accurate deduplication of a Key is performed, a < Key, Value > set that stores a positive data duplicate statistical result is traversed, if the number of elements of a Value list of a certain element is greater than 1, it is indicated that a duplicate Key of a Key indication Key is found, the duplicate Key is culled according to storage location information of the duplicate Key stored in the Value list, only a group 1 Key in the duplicate Key is reserved, and a deduplication detection identification field of the group of keys is updated to be non-duplicate; if the Value list element number of a certain element is 1, the Key indicates that the Key is unique, and the deduplication detection identification field of the group of keys is updated to be not duplicated.

12. A bloom filter-based big data key deduplication system, configured to implement the big data key deduplication method of any one of claims 1 to 11, including:

bloom deduplication module: when the method is used for storing the key, the duplicate removal initial judgment is carried out through a bloom filter, if the duplicate removal initial judgment is not carried out, the key is unique, and if the duplicate removal initial judgment is carried out, the key is added into a positive data set;

the positive data traversal statistic module comprises: the device is used for traversing the storage unit, sequentially judging whether the key exists in the positive data set or not, if the key exists and indicates that the key is repeated, recording the repeated key and the storage position characteristic information to a traversal statistical result set in a key-value pair mode;

accurate deduplication module: and the duplicate removal detection identification field is used for traversing the statistical result set, eliminating redundant duplicate keys according to the storage position list of the elements, reserving one group, and updating the unique key to be non-duplicate.