CN115617809A - Database uniqueness constraint processing method, device, equipment and medium - Google Patents

Database uniqueness constraint processing method, device, equipment and medium Download PDF

Info

Publication number
CN115617809A
CN115617809A CN202211388507.XA CN202211388507A CN115617809A CN 115617809 A CN115617809 A CN 115617809A CN 202211388507 A CN202211388507 A CN 202211388507A CN 115617809 A CN115617809 A CN 115617809A
Authority
CN
China
Prior art keywords
data
preset
data table
stored
hash
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211388507.XA
Other languages
Chinese (zh)
Other versions
CN115617809B (en
Inventor
李求实
林泽昕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Ruifan Technology Co ltd
Original Assignee
Guangzhou Ruifan Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Ruifan Technology Co ltd filed Critical Guangzhou Ruifan Technology Co ltd
Priority to CN202211388507.XA priority Critical patent/CN115617809B/en
Publication of CN115617809A publication Critical patent/CN115617809A/en
Application granted granted Critical
Publication of CN115617809B publication Critical patent/CN115617809B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof

Abstract

The application provides a database uniqueness constraint processing method, a database uniqueness constraint processing device and a database uniqueness constraint processing medium. In the method, the in-table deduplication is performed on the data table to be inserted according to the unique key in the data table to be inserted. And then for the data table to be inserted after the duplication removal, determining target data to be inserted in the data table to be inserted after the duplication removal according to a preset number of different preset first hash functions, a preset bloom filter array and sparse indexes of the stored data table, and then inserting the target data to be inserted into the stored data table. According to the scheme, the in-table duplication removal is carried out on the data table to be inserted, the target data to be inserted is inserted into the stored data table after duplication removal is carried out again according to the preset bloom filter array and the sparse index, so that the temporary data are fewer, the memory occupancy rate is effectively reduced, and the duplication removal efficiency is improved.

Description

Database uniqueness constraint processing method, device, equipment and medium
Technical Field
The present application relates to the field of databases, and in particular, to a method, an apparatus, a device, and a medium for processing uniqueness constraints of a database.
Background
With the rapid development of science and technology, the amount of data generated in various fields is also rapidly increased, and how to more effectively process big data is a concern of people.
Among the problems of large data processing, data deduplication at the time of data insertion is a relatively important problem. In the prior art, many database software have the function of implementing uniqueness constraint through hash indexes or B + trees and the like to achieve deduplication. For example, the hash index extracts a corresponding unique key for each piece of data to be inserted, determines a corresponding unique key value, transforms the unique key value into a hash value corresponding to the piece of data through a public hash function, that is, a memory location, and then determines whether the memory location stores data. If the data is not stored, the data is inserted into the data table, which indicates that the data is not repeated. If the data is stored, the data is indicated to be repeated data, and the repeated data is deleted.
In summary, in the existing database uniqueness constraint processing method, hash values of all stored data need to be calculated to generate a hash table, the hash table is stored in a memory as temporary data, and when the data amount of the stored data and the data amount to be inserted are large, the data amount in the hash table in the memory is increased, which may result in a large memory occupancy rate; meanwhile, with the increase of the filling rate of the hash table, the collision probability of new data inserted into the hash table is increased, the read-write time consumption of the hash table is further increased, and the duplicate removal efficiency is low.
Disclosure of Invention
The embodiment of the application provides a method, a device, equipment and a medium for processing uniqueness constraint of a database, which are used for solving the problems that in the existing method for processing uniqueness constraint of the database, hash values of all stored data need to be calculated, a hash table is generated, the hash table is used as temporary data and stored in a memory, and when the data volume of the stored data and the data volume to be inserted are large, the data volume in the hash table in the memory is increased, so that the memory occupancy rate is large; meanwhile, with the increase of the filling rate of the hash table, the collision probability of new data inserted into the hash table is increased, the time consumption of reading and writing of the hash table is further increased, and the problem of low duplicate removal efficiency is caused.
In a first aspect, an embodiment of the present application provides a method for processing uniqueness constraints of a database, including:
according to the unique key to be inserted into the data table, carrying out in-table duplication removal on the data table to be inserted;
determining target data to be inserted in the data table after duplication removal according to a preset number of different preset first hash functions, a preset bloom filter array and sparse indexes of a stored data table;
inserting the target data to be inserted into the stored data table.
In a specific embodiment, the performing intra-table deduplication on the to-be-inserted data table according to a unique key in the to-be-inserted data table includes:
for each data to be inserted in the data table to be inserted, determining a unique key value corresponding to the data to be inserted according to a unique key in the data table to be inserted in sequence;
determining a hash value corresponding to the data to be inserted according to the unique key value and a preset second hash function;
if the hash value is stored in a preset hash value table, deleting the data to be inserted;
if the hash value is not stored in a preset hash value table, the data to be inserted is reserved, and the hash value is stored in the preset hash value table.
In a specific embodiment, the determining, according to a preset number of different preset first hash functions, a preset bloom filter array, and a sparse index of a stored data table, target data to be inserted in a deduplicated data table, includes:
determining target data to be inserted in the data table to be inserted after the duplication removal according to the preset number of different first hash functions and the preset bloom filter array;
for each data to be inserted except the target data to be inserted in the deduplicated data table, if the data to be inserted is judged not to be stored in the stored data table according to the sparse index, taking the data to be inserted as the target data to be inserted;
and if the data to be inserted is judged to be stored in the stored data table, deleting the data to be inserted.
In a specific embodiment, the determining, according to the preset number of different preset first hash functions and the preset bloom filter array, target data to be inserted in the deduplicated data table includes:
determining a unique key value corresponding to each piece of data to be inserted in the deduplicated data table according to a unique key in the data table to be inserted;
determining a preset number of hash values according to the unique key values and the preset number of different preset first hash functions;
and if the hash values with the preset number exist 0 in the corresponding numerical values in the preset bloom filter array, taking the data to be inserted as the target data to be inserted.
In one embodiment, the method further comprises:
and updating the preset bloom filter array and the sparse index according to the stored data table.
In a second aspect, an embodiment of the present application provides a database uniqueness constraint processing apparatus, including:
a processing module to:
performing in-table duplicate removal on the data table to be inserted according to the unique key in the data table to be inserted;
determining target data to be inserted in the data table to be inserted after the duplication is removed according to a preset number of different preset first hash functions, a preset bloom filter array and sparse indexes of the stored data table;
and the storage module is used for inserting the target data to be inserted into the stored data table.
In a specific embodiment, the processing module is specifically configured to:
for each data to be inserted in the data table to be inserted, determining a unique key value corresponding to the data to be inserted according to a unique key in the data table to be inserted in sequence;
determining a hash value corresponding to the data to be inserted according to the unique key value and a preset second hash function;
if the hash value is stored in a preset hash value table, deleting the data to be inserted;
if the hash value is not stored in a preset hash value table, the data to be inserted is reserved, and the hash value is stored in the preset hash value table.
In an embodiment, the processing module is further specifically configured to:
determining target data to be inserted in the data table to be inserted after the duplication removal according to the preset number of different first hash functions and the preset bloom filter array;
for each data to be inserted except the target data to be inserted in the deduplicated data table to be inserted, if the data to be inserted is judged not to be stored in the stored data table according to the sparse index, taking the data to be inserted as the target data to be inserted;
and if the data to be inserted is judged to be stored in the stored data table, deleting the data to be inserted.
In a specific embodiment, the processing module is further specifically configured to:
determining a unique key value corresponding to each piece of data to be inserted in the data table after duplication removal according to a unique key in the data table to be inserted;
determining a preset number of hash values according to the unique key values and the preset number of different preset first hash functions;
and if the hash values with the preset number exist 0 in the corresponding numerical values in the preset bloom filter array, taking the data to be inserted as the target data to be inserted.
In a specific embodiment, the processing module is further configured to:
and updating the preset bloom filter array and the sparse index according to the stored data table.
In a third aspect, an embodiment of the present application provides an electronic device, including:
a processor, a memory, a communication interface;
the memory is used for storing executable instructions of the processor;
wherein the processor is configured to perform the database uniqueness constraint processing method of any one of the first aspects via execution of the executable instructions.
In a fourth aspect, an embodiment of the present application provides a readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the database uniqueness constraint processing method according to any one of the first aspects.
In a fifth aspect, the present application provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program is used to implement the database uniqueness constraint processing method described in any one of the first aspects.
According to the method, the device, the equipment and the medium for processing the uniqueness constraint of the database, the in-table duplication removal is carried out on the data table to be inserted according to the unique key in the data table to be inserted. And then for the data table to be inserted after the duplication removal, determining target data to be inserted in the data table to be inserted after the duplication removal according to a preset number of different preset first hash functions, a preset bloom filter array and the sparse index of the stored data table, and then inserting the target data to be inserted into the stored data table. According to the scheme, the in-table duplication removal is carried out on the data table to be inserted, the target data to be inserted is inserted into the stored data table after duplication removal is carried out again according to the preset bloom filter array and the sparse index, so that the temporary data are fewer, the memory occupancy rate is effectively reduced, and the duplication removal efficiency is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.
Fig. 1a is a first schematic flowchart of a first embodiment of a database uniqueness constraint processing method provided by the present application;
fig. 1b is a schematic flowchart of a first embodiment of a database uniqueness constraint processing method provided by the present application;
fig. 2 is a flowchart illustrating a second embodiment of a database uniqueness constraint processing method provided by the present application;
fig. 3 is a schematic flowchart of a third embodiment of a database uniqueness constraint processing method provided in the present application;
FIG. 4 is a schematic structural diagram of an embodiment of a database uniqueness constraint processing apparatus provided in the present application;
fig. 5 is a schematic structural diagram of an electronic device provided in the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by persons skilled in the art based on the embodiments in the present application in light of the present disclosure, are within the scope of protection of the present application.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the above-described drawings (if any) are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
With the rapid development of software and hardware technologies, big data becomes an important issue in the internet industry. The explosive growth of data has caused many problems to be present in the human field of view. This includes the problem of how to implement data deduplication when inserting a large amount of data into a large amount of data.
Many databases have the function of implementing uniqueness constraint through hash indexes or B + trees and the like to achieve deduplication. For example, the hash index determines a unique key value of each piece of data to be inserted, determines a hash value corresponding to the piece of data according to the unique key value, that is, a memory location, and then determines whether the memory location stores data. If the data is not stored, the data is not repeated and is inserted into the data table. If the data is stored, the data is indicated to be repeated data, and the repeated data is deleted.
In the prior art, the hash values of all stored data need to be calculated to generate a hash table, the hash table is stored in a memory as temporary data, and when the data volume of the stored data and the data volume to be inserted are large, the data volume in the hash table in the memory is increased, so that the memory occupancy rate is large; and along with the data filling of the hash table, the collision probability of new data inserted into the hash table is increased, so that the query and write time of the hash table is increased, and the deduplication efficiency is low.
Aiming at the problems in the prior art, the inventor finds that in the process of researching a database uniqueness constraint processing method, in order to reduce the memory occupancy rate, a hash value is calculated according to a unique key in a data table to be inserted, the in-table deduplication is carried out on the data table to be inserted, and the memory occupancy rate of the hash table is small because the deduplication only relates to data to be inserted. And then determining target data to be inserted in the deduplicated data table to be inserted according to a preset number of different preset first hash functions, a preset bloom filter array and sparse indexes of the stored data table, and then inserting the target data to be inserted into the stored data table. And the target data to be inserted is determined by using the preset bloom filter array and the sparse index of the stored data table, so that the occupied memory is small, and the memory occupancy rate is effectively reduced. Based on the inventive concept, the database uniqueness constraint processing scheme in the application is designed.
The execution subject of the unique constraint processing method for the database in the present application may be a server, or may also be a device such as a computer or a terminal device that can operate the database.
An application scenario of the database uniqueness constraint processing method provided by the present application is described below.
For example, in the application scenario, the server receives a batch of data to be inserted, stores the batch of data to be inserted in a data table to be inserted of the database, and sets a unique key for the data table to be inserted by the user.
And then the server needs to insert the batch of data to be inserted into the stored data table after the data to be inserted is subjected to deduplication. Firstly, removing repeated data to be inserted in the data table according to the unique key in the data table to be inserted.
And determining target data to be inserted in the deduplicated data table according to a preset number of different preset first hash functions, a preset bloom filter array and sparse indexes of the stored data table, and inserting the target data to be inserted into the stored data table, so that the batch of data to be inserted is deduplicated and then inserted into the stored data table. And updating the preset bloom filter array and the sparse index of the stored data table according to the stored data table.
It should be noted that the above scenario is only an illustration of an application scenario provided in the embodiment of the present application, and the embodiment of the present application does not limit actual forms of various devices included in the scenario, nor limits an interaction manner between the devices, and in a specific application of a scheme, the setting may be performed according to actual requirements.
Hereinafter, the technical means of the present application will be described in detail by specific examples. It should be noted that the following several specific embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments.
Fig. 1a is a first schematic flow diagram of a first embodiment of a database uniqueness constraint processing method according to the present application, in which a to-be-inserted data table is subjected to intra-table deduplication, then target to-be-inserted data is determined according to a preset number of different preset first hash functions, preset bloom filter arrays, and sparse indexes of stored data tables, and then a situation that the target to-be-inserted data is inserted into the stored data tables is described. The method in this embodiment may be implemented by software, hardware, or a combination of software and hardware. As shown in fig. 1a, the method for processing uniqueness constraint of a database specifically includes the following steps:
s101: and performing in-table duplicate removal on the data table to be inserted according to the unique key in the data table to be inserted. And after receiving the data to be inserted, the server stores the data to be inserted in a data table to be inserted of the database, and the user sets a unique key for the data table to be inserted.
In this step, after the user sets the Unique Key, the Unique Key is directly stored in the metadata of the data table to be inserted in the form of Unique Key, and the server can perform in-table deduplication on the data table to be inserted according to the Unique Key in the data table to be inserted.
The server determines the hash value of each piece of data to be inserted according to the unique key in the data table to be inserted, if the same hash value exists, the situation that repeated data exists is indicated, the data to be inserted corresponding to the same hash value is subjected to deduplication processing, and one piece of data to be inserted is reserved.
It should be noted that the server may also partition each piece of data to be inserted in the data table according to the partition key to be inserted in the data table, and then perform intra-area deduplication on the data to be inserted in each partition according to the unique key to be inserted in the data table.
Illustratively, table 1 is a table of data to be inserted provided in the embodiments of the present application.
TABLE 1
Serial number Number learning Name (I) Grade of year Height of a person Body weight Achievement
1 20220001 A1 1 120 30 130
2 20220002 A2 1 110 35 140
3 20220003 A3 1 120 35 160
4 20220004 A4 2 125 40 155
5 20220004 A4 2 125 40 155
6 20220005 A5 2 130 45 170
7 20220006 A6 3 140 45 260
8 20220007 A7 3 150 50 265
9 20220008 A8 3 155 55 245
10 20220008 A8 3 155 55 245
11 20220009 A9 4 160 55 270
12 20220010 A10 4 170 60 280
In table 1, the 4 th data and the 5 th data are duplicate data, the 9 th data and the 10 th data are duplicate data, the partition key is a grade, and the unique key is a school number. After the partition is performed according to the partition key, the partition is divided into 4 partitions, the first partition comprises 1 st data to 3 rd data, the second partition comprises 4 th data to 6 th data, the third partition comprises 7 th data to 10 th data, and the fourth partition comprises 11 th data to 12 th data.
It should be noted that the above example is only an example of a data table to be inserted, and the embodiment of the present application does not limit the data table to be inserted, data to be inserted in the data table, a partition key, a unique key, and the like, and may be determined according to an actual situation.
S102: and determining target data to be inserted in the data table after duplication removal according to a preset number of different preset first hash functions, a preset bloom filter array and sparse indexes of the stored data table.
In this step, after the server performs in-table deduplication on the data table to be inserted, in order to ensure that the data to be inserted is not duplicated with the data in the stored data table, according to a preset number of different preset first hash functions and preset bloom filter arrays, a part of target data to be inserted is determined, then the data to be inserted, which cannot be determined as the target data to be inserted, is determined whether to be duplicated with the data in the stored data table according to the sparse index of the stored data table, and the unrepeated data to be inserted is used as the target data to be inserted.
It should be noted that, after the partition is partitioned and the intra-partition deduplication is performed, deduplication and insertion processing needs to be performed on each partition in sequence, and sequential processing can be realized by adding a write lock. For each partition subjected to deduplication in the partition, determining a part of target data to be inserted according to a preset number of different preset first hash functions and preset bloom filter arrays, determining whether the target data to be inserted is repetitive with data in a stored data table according to sparse indexes of the stored data table, and taking the non-repetitive data to be inserted as the target data to be inserted.
S103: and inserting the target data to be inserted into the stored data table.
In this step, after determining the target data to be inserted in the data table after the duplication is removed, the server inserts the target data to be inserted into the stored data table.
It should be noted that, after the server inserts the target data to be inserted into the stored data table, the preset bloom filter array and the sparse index also need to be updated, so that deduplication is performed when data is inserted next time.
It should be noted that the bloom filter array is updated in a manner that each data in the stored data table is calculated according to a preset number of different preset first hash functions, and a value of a bloom filter array position corresponding to the obtained hash value is updated to 1. The preset number may be 5, 7, or 15. The preset first Hash function may be a Message Digest 4 (MD 4) function, a Message Digest 5 (MD 5) function, a Secure Hash Algorithm 1 (Secure Hash Algorithm 1, SHA 1) function, a cityHash64 function, or the like. The embodiment of the application does not limit the preset number and the preset hash function, and can be set according to actual conditions.
It should be noted that the sparse index is updated by sorting the data in the stored data table and then reestablishing the sparse index.
It should be noted that, if the data to be inserted is partitioned, and then the data to be inserted in the partition is deduplicated in the partition, and then the target data to be inserted in the partition is determined, the target data to be inserted needs to be inserted into the stored data table, the preset bloom filter array and the sparse index are updated, and then the data to be inserted in the next partition is processed until all partitions are processed.
It should be noted that the present solution can use Single Instruction Multiple Data (SIMD for short) to improve efficiency.
For example, in the case of a stored data table with 400G of 100 columns, 100 partitions and 1 hundred million rows, the scheme inserts 100 ten thousand rows of data to be inserted with 100 partitions, a data repetition rate of 0.01% and a size of 4G, the processing speed is 100.01 seconds, and the result is correct. In contrast, mysql takes more than 2 hours to complete the test.
It should be noted that, in the present solution, if a single point of failure occurs in the server when data is inserted, the data can be reinserted after the server is restarted, and the correctness of the data can still be ensured.
Exemplarily, fig. 1b is a schematic flowchart of a first embodiment of a database uniqueness constraint processing method provided by the present application. As shown in fig. 1b, the server partitions each piece of data to be inserted in the data table, partitions M regions, further performs intra-region deduplication for each partition, sequentially processes each partition according to a preset number of different preset first hash functions, a preset bloom filter array, and a sparse index of the stored data table, determines target data to be inserted in the partition, that is, removes data that is duplicated in the stored data table, and inserts the data into the stored data table. The duplicate removal efficiency is improved, and the memory occupation is reduced.
According to the method for processing the uniqueness constraint of the database, the in-table duplication removal is performed on the data table to be inserted, the target data to be inserted in the partition is determined according to the preset number of different first hash functions, the preset bloom filter arrays and the sparse indexes of the stored data table, and the target data to be inserted is inserted into the stored data table, so that the uniqueness constraint processing of the database can be realized, that is, the inserted data are not repeated and are not repeated with the stored data, and the uniqueness of the data is ensured. Compared with the prior art that hash values of all stored data are calculated, a hash table is generated and stored in a memory, stored data are not involved when duplication removal is carried out in the table, then duplication removal processing is carried out by adopting a preset bloom filter array and sparse indexes of the stored data table, the occupied memory is small, and the memory occupancy rate is effectively reduced. Meanwhile, accurate de-weight can be realized, and the de-weight efficiency is also improved.
Fig. 2 is a flowchart of a second embodiment of the database uniqueness constraint processing method provided by the present application, and on the basis of the above embodiment, the present application describes a situation that a server determines a hash value of data to be inserted according to a unique key in a data table to be inserted, and then performs intra-area deduplication according to the hash value. As shown in fig. 2, the database uniqueness constraint processing method specifically includes the following steps:
s201: and for each data to be inserted in the data table to be inserted, determining a unique key value corresponding to the data to be inserted according to the unique key in the data table to be inserted in sequence.
In this step, after the user sets the unique key of the data table to be inserted, the server may obtain the unique key, and then determine, for each data to be inserted in the data table to be inserted, the unique key value corresponding to the data to be inserted in sequence according to the unique key in the data table to be inserted.
Illustratively, table 2 is a table of data to be inserted provided in the embodiments of the present application.
TABLE 2
Serial number Study number Name(s) Grade of year Height of a person Body weight Achievement
1 20220006 A6 3 140 45 260
2 20220007 A7 3 150 50 265
3 20220008 A8 3 155 55 245
4 20220008 A8 3 155 55 245
For the data to be inserted with the number 1, because the unique key is the school number, the unique key value corresponding to the data to be inserted with the number 1 is 20220006, the unique key value corresponding to the data to be inserted with the number 2 is 20220007, the unique key value corresponding to the data to be inserted with the number 3 is 20220008, and the unique key value corresponding to the data to be inserted with the number 4 is 20220008.
It should be noted that the above example is only an example of data to be inserted, and the embodiment of the present application does not limit the data to be inserted, the unique key value, and the like, and may be determined according to an actual situation.
S202: and determining a hash value corresponding to the data to be inserted according to the unique key value and a preset second hash function.
In this step, after determining the unique key value corresponding to the data to be inserted, the server needs to determine the hash value corresponding to the data to be inserted according to the unique key value in order to perform deduplication processing.
It should be noted that the preset second Hash function may be a Message Digest 4 (MD 4) function, a Message Digest 5 (MD 5) function, a Secure Hash Algorithm 1 (SHA 1) function, a cityHash64 function, or the like. The preset second hash function is not limited, and can be set according to actual conditions.
It should be noted that, if multiple unique keys exist in the data table to be inserted, key values corresponding to the multiple unique keys may be merged, and then the hash value is determined.
S203: judging whether the hash value is stored in a preset hash value table or not; if the hash value is stored in the preset hash value table, executing step S204; if the hash value is not stored in the preset hash value table, step S205 is executed.
S204: and deleting the data to be inserted.
In the above steps, after obtaining the hash value corresponding to the data to be inserted, the server compares the hash value with the hash value in the preset hash value table, and judges whether the hash value is stored in the preset hash value table; if the hash value is stored in the preset hash value table, it indicates that the data to be inserted is the repeated data, and the repeated data needs to be deleted.
It should be noted that, if the data to be inserted is the first data processed in the partition, when determining whether the hash value is already stored in the preset hash value table, the preset hash value table does not store any data.
S205: and reserving the data to be inserted, and storing the hash value into a preset hash value table.
In this step, if the hash value is not stored in the preset hash value table, it indicates that the data to be inserted is not duplicated data, the data to be inserted needs to be retained, and the hash value is stored in the preset hash value table, so as to process the next data to be inserted.
It should be noted that, after all the data to be inserted in the partition is processed, the preset hash value table is cleared.
In the uniqueness constraint processing method for the database provided by this embodiment, the hash value of each piece of data to be inserted is calculated, and whether the hash value is the same is determined, that is, whether the hash value is stored in the preset hash value table is determined, so that deduplication in the table is realized, deduplication efficiency is effectively improved, and the memory occupancy rate is also reduced.
Fig. 3 is a flowchart of a third embodiment of the database uniqueness constraint processing method provided by the present application, and based on the above embodiments, after performing in-table deduplication on a server in the embodiment of the present application, a situation that target data to be inserted into a data table after deduplication is determined according to a preset number of different preset first hash functions, preset bloom filter arrays, and sparse indexes of stored data tables is explained. As shown in fig. 3, the method for processing uniqueness constraint of a database specifically includes the following steps:
s301: and determining target data to be inserted in the data table to be inserted after the duplication removal according to a preset number of different preset first hash functions and a preset bloom filter array.
In this step, after performing in-table deduplication, the server determines target data to be inserted in the deduplicated data table according to a preset number of different preset first hash functions and a preset bloom filter array.
Specifically, for each piece of data to be inserted in the data table after duplication removal, determining a unique key value corresponding to the data to be inserted according to a unique key in the data table to be inserted; further determining a preset number of hash values according to the unique key values and a preset number of different preset first hash functions; and judging whether the preset number of hash values has 0 in the corresponding numerical values of the preset bloom filter array, and if the preset number of hash values has 0 in the corresponding numerical values of the preset bloom filter array, taking the data to be inserted as target data to be inserted. If the corresponding numerical values are all 1, it is indicated that whether the data to be inserted is the repeated data cannot be determined, and further judgment is needed.
Specifically, the manner of determining the target data to be inserted in the deduplicated data table may also be: determining a unique key value corresponding to each piece of data to be inserted in the data table after the duplication removal according to a unique key in the data table to be inserted; determining a hash value according to the unique key value and a preset first hash function; judging whether the corresponding numerical value of the hash value in a preset bloom filter array is 0 or not; if the corresponding data is 0, the data to be inserted is not repeated data, and the data to be inserted is used as target data to be inserted; if the corresponding numerical value is 1, whether the data to be inserted is the repeated data cannot be determined, the next preset first hash function is continuously used for determining the hash value, whether the corresponding numerical value is 0 is continuously judged until whether the corresponding numerical value is 0 is judged by using or a preset number of preset first hash functions are used. If all the numerical values corresponding to the hash values obtained by using the preset first hash function are 1, it is indicated that whether the data to be inserted is the repeated data cannot be determined, and further judgment is needed.
The preset number may be 5, 7, or 15. The preset first Hash function may be a Message Digest 4 (MD 4) function, a Message Digest 5 (MD 5) function, a Secure Hash Algorithm 1 (Secure Hash Algorithm 1, SHA 1) function, a cityHash64 function, or the like. The embodiment of the application does not limit the preset number and the preset first hash function, and can be set according to actual conditions.
It should be noted that, if there is no data in the stored data table, the value of each position in the corresponding preset bloom filter array is 0.
It should be noted that, if there are multiple unique keys to be inserted into the data table, key values corresponding to the multiple unique keys may be merged, and then the hash value is determined.
It should be noted that, by using a preset first hash function and a preset bloom filter array, the spatial complexity of the target data to be inserted is determined as follows: o (Σ K × N), wherein: k is the preset number, and N is the size of the preset bloom filter array. The time complexity is: o (Σ K × N'), wherein: k is a preset number, and N' is the data volume to be inserted.
S302: for each data to be inserted except the target data to be inserted in the data table to be inserted after the duplication removal, judging whether the data to be inserted is stored in the stored data table or not according to the sparse index; if the data to be inserted is not stored in the stored data table, step S303 is executed; if the data to be inserted is already stored in the stored data table, step S304 is executed.
S303: and taking the data to be inserted as target data to be inserted.
In this step, after the server determines the target data to be inserted in the deduplicated data table according to the preset first hash function and the preset bloom filter array, it needs to further determine whether the target data to be inserted is the data to be inserted of the repeated data. And judging whether the data to be inserted is stored in the stored data table or not according to the sparse index for each data to be inserted except the target data to be inserted in the data table after the duplication removal.
The server calculates a position index of the data to be inserted, and then determines a target sparse index corresponding to the position index according to the position index and the sparse index of the stored data table; and then determining the data corresponding to the target sparse index in the stored data table, and then judging whether the data to be inserted is the same as the data corresponding to the target sparse index by the server, namely judging whether the data to be inserted is stored in the stored data table.
And if the data to be inserted is not stored in the stored data table, indicating that the data to be inserted is not repeated data, and taking the data to be inserted as target data to be inserted.
S304: and deleting the data to be inserted.
In this step, if the data to be inserted is already stored in the stored data table, it is indicated that the data to be inserted is duplicated data, and the data to be inserted is deleted.
It should be noted that, after determining target data to be inserted in the deduplicated data table according to a preset number of different preset first hash functions and a preset bloom filter array, if all the data to be inserted in the deduplicated data table are the target data to be inserted, steps S302 to S304 do not need to be executed.
According to the database uniqueness constraint processing method provided by the embodiment, a part of target data to be inserted in a deduplicated data table is determined according to the preset number of different preset first hash functions and the preset bloom filter array, and then whether the data to be inserted cannot be determined to be the duplicated data is determined according to the sparse index, so that the deduplication accuracy is effectively improved, the efficiency is improved, and the memory occupancy rate is reduced.
The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.
FIG. 4 is a schematic structural diagram of an embodiment of a database uniqueness constraint processing apparatus provided in the present application; as shown in fig. 4, the database uniqueness constraint processing device 40 includes:
a processing module 41 configured to:
performing in-table duplicate removal on the data table to be inserted according to the unique key in the data table to be inserted;
determining target data to be inserted in the data table after duplication removal according to a preset number of different preset first hash functions, a preset bloom filter array and sparse indexes of a stored data table;
a storage module 42, configured to insert the target data to be inserted into the stored data table.
Further, the processing module 41 is specifically configured to:
for each data to be inserted in the data table to be inserted, determining a unique key value corresponding to the data to be inserted according to a unique key in the data table to be inserted in sequence;
determining a hash value corresponding to the data to be inserted according to the unique key value and a preset second hash function;
if the hash value is stored in a preset hash value table, deleting the data to be inserted;
if the hash value is not stored in a preset hash value table, the data to be inserted are reserved, and the hash value is stored in the preset hash value table.
Further, the processing module 41 is specifically configured to:
determining target data to be inserted in the data table to be inserted after the duplication removal according to the preset number of different first hash functions and the preset bloom filter array;
for each data to be inserted except the target data to be inserted in the deduplicated data table, if the data to be inserted is judged not to be stored in the stored data table according to the sparse index, taking the data to be inserted as the target data to be inserted;
and if the data to be inserted is judged to be stored in the stored data table, deleting the data to be inserted.
Further, the processing module 41 is specifically configured to:
determining a unique key value corresponding to each piece of data to be inserted in the deduplicated data table according to a unique key in the data table to be inserted;
determining a preset number of hash values according to the unique key values and the preset number of different preset first hash functions;
and if the hash values with the preset number have 0 in the corresponding numerical values in the preset bloom filter array, taking the data to be inserted as the target data to be inserted.
Further, the processing module 41 is further configured to update the preset bloom filter array and the sparse index according to the stored data table.
The database uniqueness constraint processing device provided in this embodiment is used for executing the technical solution in any of the foregoing method embodiments, and the implementation principle and technical effect thereof are similar and will not be described herein again.
Fig. 5 is a schematic structural diagram of an electronic device provided in the present application. As shown in fig. 5, the electronic device 50 includes:
a processor 51, a memory 52, and a communication interface 53;
the memory 52 is used for storing executable instructions of the processor 51;
wherein the processor 51 is configured to execute the technical solution in any of the foregoing method embodiments via executing the executable instructions.
Alternatively, the memory 52 may be separate or integrated with the processor 51.
Optionally, when the memory 52 is a device independent from the processor 51, the electronic device 50 may further include:
the bus 54, the memory 52 and the communication interface 53 are connected to the processor 51 through the bus 54 and perform communication with each other, and the communication interface 53 is used for communication with other devices.
Alternatively, the communication interface 53 may be implemented by a transceiver. The communication interface is used for realizing communication between the database access device and other equipment (such as a client, a read-write library and a read-only library). The memory may comprise Random Access Memory (RAM) and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
The bus 54 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this is not intended to represent only one bus or type of bus.
The processor may be a general-purpose processor, including a central processing unit CPU, a Network Processor (NP), and the like; but also a digital signal processor DSP, an application specific integrated circuit ASIC, a field programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic, discrete hardware components.
The electronic device is configured to execute the technical solution in any one of the foregoing method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.
The embodiment of the present application further provides a readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the technical solutions provided by any of the foregoing method embodiments.
The embodiment of the present application further provides a computer program product, which includes a computer program, and the computer program is used for implementing the technical solution provided by any of the foregoing method embodiments when being executed by a processor.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims (13)

1. A database uniqueness constraint processing method is characterized by comprising the following steps:
according to the unique key to be inserted into the data table, carrying out in-table duplication removal on the data table to be inserted;
determining target data to be inserted in the data table after duplication removal according to a preset number of different preset first hash functions, a preset bloom filter array and sparse indexes of a stored data table;
inserting the target data to be inserted into the stored data table.
2. The method according to claim 1, wherein the performing intra-table deduplication on the to-be-inserted data table according to a unique key in the to-be-inserted data table comprises:
for each data to be inserted in the data table to be inserted, determining a unique key value corresponding to the data to be inserted according to a unique key in the data table to be inserted in sequence;
determining a hash value corresponding to the data to be inserted according to the unique key value and a preset second hash function;
if the hash value is stored in a preset hash value table, deleting the data to be inserted;
if the hash value is not stored in a preset hash value table, the data to be inserted are reserved, and the hash value is stored in the preset hash value table.
3. The method according to claim 1, wherein the determining the target data to be inserted in the deduplicated data table according to a preset number of different preset first hash functions, a preset bloom filter array and the sparse index of the stored data table comprises:
determining target data to be inserted in the data table to be inserted after the duplication removal according to the preset number of different preset first hash functions and the preset bloom filter array;
for each data to be inserted except the target data to be inserted in the deduplicated data table to be inserted, if the data to be inserted is judged not to be stored in the stored data table according to the sparse index, taking the data to be inserted as the target data to be inserted;
and if the data to be inserted are judged to be stored in the stored data table, deleting the data to be inserted.
4. The method according to claim 3, wherein the determining the target data to be inserted in the deduplicated data table according to the preset number of different preset first hash functions and the preset bloom filter array comprises:
determining a unique key value corresponding to each piece of data to be inserted in the data table after duplication removal according to a unique key in the data table to be inserted;
determining a preset number of hash values according to the unique key value and the preset number of different first hash functions;
and if the hash values with the preset number have 0 in the corresponding numerical values in the preset bloom filter array, taking the data to be inserted as the target data to be inserted.
5. The method of claim 1, further comprising:
and updating the preset bloom filter array and the sparse index according to the stored data table.
6. A database uniqueness constraint processing apparatus, comprising:
a processing module to:
according to the unique key to be inserted into the data table, carrying out in-table duplication removal on the data table to be inserted;
determining target data to be inserted in the data table after duplication removal according to a preset number of different preset first hash functions, a preset bloom filter array and sparse indexes of a stored data table;
and the storage module is used for inserting the target data to be inserted into the stored data table.
7. The apparatus of claim 6, wherein the processing module is specifically configured to:
for each data to be inserted in the data table to be inserted, determining a unique key value corresponding to the data to be inserted according to a unique key in the data table to be inserted in sequence;
determining a hash value corresponding to the data to be inserted according to the unique key value and a preset second hash function;
if the hash value is stored in a preset hash value table, deleting the data to be inserted;
if the hash value is not stored in a preset hash value table, the data to be inserted are reserved, and the hash value is stored in the preset hash value table.
8. The apparatus of claim 6, wherein the processing module is further specifically configured to:
determining target data to be inserted in the data table to be inserted after the duplication removal according to the preset number of different first hash functions and the preset bloom filter array;
for each data to be inserted except the target data to be inserted in the deduplicated data table, if the data to be inserted is judged not to be stored in the stored data table according to the sparse index, taking the data to be inserted as the target data to be inserted;
and if the data to be inserted is judged to be stored in the stored data table, deleting the data to be inserted.
9. The apparatus of claim 8, wherein the processing module is further specifically configured to:
determining a unique key value corresponding to each piece of data to be inserted in the data table after duplication removal according to a unique key in the data table to be inserted;
determining a preset number of hash values according to the unique key values and the preset number of different preset first hash functions;
and if the hash values with the preset number have 0 in the corresponding numerical values in the preset bloom filter array, taking the data to be inserted as the target data to be inserted.
10. The apparatus of claim 6, wherein the processing module is further configured to:
and updating the preset bloom filter array and the sparse index according to the stored data table.
11. An electronic device, comprising:
a processor, a memory, a communication interface;
the memory is used for storing executable instructions of the processor;
wherein the processor is configured to perform the database uniqueness constraint processing method of any one of claims 1 to 5 via execution of the executable instructions.
12. A readable storage medium on which a computer program is stored, the computer program, when executed by a processor, implementing the database uniqueness constraint processing method of any one of claims 1 to 5.
13. A computer program product comprising a computer program which, when executed by a processor, is adapted to implement the database uniqueness constraint processing method of any one of claims 1 to 5.
CN202211388507.XA 2022-11-08 2022-11-08 Database uniqueness constraint processing method, device, equipment and medium Active CN115617809B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211388507.XA CN115617809B (en) 2022-11-08 2022-11-08 Database uniqueness constraint processing method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211388507.XA CN115617809B (en) 2022-11-08 2022-11-08 Database uniqueness constraint processing method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN115617809A true CN115617809A (en) 2023-01-17
CN115617809B CN115617809B (en) 2023-03-21

Family

ID=84877545

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211388507.XA Active CN115617809B (en) 2022-11-08 2022-11-08 Database uniqueness constraint processing method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN115617809B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663058A (en) * 2012-03-30 2012-09-12 华中科技大学 URL duplication removing method in distributed network crawler system
CN108241615A (en) * 2016-12-23 2018-07-03 中国电信股份有限公司 Data duplicate removal method and device
CN111274212A (en) * 2020-01-20 2020-06-12 暨南大学 Cold and hot index identification and classification management method in data deduplication system
US20200319810A1 (en) * 2019-04-04 2020-10-08 Netapp Inc. Deduplication of encrypted data within a remote data store
CN114138786A (en) * 2021-12-01 2022-03-04 中国建设银行股份有限公司 Method, device, medium, product and equipment for duplicate removal of online transaction message

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663058A (en) * 2012-03-30 2012-09-12 华中科技大学 URL duplication removing method in distributed network crawler system
CN108241615A (en) * 2016-12-23 2018-07-03 中国电信股份有限公司 Data duplicate removal method and device
US20200319810A1 (en) * 2019-04-04 2020-10-08 Netapp Inc. Deduplication of encrypted data within a remote data store
CN111274212A (en) * 2020-01-20 2020-06-12 暨南大学 Cold and hot index identification and classification management method in data deduplication system
CN114138786A (en) * 2021-12-01 2022-03-04 中国建设银行股份有限公司 Method, device, medium, product and equipment for duplicate removal of online transaction message

Also Published As

Publication number Publication date
CN115617809B (en) 2023-03-21

Similar Documents

Publication Publication Date Title
CN108710613B (en) Text similarity obtaining method, terminal device and medium
CN110674881B (en) Trademark image retrieval model training method, system, storage medium and computer equipment
US20210109920A1 (en) Method for Validating Transaction in Blockchain Network and Node for Configuring Same Network
AU2017299435B2 (en) Record matching system
CN110489405B (en) Data processing method, device and server
CN106909575B (en) Text clustering method and device
CN108427736B (en) Method for querying data
CN110110325B (en) Repeated case searching method and device and computer readable storage medium
Gleich Models and algorithms for pagerank sensitivity
CN106156179B (en) Information retrieval method and device
CN113183759A (en) Method and device for displaying characters of instrument panel
CN109522305B (en) Big data deduplication method and device
CN115617809B (en) Database uniqueness constraint processing method, device, equipment and medium
CN114328968A (en) Construction method and device of medical knowledge graph, electronic equipment and medium
CN107038202B (en) Data processing method, device and equipment and readable medium
CN111325255A (en) Specific crowd delineating method and device, electronic equipment and storage medium
CN113641708B (en) Rule engine optimization method, data matching method and device, storage medium and terminal
CN112765162B (en) Method, device, medium and equipment for determining unique identity based on multi-source data
US10146509B1 (en) ASCII-seeded random number generator
US10372917B1 (en) Uniquely-represented B-trees
CN111371818A (en) Data request verification method, device and equipment
CN112163157B (en) Text recommendation method, device, server and medium
CN108763363B (en) Method and device for checking record to be written
CN116631642B (en) Extraction method and device for clinical discovery event
CN116451087B (en) Character matching method, device, terminal and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant