CN115617809B - Database uniqueness constraint processing method, device, equipment and medium - Google Patents

Database uniqueness constraint processing method, device, equipment and medium Download PDF

Info

Publication number
CN115617809B
CN115617809B CN202211388507.XA CN202211388507A CN115617809B CN 115617809 B CN115617809 B CN 115617809B CN 202211388507 A CN202211388507 A CN 202211388507A CN 115617809 B CN115617809 B CN 115617809B
Authority
CN
China
Prior art keywords
data
preset
data table
stored
hash
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211388507.XA
Other languages
Chinese (zh)
Other versions
CN115617809A (en
Inventor
李求实
林泽昕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Ruifan Technology Co ltd
Original Assignee
Guangzhou Ruifan Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Ruifan Technology Co ltd filed Critical Guangzhou Ruifan Technology Co ltd
Priority to CN202211388507.XA priority Critical patent/CN115617809B/en
Publication of CN115617809A publication Critical patent/CN115617809A/en
Application granted granted Critical
Publication of CN115617809B publication Critical patent/CN115617809B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a database uniqueness constraint processing method, a database uniqueness constraint processing device and a database uniqueness constraint processing medium. According to the method, the in-table duplication removal is carried out on the data table to be inserted according to the unique key to be inserted into the data table. And then for the data table to be inserted after the duplication removal, determining target data to be inserted in the data table to be inserted after the duplication removal according to a preset number of different preset first hash functions, a preset bloom filter array and sparse indexes of the stored data table, and then inserting the target data to be inserted into the stored data table. According to the scheme, the in-table duplication removal is carried out on the data table to be inserted, the target data to be inserted is inserted into the stored data table after duplication removal is carried out again according to the preset bloom filter array and the sparse index, so that the temporary data are fewer, the memory occupancy rate is effectively reduced, and the duplication removal efficiency is improved.

Description

Database uniqueness constraint processing method, device, equipment and medium
Technical Field
The present application relates to the field of databases, and in particular, to a method, an apparatus, a device, and a medium for processing uniqueness constraints of a database.
Background
With the rapid development of science and technology, the amount of data generated in various fields is also rapidly increased, and how to more effectively process big data is a concern of people.
Among the problems of large data processing, data deduplication at the time of data insertion is a relatively important problem. In the prior art, many database software have the function of implementing uniqueness constraint through hash indexes or B + trees and the like to achieve deduplication. For example, the hash index extracts a corresponding unique key for each piece of data to be inserted, determines a corresponding unique key value, transforms the unique key value into a hash value corresponding to the piece of data through a public hash function, that is, a memory location, and then determines whether the memory location stores data. If the data is not stored, the data is inserted into the data table, which indicates that the data is not repeated. If the data is stored, the data is indicated to be repeated data, and the repeated data is deleted.
In summary, in the existing database uniqueness constraint processing method, hash values of all stored data need to be calculated, a hash table is generated, the hash table is stored in a memory as temporary data, and when the data amount of the stored data and the data amount to be inserted are large, the data amount in the hash table in the memory is increased, which may result in a large memory occupancy rate; meanwhile, with the increase of the filling rate of the hash table, the collision probability of new data inserted into the hash table is increased, the read-write time consumption of the hash table is further increased, and the duplicate removal efficiency is low.
Disclosure of Invention
The embodiment of the application provides a method, a device, equipment and a medium for processing uniqueness constraint of a database, which are used for solving the problems that in the existing method for processing uniqueness constraint of the database, hash values of all stored data need to be calculated, a hash table is generated, the hash table is used as temporary data and stored in a memory, and when the data volume of the stored data and the data volume to be inserted are large, the data volume in the hash table in the memory is increased, so that the memory occupancy rate is large; meanwhile, with the increase of the filling rate of the hash table, the collision probability of new data inserted into the hash table is increased, the time consumption of reading and writing of the hash table is further increased, and the problem of low duplicate removal efficiency is caused.
In a first aspect, an embodiment of the present application provides a database uniqueness constraint processing method, including:
according to the unique key to be inserted into the data table, carrying out in-table duplication removal on the data table to be inserted;
determining target data to be inserted in the data table after duplication removal according to a preset number of different preset first hash functions, a preset bloom filter array and sparse indexes of a stored data table;
inserting the target data to be inserted into the stored data table.
In a specific embodiment, the performing in-table deduplication on the to-be-inserted data table according to a unique key in the to-be-inserted data table includes:
for each data to be inserted in the data table to be inserted, determining a unique key value corresponding to the data to be inserted according to a unique key in the data table to be inserted in sequence;
determining a hash value corresponding to the data to be inserted according to the unique key value and a preset second hash function;
if the hash value is stored in a preset hash value table, deleting the data to be inserted;
if the hash value is not stored in a preset hash value table, the data to be inserted is reserved, and the hash value is stored in the preset hash value table.
In a specific embodiment, the determining, according to a preset number of different preset first hash functions, a preset bloom filter array, and a sparse index of a stored data table, target data to be inserted in a deduplicated data table to be inserted includes:
determining target data to be inserted in the data table to be inserted after the duplication removal according to the preset number of different first hash functions and the preset bloom filter array;
for each data to be inserted except the target data to be inserted in the deduplicated data table, if the data to be inserted is judged not to be stored in the stored data table according to the sparse index, taking the data to be inserted as the target data to be inserted;
and if the data to be inserted is judged to be stored in the stored data table, deleting the data to be inserted.
In a specific embodiment, the determining, according to the preset number of different preset first hash functions and the preset bloom filter array, target data to be inserted in the deduplicated data table includes:
determining a unique key value corresponding to each piece of data to be inserted in the data table after duplication removal according to a unique key in the data table to be inserted;
determining a preset number of hash values according to the unique key value and the preset number of different first hash functions;
and if the hash values with the preset number have 0 in the corresponding numerical values in the preset bloom filter array, taking the data to be inserted as the target data to be inserted.
In one embodiment, the method further comprises:
and updating the preset bloom filter array and the sparse index according to the stored data table.
In a second aspect, an embodiment of the present application provides a database uniqueness constraint processing apparatus, including:
a processing module to:
according to the unique key to be inserted into the data table, carrying out in-table duplication removal on the data table to be inserted;
determining target data to be inserted in the data table after duplication removal according to a preset number of different preset first hash functions, a preset bloom filter array and sparse indexes of a stored data table;
and the storage module is used for inserting the target data to be inserted into the stored data table.
In a specific embodiment, the processing module is specifically configured to:
for each data to be inserted in the data table to be inserted, determining a unique key value corresponding to the data to be inserted according to a unique key in the data table to be inserted in sequence;
determining a hash value corresponding to the data to be inserted according to the unique key value and a preset second hash function;
if the hash value is stored in a preset hash value table, deleting the data to be inserted;
if the hash value is not stored in a preset hash value table, the data to be inserted is reserved, and the hash value is stored in the preset hash value table.
In an embodiment, the processing module is further specifically configured to:
determining target data to be inserted in the data table to be inserted after the duplication removal according to the preset number of different first hash functions and the preset bloom filter array;
for each data to be inserted except the target data to be inserted in the deduplicated data table, if the data to be inserted is judged not to be stored in the stored data table according to the sparse index, taking the data to be inserted as the target data to be inserted;
and if the data to be inserted is judged to be stored in the stored data table, deleting the data to be inserted.
In an embodiment, the processing module is further specifically configured to:
determining a unique key value corresponding to each piece of data to be inserted in the data table after duplication removal according to a unique key in the data table to be inserted;
determining a preset number of hash values according to the unique key value and the preset number of different first hash functions;
and if the hash values with the preset number exist 0 in the corresponding numerical values in the preset bloom filter array, taking the data to be inserted as the target data to be inserted.
In a specific embodiment, the processing module is further configured to:
and updating the preset bloom filter array and the sparse index according to the stored data table.
In a third aspect, an embodiment of the present application provides an electronic device, including:
a processor, a memory, a communication interface;
the memory is used for storing executable instructions of the processor;
wherein the processor is configured to perform the database uniqueness constraint processing method of any one of the first aspect via execution of the executable instructions.
In a fourth aspect, an embodiment of the present application provides a readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the database uniqueness constraint processing method described in any one of the first aspects.
In a fifth aspect, the present application provides a computer program product, including a computer program, when executed by a processor, the computer program is configured to implement the database uniqueness constraint processing method according to any one of the first aspects.
According to the method, the device, the equipment and the medium for processing the uniqueness constraint of the database, the in-table duplication removal is carried out on the data table to be inserted according to the unique key in the data table to be inserted. And then for the data table to be inserted after the duplication removal, determining target data to be inserted in the data table to be inserted after the duplication removal according to a preset number of different preset first hash functions, a preset bloom filter array and the sparse index of the stored data table, and then inserting the target data to be inserted into the stored data table. According to the scheme, the in-table deduplication is performed on the data table to be inserted, the target data to be inserted is inserted into the stored data table after the deduplication is performed again according to the preset bloom filter array and the sparse index, so that the number of temporary data is small, the memory occupancy rate is effectively reduced, and the deduplication efficiency is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.
Fig. 1a is a first schematic flowchart of a first embodiment of a database uniqueness constraint processing method provided by the present application;
fig. 1b is a schematic flowchart of a first embodiment of a database uniqueness constraint processing method provided by the present application;
fig. 2 is a flowchart illustrating a second embodiment of a database uniqueness constraint processing method provided by the present application;
fig. 3 is a schematic flowchart of a third embodiment of a database uniqueness constraint processing method provided in the present application;
FIG. 4 is a schematic structural diagram of an embodiment of a database uniqueness constraint processing apparatus provided in the present application;
fig. 5 is a schematic structural diagram of an electronic device provided in the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments that can be made by one skilled in the art based on the embodiments in the present application in light of the present disclosure are within the scope of the present application.
The terms "first," "second," "third," "fourth," and the like in the description and claims of this application and in the preceding drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
With the rapid development of software and hardware technologies, big data becomes an important issue in the internet industry. The explosive growth of data has caused many problems to be present in the human field of view. This includes the problem of how to implement data deduplication when inserting a large amount of data into a large amount of data.
Many databases have the function of removing duplicates by implementing uniqueness constraints through hash indexes or B + trees and the like. For example, for each piece of data to be inserted, a unique key value of the data is determined, a hash value corresponding to the piece of data, that is, a memory location, is determined according to the unique key value, and then whether the memory location stores the data is determined. If the data is not stored, the data is not repeated and is inserted into the data table. If the data is stored, the data is indicated to be repeated data, and the repeated data is deleted.
In the prior art, the hash values of all stored data need to be calculated to generate a hash table, the hash table is stored in a memory as temporary data, and when the data volume of the stored data and the data volume to be inserted are large, the data volume in the hash table in the memory is increased, which causes a problem of large memory occupancy rate; and along with the data filling of the hash table, the collision probability of new data inserted into the hash table is increased, so that the query and write time of the hash table is increased, and the deduplication efficiency is low.
Aiming at the problems in the prior art, the inventor finds that in the process of researching a database uniqueness constraint processing method, in order to reduce the memory occupancy rate, a hash value is calculated according to a unique key in a data table to be inserted, the data table to be inserted is subjected to intra-table deduplication, and the memory occupancy rate of the hash table is small because deduplication only relates to data to be inserted. And then determining target data to be inserted in the deduplicated data table according to a preset number of different preset first hash functions, a preset bloom filter array and sparse indexes of the stored data table, and then inserting the target data to be inserted into the stored data table. And the target data to be inserted is determined by using the preset bloom filter array and the sparse index of the stored data table, so that the occupied memory is small, and the memory occupancy rate is effectively reduced. Based on the inventive concept, the database uniqueness constraint processing scheme in the application is designed.
The execution subject of the method for processing the uniqueness constraint of the database in the present application may be a server, and may also be a device such as a computer and a terminal device that can operate the database.
An application scenario of the database uniqueness constraint processing method provided by the present application is described below.
For example, in the application scenario, the server receives a batch of data to be inserted, stores the batch of data to be inserted in a data table to be inserted of the database, and sets a unique key for the data table to be inserted by the user.
And the server needs to perform deduplication on the batch of data to be inserted and then insert the data into the stored data table. Firstly, removing repeated data to be inserted in the data table according to the unique key to be inserted in the data table.
And determining target data to be inserted in the deduplicated data table to be inserted according to a preset number of different preset first hash functions, a preset bloom filter array and sparse indexes of the stored data table, and inserting the target data to be inserted into the stored data table to realize the deduplication and insertion of the batch of data to be inserted into the stored data table. And updating the preset bloom filter array and the sparse index of the stored data table according to the stored data table.
It should be noted that the foregoing scenario is only an illustration of an application scenario provided in the embodiment of the present application, and the embodiment of the present application does not limit actual forms of various devices included in the scenario, nor limits an interaction manner between the devices, and in a specific application of the solution, the scenario may be set according to actual requirements.
Hereinafter, the technical means of the present application will be described in detail by specific examples. It should be noted that the following several specific embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments.
Fig. 1a is a first schematic flow diagram of a first embodiment of a database uniqueness constraint processing method according to the present application, in which a to-be-inserted data table is subjected to intra-table deduplication, then target to-be-inserted data is determined according to a preset number of different preset first hash functions, preset bloom filter arrays, and sparse indexes of stored data tables, and then a situation that the target to-be-inserted data is inserted into the stored data tables is described. The method in this embodiment may be implemented by software, hardware, or a combination of software and hardware. As shown in fig. 1a, the method for processing uniqueness constraint of a database specifically includes the following steps:
s101: and performing in-table duplication removal on the data table to be inserted according to the unique key to be inserted into the data table. And after receiving the data to be inserted, the server stores the data to be inserted in a data table to be inserted of the database, and the user sets a unique key for the data table to be inserted.
In this step, after the user sets the Unique Key, the Unique Key is directly stored in the metadata of the data table to be inserted in the form of Unique Key, and the server can perform in-table deduplication on the data table to be inserted according to the Unique Key in the data table to be inserted.
The server determines the hash value of each piece of data to be inserted according to the unique key in the data table to be inserted, if the same hash value exists, the situation that repeated data exists is indicated, the data to be inserted corresponding to the same hash value is subjected to deduplication processing, and one piece of data to be inserted is reserved.
It should be noted that the server may also partition each piece of data to be inserted in the data table according to the partition key to be inserted in the data table, and then perform intra-area deduplication on the data to be inserted in each partition according to the unique key to be inserted in the data table.
Illustratively, table 1 is a table of data to be inserted provided in the embodiments of the present application.
TABLE 1
Serial number Number learning Name (I) Grade of year Height of a person Body weight Achievement
1 20220001 A1 1 120 30 130
2 20220002 A2 1 110 35 140
3 20220003 A3 1 120 35 160
4 20220004 A4 2 125 40 155
5 20220004 A4 2 125 40 155
6 20220005 A5 2 130 45 170
7 20220006 A6 3 140 45 260
8 20220007 A7 3 150 50 265
9 20220008 A8 3 155 55 245
10 20220008 A8 3 155 55 245
11 20220009 A9 4 160 55 270
12 20220010 A10 4 170 60 280
In table 1, the 4 th piece of data and the 5 th piece of data are duplicated data, the 9 th piece of data and the 10 th piece of data are duplicated data, the partition key is a grade, and the unique key is a school number. After the partition is performed according to the partition key, the partition is divided into 4 partitions, the first partition comprises 1 st data to 3 rd data, the second partition comprises 4 th data to 6 th data, the third partition comprises 7 th data to 10 th data, and the fourth partition comprises 11 th data to 12 th data.
It should be noted that, the above example is only an example of the data table to be inserted, and the data table to be inserted, the data in the data table to be inserted, the partition key, the unique key, and the like are not limited in the embodiment of the present application, and may be determined according to actual situations.
S102: and determining target data to be inserted in the data table after duplication removal according to a preset number of different preset first hash functions, a preset bloom filter array and sparse indexes of the stored data table.
In this step, after the server performs in-table deduplication on the data table to be inserted, in order to ensure that the data to be inserted is not duplicated with the data in the stored data table, according to a preset number of different preset first hash functions and preset bloom filter arrays, a part of target data to be inserted is determined, then the data to be inserted, which cannot be determined as the target data to be inserted, is determined whether to be duplicated with the data in the stored data table according to the sparse index of the stored data table, and the unrepeated data to be inserted is used as the target data to be inserted.
It should be noted that, after performing partition and intra-partition deduplication, deduplication and insertion processing needs to be performed on each partition in sequence, and sequential processing may be implemented by adding a write lock. For each partition subjected to deduplication in the partition, determining a part of target data to be inserted according to a preset number of different preset first hash functions and preset bloom filter arrays, determining whether the target data to be inserted is repetitive with data in a stored data table according to sparse indexes of the stored data table, and taking the non-repetitive data to be inserted as the target data to be inserted.
S103: and inserting the target data to be inserted into the stored data table.
In this step, after determining the target data to be inserted in the data table after the duplication is removed, the server inserts the target data to be inserted into the stored data table.
It should be noted that, after the server inserts the target data to be inserted into the stored data table, the preset bloom filter array and the sparse index also need to be updated, so that deduplication is performed when data is inserted next time.
It should be noted that the bloom filter array is updated in a manner that each data in the stored data table is calculated according to a preset number of different preset first hash functions, and a value of a bloom filter array position corresponding to the obtained hash value is updated to 1. The preset number may be 5, 7, or 15. The preset first Hash function may be a Message Digest 4 (MD 4) function, a Message Digest 5 (MD 5) function, a Secure Hash Algorithm 1 (Secure Hash Algorithm 1, SHA 1) function, a cityHash64 function, or the like. The embodiment of the application does not limit the preset number and the preset hash function, and can be set according to actual conditions.
It should be noted that the sparse index is updated by sorting the data in the stored data table and then reestablishing the sparse index.
It should be noted that, if the data to be inserted is partitioned, and then the data to be inserted in the partition is deduplicated in the partition, and then the target data to be inserted in the partition is determined, the target data to be inserted needs to be inserted into the stored data table, the preset bloom filter array and the sparse index are updated, and then the data to be inserted in the next partition is processed until all partitions are processed.
It should be noted that the present solution can use Single Instruction Multiple Data (SIMD for short) to improve efficiency.
For example, in the case of a stored data table with 400G of 100 columns, 100 partitions and 1 hundred million rows, the scheme inserts 100 ten thousand rows of data to be inserted with 100 partitions, a data repetition rate of 0.01% and a size of 4G, the processing speed is 100.01 seconds, and the result is correct. In contrast, mysql takes more than 2 hours to complete the test.
It should be noted that, in the present solution, if a single point of failure occurs in the server when data is inserted, the data can be reinserted after the server is restarted, and the correctness of the data can still be ensured.
Exemplarily, fig. 1b is a schematic flowchart of a first embodiment of a database uniqueness constraint processing method provided by the present application. As shown in fig. 1b, the server partitions each piece of data to be inserted in the data table, partitions M regions, further performs intra-region deduplication for each partition, sequentially processes each partition according to a preset number of different preset first hash functions, a preset bloom filter array, and a sparse index of the stored data table, determines target data to be inserted in the partition, that is, removes data that is duplicated in the stored data table, and inserts the data into the stored data table. The duplicate removal efficiency is improved, and the memory occupation is reduced.
According to the method for processing the uniqueness constraint of the database, the data to be inserted in the partition is subjected to in-table duplication removal, the target data to be inserted in the partition is determined according to the preset number of different preset first hash functions, the preset bloom filter array and the sparse index of the stored data table, and the target data to be inserted in the partition is inserted into the stored data table, so that the uniqueness constraint processing of the database can be realized, namely the inserted data are not repeated and are not repeated with the stored data, and the uniqueness of the data is ensured. Compared with the prior art that hash values of all stored data are calculated, a hash table is generated and stored in a memory, stored data are not involved when duplication removal is carried out in the table, then duplication removal processing is carried out by adopting a preset bloom filter array and sparse indexes of the stored data table, the occupied memory is small, and the memory occupancy rate is effectively reduced. Meanwhile, accurate de-weight can be realized, and the de-weight efficiency is also improved.
Fig. 2 is a flowchart of a second embodiment of the database uniqueness constraint processing method provided by the present application, and on the basis of the above embodiment, the present application describes a situation that a server determines a hash value of data to be inserted according to a unique key in a data table to be inserted, and then performs intra-area deduplication according to the hash value. As shown in fig. 2, the database uniqueness constraint processing method specifically includes the following steps:
s201: and for each data to be inserted in the data table to be inserted, determining a unique key value corresponding to the data to be inserted according to the unique key in the data table to be inserted in sequence.
In this step, after the user sets the unique key of the data table to be inserted, the server may obtain the unique key, and then determine, for each data to be inserted in the data table to be inserted, the unique key value corresponding to the data to be inserted in sequence according to the unique key in the data table to be inserted.
Illustratively, table 2 is a table of data to be inserted provided in the embodiments of the present application.
TABLE 2
Serial number Number learning Name (I) Grade of year Height of a person Body weight Achievement
1 20220006 A6 3 140 45 260
2 20220007 A7 3 150 50 265
3 20220008 A8 3 155 55 245
4 20220008 A8 3 155 55 245
For the data to be inserted with the number 1, because the unique key is the school number, the unique key value corresponding to the data to be inserted with the number 1 is 20220006, the unique key value corresponding to the data to be inserted with the number 2 is 20220007, the unique key value corresponding to the data to be inserted with the number 3 is 20220008, and the unique key value corresponding to the data to be inserted with the number 4 is 20220008.
It should be noted that the above example is only an example of data to be inserted, and the embodiment of the present application does not limit the data to be inserted, the unique key value, and the like, and may be determined according to an actual situation.
S202: and determining a hash value corresponding to the data to be inserted according to the unique key value and a preset second hash function.
In this step, after determining the unique key value corresponding to the data to be inserted, the server needs to determine the hash value corresponding to the data to be inserted according to the unique key value in order to perform deduplication processing.
It should be noted that the preset second Hash function may be a Message Digest 4 (MD 4) function, a Message Digest 5 (MD 5) function, a Secure Hash Algorithm 1 (SHA 1) function, a cityHash64 function, or the like. The preset second hash function is not limited, and can be set according to actual conditions.
It should be noted that, if there are multiple unique keys to be inserted into the data table, key values corresponding to the multiple unique keys may be merged, and then the hash value is determined.
S203: judging whether the hash value is stored in a preset hash value table or not; if the hash value is stored in the preset hash value table, executing step S204; if the hash value is not stored in the preset hash value table, step S205 is executed.
S204: and deleting the data to be inserted.
In the above steps, after obtaining the hash value corresponding to the data to be inserted, the server compares the hash value with the hash value in the preset hash value table, and judges whether the hash value is stored in the preset hash value table; if the hash value is stored in the preset hash value table, it indicates that the data to be inserted is the repeated data, and the repeated data needs to be deleted.
It should be noted that, if the data to be inserted is the first data processed in the partition, when determining whether the hash value is already stored in the preset hash value table, the preset hash value table does not store any data.
S205: and reserving the data to be inserted, and storing the hash value into a preset hash value table.
In this step, if the hash value is not stored in the preset hash value table, it indicates that the data to be inserted is not duplicated data, the data to be inserted needs to be retained, and the hash value is stored in the preset hash value table, so as to process the next data to be inserted.
It should be noted that, after all the data to be inserted in the partition are processed, the preset hash value table is cleared.
In the uniqueness constraint processing method for the database provided by this embodiment, the hash value of each piece of data to be inserted is calculated, and whether the hash value is the same is determined, that is, whether the hash value is stored in the preset hash value table is determined, so that deduplication in the table is realized, deduplication efficiency is effectively improved, and the memory occupancy rate is also reduced.
Fig. 3 is a flowchart of a third embodiment of the database uniqueness constraint processing method provided by the present application, and based on the above embodiments, after performing in-table deduplication on a server in the embodiment of the present application, a situation that target data to be inserted into a data table after deduplication is determined according to a preset number of different preset first hash functions, preset bloom filter arrays, and sparse indexes of stored data tables is explained. As shown in fig. 3, the database uniqueness constraint processing method specifically includes the following steps:
s301: and determining target data to be inserted in the data table after the duplication removal according to a preset number of different preset first hash functions and a preset bloom filter array.
In this step, after the in-table deduplication is performed by the server, the target data to be inserted into the data table after deduplication is performed is determined according to a preset number of different preset first hash functions and a preset bloom filter array.
Specifically, for each piece of data to be inserted in the data table after duplication removal, determining a unique key value corresponding to the data to be inserted according to a unique key in the data table to be inserted; determining a preset number of hash values according to the unique key values and a preset number of different preset first hash functions; and judging whether the preset number of hash values has 0 in the corresponding numerical values of the preset bloom filter array, and if the preset number of hash values has 0 in the corresponding numerical values of the preset bloom filter array, taking the data to be inserted as target data to be inserted. If the corresponding numerical values are all 1, it is indicated that whether the data to be inserted is the repeated data cannot be determined, and further judgment is needed.
Specifically, the manner of determining the target data to be inserted in the deduplicated data table may also be: determining a unique key value corresponding to each piece of data to be inserted in the data table after the duplication removal according to a unique key in the data table to be inserted; determining a hash value according to the unique key value and a preset first hash function; judging whether the corresponding numerical value of the hash value in a preset bloom filter array is 0 or not; if the corresponding data is 0, the data to be inserted is not repeated data, and the data to be inserted is used as target data to be inserted; if the corresponding numerical value is 1, whether the data to be inserted is the repeated data cannot be determined, the next preset first hash function is continuously used for determining the hash value, and whether the corresponding numerical value is 0 is continuously judged until whether the corresponding numerical value is 0 is judged by using or a preset number of preset first hash functions are used. If all the numerical values corresponding to the hash values obtained by using the preset first hash function are 1, it is indicated that whether the data to be inserted is the repeated data cannot be determined, and further judgment is needed.
The preset number may be 5, 7, or 15. The preset first Hash function may be a Message Digest 4 (MD 4) function, a Message Digest 5 (MD 5) function, a Secure Hash Algorithm 1 (Secure Hash Algorithm 1, SHA 1) function, a cityHash64 function, or the like. The embodiment of the application does not limit the preset number and the preset first hash function, and can be set according to actual conditions.
It should be noted that, if there is no data in the stored data table, the numerical value of each position in the corresponding preset bloom filter array is 0.
It should be noted that, if there are multiple unique keys to be inserted into the data table, key values corresponding to the multiple unique keys may be merged, and then the hash value is determined.
It should be noted that, by using a preset first hash function and a preset bloom filter array, the spatial complexity of the target data to be inserted is determined as follows: o (Σ K × N), wherein: k is the preset number, and N is the size of the preset bloom filter array. The time complexity is: o (Σ K × N'), wherein: k is a preset number, and N' is the data volume to be inserted.
S302: for each data to be inserted except the target data to be inserted in the data table to be inserted after the duplication removal, judging whether the data to be inserted is stored in the stored data table or not according to the sparse index; if the data to be inserted is not stored in the stored data table, executing step S303; if the data to be inserted is already stored in the stored data table, step S304 is executed.
S303: and taking the data to be inserted as target data to be inserted.
In this step, after the server determines the target data to be inserted in the deduplicated data table according to the preset first hash function and the preset bloom filter array, it needs to further determine whether the target data to be inserted is the data to be inserted of the repeated data. And judging whether the data to be inserted is stored in the stored data table or not according to the sparse index for each data to be inserted except the target data to be inserted in the data table after the duplication removal.
The server calculates a position index of the data to be inserted, and then determines a target sparse index corresponding to the position index according to the position index and a sparse index of a stored data table; and then determining the data corresponding to the target sparse index in the stored data table, and then judging whether the data to be inserted is the same as the data corresponding to the target sparse index by the server, namely judging whether the data to be inserted is stored in the stored data table.
And if the data to be inserted is not stored in the stored data table, the data to be inserted is not repeated data, and the data to be inserted is used as target data to be inserted.
S304: and deleting the data to be inserted.
In this step, if the data to be inserted is already stored in the stored data table, it is indicated that the data to be inserted is duplicated data, and the data to be inserted is deleted.
It should be noted that, after determining target data to be inserted in the deduplicated data table according to a preset number of different preset first hash functions and a preset bloom filter array, if all the data to be inserted in the deduplicated data table are the target data to be inserted, steps S302 to S304 do not need to be executed.
According to the database uniqueness constraint processing method provided by the embodiment, a part of target data to be inserted in a deduplicated data table is determined according to the preset number of different preset first hash functions and the preset bloom filter array, and then whether the data to be inserted cannot be determined to be the duplicated data is determined according to the sparse index, so that the deduplication accuracy is effectively improved, the efficiency is improved, and the memory occupancy rate is reduced.
The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.
FIG. 4 is a schematic structural diagram of an embodiment of a database uniqueness constraint processing apparatus provided in the present application; as shown in fig. 4, the database uniqueness constraint processing device 40 includes:
a processing module 41 configured to:
according to the unique key to be inserted into the data table, carrying out in-table duplication removal on the data table to be inserted;
determining target data to be inserted in the data table after duplication removal according to a preset number of different preset first hash functions, a preset bloom filter array and sparse indexes of a stored data table;
a storage module 42, configured to insert the target data to be inserted into the stored data table.
Further, the processing module 41 is specifically configured to:
for each data to be inserted in the data table to be inserted, determining a unique key value corresponding to the data to be inserted according to a unique key in the data table to be inserted in sequence;
determining a hash value corresponding to the data to be inserted according to the unique key value and a preset second hash function;
if the hash value is stored in a preset hash value table, deleting the data to be inserted;
if the hash value is not stored in a preset hash value table, the data to be inserted is reserved, and the hash value is stored in the preset hash value table.
Further, the processing module 41 is specifically configured to:
determining target data to be inserted in the data table to be inserted after the duplication removal according to the preset number of different first hash functions and the preset bloom filter array;
for each data to be inserted except the target data to be inserted in the deduplicated data table, if the data to be inserted is judged not to be stored in the stored data table according to the sparse index, taking the data to be inserted as the target data to be inserted;
and if the data to be inserted is judged to be stored in the stored data table, deleting the data to be inserted.
Further, the processing module 41 is specifically configured to:
determining a unique key value corresponding to each piece of data to be inserted in the deduplicated data table according to a unique key in the data table to be inserted;
determining a preset number of hash values according to the unique key values and the preset number of different preset first hash functions;
and if the hash values with the preset number exist 0 in the corresponding numerical values in the preset bloom filter array, taking the data to be inserted as the target data to be inserted.
Further, the processing module 41 is further configured to update the preset bloom filter array and the sparse index according to the stored data table.
The database uniqueness constraint processing device provided in this embodiment is used for executing the technical solution in any of the foregoing method embodiments, and the implementation principle and technical effect thereof are similar and will not be described herein again.
Fig. 5 is a schematic structural diagram of an electronic device provided in the present application. As shown in fig. 5, the electronic device 50 includes:
a processor 51, a memory 52, and a communication interface 53;
the memory 52 is used for storing executable instructions of the processor 51;
wherein the processor 51 is configured to execute the technical solution in any of the foregoing method embodiments via executing the executable instructions.
Alternatively, the memory 52 may be separate or integrated with the processor 51.
Optionally, when the memory 52 is a device independent from the processor 51, the electronic device 50 may further include:
the bus 54, the memory 52 and the communication interface 53 are connected with the processor 51 through the bus 54 and perform communication with each other, and the communication interface 53 is used for communicating with other devices.
Alternatively, the communication interface 53 may be implemented by a transceiver. The communication interface is used for realizing communication between the database access device and other equipment (such as a client, a read-write library and a read-only library). The memory may comprise Random Access Memory (RAM) and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
The bus 54 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The processor may be a general-purpose processor, including a central processing unit CPU, a Network Processor (NP), and the like; but also a digital signal processor DSP, an application specific integrated circuit ASIC, a field programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic, discrete hardware components.
The electronic device is configured to execute the technical solution in any one of the foregoing method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.
The embodiment of the present application further provides a readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the technical solutions provided by any of the foregoing method embodiments.
The embodiment of the present application further provides a computer program product, which includes a computer program, and the computer program is used for implementing the technical solution provided by any of the foregoing method embodiments when being executed by a processor.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims (6)

1. A database uniqueness constraint processing method is characterized by comprising the following steps:
according to the unique key to be inserted into the data table, carrying out in-table duplication removal on the data table to be inserted;
determining a unique key value corresponding to each piece of data to be inserted in the data table after duplication removal according to a unique key in the data table to be inserted;
determining a preset number of hash values according to the unique key value and a preset number of different preset first hash functions;
if the hash values with the preset number exist 0 in the corresponding numerical values in the preset bloom filter array, taking the data to be inserted as first target data to be inserted;
determining a position index of each data to be inserted except the first target data to be inserted in the data table to be inserted after the duplication removal;
determining a target sparse index corresponding to the position index according to the position index and the sparse index of the stored data table;
judging whether the data to be inserted is the same as the data corresponding to the target sparse index in the stored data table or not;
if the data to be inserted is different from the data corresponding to the target sparse index in the stored data table, taking the data to be inserted as second target data to be inserted;
if the data to be inserted is the same as the corresponding data of the target sparse index in the stored data table, deleting the data to be inserted;
inserting the first target data to be inserted and the second target data to be inserted into the stored data table;
the method further comprises the following steps:
and updating the preset bloom filter array and the sparse index according to the stored data table.
2. The method according to claim 1, wherein the performing intra-table deduplication on the to-be-inserted data table according to a unique key in the to-be-inserted data table comprises:
for each data to be inserted in the data table to be inserted, determining a unique key value corresponding to the data to be inserted according to a unique key in the data table to be inserted in sequence;
determining a hash value corresponding to the data to be inserted according to the unique key value and a preset second hash function;
if the hash value is stored in a preset hash value table, deleting the data to be inserted;
if the hash value is not stored in a preset hash value table, the data to be inserted is reserved, and the hash value is stored in the preset hash value table.
3. A database uniqueness constraint processing apparatus, comprising:
a processing module to:
performing in-table duplicate removal on the data table to be inserted according to the unique key in the data table to be inserted;
determining a unique key value corresponding to each piece of data to be inserted in the data table after duplication removal according to a unique key in the data table to be inserted;
determining a preset number of hash values according to the unique key value and a preset number of different preset first hash functions;
if the hash values with the preset number exist 0 in the corresponding numerical values in the preset bloom filter array, taking the data to be inserted as first target data to be inserted;
determining a position index of each data to be inserted except the first target data to be inserted in the data table to be inserted after the duplication removal;
determining a target sparse index corresponding to the position index according to the position index and the sparse index of the stored data table;
judging whether the data to be inserted and the corresponding data of the target sparse index in the stored data table are the same or not;
if the data to be inserted is different from the data corresponding to the target sparse index in the stored data table, taking the data to be inserted as second target data to be inserted;
if the data to be inserted is the same as the corresponding data of the target sparse index in the stored data table, deleting the data to be inserted;
the storage module is used for inserting the first target data to be inserted and the second target data to be inserted into the stored data table;
the processing module is further configured to update the preset bloom filter array and the sparse index according to the stored data table.
4. The apparatus of claim 3, wherein the processing module is specifically configured to:
for each data to be inserted in the data table to be inserted, determining a unique key value corresponding to the data to be inserted according to a unique key in the data table to be inserted in sequence;
determining a hash value corresponding to the data to be inserted according to the unique key value and a preset second hash function;
if the hash value is stored in a preset hash value table, deleting the data to be inserted;
if the hash value is not stored in a preset hash value table, the data to be inserted is reserved, and the hash value is stored in the preset hash value table.
5. An electronic device, comprising:
a processor, a memory, a communication interface;
the memory is used for storing executable instructions of the processor;
wherein the processor is configured to perform the database uniqueness constraint processing method of any one of claims 1 or 2 via execution of the executable instructions.
6. A readable storage medium on which a computer program is stored, the computer program, when executed by a processor, implementing the database uniqueness constraint processing method of any one of claims 1 or 2.
CN202211388507.XA 2022-11-08 2022-11-08 Database uniqueness constraint processing method, device, equipment and medium Active CN115617809B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211388507.XA CN115617809B (en) 2022-11-08 2022-11-08 Database uniqueness constraint processing method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211388507.XA CN115617809B (en) 2022-11-08 2022-11-08 Database uniqueness constraint processing method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN115617809A CN115617809A (en) 2023-01-17
CN115617809B true CN115617809B (en) 2023-03-21

Family

ID=84877545

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211388507.XA Active CN115617809B (en) 2022-11-08 2022-11-08 Database uniqueness constraint processing method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN115617809B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663058A (en) * 2012-03-30 2012-09-12 华中科技大学 URL duplication removing method in distributed network crawler system
CN108241615A (en) * 2016-12-23 2018-07-03 中国电信股份有限公司 Data duplicate removal method and device
CN111274212A (en) * 2020-01-20 2020-06-12 暨南大学 Cold and hot index identification and classification management method in data deduplication system
CN114138786A (en) * 2021-12-01 2022-03-04 中国建设银行股份有限公司 Method, device, medium, product and equipment for duplicate removal of online transaction message

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11106375B2 (en) * 2019-04-04 2021-08-31 Netapp, Inc. Deduplication of encrypted data within a remote data store

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663058A (en) * 2012-03-30 2012-09-12 华中科技大学 URL duplication removing method in distributed network crawler system
CN108241615A (en) * 2016-12-23 2018-07-03 中国电信股份有限公司 Data duplicate removal method and device
CN111274212A (en) * 2020-01-20 2020-06-12 暨南大学 Cold and hot index identification and classification management method in data deduplication system
CN114138786A (en) * 2021-12-01 2022-03-04 中国建设银行股份有限公司 Method, device, medium, product and equipment for duplicate removal of online transaction message

Also Published As

Publication number Publication date
CN115617809A (en) 2023-01-17

Similar Documents

Publication Publication Date Title
CN108710613B (en) Text similarity obtaining method, terminal device and medium
CN110489405B (en) Data processing method, device and server
CN108427736B (en) Method for querying data
CN110110325B (en) Repeated case searching method and device and computer readable storage medium
CN103500185A (en) Data table generation method and system based on multi-platform data
Gleich Models and algorithms for pagerank sensitivity
CN106909575B (en) Text clustering method and device
CN113183759A (en) Method and device for displaying characters of instrument panel
CN106156179B (en) Information retrieval method and device
CN115617809B (en) Database uniqueness constraint processing method, device, equipment and medium
US11301426B1 (en) Maintaining stable record identifiers in the presence of updated data records
CN114610959B (en) Data processing method, device, equipment and storage medium
CN114127707A (en) System, computing node and method for processing write requests
CN116503608A (en) Data distillation method based on artificial intelligence and related equipment
CN113656507B (en) Method and device for executing transaction in block chain system
CN113641708B (en) Rule engine optimization method, data matching method and device, storage medium and terminal
CN112765162B (en) Method, device, medium and equipment for determining unique identity based on multi-source data
CN111324776A (en) Method and device for training graph embedding model, computing equipment and readable medium
US11449499B1 (en) System and method for retrieving data
US10146509B1 (en) ASCII-seeded random number generator
US10372917B1 (en) Uniquely-represented B-trees
CN110033098A (en) Online GBDT model learning method and device
US10747626B2 (en) Method and technique of achieving extraordinarily high insert throughput
CN108763363B (en) Method and device for checking record to be written
CN117729176B (en) Method and device for aggregating application program interfaces based on network address and response body

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant