CN110362560B - Method for removing duplicate of non-service master key data during database storage - Google Patents

Method for removing duplicate of non-service master key data during database storage Download PDF

Info

Publication number
CN110362560B
CN110362560B CN201910619770.7A CN201910619770A CN110362560B CN 110362560 B CN110362560 B CN 110362560B CN 201910619770 A CN201910619770 A CN 201910619770A CN 110362560 B CN110362560 B CN 110362560B
Authority
CN
China
Prior art keywords
data
database
module
query
hash
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910619770.7A
Other languages
Chinese (zh)
Other versions
CN110362560A (en
Inventor
杨建华
陈洁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengcaiyun Co ltd
Original Assignee
Zhengcaiyun Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengcaiyun Co ltd filed Critical Zhengcaiyun Co ltd
Priority to CN201910619770.7A priority Critical patent/CN110362560B/en
Publication of CN110362560A publication Critical patent/CN110362560A/en
Application granted granted Critical
Publication of CN110362560B publication Critical patent/CN110362560B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables

Abstract

The invention discloses a method for removing duplicate of non-service master key data in a storage database, which is characterized by comprising the following steps: the data conversion module splices external service data led into a database into character strings, the data hash operation module calculates the spliced character strings by using a sha256 algorithm to obtain a byte array, the message abstract conversion module converts a message abstract in a byte array format into the character strings H1, the message abstract hash module carries out hash operation on the character strings H1 converted from the message abstract again to obtain an integral value H2, and the deduplication processing module mainly takes values H1 and H2 obtained by the two hash operations in the steps as a conditional query database. In the invention, the method for removing the duplicate of the non-service master key data during the storage of the database adopts the characteristic of extremely low collision rate according to the result of the message digest algorithm, can judge whether the data are equal by only comparing two fields, and effectively utilizes the database index to improve the efficiency.

Description

Method for removing duplicate of non-service master key data during database storage
Technical Field
The invention relates to the technical field of database query deduplication, in particular to a deduplication method of non-service master key data during database storage.
Background
Usually, when a database table structure is designed, a business main key field is designed, uniqueness of data is judged through the business main key field, but sometimes, a situation that some externally input data has no business main key is encountered, before the data is stored, whether the same data exists is judged to determine a subsequent processing mode, when the business main key does not exist, whether the same data exists is inquired by taking each field of the data as an inquiry condition, and the mode has very low efficiency when the data amount in the table is very large, particularly when the stored field is not suitable for adding database indexes.
Disclosure of Invention
The invention aims to solve the defects in the prior art, and provides a method for removing duplicate data when business-free master key data are stored in a database.
In order to achieve the purpose, the invention adopts the following technical scheme: a method for deduplication of non-business master key data in storing a database, comprising the steps of:
s1: external data is received, and business data outside the original database system is imported into the original database through the database receiving module, so that the rapidity of importing the business data is ensured;
s2: data field conversion, namely splicing the field name and the field value into character strings according to rules by using a data conversion module to lead external service data led into a database, and ensuring that each service data is spliced into the character strings according to the same rule;
s3: data character string operation, namely calculating the character strings formed by splicing and combining through a data hash operation module by using a sha256 algorithm to obtain a message abstract, wherein the message abstract is a byte array, and each character string can be accurately calculated;
s4: the data abstract conversion is realized, the message abstract in the byte array format is converted into a character string H1 through a message abstract conversion module, and a convenient query judgment reference point is provided for subsequent query comparison;
s5: performing secondary operation on the character strings, namely performing secondary HASH operation on the character strings H1 converted from the message digests by using the FNV1_32_ HASH algorithm through a message digest HASH module to obtain an integral value H2, and ensuring the query deduplication efficiency during subsequent database indexing;
s6: the duplicate removal query is that the duplicate removal processing module mainly uses the values H1 and H2 obtained by the hash operation twice in the steps as a condition query database, if the query has data, corresponding duplicate removal processing is carried out, and if no existing data exists, the service data and the values H1 and H2 obtained by the hash operation twice are stored in the database together, so that the fast and efficient duplicate removal query is realized;
s7: and (4) intervening processing, namely performing oriented accurate coverage and deduplication processing on the result subjected to deduplication query through a subsequent processing module of the database system, removing coincident service data, and ensuring the consistency and the unicity of data in the database system.
As a further description of the above technical solution:
the rule of the data conversion comprises a field name F1, a value V1, a combination F1 which is V1, a field name Fn which is Vn, a combination Fn which is Vn, and the data are finally formed after being sorted according to the field name in English: f1 ═ V1& Fn ═ Vn.
As a further description of the above technical solution:
the form of the character string obtained by the message abstract conversion module is 16-system.
As a further description of the above technical solution:
the hash operation is a method for creating a small digital fingerprint from any kind of data, and compresses a message or data into a digest by a hash function, so that the amount of data becomes small, and fixes the format of the data, and the digest is usually represented by a short string of random letters and numbers.
As a further description of the above technical solution:
the sha algorithm is a secure hash algorithm, is a cryptographic hash function family, and can calculate an algorithm of a character string with a fixed length corresponding to a digital message, and if the input messages are different, the probability that the input messages correspond to different character strings is high, and the sha256 is one of the algorithm standards.
As a further description of the above technical solution:
when the duplicate removal processing module inquires whether the same data exists or not, the duplicate removal processing module inquires in a database sql inquiry mode, wherein the database sql inquiry mode is a database inquiry language and is used for inquiring database data, and the database sql language can be independently applied to a terminal and can also be used as a sub-language to provide effective assistance for other program designs.
As a further description of the above technical solution:
the FNV1_32_ HASH algorithm is a HASH algorithm, which can operate the input traffic data to obtain an integer number.
As a further description of the above technical solution:
and the output end and the input end of the database receiving module, the data conversion module, the data hash operation module, the message abstract conversion module, the message abstract hash module, the duplicate removal processing module and the subsequent processing module of the database system are electrically connected in sequence.
As a further description of the above technical solution:
when the duplicate removal processing module judges whether the same data exist in the database, the field values H1 and H2 need to be synchronized as conditions for query processing, so that the efficiency and the accuracy of query are improved.
As a further description of the above technical solution:
the subsequent processing module of the database system consists of a covering module and a duplication eliminating module, wherein the covering module can carry out data one-by-one covering processing on the existing data in the queried database so as to ensure the unicity of the service data, and the duplication eliminating module can carry out data screening and duplication eliminating processing on the existing data in the queried database so as to ensure the consistency of the service data.
Advantageous effects
The invention provides a method for removing duplicate of non-service master key data during database storage. The method has the following beneficial effects:
(1): the duplicate removal method for the non-service master key data during database storage judges whether the data are equal or not by avoiding comparing all fields of a line of data, adopts the characteristic that the result collision rate is extremely low according to the message digest algorithm, can judge whether the data are equal or not by only comparing two fields, effectively utilizes database indexes to improve the efficiency, and achieves the effect of quickly, comprehensively, efficiently and accurately inquiring and removing the duplicate.
(2): the method for removing the duplicate of the non-service main key data during the storage of the database adopts a space time-changing mode, breaks through the traditional mode of inquiring and removing the duplicate of the data one by one, realizes the effect of quickly inquiring and removing the duplicate of the service data in the massive database, and can more highlight the efficiency of inquiring and removing the duplicate of the data along with the increase of the data volume.
Drawings
FIG. 1 is a schematic data processing flow diagram illustrating a method for deduplication of non-business-master-key data in a database according to the present invention;
FIG. 2 is a diagram illustrating a database service table according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.
As shown in fig. 1-2, a method for deduplication of non-service master key data while storing a database includes the steps of:
s1: external data is received, and business data outside the original database system is imported into the original database through the database receiving module, so that the rapidity of importing the business data is ensured;
s2: data field conversion, namely splicing the field name and the field value into character strings according to rules by using a data conversion module to lead external service data led into a database, and ensuring that each service data is spliced into the character strings according to the same rule;
s3: data character string operation, namely calculating the character strings formed by splicing and combining through a data hash operation module by using a sha256 algorithm to obtain a message abstract, wherein the message abstract is a byte array, and each character string can be accurately calculated;
s4: the data abstract conversion is realized, the message abstract in the byte array format is converted into a character string H1 through a message abstract conversion module, and a convenient query judgment reference point is provided for subsequent query comparison;
s5: performing secondary operation on the character strings, namely performing secondary HASH operation on the character strings H1 converted from the message digests by using the FNV1_32_ HASH algorithm through a message digest HASH module to obtain an integral value H2, and ensuring the query deduplication efficiency during subsequent database indexing;
s6: the duplicate removal query is that the duplicate removal processing module mainly uses the values H1 and H2 obtained by the hash operation twice in the steps as a condition query database, if the query has data, corresponding duplicate removal processing is carried out, and if no existing data exists, the service data and the values H1 and H2 obtained by the hash operation twice are stored in the database together, so that the fast and efficient duplicate removal query is realized;
s7: and (4) intervening processing, namely performing oriented accurate coverage and deduplication processing on the result subjected to deduplication query through a subsequent processing module of the database system, removing coincident service data, and ensuring the consistency and the unicity of data in the database system.
The rule of data conversion is that the field name is F1, the value is V1, the combination is F1 is V1, the field name is Fn, the value is Vn, the combination is Fn is Vn, and the data are finally formed after being sorted according to the field name in English: f1 ═ V1& Fn ═ Vn.
The data splicing character string mode in the data conversion module is variable, and the data can be converted by using a mode of converting a data object into a json character string.
The form of the character string obtained by the message digest conversion module is 16-ary, and the character string can be transcoded by using base64 in the message digest byte array in the message digest conversion module.
The hash operation is a method for creating a small digital fingerprint from any kind of data, and compresses a message or data into a digest by a hash function, so that the amount of data becomes small, and fixes the format of the data, and the digest is usually represented by a short string of random letters and numbers.
The sha algorithm is a secure hash algorithm, is a cryptographic hash function family, and can calculate an algorithm of a character string with a fixed length corresponding to a digital message, and if the input messages are different, the probability that the input messages correspond to different character strings is high, and the sha256 is one of algorithm standards.
When the duplicate removal processing module queries whether the same data exists, the duplicate removal processing module queries in a database sql query mode, wherein the database sql query mode is a database query language and is used for querying database data, and the database sql language can be independently applied to a terminal and can also be used as a sub-language to provide effective assistance for other program designs.
The FNV1_32_ HASH algorithm is a HASH algorithm that can operate on incoming traffic data to obtain an integer number.
The hash algorithm in the data hash operation module is variable, and can use MD5 or SHA1 algorithm, the MD5 message digest algorithm is a widely used cryptographic hash function, and can generate a 128-bit hash value for ensuring the integrity and consistency of information transmission, the SHA1 algorithm is mainly applicable to the digital signature algorithm defined in the digital signature standard, for the message with the length less than 2^64 bits, the SHA1 generates a 160-bit message digest, and when the message is received, the message digest can be used for verifying the integrity of data.
The message digest hash module can directly use the java string hash or the CRC32 algorithm.
The output end and the input end of the database receiving module, the data conversion module, the data hash operation module, the message abstract conversion module, the message abstract hash module, the duplicate removal processing module and the subsequent processing module of the database system are electrically connected in sequence.
When the duplicate removal processing module determines whether the same data exists in the database, the field values H1 and H2 need to be synchronized as a condition for query processing, so as to improve the efficiency and accuracy of the query.
The subsequent processing module of the database system consists of a covering module and a duplication eliminating module, wherein the covering module can carry out data one-by-one covering processing on the existing data in the queried database so as to ensure the unicity of the service data, and the duplication eliminating module can carry out data screening and duplication eliminating processing on the existing data in the queried database so as to ensure the consistency of the service data.
In the description herein, references to the description of "one embodiment," "an example," "a specific example" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims (8)

1. A method for deduplication of non-business master key data in storing a database, comprising the steps of:
s1: external data is received, and business data outside the original database system is imported into the original database through the database receiving module, so that the rapidity of importing the business data is ensured;
s2: data field conversion, namely splicing the field name and the field value into character strings according to rules by using a data conversion module to lead external service data led into a database, and ensuring that each service data is spliced into the character strings according to the same rule;
s3: data character string operation, namely calculating the character strings formed by splicing and combining through a data hash operation module by using a sha256 algorithm to obtain a message abstract, wherein the message abstract is a byte array, and each character string can be accurately calculated;
s4: the data abstract conversion is realized, the message abstract in the byte array format is converted into a character string H1 through a message abstract conversion module, and a convenient query judgment reference point is provided for subsequent query comparison;
s5: performing secondary operation on the character strings, namely performing secondary HASH operation on the character strings H1 converted from the message digests by using the FNV1_32_ HASH algorithm through a message digest HASH module to obtain an integral value H2, and ensuring the query deduplication efficiency during subsequent database indexing;
s6: the duplicate removal query is that the duplicate removal processing module mainly uses the values H1 and H2 obtained by the hash operation twice in the steps as a condition query database, if the query has data, corresponding duplicate removal processing is carried out, and if no existing data exists, the service data and the values H1 and H2 obtained by the hash operation twice are stored in the database together, so that the fast and efficient duplicate removal query is realized;
s7: intervention processing, namely performing oriented accurate coverage and deduplication processing on the result subjected to deduplication query through a subsequent processing module of the database system, removing overlapped service data, and ensuring the consistency and the unicity of the data in the database system;
when the duplicate removal processing module judges whether the same data exist in the database, the field values H1 and H2 need to be synchronized as conditions for query processing, so that the efficiency and the accuracy of query are improved;
the subsequent processing module of the database system consists of a covering module and a duplication eliminating module, wherein the covering module can carry out data one-by-one covering processing on the existing data in the queried database so as to ensure the unicity of the service data, and the duplication eliminating module can carry out data screening and duplication eliminating processing on the existing data in the queried database so as to ensure the consistency of the service data.
2. The method of claim 1, wherein the rule of data transformation is field name F1, value V1, combination F1-V1, field name Fn, value Vn, combination Fn-Vn, and the final composition is after sorting according to field name English: f1 ═ V1& Fn ═ Vn.
3. The method of claim 1, wherein the message digest conversion module obtains the string with a 16-ary format.
4. The method of claim 1, wherein the hash operation is a method for creating a small digital fingerprint from any data, and the hash function compresses the message or data into a digest, so that the data size is reduced, and the format of the data is fixed, and the digest is usually represented by a short string of random letters and numbers.
5. The method of claim 1, wherein the sha algorithm is a secure hash algorithm, and is a family of cryptographic hash functions, and is capable of calculating a string of fixed length corresponding to a digital message, and if the input message is different, the probability that they correspond to different strings is high, and sha256 is one of the algorithm criteria.
6. The method for removing duplicate data when storing a database without business master key according to claim 1, wherein the duplicate removal processing module queries whether the same data exists or not by using a database sql query mode, wherein the database sql query mode is a database query language and is used for querying database data, and the database sql language can be independently applied to a terminal and can also be used as a sub-language to provide effective assistance for other program designs.
7. The method of claim 1, wherein the FNV1_32_ HASH algorithm is a HASH algorithm, which can operate on the inputted service data to obtain an integer number.
8. The method for removing duplicate data in storing database without service keynote as claimed in claim 1, wherein the output end and the input end of the database receiving module, the data conversion module, the data hash operation module, the message digest conversion module, the message digest hash module, the duplicate removal processing module and the subsequent processing module of the database system are electrically connected in sequence.
CN201910619770.7A 2019-07-10 2019-07-10 Method for removing duplicate of non-service master key data during database storage Active CN110362560B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910619770.7A CN110362560B (en) 2019-07-10 2019-07-10 Method for removing duplicate of non-service master key data during database storage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910619770.7A CN110362560B (en) 2019-07-10 2019-07-10 Method for removing duplicate of non-service master key data during database storage

Publications (2)

Publication Number Publication Date
CN110362560A CN110362560A (en) 2019-10-22
CN110362560B true CN110362560B (en) 2021-12-31

Family

ID=68218608

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910619770.7A Active CN110362560B (en) 2019-07-10 2019-07-10 Method for removing duplicate of non-service master key data during database storage

Country Status (1)

Country Link
CN (1) CN110362560B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259013A (en) * 2020-02-03 2020-06-09 京东数字科技控股有限公司 Method and device for storing data
CN112559506A (en) * 2020-12-22 2021-03-26 卫宁健康科技集团股份有限公司 Health data processing method and device, processing equipment and storage medium
CN113609123B (en) * 2021-08-26 2023-06-02 四川效率源信息安全技术股份有限公司 HBase-based mass user data deduplication storage method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8943024B1 (en) * 2003-01-17 2015-01-27 Daniel John Gardner System and method for data de-duplication
US10055422B1 (en) * 2013-12-17 2018-08-21 Emc Corporation De-duplicating results of queries of multiple data repositories

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8122019B2 (en) * 2006-02-17 2012-02-21 Google Inc. Sharing user distributed search results
CN101916262B (en) * 2010-07-29 2012-07-04 北京用友政务软件有限公司 Acceleration method of financial element matching
CN105989532A (en) * 2015-02-28 2016-10-05 阿里巴巴集团控股有限公司 Data processing method and device
CN106708927B (en) * 2016-11-18 2021-01-05 北京二六三企业通信有限公司 File deduplication processing method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8943024B1 (en) * 2003-01-17 2015-01-27 Daniel John Gardner System and method for data de-duplication
US10055422B1 (en) * 2013-12-17 2018-08-21 Emc Corporation De-duplicating results of queries of multiple data repositories

Also Published As

Publication number Publication date
CN110362560A (en) 2019-10-22

Similar Documents

Publication Publication Date Title
CN110362560B (en) Method for removing duplicate of non-service master key data during database storage
US20200012631A1 (en) Comparing data stores using hash sums on disparate parallel systems
US9104676B2 (en) Hash algorithm-based data storage method and system
US20200050782A1 (en) Method and apparatus for operating database
CN106407201B (en) Data processing method and device and computer readable storage medium
CN103020024B (en) A kind of file layout change-over method
US10223550B2 (en) Generating canonical representations of JSON documents
CN108268529B (en) Data summarization method and system based on business abstraction and multi-engine scheduling
CN108733317B (en) Data storage method and device
EP2186275A1 (en) Generating a fingerprint of a bit sequence
CN112651046A (en) Data synchronization method, device and system for cross-chain transaction and terminal equipment
WO2021051532A1 (en) Data compression method, apparatus and device, and computer-readable storage medium
US8868584B2 (en) Compression pattern matching
CN110990897A (en) File fingerprint generation method and device
CN107979595B (en) Private data protection method and gateway system
EP3926453A1 (en) Partitioning method and apparatus therefor
WO2017157038A1 (en) Data processing method, apparatus and equipment
CN116069725A (en) File migration method, device, apparatus, medium and program product
CN113239039B (en) Dynamic data storage method, query method, management method and management system
CN104573518A (en) Method, device, server and system for scanning files
CN110941831B (en) Vulnerability matching method based on slicing technology
CN116263770A (en) Method, device, terminal equipment and medium for storing business data based on database
CN108614842B (en) Method and device for querying data
CN107315806B (en) Embedded storage method and device based on file system
Goyal et al. A Key based Distributed Approach for Data Integrity and Consistency in JSON and XML (Hierarchical Data Exchange Formats)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant