CN111625186A - Data processing method and device, electronic equipment and storage medium - Google Patents

Data processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111625186A
CN111625186A CN202010402288.0A CN202010402288A CN111625186A CN 111625186 A CN111625186 A CN 111625186A CN 202010402288 A CN202010402288 A CN 202010402288A CN 111625186 A CN111625186 A CN 111625186A
Authority
CN
China
Prior art keywords
data block
data
stored
database
set value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010402288.0A
Other languages
Chinese (zh)
Other versions
CN111625186B (en
Inventor
葛绪意
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sangfor Technologies Co Ltd
Original Assignee
Sangfor Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sangfor Technologies Co Ltd filed Critical Sangfor Technologies Co Ltd
Priority to CN202010402288.0A priority Critical patent/CN111625186B/en
Publication of CN111625186A publication Critical patent/CN111625186A/en
Application granted granted Critical
Publication of CN111625186B publication Critical patent/CN111625186B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • G06F3/0641De-duplication techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0652Erasing, e.g. deleting, data cleaning, moving of data to a wastebasket
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention is suitable for the technical field of computers, and provides a data processing method, a data processing device, electronic equipment and a storage medium, wherein the data processing method comprises the following steps: when data to be stored is stored in a first database, determining whether a second data block which is the same as the first data block is stored in the first database or not for each first data block in all first data blocks forming the data to be stored; under the condition that a second data block identical to the first data block is stored in the first database, carrying out duplication elimination on the corresponding first data block, and increasing the duplication elimination number of the corresponding second data block by a first set value; and in the case that the deduplication number of the second data block is greater than a second set value, adding a backup for the corresponding second data block.

Description

Data processing method and device, electronic equipment and storage medium
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a data processing method and device, electronic equipment and a storage medium.
Background
Data deduplication is a technique applied in storage systems for eliminating redundant data, and the data deduplication method divides a data stream or file into fixed-size data blocks, and eliminates duplicate data blocks by comparing fingerprints of the data blocks. At present, in the related art, only one copy of data blocks with the same fingerprint is stored when data deduplication is performed, so that storage resources are saved. However, when a data block is lost or corrupted, the file is unusable because the data block cannot be retrieved.
Disclosure of Invention
In view of this, embodiments of the present invention provide a data processing method, an apparatus, an electronic device, and a storage medium, so as to at least solve the problem that a data block cannot be retrieved when the data block is lost or damaged after data deduplication is performed.
The technical scheme of the embodiment of the invention is realized as follows:
in a first aspect, an embodiment of the present invention provides a data processing method, where the method includes:
when data to be stored is stored in a first database, determining whether a second data block which is the same as the first data block is stored in the first database or not for each first data block in all first data blocks forming the data to be stored;
under the condition that a second data block identical to the first data block is stored in the first database, carrying out duplication elimination on the corresponding first data block, and increasing the duplication elimination number of the corresponding second data block by a first set value;
and in the case that the deduplication number of the second data block is greater than a second set value, adding a backup for the corresponding second data block.
In the foregoing solution, the adding a backup for the corresponding second data block includes:
a backup is added to the second database for the corresponding second data block.
In the foregoing solution, in a case that the duplication elimination number of the second data block is greater than a second set value, the method further includes:
and clearing the deduplication times of the corresponding second data blocks.
In the above scheme, the method further comprises:
in the event that a second data block is deleted from the first database, the backup for the corresponding second data block is deleted in the second database.
In the above scheme, the method further comprises:
determining the importance level of the data to be stored;
based on the importance level of the data to be stored, determining at least one of:
the first set value;
the second set value.
In the foregoing solution, when the deduplication is performed on the corresponding first data block, the method further includes:
determining metadata of the first data block; the metadata at least comprises a storage address of the corresponding second data block in the first database;
and storing the metadata of the first data block into the first database.
In the above scheme, the method further comprises:
and storing the corresponding first data block into the first database under the condition that the same second data block is not stored in the first database.
In a second aspect, an embodiment of the present invention provides a data processing apparatus, including:
the device comprises a determining module, a judging module and a judging module, wherein the determining module is used for determining whether a second data block which is the same as a first data block is stored in a first database or not for each first data block in all first data blocks forming the data to be stored when the data to be stored is stored in the first database;
the duplication removing module is used for removing duplication of the corresponding first data block under the condition that a second data block which is the same as the first data block is stored in the first database, and increasing a first set value for duplication removing times of the corresponding second data block;
and the adding module is used for adding the backup of the corresponding second data block under the condition that the duplication eliminating times of the second data block are greater than a second set value.
In a third aspect, an embodiment of the present invention provides an electronic device, including a processor and a memory, where the processor and the memory are connected to each other, where the memory is used to store a computer program, and the computer program includes program instructions, and the processor is configured to call the program instructions to execute the steps of the data processing method provided in the first aspect of the embodiment of the present invention.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, including: the computer-readable storage medium stores a computer program. Which when executed by a processor performs the steps of the data processing method as provided by the first aspect of an embodiment of the invention.
When data to be stored is stored in a first database, determining whether a second data block which is the same as the first data block is stored in the first database or not for each first data block in all first data blocks forming the data to be stored; under the condition that a second data block identical to the first data block is stored in the first database, the corresponding first data block is subjected to deduplication, and the deduplication number of the corresponding second data block is increased by a first set value; and in the case that the deduplication number of the second data block is greater than a second set value, adding a backup for the corresponding second data block. The embodiment of the invention can prevent the situation that a large number of data blocks of the data to be stored can not be recovered when the second data block is lost by adding the backup of the second data block with more duplication removal times, and can recover the second data block by using the backup of the second data block in time when the data block is lost, thereby reconstructing the data to be stored quickly.
Drawings
Fig. 1 is a schematic flow chart illustrating an implementation of a data processing method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of another implementation of a data processing method according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of another implementation of a data processing method according to an embodiment of the present invention;
FIG. 4 is a schematic flow chart illustrating another implementation of a data processing method according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating a data processing flow according to an embodiment of the present invention;
fig. 6 is a block diagram of a data processing apparatus according to an embodiment of the present invention;
fig. 7 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The technical means described in the embodiments of the present invention may be arbitrarily combined without conflict.
In addition, in the embodiments of the present invention, "first", "second", and the like are used for distinguishing similar objects, and are not necessarily used for describing a specific order or a sequential order.
Referring to fig. 1, fig. 1 is a diagram illustrating data deduplication according to an embodiment of the present invention. The data to be processed comprises A, B, C and D data blocks, and the number of each data block in the data to be processed is more than one. After data deduplication has been performed on the data to be processed, only one of data chunks A, B, C and D is retained and is not stored repeatedly. Through data deduplication, the storage space of the disk can be saved, the write-in performance of the disk is improved, the data transmission quantity on the network is reduced, and therefore the network bandwidth is saved.
The data to be processed is divided into a plurality of data blocks with the same length, repeated data blocks are found and deleted through fingerprint comparison and/or byte comparison, and finally only one data block with the same length is stored. The fingerprint comparison refers to comparison of fingerprints of each data block, and the fingerprints of the data blocks may be security hash algorithm (SHA-1, SecureHashAlgorithm1) values or information digest algorithm (MD5, MessageDigestAlgorithmMD5) values corresponding to the data blocks. Byte comparison refers to comparing bytes of two data blocks byte by byte, and if the compared bytes are different, the two data blocks are considered to be different.
Because only one copy of the same data block is stored after data deduplication, when the data block is lost or damaged, the data block cannot be recovered, and thus the data block cannot be used.
In view of the above-mentioned disadvantage that the related art cannot recover when a data block is lost or damaged after data deduplication is performed, an embodiment of the present invention provides a data processing method, which can recover a data block when the data block is lost or damaged after data deduplication. In order to better illustrate the technical solution of the present invention, the following description is given by way of specific examples.
Fig. 2 is a schematic diagram of an implementation flow of a data processing method according to an embodiment of the present invention, where an execution subject of the method may be an electronic device such as a mobile phone, a tablet, a server, and the like. Referring to fig. 2, the data processing method includes:
s201, when the data to be stored is stored in a first database, determining whether the same second data block is stored in the first database or not for each first data block in all the first data blocks forming the data to be stored.
When the data to be stored is stored in the first database, the data to be stored is divided into a plurality of first data blocks with the same length. For example, the block length of the first data block is set to 4 KB. The smaller the block length of the first data block, the better the data deduplication effect.
For each first data block, while writing the first data block to the first database, it is determined whether a second data block identical to the first data block is stored in the first database.
In practical applications, whether a second data block identical to the first data block is stored in the first database may be determined by means of fingerprint comparison or byte comparison, where identical refers to complete data content agreement between two data blocks.
For example, using a fingerprint comparison method, firstly, the fingerprint of the first data block is calculated, the fingerprint of the first data block is matched in a set fingerprint database, and whether a matching result is obtained is determined. And setting the fingerprints in the fingerprint database as the fingerprints of the second data block. And under the condition that the fingerprint of the first data block obtains a matching result in the set fingerprint database, determining that the first data block is a repeated data block, namely that a second data block which is the same as the first data block is stored in the first database, and considering that the first data block is the same as the second data block if the fingerprint of the first data block is the same as the fingerprint of the second data block. And under the condition that the fingerprint of the first data block does not obtain a matching result in the set fingerprint database, determining that the first data block is not a repeated data block, namely that a second data block identical to the first data block is not stored in the first database.
And S202, in the case that a second data block identical to the first data block is stored in the first database, performing deduplication on the corresponding first data block, and increasing the deduplication number of the corresponding second data block by a first set value.
The first database stores a second data block which is the same as the first data block, that is, the first data block already exists in the first database, and in order to save the storage space of the disk, the corresponding first data block is deduplicated.
And performing deduplication on the corresponding first data block, namely deleting the first data block from the data to be stored, namely not writing the first data block into the first database. Here, although the data block in the data to be stored is deleted, in consideration of reconstruction of the data to be stored, the embodiment of the present invention stores the metadata of the first data block, the metadata includes the storage address of the corresponding second data block in the first database, and when the data to be stored is reconstructed, the corresponding second data block can be found according to the storage address in the metadata of the first data block, so as to reconstruct the data to be stored.
Referring to fig. 3, in an embodiment, when performing deduplication on the corresponding first data block, the method further includes:
s301, determining metadata of the first data block; the metadata includes at least a storage address of the corresponding second data block in the first database.
The metadata is data describing data attributes, and in the embodiment of the present invention, the metadata is used to record the storage address of the second data block which is the same as the first data block.
S302, storing the metadata of the first data block into the first database.
In the embodiment of the present invention, in the case that a second data block identical to the first data block is stored in the first database, the first data block itself is not stored in the first database, only the metadata of the first data block is stored in the first database, and when the data to be stored is reconstructed, the second data block is read by reading the metadata according to the storage address of the second data block in the metadata, and the data to be stored is reconstructed by using the second data block. Because the content contained in the metadata is simple and the occupied storage space is small, the writing-in of the data can be reduced, the storage space of the disk is saved, and the utilization rate of the disk is improved.
In the embodiment of the invention, when detecting that a second data block identical to the first data block is stored in the first database, the corresponding first data block is deduplicated, and the deduplication number of the corresponding second data block is increased by a first set value.
For each second data block in the first database, whenever a first data block is detected to be identical to the second data block, the number of deduplication times of the second data block is increased by a first set value. Here, the first setting value is related to the importance level of the data to be stored. For example, the higher the importance level of the data to be stored, the larger the first setting value, and the specific setting of the first setting value is explained in the subsequent embodiments.
Further, in an embodiment, in a case that a second data block identical to the first data block is not stored in the first database, the corresponding first data block is stored in the first database.
And if the same second data block is not stored in the first database, the first data block is stored in the first database without carrying out deduplication processing on the first data block.
S203, adding a backup for the corresponding second data block in case the deduplication number of the second data block is greater than the second setting value.
Under the condition that the duplication eliminating number of the second data block is larger than the second set value, the data to be stored contains more first data blocks which are the same as the second data block, the second data block is very important for reconstructing the data to be stored, and if the second data block is lost, a large part of data blocks in the data to be stored cannot be recovered. Therefore, under the condition that the duplication removal times of the second data blocks are larger than the second set value, backups related to the corresponding second data blocks are added, the data blocks can be prevented from being lost or damaged and cannot be retrieved, and the situation that the data to be stored cannot be reconstructed is avoided.
Further, in an embodiment, the adding the backup for the corresponding second data block includes:
a backup is added to the second database for the corresponding second data block.
The second data block is stored in the first database, and the backup of the second data block is stored in a different database, so that the backup of the second data block can be prevented from being deleted when the data of the first database is deduplicated. And the problem that the second data block cannot be recovered due to the fact that the second data block is lost due to the error of the first database can also be avoided.
Further, in an embodiment, in a case that the deduplication number of the second data block is greater than a second set value, the method further includes:
and clearing the deduplication times of the corresponding second data blocks.
In the embodiment of the invention, when the duplication elimination number of the second data block is larger than the second set value, the backup of the corresponding second data block is added, and meanwhile, the duplication elimination number of the corresponding second data block is cleared. That is to say, the embodiment of the present invention may repeatedly add the backup of the second data block, clear the deduplication number of the second data block after adding the backup of the second data block each time, and add the backup of the second data block again when the deduplication number of the second data block is greater than the second set value again, that is, the backup of the second data block may be multiple.
And when the duplication removal number of the second data block is larger than the second set value, clearing the corresponding duplication removal number of the second data block, so that the backup of the second data block can be repeatedly added, the number of the backups of the second data block is increased, and the recovery capability of the second data block after being lost is enhanced.
In practical applications, a set number of backups of the second data block may be added at a time. For example, one copy of the second data block is added at a time. Alternatively, the set number may be determined according to the importance level of the data to be stored, and the set number may be larger as the importance level of the data to be stored is higher. Here, the multiple backups may also be stored in different databases, respectively, to reduce the probability of the second data block being lost
Referring to fig. 4, in an embodiment, the data processing method further includes:
s401, determining the importance level of the data to be stored.
In practical applications, the importance level of the data to be stored may be related to the data type of the data to be stored, or to the file size of the data to be stored, or to the source of the data to be stored, e.g. different importance levels of data to be stored from different devices, different importance levels of data to be stored from different operating systems. For example, the larger the file size of the data to be stored, the higher the importance level of the data to be stored. And dividing the important levels of the data to be stored according to the sizes of the files, wherein each file size interval corresponds to one level. For example, the importance level of 0-10M of data to be stored is level 1, and the importance level of 10-100M of data to be stored is level 2. For another example, the importance level of the data to be stored of the system in charge of the daily attendance data of the employee in the enterprise system is 1, and the importance level of the data to be stored of the system in charge of the operation data in the enterprise system is 2. For another example, the more the number of communications with the user, the higher the importance level of the data to be stored for the device.
S402, based on the importance level of the data to be stored, determining at least one of the following items:
the first set value;
the second set value.
In the embodiment of the present invention, one of the first setting value and the second setting value or both of the first setting value and the second setting value may be determined based on the importance level of the data to be stored.
In practical applications, the higher the importance level of the data to be stored, the larger the first setting value. Each importance level may correspond to a first setting value, the correspondence between the importance level and the first setting value is written in the data table in advance, and when the first setting value is determined, the first setting value corresponding to the importance level of the data to be stored may be acquired by reading the data table. For example, the first setting value corresponding to importance level 1 is set to 1, the first setting value corresponding to importance level 2 is set to 5, and the first setting value corresponding to importance level 3 is set to 10.
The higher the importance level of the data to be stored is, the larger the first set value is, the more the deduplication times of the corresponding second data block are increased each time, so that the total deduplication times can reach the second set value quickly, and backup of quickly adding the second data block is realized. The embodiment of the invention can add the backup of the second data block more frequently aiming at the data to be stored with higher importance level, thereby reducing the risk that the data cannot be retrieved after being lost.
In practical applications, the higher the importance level of the data to be stored, the smaller the second setting value. The correspondence between the importance level of the data to be stored and the second setting value may be set in advance, for example, the second setting value corresponding to importance level 1 is set to 10, the second setting value corresponding to importance level 2 is set to 8, and the second setting value corresponding to importance level 3 is set to 5.
The higher the importance level of the data to be stored is, the smaller the second set value is, so that the duplication eliminating times of the corresponding second data block can reach the second set value more quickly, and the backup of the second data block is added more quickly. The embodiment of the invention can add the backup of the second data block more frequently aiming at the data to be stored with higher importance level, thereby reducing the risk that the data cannot be retrieved after being lost.
For example, assuming that the first setting value is 1 and the second setting value is 10, the number of backups for each addition of the second data block is 1. If there are 100 first data blocks of the data to be stored, 10 backups of second data blocks will be added finally, and the more backups of second data blocks, the lower the probability that the second data blocks are lost.
Further, in an embodiment, the data processing method further includes:
in the event that a second data block is deleted from the first database, the backup for the corresponding second data block is deleted in the second database.
And deleting the stored data in the storage space, namely deleting the second data block of the stored data in the first database. When the second data block in the first database is deleted, the backup of the corresponding second data block in the second database is deleted at the same time. The storage data is deleted, so that the storage data does not need to be reconstructed, the backup of the second data block does not function, the backup of the second data block is deleted, the storage space of the disk can be saved, and the utilization rate of the disk is improved.
When data to be stored is stored in a first database, determining whether a second data block which is the same as the first data block is stored in the first database or not for each first data block in all first data blocks forming the data to be stored; under the condition that a second data block identical to the first data block is stored in the first database, the corresponding first data block is subjected to deduplication, and the deduplication number of the corresponding second data block is increased by a first set value; and in the case that the deduplication number of the second data block is greater than a second set value, adding a backup for the corresponding second data block. The embodiment of the invention can prevent the situation that a large number of data blocks of the data to be stored can not be recovered when the second data block is lost by adding the backup of the second data block with more duplication removal times, and can recover the second data block by using the backup of the second data block in time when the data block is lost, thereby reconstructing the data to be stored quickly.
Referring to fig. 5, fig. 5 is a schematic diagram of another data processing flow provided by an application embodiment of the present invention, where the data processing flow includes:
s501, reading data to be stored.
S502, dividing the data to be stored into a plurality of data blocks.
S503, calculate SHA1 value for each data block.
S504, searching the fingerprint database for the SHA1 value of each data block.
The fingerprint database stores the SHA1 value of each data block in all data blocks in the storage space.
S505, it is determined whether the SHA1 value exists in the fingerprint library.
If so, executing S506; if not, S509 is performed.
S506, writing the metadata of the data block.
The data block corresponding to the SHA1 value is not written, and the metadata of the data block is written, wherein the metadata comprises the storage address of the data block corresponding to the SHA1 value in the fingerprint database.
S507, add 1 to the duplication elimination number of the data block corresponding to the SHA1 value.
And S508, when the duplication eliminating number is larger than the set value, backing up the data block corresponding to the SHA1 value.
When the same data blocks are stored in the storage space, the data blocks corresponding to the SHA1 values are deduplicated, and the number of times of deduplication of the data blocks corresponding to the SHA1 values is added with 1; and adding the backup of the data block corresponding to the SHA1 value when the duplication removal number of the data block corresponding to the SHA1 value is larger than a set value.
S509, the SHA1 value is stored in a fingerprint database.
And S510, writing the data block corresponding to the SHA1 value.
If the same data block is not stored in the storage space, the data block corresponding to the SHA1 value is written into the storage space, and the corresponding fingerprint is stored in the fingerprint database.
The application embodiment of the invention can save the storage space of the disk and improve the utilization rate of the disk by carrying out data deduplication on the data to be stored. The data blocks are backed up in the data deduplication process, the data blocks can be prevented from being unable to be retrieved when lost, and the data blocks can be timely restored by using the backup when lost, so that the data to be stored is reconstructed.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
Referring to fig. 6, fig. 6 is a schematic diagram of a data processing apparatus according to an embodiment of the present invention, as shown in fig. 6, the apparatus includes: the device comprises a determining module, a duplicate removal module and an adding module.
The device comprises a determining module, a judging module and a judging module, wherein the determining module is used for determining whether a second data block which is the same as a first data block is stored in a first database or not for each first data block in all first data blocks forming the data to be stored when the data to be stored is stored in the first database;
the duplication removing module is used for removing duplication of the corresponding first data block under the condition that a second data block which is the same as the first data block is stored in the first database, and increasing a first set value for duplication removing times of the corresponding second data block;
and the adding module is used for adding the backup of the corresponding second data block under the condition that the duplication eliminating times of the second data block are greater than a second set value.
The adding module is specifically configured to:
a backup is added to the second database for the corresponding second data block.
The device further comprises:
and the zero clearing module is used for zero clearing the duplicate removal times of the corresponding second data block.
The device further comprises:
a deletion module to delete a backup of a corresponding second data block in the second database if the second data block is deleted from the first database.
The device further comprises:
the set value determining module is used for determining the importance level of the data to be stored;
based on the importance level of the data to be stored, determining at least one of:
the first set value;
the second set value.
The device further comprises:
a metadata processing module for determining metadata of the first data block; the metadata at least comprises a storage address of the corresponding second data block in the first database;
and storing the metadata of the first data block into the first database.
The device further comprises:
and the storage module is used for storing the corresponding first data block into the first database under the condition that the same second data block is not stored in the first database.
It should be noted that: in the data processing apparatus provided in the above embodiment, when performing data processing, only the division of the above modules is exemplified, and in practical applications, the processing may be distributed to different modules as needed, that is, the internal structure of the apparatus may be divided into different modules to complete all or part of the processing described above. In addition, the data processing apparatus and the data processing method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.
Fig. 7 is a schematic diagram of an electronic device according to an embodiment of the present invention. The electronic device includes: cell phones, tablets, servers, etc. As shown in fig. 7, the electronic apparatus of this embodiment includes: a processor, a memory, and a computer program stored in the memory and executable on the processor. The processor, when executing the computer program, implements the steps in the various method embodiments described above, such as steps 201 to 203 shown in fig. 2. Alternatively, the processor, when executing the computer program, implements the functions of the modules in the above-described device embodiments, such as the functions of the determining module, the deduplication module, and the adding module shown in fig. 6.
Illustratively, the computer program may be partitioned into one or more modules that are stored in the memory and executed by the processor to implement the invention. The one or more modules may be a series of computer program instruction segments capable of performing certain functions, which are used to describe the execution of the computer program in the electronic device.
The electronic device may include, but is not limited to, a processor, a memory. Those skilled in the art will appreciate that fig. 7 is merely an example of an electronic device and is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or different components, e.g., the electronic device may also include input-output devices, network access devices, buses, etc.
The processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may be an internal storage unit of the electronic device, such as a hard disk or a memory of the electronic device. The memory may also be an external storage device of the electronic device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, a flash card (FlashCard), and the like, provided on the electronic device. Further, the memory may also include both an internal storage unit and an external storage device of the electronic device. The memory is used for storing the computer program and other programs and data required by the electronic device. The memory may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/electronic device and method may be implemented in other ways. For example, the above-described apparatus/electronic device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer memory, Read-only memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, etc. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (10)

1. A data processing method, comprising:
when data to be stored is stored in a first database, determining whether a second data block which is the same as the first data block is stored in the first database or not for each first data block in all first data blocks forming the data to be stored;
under the condition that a second data block identical to the first data block is stored in the first database, carrying out duplication elimination on the corresponding first data block, and increasing the duplication elimination number of the corresponding second data block by a first set value;
and in the case that the deduplication number of the second data block is greater than a second set value, adding a backup for the corresponding second data block.
2. The method of claim 1, wherein adding the backup for the corresponding second data block comprises:
a backup is added to the second database for the corresponding second data block.
3. The method of claim 2, wherein in the case that the number of deduplication times of the second data block is greater than a second set value, the method further comprises:
and clearing the deduplication times of the corresponding second data blocks.
4. The method of claim 2, further comprising:
in the event that a second data block is deleted from the first database, the backup for the corresponding second data block is deleted in the second database.
5. The method of claim 1, further comprising:
determining the importance level of the data to be stored;
based on the importance level of the data to be stored, determining at least one of:
the first set value;
the second set value.
6. The method of claim 1, wherein the de-duplicating the corresponding first data block further comprises:
determining metadata of the first data block; the metadata at least comprises a storage address of the corresponding second data block in the first database;
and storing the metadata of the first data block into the first database.
7. The method of claim 1, further comprising:
and storing the corresponding first data block into the first database under the condition that a second data block identical to the first data block is not stored in the first database.
8. A data processing apparatus, comprising:
the device comprises a determining module, a judging module and a judging module, wherein the determining module is used for determining whether a second data block which is the same as a first data block is stored in a first database or not for each first data block in all first data blocks forming the data to be stored when the data to be stored is stored in the first database;
the duplication removing module is used for removing duplication of the corresponding first data block under the condition that a second data block which is the same as the first data block is stored in the first database, and increasing a first set value for duplication removing times of the corresponding second data block;
and the adding module is used for adding the backup of the corresponding second data block under the condition that the duplication eliminating times of the second data block are greater than a second set value.
9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the data processing method according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to carry out the data processing method according to any one of claims 1 to 7.
CN202010402288.0A 2020-05-13 2020-05-13 Data processing method, device, electronic equipment and storage medium Active CN111625186B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010402288.0A CN111625186B (en) 2020-05-13 2020-05-13 Data processing method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010402288.0A CN111625186B (en) 2020-05-13 2020-05-13 Data processing method, device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111625186A true CN111625186A (en) 2020-09-04
CN111625186B CN111625186B (en) 2023-11-07

Family

ID=72271947

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010402288.0A Active CN111625186B (en) 2020-05-13 2020-05-13 Data processing method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111625186B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100223495A1 (en) * 2009-02-27 2010-09-02 Leppard Andrew Minimize damage caused by corruption of de-duplicated data
CN102308288A (en) * 2009-02-06 2012-01-04 国际商业机器公司 Backup of deduplicated data
CN102436478A (en) * 2011-10-12 2012-05-02 浪潮(北京)电子信息产业有限公司 System and method for accessing massive data
CN110941514A (en) * 2019-11-25 2020-03-31 湖北工业大学 Data backup method, data recovery method, computer equipment and storage medium
CN111124259A (en) * 2018-10-31 2020-05-08 深信服科技股份有限公司 Data compression method and system based on full flash memory array
CN111124939A (en) * 2018-10-31 2020-05-08 深信服科技股份有限公司 Data compression method and system based on full flash memory array

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102308288A (en) * 2009-02-06 2012-01-04 国际商业机器公司 Backup of deduplicated data
US20100223495A1 (en) * 2009-02-27 2010-09-02 Leppard Andrew Minimize damage caused by corruption of de-duplicated data
CN102436478A (en) * 2011-10-12 2012-05-02 浪潮(北京)电子信息产业有限公司 System and method for accessing massive data
CN111124259A (en) * 2018-10-31 2020-05-08 深信服科技股份有限公司 Data compression method and system based on full flash memory array
CN111124939A (en) * 2018-10-31 2020-05-08 深信服科技股份有限公司 Data compression method and system based on full flash memory array
CN110941514A (en) * 2019-11-25 2020-03-31 湖北工业大学 Data backup method, data recovery method, computer equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
朱江;冀鸣;杨志成;张嘉贤;曹雄;: "基于重复数据删除技术的存储系统分析", 信息系统工程, no. 04, pages 70 *
韩莹;王茂发;张艳霞;: "一种在去重备份系统中数据完整性校验算法", 计算机应用研究, no. 06, pages 1819 *

Also Published As

Publication number Publication date
CN111625186B (en) 2023-11-07

Similar Documents

Publication Publication Date Title
CN108427538B (en) Storage data compression method and device of full flash memory array and readable storage medium
US9697228B2 (en) Secure relational file system with version control, deduplication, and error correction
US8443159B1 (en) Methods and systems for creating full backups
CN107506153B (en) Data compression method, data decompression method and related system
CN108427539B (en) Offline de-duplication compression method and device for cache device data and readable storage medium
US8751462B2 (en) Delta compression after identity deduplication
US8904125B1 (en) Systems and methods for creating reference-based synthetic backups
CN103098035B (en) Storage system
CN110941514B (en) Data backup method, data recovery method, computer equipment and storage medium
CN111125033B (en) Space recycling method and system based on full flash memory array
US11409766B2 (en) Container reclamation using probabilistic data structures
US20170123689A1 (en) Pipelined Reference Set Construction and Use in Memory Management
US20230376385A1 (en) Reducing bandwidth during synthetic restores from a deduplication file system
CN110618974A (en) Data storage method, device, equipment and storage medium
CN109196478B (en) Fault tolerant enterprise object storage system for small objects
CN111124940B (en) Space recovery method and system based on full flash memory array
RU2665272C1 (en) Method and apparatus for restoring deduplicated data
CN110019063B (en) Method for computing node data disaster recovery playback, terminal device and storage medium
CN105493080A (en) Method and apparatus for context aware based data de-duplication
CN111124259A (en) Data compression method and system based on full flash memory array
CN111124939A (en) Data compression method and system based on full flash memory array
CN111061428B (en) Data compression method and device
US20190026299A1 (en) Metadata separated container format
US8595271B1 (en) Systems and methods for performing file system checks
CN111198857A (en) Data compression method and system based on full flash memory array

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant