CN111625186B - Data processing method, device, electronic equipment and storage medium - Google Patents

Data processing method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111625186B
CN111625186B CN202010402288.0A CN202010402288A CN111625186B CN 111625186 B CN111625186 B CN 111625186B CN 202010402288 A CN202010402288 A CN 202010402288A CN 111625186 B CN111625186 B CN 111625186B
Authority
CN
China
Prior art keywords
data
data block
stored
database
set value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010402288.0A
Other languages
Chinese (zh)
Other versions
CN111625186A (en
Inventor
葛绪意
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sangfor Technologies Co Ltd
Original Assignee
Sangfor Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sangfor Technologies Co Ltd filed Critical Sangfor Technologies Co Ltd
Priority to CN202010402288.0A priority Critical patent/CN111625186B/en
Publication of CN111625186A publication Critical patent/CN111625186A/en
Application granted granted Critical
Publication of CN111625186B publication Critical patent/CN111625186B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • G06F3/0641De-duplication techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0652Erasing, e.g. deleting, data cleaning, moving of data to a wastebasket
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application is suitable for the technical field of computers, and provides a data processing method, a device, electronic equipment and a storage medium, wherein the data processing method comprises the following steps: when data to be stored is stored in a first database, determining whether second data blocks identical to the first data blocks are stored in the first database for each of all first data blocks constituting the data to be stored; under the condition that a second data block which is the same as the first data block is stored in the first database, performing de-duplication on the corresponding first data block, and increasing the de-duplication times of the corresponding second data block by a first set value; and adding the backup related to the corresponding second data block in the case that the number of times of de-duplication of the second data block is larger than a second set value.

Description

Data processing method, device, electronic equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a data processing method, a data processing device, an electronic device, and a storage medium.
Background
Data deduplication is a technique applied in storage systems for eliminating redundant data, where a data deduplication method divides a data stream or file into data blocks of a fixed size, and eliminates duplicate data blocks by comparing fingerprints of the data blocks. At present, when the related technology performs data deduplication, the data blocks of the same fingerprint only store one copy, so that storage resources are saved. However, when a data block is lost or damaged, the file is not available because the data block cannot be retrieved.
Disclosure of Invention
In view of the above, the embodiments of the present application provide a data processing method, apparatus, electronic device, and storage medium, so as to at least solve the problem that the data block cannot be retrieved when lost or damaged after the data is de-duplicated.
The technical scheme of the embodiment of the application is realized as follows:
in a first aspect, an embodiment of the present application provides a data processing method, including:
when data to be stored is stored in a first database, determining whether second data blocks identical to the first data blocks are stored in the first database for each of all first data blocks constituting the data to be stored;
under the condition that a second data block which is the same as the first data block is stored in the first database, performing de-duplication on the corresponding first data block, and increasing the de-duplication times of the corresponding second data block by a first set value;
and adding the backup related to the corresponding second data block in the case that the number of times of de-duplication of the second data block is larger than a second set value.
In the above solution, the adding a backup for the corresponding second data block includes:
a backup for the corresponding second data block is added in the second database.
In the above solution, in the case where the number of times of deduplication of the second data block is greater than the second set value, the method further includes:
and clearing the duplicate removal times of the corresponding second data block.
In the above scheme, the method further comprises:
in the event that a second data block is deleted from the first database, a backup for the corresponding second data block is deleted in the second database.
In the above scheme, the method further comprises:
determining the importance level of the data to be stored;
based on the importance level of the data to be stored, at least one of the following is determined:
the first set value;
the second set value.
In the above solution, when the deduplicating is performed on the corresponding first data block, the method further includes:
determining metadata of the first data block; the metadata at least comprises storage addresses of corresponding second data blocks in the first database;
and storing the metadata of the first data block into the first database.
In the above scheme, the method further comprises:
and storing the corresponding first data block into the first database under the condition that the same second data block is not stored in the first database.
In a second aspect, an embodiment of the present application provides a data processing apparatus, including:
the determining module is used for determining whether second data blocks which are identical to the first data blocks are stored in the first database or not for each first data block in all first data blocks forming the data to be stored when the data to be stored is stored in the first database;
the de-duplication module is used for de-duplication of the corresponding first data block and increasing the de-duplication times of the corresponding second data block by a first set value under the condition that the second data block which is the same as the first data block is stored in the first database;
and the adding module is used for adding the backup related to the corresponding second data block under the condition that the number of times of de-duplication of the second data block is larger than a second set value.
In a third aspect, an embodiment of the present application provides an electronic device, including a processor and a memory, where the processor and the memory are connected to each other, where the memory is configured to store a computer program, the computer program including program instructions, and the processor is configured to invoke the program instructions to perform the steps of the data processing method provided in the first aspect of the embodiment of the present application.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium comprising: the computer readable storage medium stores a computer program. The computer program when executed by a processor implements the steps of the data processing method as provided in the first aspect of the embodiment of the present application.
When the data to be stored is stored in the first database, determining whether second data blocks which are the same as the first data blocks are stored in the first database for each first data block in all first data blocks forming the data to be stored; under the condition that a second data block which is the same as the first data block is stored in the first database, performing de-duplication on the corresponding first data block, and increasing the de-duplication times of the corresponding second data block by a first set value; and adding the backup related to the corresponding second data block in the case that the number of times of de-duplication of the second data block is larger than a second set value. According to the embodiment of the application, the situation that a large number of data blocks of the data to be stored cannot be recovered when the second data block is lost can be prevented by adding the backup of the second data block with more de-duplication times, and the second data block can be recovered by timely using the backup of the second data block when the data block is lost, so that the data to be stored is quickly reconstructed.
Drawings
FIG. 1 is a schematic diagram of an implementation flow of a data processing method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of an implementation flow of another data processing method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of an implementation flow of another data processing method according to an embodiment of the present application;
FIG. 4 is a schematic diagram of an implementation flow of another data processing method according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a data processing flow provided by an embodiment of the present application;
FIG. 6 is a block diagram of a data processing apparatus according to an embodiment of the present application;
fig. 7 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The technical schemes described in the embodiments of the present application may be arbitrarily combined without any collision.
In addition, in the embodiments of the present application, "first", "second", etc. are used to distinguish similar objects and are not necessarily used to describe a particular order or precedence.
Referring to fig. 1, fig. 1 is an exemplary diagram of data deduplication according to an embodiment of the present application. The data to be processed comprises A, B, C and D data blocks, and the number of each data block in the data to be processed is more than one. After the data to be processed is subjected to data deduplication, only one of the data blocks A, B, C and D is reserved, and the data blocks are not stored repeatedly. By means of data deduplication, the storage space of the disk can be saved, the writing performance of the disk is improved, the data transmission quantity on a network is reduced, and therefore network bandwidth is saved.
As an implementation scheme of data deduplication, the data to be processed is divided into a plurality of data blocks with the same length, repeated data blocks are found out and deleted through fingerprint comparison and/or byte comparison, and finally the same data block only stores one part. The fingerprint comparison refers to comparing the fingerprints of each data block, where the fingerprints of the data block may be a secure hash algorithm (SHA-1, secure hashthirith lmd 1) value or a message digest algorithm (MD 5, messagedigestalgorithmd 5) value corresponding to the data block. Byte comparison refers to comparing bytes of two data blocks byte by byte, and if the comparison result indicates that the compared bytes are different, the two data blocks are considered to be different.
Since the same data block is stored only one after the data is de-duplicated, when the data block is lost or damaged, the data block cannot be recovered, and thus the data block cannot be used.
Aiming at the defect that the data cannot be recovered when the data block is lost or damaged after the data is subjected to the duplication elimination in the related art, the embodiment of the application provides a data processing method which can recover the data block under the condition that the data block is lost or damaged after the data is duplicated. In order to better illustrate the technical solution of the present application, the following description is made by specific examples.
Fig. 2 is a schematic implementation flow chart of a data processing method according to an embodiment of the present application, where an execution body of the method may be an electronic device such as a mobile phone, a tablet, a server, etc. Referring to fig. 2, the data processing method includes:
s201, when data to be stored is stored in the first database, determining whether the same second data block is stored in the first database for each first data block in all first data blocks forming the data to be stored.
When the data to be stored is stored in the first database, the data to be stored is divided into a plurality of first data blocks with the same length. For example, the block length of the first data block is set to 4KB. The smaller the block length of the first data block, the better the data deduplication effect.
For each first data block, when the first data block is written into the first database, it is determined whether a second data block identical to the first data block is stored in the first database.
In practical applications, it may be determined by fingerprint comparison or byte comparison whether a second data block identical to the first data block is stored in the first database, where identical refers to that the data contents between two data blocks are completely identical.
For example, by using a fingerprint comparison method, first, the fingerprint of the first data block is calculated, and the fingerprint of the first data block is matched in a set fingerprint database to determine whether a matching result is obtained. Wherein, set up the fingerprint in the fingerprint storehouse as the fingerprint of the second data block. And under the condition that the fingerprint of the first data block is matched with the matching result in the set fingerprint database, determining that the first data block is a repeated data block, namely, the first database already stores a second data block identical to the first data block, and considering that the fingerprint of the first data block is identical to the fingerprint of the second data block. In the case that the fingerprint of the first data block does not obtain a matching result in the set fingerprint database, it is determined that the first data block is not a duplicate data block, that is, the first database does not store the same second data block as the first data block.
S202, when a second data block identical to the first data block is stored in the first database, performing de-duplication on the corresponding first data block, and increasing the de-duplication times of the corresponding second data block by a first set value.
The first database stores the second data block identical to the first data block, that is, the first data block already exists in the first database, and in order to save the storage space of the disk, the corresponding first data block is deduplicated.
And de-duplicating the corresponding first data block, namely deleting the first data block from the data to be stored, namely not writing the first data block into the first database. Here, although the data block in the data to be stored is deleted, in consideration of the reconstruction of the data to be stored, the embodiment of the present application stores the metadata of the first data block, where the metadata includes the storage address of the corresponding second data block in the first database, and when the data to be stored is reconstructed, the corresponding second data block can be found according to the storage address in the metadata of the first data block, so as to reconstruct the data to be stored.
Referring to fig. 3, in an embodiment, when the deduplicating is performed on the corresponding first data block, the method further includes:
s301, determining metadata of the first data block; the metadata includes at least a storage address of the corresponding second data block in the first database.
Metadata is data describing the attributes of the data, and in embodiments of the present application, the metadata is used to record the storage address of a second data block that is identical to the first data block.
S302, metadata of the first data block are stored in the first database.
In the embodiment of the application, under the condition that the second data block which is the same as the first data block is stored in the first database, the first data block is not stored in the first database, only the metadata of the first data block is stored in the first database, and when the data to be stored is reconstructed, the metadata is read according to the storage address of the second data block in the metadata, so that the second data block is read, and the data to be stored is reconstructed by using the second data block. Because the metadata contains simple content and occupies small storage space, the writing of data can be reduced, the storage space of a disk is saved, and the utilization rate of the disk is improved.
In the embodiment of the application, when the second data block which is the same as the first data block is stored in the first database, the corresponding first data block is subjected to de-duplication, and the de-duplication frequency of the corresponding second data block is increased by a first set value.
For each second data block in the first database, the number of times of deduplication of the second data block is increased by a first set value every time the first data block is detected to be identical to the second data block. Here, the first setting value is related to an importance level of data to be stored. For example, the higher the importance level of the data to be stored, the larger the first setting value, and the specific setting of the first setting value will be described in the subsequent embodiments.
Further, in an embodiment, in a case where the second data block identical to the first data block is not stored in the first database, the corresponding first data block is stored in the first database.
If the same second data block is not stored in the first database, the first data block is stored in the first database without performing duplicate removal processing on the first data block.
And S203, adding a backup related to the corresponding second data block in the case that the number of times of de-duplication of the second data block is larger than a second set value.
And under the condition that the number of times of de-duplication of the second data block is larger than a second set value, the data to be stored is indicated to contain more first data blocks which are the same as the second data block, the second data block is very important to reconstruct the data to be stored, and if the second data block is lost, a large part of the data blocks in the data to be stored cannot be recovered. Therefore, when the number of times of de-duplication of the second data block is greater than the second set value, the backup related to the corresponding second data block is added, so that the data block can be prevented from being lost or cannot be retrieved after being damaged, and the data to be stored can be prevented from being unable to be reconstructed.
Further, in an embodiment, the adding a backup for the corresponding second data block includes:
a backup for the corresponding second data block is added in the second database.
The second data block is stored in the first database, and the backup of the second data block is stored in a different database, so that the backup of the second data block can be prevented from being deleted when the data of the first database is deduplicated. It is also possible to avoid that the second data block is lost due to a first database error, and thus cannot be restored.
Further, in an embodiment, in a case that the number of deduplication times of the second data block is greater than the second set value, the method further includes:
and clearing the duplicate removal times of the corresponding second data block.
In the embodiment of the application, when the number of times of de-duplication of the second data block is larger than the second set value, adding a backup related to the corresponding second data block, and simultaneously clearing the number of times of de-duplication of the corresponding second data block. That is, in the embodiment of the present application, the backup of the second data block may be repeatedly added, after each addition of the backup of the second data block, the number of deduplication times of the second data block is cleared, and when the number of deduplication times of the second data block is again greater than the second set value, the backup of the second data block is newly added, that is, the backup of the second data block may be multiple.
And under the condition that the number of times of de-duplication of the second data block is larger than a second set value, resetting the corresponding number of times of de-duplication of the second data block, so that the backup of the second data block can be repeatedly added, the number of the second data block backups is increased, and the recovery capacity after the second data block is lost is enhanced.
In practical applications, a set number of backups of the second data block may be added at a time. For example, a backup of the second data block is added one at a time. Alternatively, the size of the set number may be determined according to the importance level of the data to be stored, and the higher the importance level of the data to be stored is, the greater the set number is. Here, it is also possible to store a plurality of backups in different databases, respectively, thereby reducing the probability of the second data block being lost
Referring to fig. 4, in an embodiment, the data processing method further includes:
s401, determining the importance level of the data to be stored.
In practical applications, the importance level of the data to be stored may be related to the data type of the data to be stored, or to the file size of the data to be stored, or to the source of the data to be stored, e.g. the importance level of the data to be stored from different devices is different, the importance level of the data to be stored from different operating systems is different. For example, the larger the file size of the data to be stored, the higher the importance level of the data to be stored. And dividing the importance level of the data to be stored according to the file size, wherein each file size interval corresponds to one level. For example, the importance level of the data to be stored is 1 level, and the importance level of the data to be stored is 2 level, which is 0-10M. For another example, the importance level of the data to be stored of the system responsible for staff daily attendance data in the enterprise system is 1, and the importance level of the data to be stored of the system responsible for operation data in the enterprise system is 2. For another example, the higher the importance level of the data to be stored of the device with the greater number of communications with the user.
S402, determining at least one of the following based on the importance level of the data to be stored:
the first set value;
the second set value.
In the embodiment of the application, one of the first setting value and the second setting value can be determined based on the importance level of the data to be stored, or the first setting value and the second setting value can be determined simultaneously.
In practical application, the higher the importance level of the data to be stored is, the larger the first set value is. Each importance level can correspond to a first set value, the corresponding relation between the importance level and the first set value is written into the data table in advance, and when the first set value is determined, the first set value corresponding to the importance level of the data to be stored can be obtained by reading the data table. For example, the first setting value corresponding to the importance level 1 is set to 1, the first setting value corresponding to the importance level 2 is set to 5, and the first setting value corresponding to the importance level 3 is set to 10.
The higher the importance level of the data to be stored is, the larger the first set value is, so that the number of times of duplicate removal of the corresponding second data block is increased each time, the total number of times of duplicate removal can reach the second set value quickly, and backup of adding the second data block quickly is realized. The embodiment of the application can more frequently add the backup of the second data block aiming at the data to be stored with higher importance level, thereby reducing the risk of being unable to retrieve after the data is lost.
In practical application, the higher the importance level of the data to be stored is, the smaller the second set value is. The correspondence between the importance level and the second set value of the data to be stored may be preset, for example, the second set value corresponding to the importance level 1 is set to 10, the second set value corresponding to the importance level 2 is set to 8, and the second set value corresponding to the importance level 3 is set to 5.
The higher the importance level of the data to be stored, the smaller the second set value is, so that the number of times of de-duplication of the corresponding second data block can reach the second set value faster, and the faster the backup of the second data block is added. The embodiment of the application can more frequently add the backup of the second data block aiming at the data to be stored with higher importance level, thereby reducing the risk of being unable to retrieve after the data is lost.
For example, assume that the first setting value is 1, the second setting value is 10, and the number of backups to which the second data block is added at a time is 1. If the data to be stored has 100 first data blocks, 10 backups of second data blocks are added finally, and the more backups of the second data blocks are, the lower the probability of losing the second data blocks is.
Further, in an embodiment, the data processing method further includes:
in the event that a second data block is deleted from the first database, a backup for the corresponding second data block is deleted in the second database.
The stored data in the storage space, i.e. the second data block of the stored data in the first database, is deleted. When deleting the second data block in the first database, the embodiment of the application simultaneously deletes the backup of the corresponding second data block in the second database. Because the stored data is deleted, the stored data is not required to be reconstructed, the backup of the second data block is not effective, and the backup of the second data block is deleted, so that the storage space of a disk can be saved, and the utilization rate of the disk is improved.
When the data to be stored is stored in the first database, determining whether second data blocks which are the same as the first data blocks are stored in the first database for each first data block in all first data blocks forming the data to be stored; under the condition that a second data block which is the same as the first data block is stored in the first database, performing de-duplication on the corresponding first data block, and increasing the de-duplication times of the corresponding second data block by a first set value; and adding the backup related to the corresponding second data block in the case that the number of times of de-duplication of the second data block is larger than a second set value. According to the embodiment of the application, the situation that a large number of data blocks of the data to be stored cannot be recovered when the second data block is lost can be prevented by adding the backup of the second data block with more de-duplication times, and the second data block can be recovered by timely using the backup of the second data block when the data block is lost, so that the data to be stored is quickly reconstructed.
Referring to fig. 5, fig. 5 is a schematic diagram of another data processing flow provided by an application embodiment of the present application, where the data processing flow includes:
s501, reading data to be stored.
S502, dividing data to be stored into a plurality of data blocks.
S503, calculating SHA1 value of each data block.
S504, searching the SHA1 value of each data block in the fingerprint database.
The fingerprint library stores SHA1 values for each of all the data blocks in the memory space.
S505, determining whether the SHA1 value exists in the fingerprint database.
If so, S506 is performed; if not, S509 is performed.
S506, writing the metadata of the data block.
And writing metadata of the data block without writing the data block corresponding to the SHA1 value, wherein the metadata comprises a storage address of the data block corresponding to the SHA1 value in a fingerprint database.
S507, adding 1 to the number of times of de-duplication of the data block corresponding to the SHA1 value.
And S508, backing up the data block corresponding to the SHA1 value when the number of times of de-duplication is larger than the set value.
Under the condition that the same data blocks are stored in the storage space, the data blocks corresponding to the SHA1 value are subjected to de-duplication, and the de-duplication times of the data blocks corresponding to the SHA1 value are added by 1; and adding the backup of the data block corresponding to the SHA1 value when the number of times of de-duplication of the data block corresponding to the SHA1 value is larger than a set value.
S509, storing the SHA1 value into a fingerprint database.
S510, writing the data block corresponding to the SHA1 value.
And under the condition that the same data blocks are not stored in the storage space, writing the data blocks corresponding to the SHA1 value into the storage space, and storing the corresponding fingerprints in a fingerprint library.
The embodiment of the application can save the storage space of the disk and improve the utilization rate of the disk by carrying out data deduplication on the data to be stored. The backup is carried out on the data blocks in the data deduplication process, so that the situation that the data blocks cannot be retrieved when being lost can be avoided, and the backup can be timely used for recovering the data blocks when the data blocks occur, thereby reconstructing the data to be stored.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present application.
Referring to fig. 6, fig. 6 is a schematic diagram of a data processing apparatus according to an embodiment of the present application, as shown in fig. 6, the apparatus includes: a determining module, a de-duplication module, and an adding module.
The determining module is used for determining whether second data blocks which are identical to the first data blocks are stored in the first database or not for each first data block in all first data blocks forming the data to be stored when the data to be stored is stored in the first database;
the de-duplication module is used for de-duplication of the corresponding first data block and increasing the de-duplication times of the corresponding second data block by a first set value under the condition that the second data block which is the same as the first data block is stored in the first database;
and the adding module is used for adding the backup related to the corresponding second data block under the condition that the number of times of de-duplication of the second data block is larger than a second set value.
The adding module is specifically used for:
a backup for the corresponding second data block is added in the second database.
The apparatus further comprises:
and the zero clearing module is used for zero clearing the number of the duplicate removal times of the corresponding second data block.
The apparatus further comprises:
and the deleting module is used for deleting the backup related to the corresponding second data block in the second database under the condition that the second data block is deleted from the first database.
The apparatus further comprises:
the set value determining module is used for determining the importance level of the data to be stored;
based on the importance level of the data to be stored, at least one of the following is determined:
the first set value;
the second set value.
The apparatus further comprises:
a metadata processing module, configured to determine metadata of the first data block; the metadata at least comprises storage addresses of corresponding second data blocks in the first database;
and storing the metadata of the first data block into the first database.
The apparatus further comprises:
and the storage module is used for storing the corresponding first data block into the first database under the condition that the same second data block is not stored in the first database.
It should be noted that: in the data processing apparatus provided in the foregoing embodiments, only the division of the modules is used as an example for data processing, and in practical application, the processing allocation may be performed by different modules according to needs, that is, the internal structure of the apparatus is divided into different modules, so as to complete all or part of the processing described above. In addition, the data processing apparatus and the data processing method embodiment provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the data processing apparatus and the data processing method embodiment are detailed in the method embodiment, which is not described herein again.
Fig. 7 is a schematic diagram of an electronic device according to an embodiment of the present application. The electronic device includes: cell phones, tablets, servers, etc. As shown in fig. 7, the electronic apparatus of this embodiment includes: a processor, a memory, and a computer program stored in the memory and executable on the processor. The processor, when executing the computer program, implements the steps of the various method embodiments described above, such as steps 201 to 203 shown in fig. 2. Alternatively, the processor may implement the functions of the modules in the above embodiments of the apparatus when executing the computer program, for example, the functions of the determining module, the deduplication module, and the adding module shown in fig. 6.
The computer program may be divided into one or more modules, which are stored in the memory and executed by the processor to accomplish the present application, for example. The one or more modules may be a series of computer program instruction segments capable of performing the specified functions, which are used to describe the execution of the computer program in the electronic device.
The electronic device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that fig. 7 is merely an example of an electronic device and is not meant to be limiting, and that more or fewer components than shown may be included, or that certain components may be combined, or that different components may be included, for example, in an electronic device that may also include an input-output device, a network access device, a bus, etc.
The processor may be a central processing unit (CentralProcessingUnit, CPU), other general purpose processors, digital signal processors (DigitalSignalProcessor, DSP), application specific integrated circuits (ApplicationSpecificIntegratedCircuit, ASIC), off-the-shelf programmable gate arrays (Field-ProgrammableGateArray, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may be an internal storage unit of the electronic device, such as a hard disk or a memory of the electronic device. The memory may also be an external storage device of the electronic device, such as a plug-in hard disk, a smart memory card (SmartMediaCard, SMC), a secure digital (SecureDigital, SD) card, a flash memory card (FlashCard), etc. provided on the electronic device. Further, the memory may also include both an internal storage unit and an external storage device of the electronic device. The memory is used for storing the computer program and other programs and data required by the electronic device. The memory may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus/electronic device and method may be implemented in other manners. For example, the apparatus/electronic device embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical function division, and there may be additional divisions in actual implementation, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a Read-only memory (ROM), a random access memory (RandomAccessMemory, RAM), an electrical carrier signal, a telecommunication signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.
The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims (8)

1. A method of data processing, comprising:
when data to be stored is stored in a first database, determining whether second data blocks identical to the first data blocks are stored in the first database for each of all first data blocks constituting the data to be stored;
under the condition that a second data block which is the same as the first data block is stored in the first database, performing de-duplication on the corresponding first data block, and increasing the de-duplication times of the corresponding second data block by a first set value;
adding a backup for the corresponding second data block if the number of deduplication times of the second data block is greater than a second set value;
in the case that the number of times of deduplication of the second data block is greater than a second set value, the method further includes:
resetting the number of de-duplication times of the corresponding second data block;
the method further comprises the steps of:
determining the importance level of the data to be stored;
based on the importance level of the data to be stored, at least one of the following is determined:
the first set value; the first set value is proportional to the importance level;
the second set value; the second set value is inversely proportional to the importance level.
2. The method of claim 1, wherein the adding a backup for the corresponding second data block comprises:
a backup for the corresponding second data block is added in the second database.
3. The method according to claim 2, wherein the method further comprises:
in the event that a second data block is deleted from the first database, a backup for the corresponding second data block is deleted in the second database.
4. The method of claim 1, wherein when the deduplicating is performed on the corresponding first data block, the method further comprises:
determining metadata of the first data block; the metadata at least comprises storage addresses of corresponding second data blocks in the first database;
and storing the metadata of the first data block into the first database.
5. The method according to claim 1, wherein the method further comprises:
and storing the corresponding first data block into the first database under the condition that the second data block which is the same as the first data block is not stored in the first database.
6. A data processing apparatus, comprising:
the determining module is used for determining whether second data blocks which are identical to the first data blocks are stored in the first database or not for each first data block in all first data blocks forming the data to be stored when the data to be stored is stored in the first database;
the de-duplication module is used for de-duplication of the corresponding first data block and increasing the de-duplication times of the corresponding second data block by a first set value under the condition that the second data block which is the same as the first data block is stored in the first database;
an adding module, configured to add a backup related to the corresponding second data block if the number of deduplication times of the second data block is greater than a second set value;
the zero clearing module is used for clearing the corresponding duplicate removal times of the second data block;
the set value determining module is used for:
determining the importance level of the data to be stored;
based on the importance level of the data to be stored, at least one of the following is determined:
the first set value; the first set value is proportional to the importance level;
the second set value; the second set value is inversely proportional to the importance level.
7. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the data processing method according to any of claims 1 to 5 when executing the computer program.
8. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the data processing method according to any of claims 1 to 5.
CN202010402288.0A 2020-05-13 2020-05-13 Data processing method, device, electronic equipment and storage medium Active CN111625186B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010402288.0A CN111625186B (en) 2020-05-13 2020-05-13 Data processing method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010402288.0A CN111625186B (en) 2020-05-13 2020-05-13 Data processing method, device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111625186A CN111625186A (en) 2020-09-04
CN111625186B true CN111625186B (en) 2023-11-07

Family

ID=72271947

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010402288.0A Active CN111625186B (en) 2020-05-13 2020-05-13 Data processing method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111625186B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118227842B (en) * 2024-04-15 2024-09-06 北京瑞太智联技术有限公司 Multi-source heterogeneous data storage method, system, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102308288A (en) * 2009-02-06 2012-01-04 国际商业机器公司 Backup of deduplicated data
CN102436478A (en) * 2011-10-12 2012-05-02 浪潮(北京)电子信息产业有限公司 System and method for accessing massive data
CN110941514A (en) * 2019-11-25 2020-03-31 湖北工业大学 Data backup method, data recovery method, computer equipment and storage medium
CN111124939A (en) * 2018-10-31 2020-05-08 深信服科技股份有限公司 Data compression method and system based on full flash memory array
CN111124259A (en) * 2018-10-31 2020-05-08 深信服科技股份有限公司 Data compression method and system based on full flash memory array

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8645334B2 (en) * 2009-02-27 2014-02-04 Andrew LEPPARD Minimize damage caused by corruption of de-duplicated data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102308288A (en) * 2009-02-06 2012-01-04 国际商业机器公司 Backup of deduplicated data
CN102436478A (en) * 2011-10-12 2012-05-02 浪潮(北京)电子信息产业有限公司 System and method for accessing massive data
CN111124939A (en) * 2018-10-31 2020-05-08 深信服科技股份有限公司 Data compression method and system based on full flash memory array
CN111124259A (en) * 2018-10-31 2020-05-08 深信服科技股份有限公司 Data compression method and system based on full flash memory array
CN110941514A (en) * 2019-11-25 2020-03-31 湖北工业大学 Data backup method, data recovery method, computer equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
一种在去重备份系统中数据完整性校验算法;韩莹;王茂发;张艳霞;;计算机应用研究(第06期);第1819页摘要-1821页第4节 *
基于重复数据删除技术的存储系统分析;朱江;冀鸣;杨志成;张嘉贤;曹雄;;信息系统工程(第04期);第70页摘要-第73页第4节 *

Also Published As

Publication number Publication date
CN111625186A (en) 2020-09-04

Similar Documents

Publication Publication Date Title
CN108427538B (en) Storage data compression method and device of full flash memory array and readable storage medium
CN108427539B (en) Offline de-duplication compression method and device for cache device data and readable storage medium
CN103098035B (en) Storage system
US9612774B2 (en) Metadata structures for low latency and high throughput inline data compression
US8751462B2 (en) Delta compression after identity deduplication
CN107506153B (en) Data compression method, data decompression method and related system
US9864542B2 (en) Data deduplication using a solid state drive controller
US10936228B2 (en) Providing data deduplication in a data storage system with parallelized computation of crypto-digests for blocks of host I/O data
CN111125033B (en) Space recycling method and system based on full flash memory array
US9959049B1 (en) Aggregated background processing in a data storage system to improve system resource utilization
US20210042327A1 (en) Container reclamation using probabilistic data structures
CN110941514B (en) Data backup method, data recovery method, computer equipment and storage medium
CN107850983B (en) Computer system, storage device and data management method
CN110618974A (en) Data storage method, device, equipment and storage medium
CN110019063B (en) Method for computing node data disaster recovery playback, terminal device and storage medium
RU2665272C1 (en) Method and apparatus for restoring deduplicated data
WO2015096847A1 (en) Method and apparatus for context aware based data de-duplication
CN109196478B (en) Fault tolerant enterprise object storage system for small objects
CN111124940B (en) Space recovery method and system based on full flash memory array
CN111124259A (en) Data compression method and system based on full flash memory array
CN111124939A (en) Data compression method and system based on full flash memory array
CN111625186B (en) Data processing method, device, electronic equipment and storage medium
WO2021082926A1 (en) Data compression method and apparatus
US20150355968A1 (en) Systems and methods for sequential resilvering
CN111198857A (en) Data compression method and system based on full flash memory array

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant