CN110618974A

CN110618974A - Data storage method, device, equipment and storage medium

Info

Publication number: CN110618974A
Application number: CN201910843695.2A
Authority: CN
Inventors: 徐晓阳; 赵万里
Original assignee: Suzhou Wave Intelligent Technology Co Ltd
Current assignee: Suzhou Wave Intelligent Technology Co Ltd
Priority date: 2019-09-06
Filing date: 2019-09-06
Publication date: 2019-12-27

Abstract

The application discloses a data storage method, which comprises the steps of dividing source data into a preset number of data blocks when the source data are obtained; performing hash operation on a current data block to obtain a hash value corresponding to the current data block; acquiring snapshot data corresponding to the source data, acquiring a snapshot data block corresponding to the current data block from the snapshot data, and calculating to acquire a hash value of the snapshot data block; matching the hash value of the current data block with the hash value of the snapshot data block, and storing the current data block to a preset storage space when the matching fails; the data storage method effectively improves the utilization rate of the storage space and greatly reduces the load of the storage resources at the same time. The application also discloses a data storage device, equipment and a computer readable storage medium, which have the beneficial effects.

Description

Data storage method, device, equipment and storage medium

Technical Field

The present application relates to the field of storage technologies, and in particular, to a data storage method, a data storage apparatus, a device, and a computer-readable storage medium.

Background

In the big data era, data is more important, and common data storage methods include high-availability data storage schemes such as asynchronous disaster recovery, remote copy, and data compression, and certainly have some indispensable basic data functions such as snapshot, backup, and clone, and some other capacity policies such as data capacity reduction and data deduplication.

The data deduplication refers to clearing repeated parts of data in storage, namely deleting repeated data, the data storage capacity space can be effectively reduced through the data deduplication, and the maximum capacity release can be improved by more than 90%. Specifically, the existing data deduplication mainly aims at realizing deduplication scanning for files by scanning the data storage space through deduplication, finding the same file by comparing file check values, and then executing deletion operation. However, although the execution of the data deduplication operation can achieve a certain degree of space reclamation, the size of the reclaimed space is limited, and the execution of the data deduplication operation also consumes the load of the whole storage resource.

Therefore, how to effectively improve the utilization rate of the storage space and reduce the load of the storage resource is an urgent problem to be solved by those skilled in the art.

Disclosure of Invention

The data storage method effectively improves the utilization rate of storage space and greatly reduces the load of storage resources; it is another object of the present application to provide a data storage device, an apparatus, and a computer-readable storage medium, which also have the above-mentioned advantageous effects.

In order to solve the above technical problem, the present application provides a data storage method, where the data storage method includes:

when source data are acquired, dividing the source data into a preset number of data blocks;

performing hash operation on a current data block to obtain a hash value corresponding to the current data block;

acquiring snapshot data corresponding to the source data, acquiring a snapshot data block corresponding to the current data block from the snapshot data, and calculating to acquire a hash value of the snapshot data block;

and matching the hash value of the current data block with the hash value of the snapshot data block, and storing the current data block to a preset storage space when the matching fails.

Preferably, the performing the hash operation on the current data block to obtain the hash value corresponding to the current data block includes:

and carrying out hash operation on the current data by utilizing a hash algorithm and an MD5 algorithm to obtain a hash value corresponding to the current data block.

Preferably, the data storage method further comprises:

when the hash value of the current data block and the hash value of the snapshot data block pass matching, increasing the size of the next data block to a preset multiple of the size of the current data block, and performing hash value matching on the next data block;

and when the hash value of the current data block is not matched with the hash value of the snapshot data block, dividing the current data block into data blocks with the initially set size, and performing hash value matching on the data blocks with the initially set size.

Preferably, the data storage method further comprises:

and when the hash value of the current data block and the hash value of the snapshot data block are matched and passed, marking the offset addresses of the current data block and the current data block.

In order to solve the above technical problem, the present application also provides a data storage device, including:

the data block dividing module is used for dividing the source data into a preset number of data blocks when the source data are obtained;

the first hash operation module is used for carrying out hash operation on the current data block to obtain a hash value corresponding to the current data block;

the second hash operation module is used for acquiring snapshot data corresponding to the source data, acquiring a snapshot data block corresponding to the current data block from the snapshot data, and calculating to acquire a hash value of the snapshot data block;

and the hash value matching module is used for matching the hash value of the current data block with the hash value of the snapshot data block, and storing the current data block to a preset storage space when the matching is failed.

Preferably, the first hash operation module is specifically configured to perform a hash operation on the current data by using a hash algorithm and an MD5 algorithm to obtain a hash value corresponding to the current data block.

Preferably, the data storage device further comprises:

a data block dynamic adjustment module, configured to, when the hash value of the current data block matches the hash value of the snapshot data block, increase the size of a next data block to a preset multiple of the size of the current data block, and perform hash value matching on the next data block; and when the hash value of the current data block is not matched with the hash value of the snapshot data block, dividing the current data block into data blocks with the initially set size, and performing hash value matching on the data blocks with the initially set size.

Preferably, the data storage device further comprises:

and the data block marking module is used for marking the offset addresses of the current data block and the current data block when the hash value of the current data block is matched with the hash value of the snapshot data block.

In order to solve the above technical problem, the present application further provides a data storage device, where the data storage device includes:

a memory for storing a computer program;

a processor for implementing the steps of any of the above data storage methods when executing the computer program.

In order to solve the above technical problem, the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of any one of the above data storage methods.

The data storage method comprises the steps that when source data are obtained, the source data are divided into a preset number of data blocks; performing hash operation on a current data block to obtain a hash value corresponding to the current data block; acquiring snapshot data corresponding to the source data, acquiring a snapshot data block corresponding to the current data block from the snapshot data, and calculating to acquire a hash value of the snapshot data block; and matching the hash value of the current data block with the hash value of the snapshot data block, and storing the current data block to a preset storage space when the matching fails.

Therefore, the data storage method provided by the application can be used for dividing the source data to be stored into data blocks and matching the hash values of the data blocks with the corresponding snapshot data blocks in the snapshot data, namely, comparing the data blocks one by one, so as to determine whether the source data is changed compared with the original snapshot data, and further, only executing storage operation on the changed data blocks, so that data is effectively prevented from being repeatedly stored in the data storage process, the same data in the storage space is fundamentally prevented from appearing, and therefore, the data storage method does not need to execute data deduplication operation, the storage resource load is effectively reduced, the space occupation caused by repeated storage of the data is avoided, and the utilization rate of the storage space is effectively improved.

The data storage device, the device and the computer readable storage medium provided by the present application all have the above beneficial effects, and are not described herein again.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flow chart of a data storage method provided in the present application;

FIG. 2 is a schematic flow chart of another data storage method provided in the present application;

FIG. 3 is a schematic structural diagram of a data storage device provided in the present application;

fig. 4 is a schematic structural diagram of a data storage device provided in the present application.

Detailed Description

The core of the application is to provide a data storage method, which effectively improves the utilization rate of storage space and greatly reduces the load of storage resources; another core of the present application is to provide a data storage device, a server, and a computer-readable storage medium, which also have the above-mentioned advantages.

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

At present, in the data storage process, the occupancy rate of the storage space can be improved by executing data deduplication operation, so that resource waste is avoided, in the specific implementation process, deduplication scanning for files can be realized by deduplication scanning of the data storage space, the same files can be found by comparing file check values, and then deletion operation is executed. However, although the execution of the data deduplication operation can achieve a certain degree of space reclamation, the size of the reclaimed space is limited, and the operation also consumes the load of the entire storage resource during the execution.

Therefore, in order to solve the above problems, the present application provides a data storage method, when data storage is performed, data block division is performed on source data to be stored, hash value matching is performed on each data block and a corresponding snapshot data block in snapshot data, that is, data block-by-data block comparison is performed, so as to determine whether the source data is changed compared with original snapshot data, and further, a storage operation is performed only on the changed data block, thereby effectively avoiding data from being repeatedly stored in the data storage process, fundamentally avoiding the same data from appearing in a storage space, thus it is seen that the data storage method does not need to perform data deduplication operation, effectively reducing storage resource load, avoiding space occupation caused by data repeated storage, and effectively improving the utilization rate of the storage space.

Referring to fig. 1, fig. 1 is a schematic flow chart of a data storage method provided in the present application, where the data storage method may include:

s101: when source data are acquired, dividing the source data into a preset number of data blocks;

the method includes the steps of dividing source data, namely data to be stored, specifically, dividing the source data into a preset number of data blocks (blocks) when the source data are acquired.

The source data may be divided based on a preset rule, for example, the source data may be divided according to a preset data block size, and the divided data blocks may be equal or unequal in size. In addition, the value of the preset quantity is preset according to actual requirements, and the implementation of the technical scheme is not affected, and the application does not limit the value.

S102: performing hash operation on the current data block to obtain a hash value corresponding to the current data block;

the present step is intended to implement hash operation of a data block, specifically, for a current data block, that is, a data block (any data block in source data) that needs to be checked currently, hash operation is performed on the current data block to obtain a hash value corresponding to the current data block. The hash value is calculated here, and the hash value is used to verify the data block, so as to effectively verify whether the current data block is changed compared with the original data block.

Preferably, the performing the hash operation on the current data block to obtain the hash value corresponding to the current data block includes: and performing hash operation on the current data by using a hash Algorithm and an MD5 Algorithm (Message Digest Algorithm, fifth edition), so as to obtain a hash value corresponding to the current data block.

For the calculation process of the hash value corresponding to the data block, a more specific implementation mode is provided, namely the implementation is based on a hash algorithm and an MD5 algorithm, wherein the hash algorithm is realized by mapping a binary value with any length into a binary value with a shorter fixed length, the shorter binary value is the hash value, and the hash value is a unique and extremely compact numerical value representation form of a section of data and can be used for checking the integrity of the data; the MD5 algorithm is a hash function widely used in the field of computer security to provide integrity protection for messages.

S103: acquiring snapshot data corresponding to source data, acquiring a snapshot data block corresponding to a current data block from the snapshot data, and calculating to obtain a hash value of the snapshot data block;

the step aims to realize the hash value calculation of the snapshot data, wherein the snapshot data is the original data which is stored and corresponds to the source data, so that the snapshot data can be called from the corresponding storage space, and the data block division is carried out on the snapshot data according to the data block division rule of the source data to obtain a preset number of snapshot data blocks, so that the snapshot data block corresponding to the current data block can be obtained from the snapshot data blocks, the hash operation of the snapshot data blocks is further completed, and the hash value of the snapshot data blocks is obtained. Of course, the hash operation process of the snapshot data block may refer to the hash operation flow of the current data block, which is not described herein again.

S104: and matching the hash value of the current data block with the hash value of the snapshot data block, and storing the current data block to a preset storage space when the matching fails.

The method comprises the steps of matching hash values of data blocks, namely comparing and analyzing the hash value of a current data block and the hash value of a corresponding snapshot data block to determine whether the hash values are the same or not, if the hash values are the same, namely the matching is passed, the current data block is the same as the snapshot data block, data change does not occur, and new data information is not generated; if the two data blocks are the same, that is, the matching is not passed, it is indicated that the current data block is changed from the corresponding snapshot data block, and therefore, the current data block is stored in the pre-established storage space, and the data storage is completed. Therefore, for the source data, the data blocks which are the same as the snapshot data blocks do not execute the storage operation any more, and the data blocks which are different from the snapshot data blocks execute the storage operation, so that the repeated storage of the data information is effectively avoided, and the same data information in the storage space is fundamentally avoided.

As a preferred embodiment, the data storage method may further include: and when the hash value of the current data block is matched with the hash value of the snapshot data block, marking the current data block and the offset address of the current data block.

The present implementation is intended to implement the marking of the data block, and of course, the marked data block is a data block that is determined to be the same as the snapshot data block in the source data. Specifically, when the hash value of the current data block matches the hash value of the corresponding snapshot data block, it indicates that the current data block is the same as the snapshot data block, and at this time, the current data block and the offset address thereof are marked to indicate that the data block has completed matching verification, so that repeated verification is avoided, and the data storage efficiency is further improved.

According to the data storage method provided by the embodiment of the application, when data storage is carried out, data block division is carried out on source data to be stored, hash value matching is carried out on each data block and a corresponding snapshot data block in snapshot data, namely, data block-by-data block comparison is carried out, whether the source data are changed compared with original snapshot data or not is determined, and further, storage operation is carried out only on the changed data blocks, so that repeated storage of the data is effectively avoided in the data storage process, the same data in a storage space is fundamentally avoided, therefore, the data storage method does not need to carry out data deduplication operation, storage resource load is effectively reduced, space occupation caused by repeated storage of the data is avoided, and the utilization rate of the storage space is effectively improved.

On the basis of the foregoing embodiments, as a preferred embodiment, the data storage method may further include:

when the hash value of the current data block is matched with the hash value of the snapshot data block, increasing the size of the next data block to a preset multiple of the size of the current data block, and performing hash value matching on the next data block; when the hash value of the current data block is not matched with the hash value of the snapshot data block, the current data block is divided into data blocks with the initial set size, and the hash value of the data blocks with the initial set size is matched.

The embodiment of the application provides another specific data storage method, and the size of each verified data block can be dynamically adjusted in the matching verification process of each data block in source data, so that the data storage efficiency is effectively improved. Specifically, when the hash value between the current data block undergoing matching verification and the corresponding snapshot data block passes through matching, when the next data block is verified, the size of the next data block can be increased to a preset multiple of the current data block, such as 2 times, and then the next data block with the size of 2 current data blocks is directly verified, so on, if the next data block passes through matching, the next data block with the size of 4 current data blocks can be continuously matched, therefore, if the data blocks pass through matching all the time, the number of times of verification can be effectively reduced and the data storage efficiency can be improved by continuously increasing the size of the verified data block; on the contrary, when the verified data block is found to be not matched, the data block can be divided into the initial size state, that is, the data block with the initial set size is obtained, and then each data block is verified according to the verification method until all the data blocks in the source data are verified.

The technical solution provided in the embodiment of the present application is described below by way of example, for example, the acquired source data is divided into 20 data blocks with the same size, which are respectively labeled 1 to 20, and the size of the divided data block is the initial set size. In the process of data block verification, hash value matching is firstly carried out on a data block 1, if the data block 1 passes the verification, a complete data block consisting of a data block 2 and a data block 3 is subjected to hash value matching, when the verification passes again, the data blocks 4 to 7 are subjected to hash value matching, if the verification does not pass this time, the combined large data block can be divided into 4 data blocks with the same size again, namely, the size of the original data block is recovered, further, the recovered data block 4 is subjected to hash value matching, if the data block 4 passes the verification, the recovered data block 5 and the recovered data block 6 are combined into a complete data block to be subjected to hash value matching, and if the data block 4 does not pass the verification, the data block 4 is stored in a storage space.

The data storage method provided by the embodiment realizes dynamic adjustment of the size of the checked data block, effectively reduces the checking times, and further improves the data storage efficiency.

On the basis of the foregoing embodiments, a more specific data storage method is provided in the embodiments of the present application, please refer to fig. 2, and fig. 2 is a schematic flow chart of another data storage method provided in the present application.

Firstly, dividing source data with the size of M into block blocks with a certain size of j, calculating and checking the hash value of each block by using a hash algorithm and an MD5 algorithm, then matching the hash value M _ i of the block of the ith (i is more than or equal to 1 and less than or equal to M/j) time on a snapshot data block, and if the matching is found to be passed, indicating that the block data is not changed and not needing to copy and store data; and if the matching is not passed, the block data is changed, and data copying and storing are needed, and at the moment, the offset address of the block and the block data are recorded as i x j and M _ i respectively.

Further, in order to improve the data storage rate, if the current block data is not changed in the process of matching the data blocks, the size of the block data block to be checked can be improved to 2 times when the block is checked next time, and if the block data block is still matched, the size of the block data block to be checked is improved to 2 times again until a block which cannot be matched is found; and when the condition that the matching fails is generated, the block data is changed, the block is switched to the initial block size, and the like, so that the verification of all the data blocks in the source data is completed.

In addition, when the data block in the source data is matched with the snapshot data block through the hash function, verification can be performed at the corresponding offset position of the data block, if the matching is passed, the offset address and the block data are marked, and if the matching is not passed, the block data at the offset position is stored in another storage space.

The statistics can show that more repeated blocks in a storage space can be checked out usually based on file checking of the block blocks, when one repeated block file is appointed, more repeated blocks can be checked out for the block file to be checked out, particularly when source data are huge, the divided block blocks are small, and processing in cache (cache) has high checking speed.

It can be seen that, in the data storage method provided in the embodiment of the present application, when data is stored, data block division is performed on source data to be stored, hash value matching is performed on each data block and a corresponding snapshot data block in snapshot data, that is, data block-by-data block comparison is performed, so as to determine whether the source data is changed compared with original snapshot data, and further, a storage operation is performed only on the changed data block, thereby effectively avoiding data from being repeatedly stored in the data storage process, and fundamentally avoiding the same data from appearing in a storage space.

To solve the above problem, please refer to fig. 3, fig. 3 is a schematic structural diagram of a data storage device provided in the present application, where the data storage device may include:

a data block dividing module 100, configured to divide source data into a preset number of data blocks when the source data is acquired;

the first hash operation module 200 is configured to perform hash operation on a current data block to obtain a hash value corresponding to the current data block;

the second hash operation module 300 is configured to obtain snapshot data corresponding to the source data, obtain a snapshot data block corresponding to the current data block from the snapshot data, and calculate a hash value of the snapshot data block;

and the hash value matching module 400 is configured to match the hash value of the current data block with the hash value of the snapshot data block, and store the current data block in a preset storage space when the matching fails.

It can be seen that, in the data storage device provided in the embodiment of the present application, when data is stored, data block division is performed on source data to be stored, hash value matching is performed on each data block and a corresponding snapshot data block in snapshot data, that is, data block-by-data block comparison is performed, so as to determine whether the source data is changed compared with original snapshot data, and further, a storage operation is performed only on the changed data block, thereby effectively avoiding data from being repeatedly stored in the data storage process, and fundamentally avoiding the same data from appearing in a storage space.

As a preferred embodiment, the first hash operation module 200 may be specifically configured to perform a hash operation on the current data by using a hash algorithm and an MD5 algorithm, so as to obtain a hash value corresponding to the current data block.

As a preferred embodiment, the data storage device may further include:

the data block dynamic adjustment module is used for increasing the size of the next data block to a preset multiple of the size of the current data block and performing hash value matching on the next data block when the hash value of the current data block is matched with the hash value of the snapshot data block; when the hash value of the current data block is not matched with the hash value of the snapshot data block, the current data block is divided into data blocks with the initial set size, and the hash value of the data blocks with the initial set size is matched.

As a preferred embodiment, the data storage device may further include:

For the introduction of the apparatus provided in the present application, please refer to the above method embodiments, which are not described herein again.

To solve the above problem, please refer to fig. 4, fig. 4 is a schematic structural diagram of a data storage device provided in the present application, where the data storage device may include:

a memory 10 for storing a computer program;

a processor 20, configured to implement the following steps when executing the computer program:

when source data are acquired, dividing the source data into a preset number of data blocks; performing hash operation on the current data block to obtain a hash value corresponding to the current data block; acquiring snapshot data corresponding to source data, acquiring a snapshot data block corresponding to a current data block from the snapshot data, and calculating to obtain a hash value of the snapshot data block; and matching the hash value of the current data block with the hash value of the snapshot data block, and storing the current data block to a preset storage space when the matching fails.

For the introduction of the device provided in the present application, please refer to the above method embodiment, which is not described herein again.

To solve the above problem, the present application further provides a computer-readable storage medium having a computer program stored thereon, where the computer program when executed by a processor can implement the following steps:

The computer-readable storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

For the introduction of the computer-readable storage medium provided in the present application, please refer to the above method embodiments, which are not described herein again.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The data storage method, apparatus, device and computer readable storage medium provided by the present application are described in detail above. The principles and embodiments of the present application are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and these improvements and modifications also fall into the elements of the protection scope of the claims of the present application.

Claims

1. A method of storing data, comprising:

2. The data storage method of claim 1, wherein the performing the hash operation on the current data block to obtain the hash value corresponding to the current data block comprises:

3. The data storage method of claim 1, further comprising:

4. A data storage method according to any one of claims 1 to 3, further comprising:

5. A data storage device, comprising:

6. The data storage device according to claim 5, wherein the first hash operation module is specifically configured to perform a hash operation on the current data by using a hash algorithm and an MD5 algorithm to obtain a hash value corresponding to the current data block.

7. The data storage device of claim 5, further comprising:

8. The data storage device of any of claims 5 to 7, further comprising:

9. A data storage device, further comprising:

a memory for storing a computer program;

a processor for implementing the steps of the data storage method of any one of claims 1 to 4 when executing said computer program.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the data storage method according to any one of claims 1 to 4.