CN113821476A

CN113821476A - Data processing method and device

Info

Publication number: CN113821476A
Application number: CN202111409107.8A
Authority: CN
Inventors: 康玉竹; 黄岩
Original assignee: Yunhe Enmo Beijing Information Technology Co ltd
Current assignee: Yunhe Enmo Beijing Information Technology Co ltd
Priority date: 2021-11-25
Filing date: 2021-11-25
Publication date: 2021-12-21
Anticipated expiration: 2041-11-25
Also published as: CN113821476B

Abstract

The invention discloses a data processing method and device. Wherein, the method comprises the following steps: the method comprises the steps that a single-level directory structure is adopted, and acquired snapshot data are stored, wherein the snapshot data comprise a plurality of data blocks and data block pointers used for indicating the data blocks; generating corresponding metadata according to the snapshot data, wherein the metadata comprises an index of a data block pointer, and the metadata is a key-value pair structure; the metadata is updated based on the updates to the snapshot data. The invention solves the technical problems of complex structure, large data volume and easy error existing in the mode that snapshot data of a system in the related technology needs to depend on pointers to connect data blocks into a directed acyclic graph data structure.

Description

Data processing method and device

Technical Field

The present invention relates to the field of data processing, and in particular, to a data processing method and apparatus.

Background

There are two main techniques currently used to implement snapshots, one is copy-on-write (COW) and the other is redirect-on-write (ROW).

COW: when a snapshot is created, a volume is assigned as the snapshot volume relative to the source volume. When a data block is written for the first time after a snapshot is created, the original data of the block is copied from the source volume to the snapshot volume. After copying, the blocks in the source volume are written. Thus, the data image of the snapshot is preserved. The combination of the source volume and the snapshot volume presents a point-in-time image of the data. After the snapshot is created, all subsequent read input/output I/O is performed on the source volume. Write input/output I/O after the first change to the block is also performed on the source volume, i.e., only the first write to the block copies the original data to the snapshot volume.

ROW: copy-on-write requires three input/output I/O operations when a block is written to for the first time: 1) reading an original block from a source volume; 2) Writing the original block into the snapshot volume; 3) New data is written in the source volume. These input/output I/O operations are completed at production time, which may negatively impact application performance. To overcome this, a write-time redirect may be performed, as shown in fig. 1-2, leaving the original blocks in the source volume unchanged and performing a new write operation on the snapshot volume. This eliminates the extra input/output I/O operations of the copy-on-write method. After the snapshot is created, all subsequent write input/output I/O is performed on the snapshot volume, while read input/output I/O may come from the source volume or the snapshot volume, depending on whether the block has changed after the snapshot is created. The point-in-time image of the snapshot data is the source volume itself because the source volume is read-only at all times after the snapshot is created.

At this stage, the snapshot approach of ROW + distributed storage is the main direction of development in the industry. The original data in the ROW snapshot still remains in the source data volume, and in order to ensure the integrity of the snapshot data, the state of the source data volume is changed from read-write to read-only when the snapshot is created. If a storage system takes multiple snapshots, a snapshot chain is generated, and the disk volume is always mounted at the tail end of the snapshot chain, namely all write operations fall into the tail-end snapshot volume. This feature causes a problem that if 10 snapshots are taken in total, when the latest snapshot point is restored, 10 snapshot volumes are merged to obtain a complete latest snapshot point data; if the 8 th quick finding time point is recovered, the previous 8 snapshots need to be rolled up to form a complete snapshot data. Therefore, the greatest problem of ROW in the conventional storage scenario is that the read performance is affected greatly.

After each storage device is formatted to create a file system, all files are roughly divided into two parts, namely an inode and a block of data blocks. The index node inode is used for storing file attribute information, wherein the file attribute information comprises file size, file attribution group, authority, type, modification time and a pointer pointing to file entity data (block), namely metadata; the Super-block records the whole information of the whole file system. The data block stores the actual data of the file, such as photos, video, audio, etc.

Traditional ROW snapshots express snapshot information in a manner of "linking blocks of data with pointers into a directed acyclic graph data structure". When data is modified, the original data block is not covered, updated data is written into a new data block, then a new index node inode pointing to the updated data is created, and a new index node inode pointing to a lower updated index node inode is continuously created on the upper layer until reaching a root node. At this time, the current data can be read through the current Super-block, and the snapshot data can be read through the old Super-block. This approach has several disadvantages: (1) the data structure is complex, the realization difficulty is high, and errors are easy to occur; (2) the newly written data needs a large amount of updated metadata, and the written physical data amount is multiple times of the written data amount, so that the solid state disk SSD is not suitable. Newly writing a data block, wherein all intermediate nodes from a snapshot root node to a new data block are needed; (3) a "directed acyclic graph" is difficult to store using KV data.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a data processing method and a data processing device, which at least solve the technical problems of complex structure, large data volume and easy error in the mode that snapshot data of a system in the related technology needs to depend on pointers to connect data blocks into a directed acyclic graph data structure.

According to an aspect of an embodiment of the present invention, a data processing method includes: storing the acquired snapshot data by adopting a single-level directory structure, wherein the snapshot data comprises a plurality of data blocks and data block pointers for indicating the data blocks; generating corresponding metadata according to the snapshot data, wherein the metadata comprises an index of the data block pointer, and the metadata is a key-value pair structure; and updating the metadata according to the updating of the snapshot data.

Optionally, storing the obtained snapshot data by using a single-level directory structure includes: determining one or more data blocks of different snapshot volumes corresponding to target positions of the data blocks in the snapshot data, wherein the different snapshot volumes include a plurality of snapshot volumes obtained by performing snapshots successively and repeatedly, the snapshot data is a current working volume, and the snapshot data includes a plurality of current working data blocks of the plurality of different snapshot volumes; and storing one or more data blocks corresponding to the target position at positions of which the storage positions do not exceed a preset distance from each other.

Optionally, the key-value pair is a key-value pair, the key value includes a volume identifier, a block identifier and a snapshot identifier, and the value includes an address of a corresponding data block, and the method further includes: receiving a request for searching a first target data block, wherein the request comprises a volume identifier and a binary group of a block identifier; calculating a hash value according to the binary group; taking the hash value as an index, and searching an interval where the binary group is located, wherein the interval stores ordered key value pair records; and searching a corresponding target key-value pair in the key-value pair record corresponding to the interval by including the volume identifier, the block identifier and the snapshot identifier, and determining a corresponding first target data block by using the target key-value pair, wherein the snapshot identifier is used for identifying different snapshot volumes.

Optionally, a write request for a second target data block of the snapshot data is received, where the second target data block is a data block at any first data block position in the snapshot data; writing the second target data block into the first data block position of the current volume of the snapshot data when no data block exists in the first data block position of the current volume; and if the first data block position of the current volume has a data block, covering the target data block with the data block at the first data block position.

Optionally, a read request for reading a third target data block is received, where the third target data block is a data block at any second data block position in the snapshot data; reading a data block at the second data block position of the current volume under the condition that the data block exists at the second data block position of the current volume; reading a data block at the second data block position of a snapshot volume before the current volume under the condition that no data block exists at the second data block position of the current volume; and in the case that all snapshot volumes in the second data block position have no data block, returning all zero data blocks in response to the read request.

Optionally, a deletion operation on the snapshot volume is received, where the deletion operation includes a first snapshot identifier of the snapshot volume to be deleted and a second snapshot identifier of a next snapshot volume adjacent to the snapshot volume to be deleted; determining that data blocks which do not exist in the snapshot volume identified by the second snapshot are referenced data blocks according to the first snapshot identification and the second snapshot identification; deleting data blocks, except the referenced data block, in the snapshot volume identified by the first snapshot in response to the deleting operation; the space of the deleted data block and the space of the data block referenced by the deleted data block are recycled.

Optionally, recording the snapshot identifier of the deleted snapshot volume; and when the snapshot volume is deleted next time, searching all recorded snapshot identifiers, and recovering the spaces of the unretracted data blocks in the snapshot volume corresponding to the recorded snapshot identifiers.

Optionally, when the deletion operation is responded, the snapshot identifier of the referenced data block is modified to the snapshot identifier of the snapshot volume referencing the referenced data block.

According to another aspect of the embodiments of the present invention, there is also provided a data processing apparatus, including: the storage module is used for storing the acquired snapshot data by adopting a single-level directory structure, wherein the snapshot data comprises a plurality of data blocks and data block pointers for indicating the data blocks; the generating module is used for generating corresponding metadata according to the snapshot data, wherein the metadata comprises an index of the data block pointer, and the metadata is a key-value pair structure; and the updating module is used for updating the metadata according to the updating of the snapshot data.

According to another aspect of the embodiments of the present invention, there is also provided a processor, configured to execute a program, where the program executes the data processing method described in any one of the above.

According to another aspect of the embodiments of the present invention, there is also provided a computer storage medium, where the computer storage medium includes a stored program, and when the program runs, the apparatus where the computer storage medium is located is controlled to execute any one of the above data processing methods.

In the embodiment of the invention, a single-level directory structure is adopted to store the acquired snapshot data, wherein the snapshot data comprises a plurality of data blocks and data block pointers for indicating the data blocks; generating corresponding metadata according to the snapshot data, wherein the metadata comprises an index of a data block pointer, and the metadata is a key-value pair structure; the metadata is updated based on the updates to the snapshot data. The method achieves the purposes of simplifying a data structure and reducing the written-in physical quantity by adopting a single-machine directory mode for storing the snapshot data, thereby realizing the technical effects of improving the utilization rate of storage space and the accuracy of the snapshot data, and further solving the technical problems of complicated structure, large data quantity and easy error caused by the fact that the snapshot data of the system in the related technology needs to depend on a pointer to connect data blocks into a directed acyclic graph data structure.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow chart of a method of data processing according to an embodiment of the invention;

FIG. 2 is a schematic diagram of a data value based storage system according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of prioritizing data storage according to block addresses according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a primary volume and snapshot rollback according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of data of a snapshot volume according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of snapshot deletion according to an embodiment of the present invention;

fig. 7 is a schematic diagram of a data processing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In accordance with an embodiment of the present invention, there is provided a method embodiment of a data processing method, it being noted that the steps illustrated in the flowchart of the figure may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order different than that presented herein.

Fig. 1 is a flow chart of a data processing method according to an embodiment of the present invention, as shown in fig. 1, the method including the steps of:

step S102, a single-level directory structure is adopted to store the acquired snapshot data, wherein the snapshot data comprises a plurality of data blocks and data block pointers for indicating the data blocks;

step S104, generating corresponding metadata according to the snapshot data, wherein the metadata comprises indexes of data block pointers, and the metadata is a key value pair structure;

and step S106, updating the metadata according to the updating of the snapshot data.

Through the steps, a single-level directory structure is adopted to store the acquired snapshot data, wherein the snapshot data comprises a plurality of data blocks and data block pointers for indicating the data blocks; generating corresponding metadata according to the snapshot data, wherein the metadata comprises an index of a data block pointer, and the metadata is a key-value pair structure; the metadata is updated based on the updates to the snapshot data. The method achieves the purposes of simplifying the data structure and reducing the physical quantity written in by the snapshot by adopting a single-machine directory mode for storing the snapshot data, thereby realizing the technical effects of improving the accuracy of the snapshot data and the utilization rate of the storage space, and further solving the technical problems of complicated structure, large data quantity and easy error caused by the fact that the snapshot data of the system in the related technology needs to depend on a pointer to connect data blocks into a directed acyclic graph data structure.

The single-level directory structure is that the system only establishes a directory table, each snapshot data only occupies a directory entry, the directory information of the single-level directory structure can be the snapshot volume information or the current volume information of the snapshot data, and can also be the storage address information of the data block. The single-level directory can determine the target position information of the data block, when an operation request for a certain data block is received, the single-level directory can adopt a binary group form, a binary group interval can be found by calculating a hash value and taking the hash value as an index, the interval contains KV data records which are stored orderly, then a logical address of the corresponding data block is searched by utilizing a triplet through a dichotomy, and the acquired snapshot data can be stored in the logical address.

Compared with the prior art that snapshot data are stored on distributed storage equipment, the data items of each single-level directory correspond to independent metadata, so that the structure of the metadata can be simplified, and the updating of the metadata can be faster and more efficient. The snapshot data can be regarded as a storage backup obtained by copying all or part of data in a storage system at a certain moment, the snapshot can be regarded as a photographing function in daily life, the data in the storage system at different historical times are stored and backed up, and when the storage system is interrupted unexpectedly or data is lost or inaccurate due to malicious attack, the data in the storage system can be recovered through the snapshot data, and the data of the storage system at a certain historical moment is regenerated. A snapshot may be considered a fully available copy of a given data set that includes an image of the corresponding data at some point in time (the point in time at which the copy began). The snapshot may be a copy of the data it represents or may be a replica of the data. The main way of the snapshot described above is data protection. Typically, the contents of the storage snapshot are read-only, which storage administrators and various third-party backup applications can use to read or restore data, while write operations are constantly being performed on the real-time current volume.

The storage system may include a specific data block and attribute information identifying the data block, and the metadata may be used to indicate different data in the storage system and may be used to identify the attribute information of the data block. The metadata is a key-value pair structure (i.e., key-value structure), the key may refer to a number, and the value may refer to stored data and a logical address. In the above key-value pair structure, a key may include a volume identifier, a block identifier, and a snapshot identifier, and a value may include an address of a data block.

The above metadata update may be understood as that, with the use of the storage system, the stored data may also change, and the metadata page for identifying the data naturally needs to be updated along with the change of the data, so that data recovery is performed when the data is lost, the data is inaccurate, or there is an update requirement.

Optionally, storing the obtained snapshot data by using a single-level directory structure includes: determining one or more data blocks of different snapshot volumes corresponding to target positions of the data blocks in the snapshot data, wherein the different snapshot volumes comprise a plurality of snapshot volumes obtained by performing snapshots successively and repeatedly, the snapshot data is a current working volume, and the snapshot data comprises a plurality of current working data blocks of the plurality of different snapshot volumes; and storing one or more data blocks corresponding to the target position at positions of which the storage positions do not exceed a preset distance from each other.

In the target position of the data block, different snapshot volumes may be regarded as snapshots of the data block sequentially established by the current volume according to the time sequence, and the snapshot data may be regarded as a backup of data. The storage structure may include a current volume and different snapshot volumes, and the snapshot volumes may be arranged according to a time sequence of snapshots.

The data block may be considered as a minimum unit constituting the current volume and the different snapshot volumes, and in any data block, the data block of the current volume and the data block of the snapshot volume thereof correspond to each other, and should be logically stored in the same data block in the vicinity of the data blocks of all the snapshot volumes of the same data block. The logically close deposit may be a storage area stored on the same storage device.

Optionally, the key-value pair is a key-value pair, the key value includes a volume identifier, a block identifier and a snapshot identifier, and the value includes an address of the corresponding data block, and the method further includes: receiving a request for searching a first target data block, wherein the request comprises a volume identifier and a binary group of a block identifier; calculating a hash value according to the binary group; taking the hash value as an index, and searching an interval where the binary group is located, wherein the interval stores ordered key value pair records; the method comprises the steps of searching corresponding target key value pairs in key value pair records corresponding to intervals through volume identification, block identification and snapshot identification, and determining corresponding first target data blocks through the target key value pairs, wherein the snapshot identification is used for identifying different snapshot volumes.

The searching for the first target data block is performed according to the VOLUME identifier and the block identifier carried by the search request, the hash value is calculated through a binary group (VOLUME identifier VOLUME _ ID, block identifier CHUNK _ ID), the hash value is used as an index to determine a binary group interval containing target data block information, the block identifier and the snapshot identifier can search for a corresponding logical address of the target data block through a binary method in a triple (VOLUME identifier VOLUME _ ID, block identifier CHUNK _ ID, snapshot identifier SNAP _ ID) according to the VOLUME identifier of the target data block, and the searching efficiency of the target data block can be improved by adopting the method for searching.

Optionally, a write request for a second target data block of the snapshot data is received, where the second target data block is a data block at any position of the first data block in the snapshot data; under the condition that no data block exists in the first data block position of the current volume, writing a second target data block into the first data block position of the current volume of the snapshot data; and if the first data block position of the current volume has the data block, covering the target data block with the data block at the first data block position.

The second target data block may be considered as a target data block that needs to perform a write operation on snapshot data, where the second target data block is a data block at any first data block position in the snapshot data, and when a write request for the second target data block of the snapshot data is received, if no data block exists at the first data block position of the current volume, the second target data block may be directly written into the first data block position of the current volume of the snapshot data, and if a data block exists at the first data block position of the current volume, the data block at the first data block position may be directly overwritten.

Optionally, a read request for reading a third target data block is received, where the third target data block is a data block at any second data block position in the snapshot data; under the condition that a data block exists at the position of a second data block of the current volume, reading the data block at the position of the second data block of the current volume; under the condition that no data block exists in the second data block position of the current volume, reading the data block in the second data block position of the snapshot volume before the current volume; in the event that all snapshot volumes at the second data block location have no data blocks, all zero data blocks are returned in response to the read request.

The third target data block may be considered as a data block at any second position in the snapshot data, when a read request for the third target data block is received, when a data block exists at the second data block position of the current volume, the data block at the second data block position is directly read, if no data block exists at the second data block position of the current volume, the data block of the previous snapshot volume of the current volume is read, if the data block of the previous snapshot volume still does not continue to be read, until the data blocks at the second data block positions of all snapshot volumes are read, and if no data block exists, a zero data block is returned in response to the read request.

Optionally, a deletion operation on the snapshot volume is received, where the deletion operation includes a first snapshot identifier of the snapshot volume to be deleted and a second snapshot identifier of a next snapshot volume adjacent to the snapshot volume to be deleted; determining that data blocks which exist in the snapshot volume identified by the first snapshot exist and data blocks which do not exist in the snapshot volume identified by the second snapshot are referenced data blocks according to the first snapshot identification and the second snapshot identification; deleting data blocks except the referenced data block in the snapshot volume identified by the first snapshot in response to the deletion operation; the space of the deleted data block and the space of the data block referenced by the deleted data block are recycled.

The deletion operation performed on the snapshot volume may be performed periodically by the system, or may be performed when the user needs to clear the storage space, where the deletion operation includes a first snapshot identifier of the snapshot volume to be deleted, and a second snapshot identifier of a subsequent snapshot volume adjacent to the snapshot volume to be deleted, where the first snapshot identifier and the second snapshot identifier determine that a data block that does not exist in the snapshot volume of the first snapshot identifier is a referenced data block, and determine whether the data block can be deleted or not through the first snapshot identifier and the second snapshot identifier, and whether the deletion operation can be performed on the data block of the snapshot volume.

The deletion operation on the snapshot volume may be considered that, if a data block of the snapshot volume is referred to by a data block of a previous adjacent snapshot volume, the referred data block cannot be subjected to the deletion operation. The referenced data block does not respond to the deletion operation, and the space of the deleted data block needs to be recycled, so that the purpose of expanding the utilization rate of the storage space is achieved.

And recording the deleted snapshot volume snapshot identification. Since there may be a situation that the data block of the snapshot volume corresponding to the snapshot identifier is not recovered in the deleted snapshot volume, when the snapshot volume is deleted next time, all recorded snapshot identifiers need to be searched, and a space of the data block that is not recovered in the snapshot volume corresponding to the recorded snapshot identifier needs to be recovered.

Optionally, when responding to the delete operation, the snapshot identifier of the referenced data block is modified to the snapshot identifier of the snapshot volume referencing the referenced data block.

Since the data block is deleted, the referenced data block should no longer use the data block identification referencing it, requiring the snapshot identification of the referenced data block to be modified to the snapshot identification of the snapshot volume referencing the referenced data block. The data block does not exist in the deleted snapshot data, and the data block is migrated and copied from the deleted snapshot data to the undeleted snapshot data, the snapshot data to be deleted is deleted, and the space of the data block is recovered. The data block does not need to be recovered again, the workload of data deletion and space recovery is reduced, and the efficiency of space recovery is improved.

It should be noted that the present application also provides an alternative implementation, and the details of the implementation are described below.

By adopting the single-level directory structure, the amount of metadata required to be updated is small, and the written physical data amount is reduced relative to the tree directory structure.

The data structure of the snapshot metadata consists of Key-value pairs, the Key is (VOLUME _ ID, CHUNK _ ID, SNAP _ ID) stored in a KV database through a KV interface, the Key is (VOLUME ID, block ID, snapshot ID), and the value is a logical address corresponding to CHUNK.

Calculating a Hash value through the binary groups (VOLUME _ ID, CHUNK _ ID), using the Hash value as an index to find an interval where the binary groups (VOLUME _ ID, CHUNK _ ID) are located, storing ordered KV records in the interval, and searching the corresponding logical address of Chunk through a dichotomy by using the triples (VOLUME _ ID, CHUNK _ ID, SNAP _ ID).

The data of all snapshots of the same data block are stored together in the close logic, and the data are stored according to the block address preferentially.

FIG. 2 is a schematic diagram of a data value based storage system according to an embodiment of the present invention, where the Volume (Volume) shown in FIG. 2 is a collection, which contains all snapshots, as well as the current working example; a Snapshot (Snapshot) is a copy of all data on a volume at a certain time, and the data on the copy is not allowed to be deleted or rewritten; one Volume is divided into several chunks, which are distributed on each Chunk server. The data structure of the snapshot metadata consists of Key-value pairs, the Key is (VOLUME _ ID, CHUNK _ ID, SNAP _ ID), and the value is a logical address corresponding to CHUNK, and the Key-value pairs are stored in a KV database through a KV interface. The snapshot IDs are ordered from large to small by the time of snapshot creation.

Searching a logic address for storing Chunk:

Data of all snapshots of the same data block are stored together in close logical proximity, and fig. 3 is a schematic diagram of storing data preferentially according to block addresses according to an embodiment of the present invention, as shown in fig. 3. The advantages are that: for the read operation, if the data block does not exist on the snapshot ID of the current primary volume, the data block can be found from the historical snapshot nearby, and the read operation performance is high. The disadvantages are that: (1) when there is a duration snapshot, each write operation requires that the data block written on the current primary volume be inserted between two old data blocks. (2) When there is a large amount of snapshot data, the sequential read operation of the large block of data becomes a non-sequential read on the disk medium because there are a large number of historical snapshot data blocks in between. The SSD is used as a medium for bearing online data of the database, and has the characteristic of high random read-write performance, and the discontinuous layout of data blocks of the same snapshot is not a big problem for the SSD. The term "storing nearby" also means "logically nearby", that is, as long as the pointers for indexing the two data blocks are relatively close to each other, the number of times of searching the physical positions of the data blocks can be reduced as much as possible.

Fig. 4 is a schematic diagram of a primary volume and a snapshot rollback according to an embodiment of the present invention, taking fig. 4 as an example, snap999999 is an earliest snapshot, snap 9995 is a current primary volume, and a storage structure of metadata in a KV database is shown in table 1.

After receiving a write request for a data block of the current primary volume, if the data block does not exist on the current primary volume, the data block is directly inserted. If so, the current data is overwritten.

When a read request is received, it is checked whether there is a corresponding data block on the snapshot ID (999995 in fig. 4) corresponding to the current primary volume. If yes, returning the data block; if not, searching whether the snapshot (with larger number) earlier than the primary volume snapshot ID exists, and if so, returning; if not, then an all-zeros block of data is returned.

It should be noted that the read operation should ignore those data blocks that have been rolled back. But the deleted data blocks should not be ignored.

FIG. 5 is a diagram illustrating data of a snapshot volume, such as the snapshot that may be mapped to a host as a read-only LUN, as shown in FIG. 5, according to an embodiment of the present invention. The data operation of the snapshot read-only LUN is similar to the primary volume except that the SNAP _ ID of the primary volume is no longer used to find the data blocks, but rather the SNAP _ ID of the snapshot volume is used to find the data blocks.

Fig. 6 is a schematic diagram of snapshot deletion according to an embodiment of the present invention, and fig. 6 illustrates an operation of deleting one snapshot. When a snapshot is deleted, the space of some data blocks on the snapshot can be reclaimed, and the data blocks crossed on the snapshot 999996 in fig. 6 can be reclaimed. The space of other data blocks of this snapshot cannot be reclaimed, as in fig. 6 for the 2 nd data (data is 987), since this data is still referenced by the current primary volume snap 9995.

Therefore, the snapshot deleting message sent to the Chunk Server should have at least two pieces of information: (1) the deleted snapshot ID; (2) the snapshot ID that is closest to this deleted snapshot and some updates. If a block is not present in the snapshot of the update, the block cannot be deleted.

As shown in fig. 6, when the snapshot snap 9996 is deleted, not only the space of the data block with the snap ID of snap999996 needs to be reclaimed, but also the space occupied by the snapshot which is deleted historically, such as the data block No. 4 of snap 9997 in fig. 7 and the data block No. 6 of snap 9998, may need to be reclaimed. The reason these data blocks are left behind is: snapshot snap999996 needs to refer to them. Since snap999996 is also deleted, no other snapshots refer to them and the space they occupy can be reclaimed. But the space of the # 3 data block of snap999997 in fig. 6 cannot be released because it is still referenced by snap 9995.

When a snapshot delete is implemented, there are two ways to handle the data blocks that are still referenced: 1. recording all the snapshot IDs deleted once, searching all the snapshot IDs deleted historically at the same time when the snapshot updated next time is deleted, and if the space of the data block on the snapshot IDs needs to be recycled, recycling. 2. When a snapshot is deleted, the snapshot ID of a data block which cannot be recycled is changed to the ID of the snapshot which refers to the data block, so that the step of recording the deleted snapshot ID can be omitted.

The embodiment adopts a single-level directory structure, the amount of metadata required to be updated is small, and the written physical data amount is reduced compared with a tree directory structure. The data structure of the snapshot metadata consists of Key-value pairs, the Key is (VOLUME _ ID, CHUNK _ ID, SNAP _ ID), and the value is a logical address corresponding to CHUNK, and the Key-value pairs are stored in a KV database through a KV interface. Calculating a Hash value through the binary groups (VOLUME _ ID, CHUNK _ ID), using the Hash value as an index to find an interval where the binary groups (VOLUME _ ID, CHUNK _ ID) are located, storing ordered KV records in the interval, and searching the corresponding logical address of Chunk through a dichotomy by using the triples (VOLUME _ ID, CHUNK _ ID, SNAP _ ID). The data of all snapshots of the same data block are stored together in the close logic, and the data are stored according to the block address preferentially.

According to the embodiment, different methods are used for storing the metadata of the snapshots, all the snapshot metadata of the same data block are stored together in the vicinity of the logic, the data are stored according to the block address preferentially, the data structure is simple, the implementation difficulty is low, and errors are not easy to occur. The newly written data needs small updated metadata amount and small write amplification, and is suitable for the solid-state memory SSD.

Fig. 7 is a schematic diagram of a data processing apparatus according to an embodiment of the present invention, and as shown in fig. 7, according to another aspect of the embodiment of the present invention, there is also provided a data processing apparatus including: a storage module 72, a generation module 74 and an update module 76, which are described in detail below.

The storage module 72 stores the acquired snapshot data by using a single-level directory structure, where the snapshot data includes a plurality of data blocks and a data block pointer for indicating the data blocks; a generating module 74, connected to the determining module 72, for generating corresponding metadata according to the snapshot data, where the metadata includes an index of a data block pointer, and the metadata is a key-value pair structure; an update module 76 is coupled to the tagging module 74 for updating the metadata based on the updates to the snapshot data.

By the device, the acquired snapshot data is stored by adopting a single-level directory structure, wherein the snapshot data comprises a plurality of data blocks and data block pointers for indicating the data blocks; generating corresponding metadata according to the snapshot data, wherein the metadata comprises an index of a data block pointer, and the metadata is a key-value pair structure; the metadata is updated based on the updates to the snapshot data. The method achieves the purposes of simplifying a data structure and reducing the written-in physical quantity by adopting a single-machine directory mode for storing the snapshot data, thereby realizing the technical effects of improving the utilization rate of storage space and the accuracy of the snapshot data, and further solving the technical problems of complicated structure, large data quantity and easy error caused by the fact that the snapshot data of the system in the related technology needs to depend on a pointer to connect data blocks into a directed acyclic graph data structure.

According to another aspect of the embodiments of the present invention, there is also provided a processor, configured to execute a program, where the program executes a data processing method of any one of the above.

According to another aspect of the embodiments of the present invention, there is also provided a computer storage medium, which includes a stored program, wherein when the program runs, an apparatus in which the computer storage medium is located is controlled to execute the data processing method of any one of the above.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A data processing method, comprising:

storing the acquired snapshot data by adopting a single-level directory structure, wherein the snapshot data comprises a plurality of data blocks and data block pointers for indicating the data blocks;

generating corresponding metadata according to the snapshot data, wherein the metadata comprises an index of the data block pointer, and the metadata is a key-value pair structure;

and updating the metadata according to the updating of the snapshot data.

2. The method of claim 1, wherein storing the obtained snapshot data using a single-level directory structure comprises:

determining one or more data blocks of different snapshot volumes corresponding to target positions of the data blocks in the snapshot data, wherein the different snapshot volumes include a plurality of snapshot volumes obtained by performing snapshots successively and repeatedly, the snapshot data is a current working volume, and the snapshot data includes a plurality of current working data blocks of the plurality of different snapshot volumes;

and storing one or more data blocks corresponding to the target position at positions of which the storage positions do not exceed a preset distance from each other.

3. The method of claim 2, wherein the key-value pair is a key-value pair, wherein the key value comprises a volume identifier, a block identifier, and a snapshot identifier, and wherein the value comprises an address of a corresponding data block, the method further comprising:

receiving a request for searching a first target data block, wherein the request comprises a volume identifier and a binary group of a block identifier;

calculating a hash value according to the binary group;

taking the hash value as an index, and searching an interval where the binary group is located, wherein the interval stores ordered key value pair records;

and searching a corresponding target key-value pair in the key-value pair record corresponding to the interval by including the volume identifier, the block identifier and the snapshot identifier, and determining a corresponding first target data block by using the target key-value pair, wherein the snapshot identifier is used for identifying different snapshot volumes.

4. The method of claim 3, further comprising:

receiving a write request for a second target data block of the snapshot data, wherein the second target data block is a data block at any first data block position in the snapshot data;

writing the second target data block into the first data block position of the current volume of the snapshot data when no data block exists in the first data block position of the current volume;

and if the first data block position of the current volume has a data block, covering the target data block with the data block at the first data block position.

5. The method of claim 3, further comprising:

receiving a read request for reading a third target data block, wherein the third target data block is a data block at any second data block position in the snapshot data;

reading a data block at the second data block position of the current volume under the condition that the data block exists at the second data block position of the current volume;

reading a data block at the second data block position of a snapshot volume before the current volume under the condition that no data block exists at the second data block position of the current volume;

and in the case that all snapshot volumes in the second data block position have no data block, returning all zero data blocks in response to the read request.

6. The method of claim 3, further comprising:

receiving a deletion operation on the snapshot volume, wherein the deletion operation comprises a first snapshot identification of the snapshot volume to be deleted and a second snapshot identification of a next snapshot volume adjacent to the snapshot volume to be deleted;

determining that data blocks which do not exist in the snapshot volume identified by the second snapshot are referenced data blocks according to the first snapshot identification and the second snapshot identification;

deleting data blocks, except the referenced data block, in the snapshot volume identified by the first snapshot in response to the deleting operation;

the space of the deleted data block and the space of the data block referenced by the deleted data block are recycled.

7. The method of claim 6, further comprising:

recording the snapshot identification of the deleted snapshot volume;

and when the snapshot volume is deleted next time, searching all recorded snapshot identifiers, and recovering the spaces of the unretracted data blocks in the snapshot volume corresponding to the recorded snapshot identifiers.

8. The method of claim 6, further comprising:

and when the deletion operation is responded, modifying the snapshot identification of the referenced data block into the snapshot identification of the snapshot volume which refers to the referenced data block.

9. A data processing apparatus, comprising:

the storage module is used for storing the acquired snapshot data by adopting a single-level directory structure, wherein the snapshot data comprises a plurality of data blocks and data block pointers for indicating the data blocks;

the generating module is used for generating corresponding metadata according to the snapshot data, wherein the metadata comprises an index of the data block pointer, and the metadata is a key-value pair structure;

and the updating module is used for updating the metadata according to the updating of the snapshot data.

10. A processor, characterized in that the processor is configured to run a program, wherein the program is configured to execute the data processing method according to any one of claims 1 to 8 when running.

11. A computer storage medium, comprising a stored program, wherein the program, when executed, controls an apparatus in which the computer storage medium is located to perform the data processing method of any one of claims 1 to 8.