CN111858158A

CN111858158A - Data processing method and device and electronic equipment

Info

Publication number: CN111858158A
Application number: CN202010565082.XA
Authority: CN
Inventors: 梁建群
Original assignee: Beijing Kingsoft Cloud Network Technology Co Ltd
Current assignee: Beijing Kingsoft Cloud Network Technology Co Ltd
Priority date: 2020-06-19
Filing date: 2020-06-19
Publication date: 2020-10-30
Anticipated expiration: 2040-06-19
Also published as: CN111858158B

Abstract

The embodiment of the disclosure discloses a data processing method, a data processing device, an electronic device and a computer readable storage medium. The method comprises the following steps: when any data set is subjected to snapshot to generate a data snapshot, storing the data snapshot in a first preset storage space, wherein the data snapshot comprises first source data subjected to snapshot in the data set and metadata corresponding to the first source data; after the first source data is snapshot, if written second source data is received, storing the second source data in a second preset storage space; generating metadata corresponding to the second source data; storing metadata corresponding to the second source data in a third preset storage space; and when a processing operation instruction for the data set is received, executing the processing operation instruction for the data set according to the source data of the data set, the metadata corresponding to the source data of the data set and/or the data snapshot. The data processing method provided by the application improves the efficiency of data processing.

Description

Data processing method and device and electronic equipment

Technical Field

The disclosed embodiments relate to the field of data processing technologies, and in particular, to a data processing method, a data processing apparatus, an electronic device, and a computer-readable storage medium.

Background

The snapshot is a backup mode which can backup data in a storage area at a specific time point in real time and does not affect the efficiency of data service, and is a storage area independently existing on a storage medium. When a snapshot is created, a user needs to specify information such as a used disk space and a corresponding storage area, and the snapshot backs up data at the creation time point. The data of the snapshot is static and can not be modified, and after the snapshot is created, the subsequent modification of the original data set does not affect the snapshot data.

Currently, the snapshot technology that is widely applied is the COW mechanism (Copy On Write). The main feature of the COW mechanism is that when some data is subsequently processed, for example, when some data is modified, all data in a data set to which the data belongs needs to be completely copied and stored, and then the data to be modified in the data set to which the data belongs needs to be read and modified. This way of copying before modifying the data is very time consuming.

Therefore, a new data processing scheme is required to solve the above problems.

Disclosure of Invention

An object of the embodiments of the present disclosure is to provide a data processing method, a data processing apparatus, an electronic device, and a computer-readable storage medium, which can improve the efficiency of data processing.

According to a first aspect of the embodiments of the present disclosure, a data processing method is provided, including:

when any data set is subjected to snapshot to generate a data snapshot, storing the data snapshot in a first preset storage space, wherein the data snapshot comprises first source data subjected to snapshot in the data set and metadata corresponding to the first source data, and the metadata corresponding to the first source data comprises storage parameters of the first source data, key value information of a data set to which the first source data belongs, a pre-allocated snapshot version number of the first source data and a data serial number of the first source data;

after the first source data is snapshot, if written second source data is received, storing the second source data in a second preset storage space;

generating metadata corresponding to the second source data, wherein the metadata corresponding to the second source data comprise storage parameters of the second source data, key value information of a data set to which the second source data belong, a pre-allocated snapshot version number of the second source data, and a data sequence number of the second source data, the pre-allocated snapshot version number of the second source data sequentially increases on the basis of the pre-allocated snapshot version number of the first source data, and the data sequence number of the second source data sequentially increases according to the number of the source data in the data set;

Storing metadata corresponding to the second source data in a third preset storage space;

when a processing operation instruction for the data set is received, executing the processing operation instruction for the data set according to the source data of the data set, the metadata and/or the data snapshot corresponding to the source data of the data set;

the data set comprises at least one piece of source data with the same key value information, and the second source data and the first source data belong to the same data set.

According to a second aspect of the embodiments of the present disclosure, there is provided a data processing apparatus including:

the snapshot module is used for performing snapshot on any data set to generate a data snapshot and storing the data snapshot in a first preset storage space, wherein the data snapshot comprises first source data for performing snapshot in the data set and metadata corresponding to the first source data, and the metadata corresponding to the first source data comprises storage parameters of the first source data, key value information of the data set to which the first source data belongs, a pre-allocated snapshot version number of the first source data and a data serial number of the first source data;

the writing module is used for storing second source data in a second preset storage space after the first source data is subjected to snapshot and when the written second source data is received; generating metadata corresponding to the second source data, wherein the metadata corresponding to the second source data comprise storage parameters of the second source data, key value information of a data set to which the second source data belong, a pre-allocated snapshot version number of the second source data, and a data sequence number of the second source data, the pre-allocated snapshot version number of the second source data sequentially increases on the basis of the pre-allocated snapshot version number of the first source data, and the data sequence number of the second source data sequentially increases according to the number of the source data in the data set; storing metadata corresponding to the second source data in a third preset storage space;

The processing module is used for executing the processing operation instruction on the data set according to the source data of the data set, the metadata and/or the data snapshot corresponding to the source data of the data set when the processing operation instruction on the data set is received;

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a data processing apparatus according to a second aspect of the embodiment of the present disclosure; alternatively, the apparatus comprises a processor and a memory, the memory being configured to store executable instructions for controlling the processor to perform the data processing method according to the first aspect of the embodiments of the present disclosure.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer instructions, which, when executed by a processor, implement the data processing method of the first aspect of the embodiments of the present disclosure.

According to the embodiment of the disclosure, when writing source data into a data set, the source data is stored into a new storage space and corresponding metadata is generated for the source data, and the metadata corresponding to the source data comprises a storage parameter of the source data, a pre-allocated snapshot version number and a data sequence number. That is, each written source data is stored and corresponding metadata is generated. By the method, when the source data in the set is modified, all the data in the set does not need to be copied, but the source data needing to be modified is directly written, so that the data processing time is saved, and the data processing efficiency is improved.

Other features of embodiments of the present disclosure and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which is to be read in connection with the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the embodiments of the disclosure.

Fig. 1 is a block diagram of a hardware configuration of an electronic device that can be used to implement an embodiment of the present disclosure.

Fig. 2 is a flow chart of steps of a data processing method according to an embodiment of the present disclosure.

Fig. 3 is a block diagram of a data processing apparatus according to an embodiment of the present disclosure.

Fig. 4 is a block diagram of an electronic device according to an embodiment of the present disclosure.

Fig. 5 is a schematic diagram of metadata according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of parts and steps, numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the embodiments of the present disclosure unless specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the embodiments of the disclosure, their application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

Fig. 1 is a block diagram showing a hardware configuration of an electronic apparatus 1000 that can implement an embodiment of the present disclosure.

The electronic device 1000 may be a laptop, desktop, tablet, server, workstation, etc.

The servers can be unitary servers or distributed servers across multiple computers or computer data centers. The server may be of various types, such as, but not limited to, a node device of a content distribution network, a storage server of a distributed storage system, a cloud database server, a cloud computing server, a cloud management server, a web server, a news server, a mail server, a message server, an advertisement server, a file server, an application server, an interaction server, a storage server, a database server, a proxy server, or the like. In some embodiments, each server may include hardware, software, or embedded logic components or a combination of two or more such components for performing the appropriate functions supported or implemented by the server. For example, a server, such as a blade server, a cloud server, etc., or may be a server group consisting of a plurality of servers, which may include one or more of the above types of servers, etc.

As shown in fig. 1, the electronic device 1000 may include a processor 1100, a memory 1200, an interface device 1300, a communication device 1400, a display device 1500, an input device 1600, a speaker 1700, a microphone 1800, and the like. The processor 1100 may be a central processing unit CPU, a microprocessor MCU, or the like. The memory 1200 includes, for example, a ROM (read only memory), a RAM (random access memory), a nonvolatile memory such as a hard disk, and the like. The interface device 1300 includes, for example, a USB interface, a headphone interface, and the like. The communication device 1400 is capable of wired or wireless communication, for example, and may specifically include WiFi communication, bluetooth communication, 2G/3G/4G/5G communication, and the like. The display device 1500 is, for example, a liquid crystal display panel, a touch panel, or the like. The input device 1600 may include, for example, a touch screen, a keyboard, a somatosensory input, and the like. A user can input/output voice information through the speaker 1700 and the microphone 1800.

The electronic device shown in fig. 1 is merely illustrative and is in no way intended to limit the embodiments of the disclosure, their application, or uses. In an embodiment of the present disclosure, the memory 1200 of the electronic device 1000 is used for storing instructions for controlling the processor 1100 to operate so as to execute any data processing method provided by the embodiment of the present disclosure. It should be understood by those skilled in the art that although a plurality of means are shown for the electronic device 1000 in fig. 1, embodiments of the present disclosure may only refer to some of the means therein, for example, the electronic device 1000 may only refer to the processor 1100 and the storage 1200. The skilled person can design the instructions according to the disclosed embodiments of the present disclosure. How the instructions control the operation of the processor is well known in the art and will not be described in detail herein.

< method of processing data >

Snapshots of data are one technique for recording the process of a data change. Modification of data can be traced through a snapshot technology, data damage can be repaired, loss of data is reduced, and data reading can also depend on the snapshot. The embodiment of the disclosure provides a data processing method, which is realized based on a snapshot technology.

First, a description will be given of a concept related to an embodiment of the present disclosure.

A data collection is a collection of source data combinations, each data collection having unique Key value information (Key). That is, the key value information is a unique set identifier of the data set, and one data set can be uniquely determined according to the key value information. The source data may be the smallest unit of data used for storing, modifying, reading, and deleting. Each source data in the data set can be stored in a respective storage space.

Metadata (data about data) is data describing source data, and can be used to describe attributes of the source data. Each source data has a unique piece of metadata corresponding to the source data, and each piece of metadata can be stored in a corresponding storage space.

In the embodiment of the present disclosure, when new source data needs to be added to a data set, the new source data is directly written into a new storage space. When the existing source data in the data set needs to be modified, the source data needing to be modified is not in an overlay form, namely the source data needing to be modified is not deleted, and the modified source data is directly written in a new storage space. That is, in the embodiment of the present disclosure, the written source data is stored to a new storage space each time the source data is written. Likewise, in embodiments of the present disclosure, when source data is written to a collection, corresponding metadata is generated for the written source data.

In the embodiment of the present disclosure, when a set is snapshot, the generated data snapshot includes source data and corresponding metadata, that is, the source data is not just snapshot, but is snapshot together with the source data and the metadata corresponding to the source data.

In an embodiment of the present disclosure, metadata corresponding to source data includes: the storage parameter of the source data, the key value information of the data set to which the source data belongs, the pre-allocated snapshot version number (Epoch) of the source data, and the data Sequence number (Sequence) of the source data.

In the embodiment of the present disclosure, the snapshot version number is pre-allocated, and each time a snapshot is performed, the pre-allocated snapshot version number is increased by 1. A specific example is given below: the version number of the snapshot pre-assigned to the initial data set is 1, and at this time, the snapshot has not been made yet. And taking a snapshot for each pair of data sets, wherein the version number of the pre-allocated snapshot is increased by 1. After taking a Snapshot for the first time (the Snapshot name is Snapshot1, for example, and the actual corresponding pre-assigned Snapshot version number is 1), the pre-assigned Snapshot version number becomes 2. After taking the Snapshot for the second time (the Snapshot name is Snapshot2, for example, and the actual corresponding pre-assigned Snapshot version number is 2), the pre-assigned Snapshot version number becomes 3. After taking a Snapshot for the third time (the Snapshot name is Snapshot3, for example, and the actual corresponding pre-assigned Snapshot version number is 3), the pre-assigned Snapshot version number becomes 4. By analogy, after the nth snapshot is completed, a snapshot version number N +1 can be allocated to the (N + 1) th snapshot in advance, wherein N is an integer greater than or equal to 1.

In the embodiment of the present disclosure, the pre-assigned snapshot version number of the source data is the pre-assigned snapshot version number corresponding to the time point of writing the source data. For example, after the nth snapshot is completed, the pre-allocated snapshot version number is changed to N +1, and the pre-allocated snapshot version numbers corresponding to the source data written in the period from the completion of the nth snapshot to the completion of the (N + 1) th snapshot are both N + 1.

In the embodiment of the present disclosure, the data sequence numbers are sequentially increased according to the number of the source data in the data set to which the source data belongs. That is, for any data set, the data sequence number corresponding to the 1 st source data written into the set is 1, the data sequence number corresponding to the 2 nd source data written into the set is 2, …, and the data sequence number corresponding to the mth source data written into the set is M, where M is an integer greater than or equal to 1. From the data sequence number, it is known which source data the most recently written source data in the data set is.

In the embodiment of the present disclosure, when performing a data snapshot on a set, the source data with the highest data sequence number in the data set and the corresponding metadata may be snapshot. In this case, the data sequence number of the data snapshot is the data sequence number corresponding to the source data in the data snapshot.

Referring to fig. 5, in the embodiment of the present disclosure, the storage parameters of the source data include: the File name (File) of the data File storing the source data, the offset (offset) of the source data in the data File, and the size (size) of the source data. Specifically, the file name refers to a specific name of a data file storing the source data, and the specific data file can be located by the file name. The offset is a position offset of the source data relative to a certain preset position in the data file, and the offset indicates a starting position of the source data in the data file and is used for positioning the source data in the data file. In combination with the start position, the end position of the source data in the data file can be located. The size of the source data refers to the size of the storage space occupied by the source data, and the source data can be read quickly and accurately through the storage parameters of the structure.

Referring to fig. 5, based on metadata "< key value: 3; the snapshot version number is 2; 2, data sequence number; the file name is 10; offset is 300; data size 8> "and metadata" < key-value information 3; the snapshot version number is 2; the data sequence number is 3; the file name of the stored data is 10; the offset is 200; 8>, it can be seen that, in the data set with the key information of 3, the data sequence number of the source data written once last is 3, and the source data written once last is stored in the data file with the file name of 10 at the position of the offset amount of 200.

In the embodiment of the disclosure, the metadata can be used to achieve the purpose of data retrieval, so that the source data can be quickly and accurately searched and read in the storage space, and the operations of reading, modifying and deleting the source data are facilitated.

Referring to fig. 2, a data processing method provided by the embodiment of the disclosure is described, which includes steps S102-S108. The data processing method may be implemented by an electronic device, which may be, for example, the electronic device 1000 as shown in fig. 1.

S102, when any data set is subjected to snapshot to generate a data snapshot, the data snapshot is stored in a first preset storage space. The data snapshot comprises first source data for snapshot in the data set and metadata corresponding to the first source data. The metadata corresponding to the first source data comprises storage parameters of the first source data, key value information of a data set to which the first source data belongs, a pre-allocated snapshot version number of the first source data and a data sequence number of the first source data.

In the embodiment of the present disclosure, at least one piece of source data having the same key value information is included in the data set.

In a specific example, when the data set is snapshot, first source data to be snapshot in the data set is determined, and a data snapshot of the data set is determined according to the first source data to be snapshot and metadata corresponding to the first source data. In a specific example, the first source data is the source data with the highest data sequence number in the data set. That is, when a set is snapshot, the source data with the highest data sequence number in the data set and the corresponding metadata may be snapshot.

In a specific example, the data set may be snapshot to generate a data snapshot when a snapshot instruction is received. In a specific example, the data set may be snapshot according to a preset snapshot instruction cycle to generate a data snapshot. The method of the disclosed embodiments is not limited to the triggering mechanism of the snapshot.

In a specific example, after the snapshot is performed, snapshot link information corresponding to the data snapshot is generated and stored, where the snapshot link information includes key value information of a data set to which the data snapshot belongs and a pre-assigned snapshot version number of the data snapshot. According to the snapshot link information, the data set and the snapshot version number corresponding to the data snapshot can be quickly determined.

And S104, after the first source data is subjected to snapshot, if written second source data is received, storing the second source data in a second preset storage space, wherein the second source data and the first source data belong to the same data set.

In the embodiment of the present disclosure, the second preset storage space and the first preset storage space are different storage spaces, that is, the writing of the second source data does not cause the existing source data in the data set to be overwritten.

And S106, generating metadata corresponding to the second source data, and storing the metadata corresponding to the second source data in a third preset storage space. The metadata corresponding to the second source data comprises storage parameters of the second source data, key value information of a data set to which the second source data belongs, a pre-allocated snapshot version number of the second source data, and a data sequence number of the second source data. The pre-allocated snapshot version number of the second source data is sequentially increased on the basis of the pre-allocated snapshot version number of the first source data, and the data sequence number of the second source data is sequentially increased according to the number of the source data in the data set. And storing the metadata corresponding to the second source data in a third preset storage space.

As can be seen from the foregoing, in the embodiment of the present disclosure, every time one source data is written, corresponding metadata is generated for the source data.

And S108, when the processing operation instruction for the data set is received, executing the processing operation instruction for the data set according to the source data of the data set, the metadata and/or the data snapshot corresponding to the source data of the data set.

According to the data processing method provided by the embodiment of the disclosure, when modifying the source data in the set, copying of all the data in the set is not needed, but the source data to be modified is directly written, so that the data processing time is saved, and the data processing efficiency is improved. By generating the metadata for the source data written in each time and making the snapshot by using the source data and the corresponding metadata, the subsequent operations of reading, modifying, deleting and the like of the source data are facilitated, and the data processing efficiency is improved.

In the embodiment of the present disclosure, the processing operation instruction may be an operation instruction of reading, modifying, and deleting. The operation instruction of reading, modifying and deleting can be the operation on single data or the operation on batch data. The following examples are given.

< example 1 >:

the processing operation instruction is a data reading instruction and is used for reading data to be read, and the data reading instruction comprises target key value information and a target snapshot version number.

When a processing operation instruction for the data set is received, executing the processing operation instruction for the data set according to the source data of the data set, the metadata corresponding to the source data of the data set and/or the data snapshot, and including:

s202, target key value information and a target snapshot version number in the data reading instruction are obtained.

S204, searching first target metadata comprising target key value information and a target snapshot version number from all metadata of the data set.

That is, metadata consistent with the target key value information and the target snapshot version number in the data reading instruction is searched as the first target metadata.

S206, determining the metadata with the largest data sequence number in the first target metadata as second target metadata.

There may be a plurality of first target metadata searched, and the metadata with the largest data sequence number selected from the first target metadata is determined as the second target metadata.

And S208, determining the storage parameters of the data to be read from the second target metadata.

And S210, reading data to be read according to the storage parameters in the second target metadata.

In a specific example, referring to fig. 5, a client sends a data reading instruction to a server, where target key value information in the data reading instruction is 1, and a target snapshot version number is 2, that is, the client wants to read source data with a snapshot version number of 2 in a data set with key value information of 1. The server finds out two first target metadata according to the target key value information 1 and the target snapshot version number 2, determines the metadata with the data sequence number of 5 as second target metadata, and reads the source data according to the storage parameters in the second target metadata and returns the source data to the client.

< example 2 >:

S302, target key value information and a target snapshot version number in the data reading instruction are obtained.

S304, determining whether the metadata consistent with the target key value information and the target snapshot version number exists in all the metadata of the data set.

And when determining that all metadata of the data set have metadata which is consistent with the target key value information and the target snapshot version number, taking the metadata as first target metadata, and executing the steps S306-310.

S306, determining the metadata with the largest data sequence number in the first target metadata as second target metadata.

And S308, determining the storage parameters of the data to be read from the second target metadata.

And S310, reading data to be read according to the storage parameters in the second target metadata.

And executing steps S312-S318 when determining that the metadata which is consistent with the target key value information and the target snapshot version number does not exist in all the metadata of the data set.

S312, sequentially decreasing the version number of the target snapshot until third target metadata are determined from all metadata of the data set; the third target metadata includes target key value information and target snapshot version numbers that have been processed in a descending order.

Specifically, when it is determined that there is no metadata in all metadata of the data set that is consistent with both the target key value information and the target snapshot version number, in order to avoid missing in the data reading process, the target snapshot version number may be reselected according to a rule that the target snapshot version number is sequentially decreased in order to read the source data.

For example, the target key value information in the data read instruction is 2, and the target snapshot version number is 6. All metadata of the data set with the target key value information of 2 are regarded as candidate metadata sets. When the metadata with the snapshot version number of 6 does not exist in the candidate metadata set, the metadata with the snapshot version number of 5 is tried to be obtained in the candidate metadata set to serve as third target metadata. And if the candidate metadata set does not have the metadata with the snapshot version number of 5, continuing to decrement the target snapshot version number, and trying to acquire the metadata with the snapshot version number of 4 in the candidate metadata set as third target metadata. And so on until a third target metadata is obtained in the candidate metadata set.

And S314, determining the metadata with the largest data sequence number in the third target metadata as fourth target metadata.

And S316, determining the storage parameter of the data to be read from the fourth target metadata.

And S318, reading the data to be read according to the storage parameters in the fourth target metadata.

< example 3 >:

the processing operation instruction is a data snapshot reading instruction used for reading the data snapshot, and the data reading instruction comprises target key value information and a target snapshot version number.

s402, target key value information and a target snapshot version number in the data reading instruction are obtained to construct target snapshot link information.

S404, searching the target snapshot link information from the stored snapshot link information to read the data snapshot corresponding to the target snapshot contact information.

In a specific example, the target snapshot link is determined by using the key value information and the snapshot version number of the data set recorded by the snapshot link information, and the data snapshot indicated by the target snapshot link can be quickly read by clicking the target snapshot link.

< example 4 >:

the processing operation instruction is a data cleaning instruction.

When a processing operation instruction for the data set is received, according to the source data of the data set, the metadata and/or the data snapshot corresponding to the source data of the data set, the step of executing the processing operation instruction for the data set includes:

s502, when a data cleaning instruction is received, determining data to be cleaned from the data set according to the data set based on metadata and data snapshot of the data set.

S506, executing a data cleaning instruction on the data to be cleaned.

In one specific example, the execution of the data scrubbing instructions on the data to be scrubbed includes: and determining storage parameters of the data to be cleaned according to the metadata corresponding to the data to be cleaned, and deleting the data to be cleaned according to the storage parameters.

In another specific example, the executing the data scrubbing instruction to the data to be scrubbed further includes: and clearing the metadata corresponding to the data to be cleaned. In this example, when executing the data cleaning instruction, not only the source data to be cleaned is cleaned, but also the metadata of the source data to be cleaned is cleaned, and both the storage space of the source data is released and the storage space of the metadata is correspondingly released, so as to fully utilize the storage space.

The following describes how to determine the data to be cleaned from the data set based on the metadata and the data snapshot of the data set.

Assuming that a data set with key value information of 1 is subjected to a series of processing, the corresponding metadata includes:

< Key information: 1, Snapshot version number: 1, data sequence number: 5, … >

< Key information: 1, Snapshot version number: 2, data sequence number: 20, … >

< Key information: 1, Snapshot version number: 2, data sequence number: 22, … >

< Key information: 1, Snapshot version number: 3, data sequence number: 23, … >

< Key information: 1, Snapshot version number: 3, data sequence number: 24, … >

< Key information: 1, Snapshot version number: 3, data sequence number: 25, … >

< Key information: 1, Snapshot version number: 5, data sequence number: 30, … >

< Key information: 1, Snapshot version number: 5, data sequence number: 31, … >

< Key information: 1, Snapshot version number: 6, data sequence number: 33, … >

< Key information: 1, Snapshot version number: 6, data sequence number: 38, … >

< Key information: 1, Snapshot version number: 7, data sequence number: 40, … >

< Key information: 1, Snapshot version number: 7, data sequence number: 45, … >

For a data set with key value information of 1, Snapshot snapshots 1 (corresponding to a Snapshot version number 1), Snapshot snapshots 2 (corresponding to a Snapshot version number 2), Snapshot snapshots 3 (corresponding to a Snapshot version number 3), Snapshot snapshots 4 (corresponding to a Snapshot version number 4), Snapshot snapshots 5 (corresponding to a Snapshot version number 5), and Snapshot snapshots 6 (corresponding to a Snapshot version number 6) are successively made.

The Snapshot version number 7 corresponds to the Snapshot7, and it should be noted that, at present, the Snapshot7 may not be made yet, and only the Snapshot version number 7 is allocated to the data set in advance. If a Snapshot7 (corresponding to Snapshot version number 7) has been made currently, the current latest pre-assigned Snapshot version number is 8, and there is no metadata with Snapshot version number 8, which indicates that no operation of writing source data has been performed on the data set after Snapshot7 is completed.

Based on the above assumptions, during the period from Snapshot3 (corresponding to Snapshot version number 3) to Snapshot4 (corresponding to Snapshot version number 4), no operation for writing source data is performed on the data set, and thus no metadata with Snapshot version number 4 exists. When reading the source data of the Snapshot4 (corresponding to the Snapshot version number 4), metadata corresponding to the Snapshot version number at the lower level closest to the Snapshot version number 4 needs to be used (see the aforementioned process of sequentially decreasing the target Snapshot version number in the data reading instruction when executing the data reading instruction), that is, metadata referencing the Snapshot version number 3 needs to be used.

In a specific example, the data to be cleaned is determined from the data set based on the metadata and the data snapshot of the data set, including steps S602-S604:

S602, target metadata to be cleaned is determined from metadata of the data set, data serial numbers contained in the target metadata to be cleaned are all smaller than data serial numbers of snapshots of the target data to be cleaned, and the target metadata to be cleaned and the snapshots of the target data to be cleaned have the same pre-distributed snapshot version numbers.

S604, determining the source data corresponding to the target metadata to be cleaned as the data to be cleaned.

In this example, for the data set whose key value information is 1, the metadata to be cleaned that is determined includes:

In a specific example, determining data to be cleaned from the data set based on the metadata of the data set and the data snapshot may include steps S702 to S706:

s702, when any data snapshot in the data set is an invalid snapshot, taking the snapshot version number of the invalid snapshot as a first target snapshot version number, and taking the snapshot version number of the data snapshot generated by processing the next snapshot of the invalid snapshot as a second target snapshot version number.

S704, whether the metadata with the second target snapshot version number exists in all the metadata of the data set or not is determined.

S706, when it is determined that all metadata of the data set have metadata with a second target snapshot version number, determining source data corresponding to the metadata with the first target snapshot version number as to-be-cleaned data.

The following describes how to determine the to-be-cleaned data corresponding to an invalid snapshot when any data snapshot in the data set is an invalid snapshot.

(1) Assuming that Snapshot5 (corresponding to Snapshot version number 5) is deleted as an invalid Snapshot, the remaining valid snapshots include: snapshot1 (corresponding to Snapshot version number 1), Snapshot2 (corresponding to Snapshot version number 2), Snapshot3 (corresponding to Snapshot version number 3), Snapshot4 (corresponding to Snapshot version number 4), Snapshot6 (corresponding to Snapshot version number 6). If a Snapshot7 (corresponding to Snapshot version number 7) has been currently made, the remaining valid snapshots also include Snapshot7 (corresponding to Snapshot version number 7).

For each valid snapshot, the source data with the highest data sequence number corresponding to the snapshot version number of the valid snapshot should be valid source data. If the Snapshot7 is not made, the source data with the highest data sequence number corresponding to the Snapshot version number 7 should also be valid source data.

In the case where Snapshot5 (corresponding to Snapshot version number 5) is deleted, the metadata corresponding to the valid source data includes:

< Key information: 1, Snapshot version number: 2, data sequence number 22, … >

< Key information: 1, Snapshot version number: 3, data sequence number 25, … >

< Key information: 1, Snapshot version number: 6, data sequence number 38, … >

< Key information: 1, Snapshot version number: 7, data sequence number 45, … >

In the case where the Snapshot5 (corresponding to the Snapshot version number 5) is deleted, the source data corresponding to the metadata < key value information: 1, Snapshot version number: 1, data sequence number: 5, … >, < key value information: 1, Snapshot version number: 2, data sequence number 22, … >, < key value information: 1, Snapshot version number: 3, data sequence number 25, … >, < key value information: 1, Snapshot version number: 6, data sequence number 38, … >, < key value information: 1, Snapshot version number: 7, data sequence number 45, … > is valid source data, and the source data corresponding to other metadata is to-be-cleared data.

(2) Assuming that Snapshot3 (corresponding to Snapshot version number 3) is deleted as an invalid Snapshot, the remaining valid snapshots include: snapshot1 (corresponding to Snapshot version number 1), Snapshot2 (corresponding to Snapshot version number 2), Snapshot4 (corresponding to Snapshot version number 4), Snapshot5 (corresponding to Snapshot version number 5), and Snapshot6 (corresponding to Snapshot version number 6). If a Snapshot7 (corresponding to Snapshot version number 7) has been currently made, the remaining valid snapshots also include Snapshot7 (corresponding to Snapshot version number 7).

For each valid snapshot, the source data with the highest data sequence number corresponding to the snapshot version number of the valid snapshot should be valid source data. If the Snapshot7 is not made, the source data with the highest data sequence number corresponding to the Snapshot version number 7 should also be valid source data. In addition, since the valid Snapshot4 (corresponding to the Snapshot version number 4) needs to reference the invalid Snapshot3 (corresponding to the Snapshot version number 3), the source data with the highest data sequence number corresponding to the Snapshot version number 3 should also be valid source data.

In the case where Snapshot3 (corresponding to Snapshot version number 3) is deleted, the metadata corresponding to the valid source data includes:

< Key information: 1, Snapshot version number: 5, data sequence number 31, … >

In the case where the Snapshot3 (corresponding to the Snapshot version number 3) is deleted, the source data corresponding to the metadata < key value information: 1, the Snapshot version number: 1, the data sequence number: 5, … >, < key value information: 1, the Snapshot version number: 2, the data sequence number 22, … >, < key value information: 1, the Snapshot version number: 3, the data sequence number 25, … >, < key value information: 1, the Snapshot version number: 5, the data sequence number 31, … >, < key value information: 1, the Snapshot version number: 6, the data sequence number 38, … >, < key value information: 1, the Snapshot version number: 7, the data sequence number 45, … > is valid source data, and the source data corresponding to other metadata is to-be-cleared data.

In a specific example, the determination of the data to be cleaned from the data set based on the metadata of the data set and the data snapshot may include steps S602 to S604 and steps S702 to S706 described above.

< data processing apparatus >

In yet another embodiment of the present disclosure, as shown in fig. 3, a data processing apparatus 200 is provided. The data processing apparatus 200 includes:

The data snapshot storage module 21 is configured to store a data snapshot in a first preset storage space when a snapshot is performed on any data set to generate a data snapshot, where the data snapshot includes first source data for performing the snapshot in the data set and metadata corresponding to the first source data, and the metadata corresponding to the first source data includes a storage parameter of the first source data, key value information of the data set to which the first source data belongs, a pre-allocated snapshot version number of the first source data, and a data sequence number of the first source data; the data set comprises at least one piece of source data with the same key value information.

The source data writing module 22 is configured to, after the first source data is snapshot, store second source data in a second preset storage space if the written second source data is received; wherein the second source data and the first source data belong to the same data set.

The metadata generating module 23 is configured to generate metadata corresponding to second source data, where the metadata corresponding to the second source data includes storage parameters of the second source data, key value information of a data set to which the second source data belongs, a pre-allocated snapshot version number of the second source data, and a data sequence number of the second source data, the pre-allocated snapshot version number of the second source data sequentially increases on the basis of the pre-allocated snapshot version number of the first source data, and the data sequence number of the second source data sequentially increases according to the number of the source data in the data set; and storing the metadata corresponding to the second source data in a third preset storage space.

The processing module 24 is configured to, when receiving a processing operation instruction for the data set, execute the processing operation instruction for the data set according to the source data of the data set, the metadata and/or the data snapshot corresponding to the source data of the data set;

in a specific example, the data processing apparatus 200 may further include a snapshot source data determination module. The snapshot source data determining module is used for determining first source data of the data set, which need to be subjected to snapshot, when the data set is subjected to snapshot; and determining the data snapshot of the data set according to the first source data needing to be snapshot and the metadata corresponding to the first source data.

In a specific example, the data processing apparatus 200 may further include a snapshot module. The snapshot module is used for carrying out snapshot on any data set to generate data snapshot, and comprises: when a snapshot instruction is received, carrying out snapshot on any data set to generate a data snapshot; or, according to a preset snapshot instruction cycle, performing snapshot on any data set to generate a data snapshot.

In a specific example, the data processing apparatus 200 may further include a snapshot link information storage module. And the snapshot link information storage module is used for storing snapshot link information corresponding to the data snapshot, and the snapshot link information comprises key value information of a data set to which the data snapshot belongs and a pre-distributed snapshot version number of the data snapshot.

In a specific example, the processing operation instruction is a data reading instruction, which is used to read data to be read, and the data reading instruction includes target key value information and a target snapshot version number.

In one particular example, the processing module 24 includes a first read instruction processing module. A first read instruction processing module to: acquiring target key value information and a target snapshot version number in a data reading instruction; searching first target metadata comprising target key value information and a target snapshot version number from all metadata of a data set, wherein the data set at least comprises first source data and second source data; determining the metadata with the largest data sequence number in the first target metadata as second target metadata; determining a storage parameter of data to be read from the second target metadata; and reading the data to be read according to the storage parameters in the second target metadata.

In one particular example, the processing module 24 includes a second read instruction processing module. A second read instruction processing module, configured to: when a processing operation instruction for the data set is received, executing the processing operation instruction for the data set according to the source data of the data set, the metadata corresponding to the source data of the data set and/or the data snapshot, and including: acquiring target key value information and a target snapshot version number in a data reading instruction; determining whether metadata consistent with the target key value information and the target snapshot version number exist in all metadata of the data set; when determining that no metadata consistent with the target key value information and the target snapshot version number exists in all metadata of the data set, sequentially decreasing the target snapshot version number until third target metadata is determined from all metadata of the data set; the third target metadata comprises target key value information and a target snapshot version number which is processed in a descending order; determining the metadata with the largest data sequence number in the third target metadata as fourth target metadata; determining a storage parameter of data to be read from the fourth target metadata; and reading the data to be read according to the storage parameters in the fourth target metadata.

In one particular example, the processing module 24 includes a third read instruction processing module. A third read instruction processing module to: when a processing operation instruction for the data set is received, executing the processing operation instruction for the data set according to the source data of the data set, the metadata corresponding to the source data of the data set and/or the data snapshot, and including: acquiring target key value information and a target snapshot version number in a data reading instruction to construct target snapshot link information; and searching the target snapshot link information from the stored snapshot link information to read the data snapshot corresponding to the target snapshot contact information.

In one specific example, the processing operation instruction is a data scrubbing instruction.

In one particular example, processing module 24 includes a clean instruction processing module. A cleaning instruction processing module for: when a data cleaning instruction is received, determining data to be cleaned from a data set based on metadata and a data snapshot of the data set aiming at the data set; and executing a data cleaning instruction on the data to be cleaned.

In a specific example, the cleaning instruction processing module includes a first to-be-cleaned data determination module. And the first to-be-cleaned data determining module is used for determining to-be-cleaned data from the data set based on the metadata and the data snapshot of the data set. Specifically, the first module for determining to-be-cleaned data is configured to: determining target metadata to be cleaned from metadata of the data set, wherein data serial numbers contained in the target metadata to be cleaned are all smaller than data serial numbers of the target data snapshots to be cleaned, and the target metadata to be cleaned and the target data snapshots to be cleaned have the same pre-allocated snapshot version numbers; and determining the source data corresponding to the target metadata to be cleaned as the data to be cleaned.

In a specific example, the cleaning instruction processing module includes a second to-be-cleaned data determination module. And the second to-be-cleaned data determining module is used for determining to-be-cleaned data from the data set based on the metadata and the data snapshot of the data set. Specifically, the second to-be-cleaned data determining module is configured to: when any data snapshot in the data set is an invalid snapshot, taking the snapshot version number of the invalid snapshot as a first target snapshot version number, and taking the snapshot version number of the data snapshot generated by processing the next snapshot of the invalid snapshot as a second target snapshot version number; determining whether metadata with a second target snapshot version number exists in all metadata of the data set; and when determining that all metadata of the data set have metadata with a second target snapshot version number, determining the source data corresponding to the metadata with the first target snapshot version number as the data to be cleaned.

In one particular example, the cleaning instruction processing module includes a cleaning module. And the cleaning module is used for executing a data cleaning instruction on the data to be cleaned. Specifically, the cleaning module is used for determining storage parameters of the data to be cleaned according to metadata corresponding to the data to be cleaned; and deleting the data to be cleaned according to the specific position of the data to be cleaned determined by the storage parameters. Or the cleaning module is used for determining the storage parameters of the data to be cleaned according to the metadata corresponding to the data to be cleaned; and deleting the data to be cleaned according to the specific position of the data to be cleaned determined by the storage parameters, and cleaning the metadata corresponding to the data to be cleaned.

< electronic apparatus >

In a further embodiment of the present disclosure, an electronic device 300 is provided, and on one hand, the electronic device 300 may include the foregoing data processing apparatus, configured to implement the data processing method according to any embodiment of the present disclosure.

On the other hand, as shown in fig. 4, the electronic device 300 may include a memory 32 and a processor 31, the memory 32 being configured to store executable instructions; the instructions are for controlling the processor 31 to perform the aforementioned data processing method.

In the embodiment, the electronic device 300 may be any electronic product having a memory 32 and a processor 31, such as a desktop computer, a notebook computer, a server, a workstation, and the like.

< computer-readable storage Medium >

Finally, according to yet another embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a data processing method according to any of the embodiments of the present disclosure.

The disclosed embodiments may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement aspects of embodiments of the disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations for embodiments of the present disclosure may be assembly instructions, Instruction Set Architecture (ISA) instructions, machine related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the disclosed embodiments by personalizing the custom electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of the computer-readable program instructions.

Various aspects of embodiments of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, implementation by software, and implementation by a combination of software and hardware are equivalent.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the embodiments of the present disclosure is defined by the appended claims.

Claims

1. A data processing method, comprising:

2. The method of claim 1, wherein the snapshot of any data set generates a data snapshot, comprising:

when the data set is subjected to snapshot, determining first source data of the data set, which need to be subjected to snapshot;

and determining the data snapshot of the data set according to the first source data needing to be snapshot and the metadata corresponding to the first source data.

3. The method of claim 1, wherein the snapshot of any data set generates a data snapshot, comprising:

when a snapshot instruction is received, carrying out snapshot on any data set to generate a data snapshot;

alternatively, the first and second electrodes may be,

and carrying out snapshot on any data set according to a preset snapshot instruction cycle to generate a data snapshot.

4. The method of claim 2, further comprising:

and storing the data snapshot and snapshot link information corresponding to the data snapshot, wherein the snapshot link information comprises key value information of a data set to which the data snapshot belongs and a pre-allocated snapshot version number of the data snapshot.

5. The method according to claim 1, wherein the processing operation instruction is a data reading instruction for reading data to be read, the data reading instruction includes target key value information and a target snapshot version number,

When a processing operation instruction for the data set is received, the step of executing the processing operation instruction for the data set according to the source data of the data set, the metadata and/or the data snapshot corresponding to the source data of the data set includes:

acquiring target key value information and a target snapshot version number in the data reading instruction;

searching first target metadata comprising the target key value information and a target snapshot version number from all metadata of the data set, wherein the data set at least comprises first source data and second source data;

determining the metadata with the largest data sequence number in the first target metadata as second target metadata;

determining a storage parameter of the data to be read from the second target metadata;

and reading the data to be read according to the storage parameters in the second target metadata.

6. The method according to claim 1 or 5, wherein the processing operation instruction is a data reading instruction for reading data to be read, the data reading instruction includes target key value information and a target snapshot version number,

determining whether metadata consistent with the target key value information and the target snapshot version number exist in all metadata of the data set;

when determining that no metadata consistent with the target key value information and the target snapshot version number exists in all metadata of the data set, sequentially decreasing the target snapshot version number until third target metadata is determined from all metadata of the data set; the third target metadata comprises the target key value information and a target snapshot version number which is processed in a descending order;

determining the metadata with the largest data sequence number in the third target metadata as fourth target metadata;

determining a storage parameter of the data to be read from the fourth target metadata;

and reading the data to be read according to the storage parameters in the fourth target metadata.

7. The method of claim 4, wherein the processing operation instruction is a data snapshot reading instruction for reading a data snapshot, the data reading instruction includes target key value information and a target snapshot version number,

acquiring target key value information and a target snapshot version number in the data reading instruction to construct target snapshot link information;

and searching the target snapshot link information from the stored snapshot link information so as to read the data snapshot corresponding to the target snapshot contact information.

8. The method of claim 1, wherein the processing operation instruction is a data scrubbing instruction,

when the data cleaning instruction is received, determining data to be cleaned from the data set according to the data set and the metadata and the data snapshot of the data set;

and executing the data cleaning instruction on the data to be cleaned.

9. The method of claim 8, wherein the step of determining data to be cleaned from the data collection based on the metadata of the data collection, the data snapshot, comprises:

determining target metadata to be cleaned from metadata of the data set, wherein data serial numbers contained in the target metadata to be cleaned are all smaller than data serial numbers of target data snapshots to be cleaned, and the target metadata to be cleaned and the target data snapshots to be cleaned have the same pre-allocated snapshot version numbers;

and determining the source data corresponding to the target metadata to be cleaned as the data to be cleaned.

10. The method of claim 8 or 9, wherein the step of determining data to be cleaned from the data set based on the metadata and the data snapshot of the data set comprises:

when any data snapshot in the data set is an invalid snapshot, taking the snapshot version number of the invalid snapshot as a first target snapshot version number, and taking the snapshot version number of the data snapshot generated by the next snapshot processing of the invalid snapshot as a second target snapshot version number;

determining whether metadata having the second target snapshot version number exists in all metadata of the data set;

And when determining that all metadata of the data set have metadata with the second target snapshot version number, determining source data corresponding to the metadata with the first target snapshot version number as to-be-cleaned data.

11. A data processing apparatus, comprising:

the data snapshot storage module is used for carrying out snapshot on any data set to generate a data snapshot and storing the data snapshot in a first preset storage space, wherein the data snapshot comprises first source data for carrying out snapshot in the data set and metadata corresponding to the first source data, and the metadata corresponding to the first source data comprises storage parameters of the first source data, key value information of the data set to which the first source data belongs, a pre-allocated snapshot version number of the first source data and a data serial number of the first source data;

the source data writing module is used for storing second source data in a second preset storage space after the first source data is subjected to snapshot and when the written second source data is received;

the metadata generation module is configured to generate metadata corresponding to the second source data, where the metadata corresponding to the second source data includes a storage parameter of the second source data, key value information of a data set to which the second source data belongs, a pre-assigned snapshot version number of the second source data, and a data sequence number of the second source data, the pre-assigned snapshot version number of the second source data sequentially increases on the basis of the pre-assigned snapshot version number of the first source data, and the data sequence number of the second source data sequentially increases according to the number of the source data in the data set; storing metadata corresponding to the second source data in a third preset storage space;

12. An electronic device, comprising:

the data processing apparatus of claim 11; alternatively, the first and second electrodes may be,

a processor and a memory for storing executable instructions for controlling the processor to perform the data processing method of any of claims 1-10.

13. A computer-readable storage medium, having stored thereon computer instructions, which, when executed by a processor, implement the data processing method of any one of claims 1 to 10.