CN117093139A

CN117093139A - Data processing method, device and system for data storage

Info

Publication number: CN117093139A
Application number: CN202310814103.0A
Authority: CN
Inventors: 彭翔宇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-07-04
Filing date: 2023-07-04
Publication date: 2023-11-21

Abstract

The present disclosure provides a data processing method for data storage. The data processing method applied to the management node comprises the following steps: selecting one data set copy from a plurality of data set copies corresponding to the target data set as a master data set copy, and taking the rest as a slave data set copy; and acquiring the main copy identification information of the main data set copy, and transmitting the main copy identification information to a slave copy node of the slave data set copy. The data processing method applied to the master-copy node comprises the following steps: generating target remote metadata corresponding to the cold data set; transmitting the cold data set, the target remote metadata and the primary copy identification information to a remote storage system; the locally located cold data set is replaced with the target remote metadata. The data processing method applied to the slave replica node comprises the following steps: determining a cold dataset from the target dataset; target remote metadata is obtained from the remote storage system based on the primary copy identification information, and the target remote metadata is used to replace the locally located cold data set.

Description

Data processing method, device and system for data storage

Technical Field

The disclosure relates to the technical field of data processing, in particular to the technical field of artificial intelligence such as big data, cloud service and the like. A data processing method, apparatus, system, electronic device, and readable storage medium for data storage are provided.

Background

In databases or data warehouses, the storage of data is generally divided into remote storage and local storage. Wherein, local storage refers to storing data locally at a node, and remote storage refers to storing data in a remotely located storage system. In some scenarios, cold and hot data transfer storage issues may be involved, i.e., locally located cold data transfer to a remote storage system. However, in the prior art, when cold data is transferred and stored, cold data in different data copies of the same data are respectively stored in a remote storage system, so that storage resources of the remote storage system are wasted.

Disclosure of Invention

According to a first aspect of the present disclosure, there is provided a data processing method for data storage, applied to a management node, comprising: selecting one data set copy from a plurality of data set copies corresponding to the target data set as a master data set copy, and taking the rest data set copies as slave data set copies; and acquiring primary copy identification information of the primary data set copy, transmitting the primary copy identification information to a secondary copy node of the secondary data set copy, wherein the primary copy identification information is used for acquiring target remote metadata from a remote storage system by the secondary copy node, replacing a cold data set positioned locally by using the target remote metadata, and transmitting the target remote metadata to the remote storage system by the primary and secondary nodes of the primary data set copy.

According to a second aspect of the present disclosure, there is provided a data processing method for data storage, applied to a primary replica node, comprising: determining a cold dataset from a target dataset, generating target remote metadata corresponding to the cold dataset, the target dataset being local to the primary replica node; transmitting the cold data set, the target remote metadata and the main copy identification information to a remote storage system for the remote storage system to store the target remote metadata and the cold data set according to the main copy identification information, wherein the target remote metadata are used for searching the cold data set in the remote storage system; replacing the cold data set locally with the target remote metadata.

According to a third aspect of the present disclosure, there is provided a data processing method for data storage, applied to a slave replica node, comprising: determining a cold dataset from a target dataset, the target dataset being local to the slave replica node; and acquiring target remote metadata from a remote storage system according to the main copy identification information, and replacing the locally-located cold data set by using the target remote metadata, wherein the main copy identification information is sent by a management node.

According to a fourth aspect of the present disclosure there is provided a data processing apparatus for data storage, for application to a management node, comprising: a first determining unit, configured to select one data set copy from a plurality of data set copies corresponding to the target data set as a master data set copy, and use the remaining data set copies as slave data set copies; the first sending unit is used for obtaining the main copy identification information of the main data set copy, sending the main copy identification information to the slave copy node of the slave data set copy, wherein the main copy identification information is used for obtaining target remote metadata from a remote storage system by the slave copy node, replacing a cold data set located locally by using the target remote metadata, and the target remote metadata is sent to the remote storage system by the main copy node of the main data set copy.

According to a fifth aspect of the present disclosure there is provided a data processing apparatus for data storage, for application to a primary replica node, comprising: a second determining unit, configured to determine a cold data set from a target data set, and generate target remote metadata corresponding to the cold data set, where the target data set is located locally to the primary replica node; the second sending unit is used for sending the cold data set, the target remote metadata and the main copy identification information to a remote storage system, so that the remote storage system can store the target remote metadata and the cold data set according to the main copy identification information, and the target remote metadata can be used for searching the cold data set in the remote storage system; a first replacement unit for replacing the cold data set locally using the target remote metadata.

According to a sixth aspect of the present disclosure there is provided a data processing apparatus for data storage, for application to a slave replica node, comprising: a third determining unit configured to determine a cold data set from a target data set, the target data set being located locally to the slave replica node; and the second replacing unit is used for acquiring target remote metadata from the remote storage system according to the main copy identification information, replacing the locally-located cold data set by using the target remote metadata, and the main copy identification information is sent by the management node.

According to a seventh aspect of the present disclosure, there is provided a data processing system for data storage, comprising a management node, a master replica node, and at least one slave replica node; wherein the management node is configured to perform the method of the first aspect, the master replica node is configured to perform the method of the second aspect, and the at least one slave replica node is configured to perform the method of the third aspect.

According to an eighth aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to a ninth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method as described above.

According to a tenth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

According to the technical scheme, the management node determines one main data set copy from the plurality of data set copies, and then the main copy node corresponding to the main data set copy uploads the contents such as the cold data set and the target remote metadata to the remote storage system, and the slave copy node corresponding to the slave data set copy acquires the target remote metadata from the remote storage system according to the main copy identification information to perform data synchronization, so that the data synchronization step can be simplified, the data synchronization efficiency and accuracy are improved, and the resource cost required by the remote storage system is greatly reduced.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 4 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 5 is a schematic diagram according to a fifth embodiment of the present disclosure;

FIG. 6 is a schematic diagram according to a sixth embodiment of the present disclosure;

FIG. 7 is a schematic diagram according to a seventh embodiment of the present disclosure;

fig. 8 is a block diagram of an electronic device for implementing a data processing method for data storage according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure. As shown in fig. 1, the data processing method for data storage of the present embodiment is applied to a management node, and specifically includes the following steps:

s101, selecting one data set copy from a plurality of data set copies corresponding to a target data set as a master data set copy, and taking the rest data set copies as slave data set copies;

s102, acquiring primary copy identification information of the primary data set copy, transmitting the primary copy identification information to a secondary copy node of the secondary data set copy, wherein the primary copy identification information is used for acquiring target remote metadata from a remote storage system by the secondary copy node, replacing a locally-located cold data set by using the target remote metadata, and transmitting the target remote metadata to the remote storage system by the primary copy node of the primary data set copy.

The execution main body of the data processing method for data storage in this embodiment is a management node, and for a target data set stored locally in a database or a data warehouse, after the management node selects one data set copy from multiple data set copies corresponding to the target data set as a master data set copy, the master copy identification information of the master data set copy is sent to a slave copy node corresponding to the slave data set copy, so that the slave copy node obtains target remote metadata from a remote storage system according to the master copy identification information, and the target remote metadata is used to replace cold data located locally at the node, so that the purpose of data synchronization of the target data set between nodes corresponding to the data set copy is achieved.

In this embodiment, the management node is located in a database or data warehouse that stores data; the database or the data warehouse comprises a plurality of data sets, each data set can be used as a target data set, and each target data set corresponds to a plurality of data set copies; the remote storage system in this embodiment may be a distributed storage system, and the remote storage is to transfer and store data locally located in the node to the remotely located distributed storage system.

The copy of different data sets corresponding to the same target data set contains all data in the target data set, and the difference is that the copy of different data sets corresponds to different data intervals.

For example, if the target data set 1 is a data set within a range corresponding to [0, 500], each number within the range represents one data; if the target data set 1 has the data set copy 1, the data set copy 2 and the data set copy 3, the data interval corresponding to the data set copy 1 may be (0-100, 101-200, 201-300, 301-400, 401-500), the data interval corresponding to the data set copy 2 may be (0-112, 113-211, 212-400, 401-403, 404-500), and the data interval corresponding to the data set copy 3 may be (0-76, 77-400, 401-500).

In this embodiment, different data set copies corresponding to the same target data set may be located locally on different nodes, i.e. the data set copies are stored locally on the nodes, e.g. the data set copy 1 is stored locally on the node 1, the data set copy 2 is stored locally on the node 2, etc.

When executing S101 to select one dataset copy from the multiple dataset copies corresponding to the target dataset as the master dataset copy, the management node of this embodiment may first determine the multiple dataset copies corresponding to the target dataset, then select one from the multiple dataset copies in a random selection manner, and use the remaining dataset copies as the slave dataset copies.

It may be appreciated that, after executing S101 to determine the master data set replica and the slave data set replica, the management node of the present embodiment may further use the node storing the master data set replica as a master replica node and the node storing the slave data set replica as a slave replica node.

When executing S101 to select one data set copy from the multiple data set copies corresponding to the target data set as the master data set copy, the management node of this embodiment may also first determine multiple data set copies corresponding to the target data set, then acquire the number of data intervals of each data set copy, and finally select one data set copy whose number of data intervals meets the preset requirement as the master data set copy, and use the remaining data set copies as the slave data set copies.

When executing S101 to select one data set copy with the number of data intervals meeting the preset requirement as the main data set copy, the management node of this embodiment may select one data set copy with the number of data intervals greater than or equal to the preset number threshold as the main data set copy.

The more the number of the data intervals is, the finer the data set copy is divided into the target data sets, so that the data set copy with finer data division is used as the main data set copy, and the data synchronization of the nodes of the follow-up data set copy can be facilitated.

After determining the master data set replica and the slave data set replica in the execution S101, the management node in this embodiment executes S102 to acquire master replica identification information of the master data set replica, and sends the master replica identification information to the slave replica node of the slave data set replica.

The primary copy identification information in the embodiment is used for acquiring target remote metadata uploaded by the primary copy node from the remote storage system by the copy node, and further replacing local cold data by using the target remote metadata, so that data synchronization among the copy nodes of the data set is realized.

The management node in this embodiment executes the primary copy identification information acquired in S102, and can uniquely identify different primary data set copies corresponding to different target data sets; for example, the primary copy identification information obtained may be "data set 1 copy 1", i.e., indicating that the primary copy identification information corresponds to data set copy 1 of target data set 1.

In addition, when the node sends the local cold data to the remote storage system for storage, the remote storage system may contain some invalid data, for example, the data left by the remote storage system when merging different cold data, or the cold data sent from the copy node to the remote storage system, and therefore, the invalid data in the remote storage system needs to be deleted.

In order to implement deletion of invalid data in the remote storage system, the management node of the present embodiment may further include the following after executing S102: acquiring latest progress identification information sent from a duplicate node; determining target progress identification information according to different latest progress identification information sent by different slave replica nodes; and sending the determined target progress identification information to the master-copy node, so that the master-copy node can delete invalid data corresponding to the target progress identification information in the remote storage system.

In this embodiment, the progress identifying information is generated by the master-copy node together when generating the target remote metadata, the different progress identifying information corresponds to the number of times when the master-copy node generates the target remote metadata, and the different progress identifying information is continuous, for example, the progress identifying information 1 corresponds to the master-copy node to generate the target remote metadata for the first time, the progress identifying information 2 corresponds to the master-copy node to generate the target remote metadata for the second time, and the like, and the progress identifying information is used for reflecting the progress of the data update of the slave-copy node, for example, the first update is completed according to the target remote metadata generated for the first time, the second update is completed according to the target remote metadata generated for the second time, and the like.

In this embodiment, when the slave copy node obtains the target remote metadata from the remote storage system according to the master copy identification information, the progress identification information corresponding to the obtained target remote metadata can be obtained together, after the local cold data is replaced by the target remote metadata, the obtained progress identification information is sent to the management node as the latest progress identification information, so that the management node determines the update progress of each slave copy node according to the minimum progress identification information sent by different slave copy nodes, further sends the determined target progress identification information to the master copy node, and the master copy node deletes the invalid data corresponding to the target progress identification information, thereby not affecting some slave copy nodes which do not perform data synchronization.

That is, the management node of the embodiment controls the master replica node to delete the corresponding invalid data in the remote storage system according to the latest progress identification information sent by the slave replica node, so that the problem of storage resource waste caused by storing the invalid data in the remote storage system can be avoided.

When executing S102 to determine the target progress identification information according to different latest progress identification information sent by different slave replica nodes, the management node of this embodiment may use the consistent latest progress identification information as the target progress identification information when determining that the latest progress identification information sent by different slave replica nodes is consistent.

For example, if the slave replica node includes slave replica node 1, slave replica node 2, and slave replica node 3, and if the latest progress identification information sent from slave replica node 1, slave replica node 2, and slave replica node 3 to the management node is progress identification information 1, the management node of the present embodiment takes the progress identification information 1 as target progress identification information, and further causes the master replica node to delete invalid data corresponding to the progress identification information 1 in the remote storage system.

When executing S102 to determine the target progress identification information according to different latest progress identification information sent by different slave replica nodes, the management node of this embodiment may also use the smallest latest progress identification information as the target progress identification information if it is determined that the latest progress identification information sent by different slave replica nodes is inconsistent.

For example, if the slave copy node includes slave copy node 1, slave copy node 2 and slave copy node 3, if the latest progress identification information sent from the slave copy node 1 to the management node is progress identification information 2, the latest progress identification information sent from the slave copy node 2 to the management node is progress identification information 2, and the latest progress identification information sent from the slave copy node 3 to the management node is progress identification information 3, the management node of the present embodiment takes the progress identification information 2 as target progress identification information, and further causes the master copy node to delete invalid data corresponding to the progress identification information 2 in the remote storage system.

Fig. 2 is a schematic diagram according to a second embodiment of the present disclosure. As shown in fig. 2, the data processing method for data storage of the present embodiment is applied to a primary replica node, where the primary replica node is a node where a primary data set replica of a target data set is located, and specifically includes the following steps:

s201, determining a cold data set from a target data set, and generating target remote metadata corresponding to the cold data set, wherein the target data set is located at the local of a main copy node;

s202, sending the cold data set, the target remote metadata and the main copy identification information to a remote storage system for the remote storage system to store the target remote metadata and the cold data set according to the main copy identification information, wherein the target remote metadata is used for searching the cold data set in the remote storage system;

s203, replacing the cold data set locally by using the target remote metadata.

The execution main body of the data processing method for data storage in this embodiment is located at a master copy node, after determining a cold data set in a target data set, the master copy node first generates target remote metadata corresponding to the cold data set, then sends the target remote metadata, the cold data set and the master copy identification information to a remote storage system together, and finally replaces the locally located cold data with the currently generated target remote metadata.

When executing S201 to determine a cold data set from a target data set, the primary replica node of this embodiment may adopt the following alternative implementation manners: determining the storage time of each data according to the current time and the storage time of each data in the target data set; and determining a cold data set according to the data with the storage time length being greater than or equal to the preset time length threshold, wherein the cold data set determined by the embodiment contains at least one data.

That is, in this embodiment, the cold data set is determined according to the storage duration of each data in the target data set, and the data with longer storage duration is transferred and stored as the cold data to the remote storage system, so that the accuracy of the determined cold data set can be improved.

The primary-replica node in this embodiment executes the target remote metadata corresponding to the cold data set generated in S201, and is configured to store the cold data set determined at this time in the remote storage system, where the target remote metadata may be considered as attribute data of the cold data set, and is configured to record a storage path of the cold data set in the remote storage system, where the cold data set determined at this time can be found in the remote storage system through the target remote metadata.

After executing S201 to generate the target remote metadata corresponding to the cold data set, the primary replica node of the present embodiment may further generate progress identification information corresponding to the target remote metadata, where the progress identification information is used to indicate what number of times the primary replica node is currently generating the target remote metadata; for example, if the target remote metadata is generated for the first time, the progress identification information generated at this time is 1, if the target remote metadata is generated for the second time, the progress identification information generated at this time is 2, and so on.

Further, when executing S201 to generate the progress identifier information corresponding to the target remote metadata, the master copy node of this embodiment may further determine an invalid data list corresponding to the progress identifier information, and further record the determined invalid data list.

When executing S201, the master-slave node of this embodiment may acquire invalid data in the remote storage system after generating the target remote metadata at this time, and further obtain an invalid data list according to the acquired invalid data and record the invalid data list; the invalid data obtained in this embodiment may be data that is left over due to different cold data sets being combined and stored in the remote storage system at the current time, or data that is uploaded from a replica node and stored in the remote storage system at the current time.

After performing S201 to generate the target remote metadata corresponding to the cold data set, the primary replica node of the present embodiment performs S202 to send the cold data set, the target remote metadata, and the primary replica identification information to the remote storage system.

The above content sent by the master copy node to the remote storage system in this embodiment is used for the remote storage system to store the target remote metadata and the cold data set according to the master copy identification information, so as to obtain the target remote metadata from the remote storage system according to the master copy identification information from the copy node, and then find the corresponding cold data set in the remote storage system according to the target remote metadata.

If the master-slave node in this embodiment generates the progress identification information corresponding to the target remote metadata when executing S201, the master-slave node may also send the progress identification information to the remote storage system together when executing S202, so that the slave-slave node may also acquire the corresponding progress identification information together when acquiring the target remote metadata from the remote storage system.

The present embodiment, after performing S202 to send the cold data set, the target remote metadata, and the primary copy identification information to the remote storage system, performs S203 to replace the locally located cold data set with the target remote metadata.

That is, after the primary-replica node in this embodiment sends the cold data set determined at the current time and the target remote metadata corresponding to the cold data set to the remote storage system, the target remote metadata can be used to replace the cold data set located locally, so that the node does not store the cold data set determined at this time locally, thereby reducing the waste of local storage resources of the node.

It will be appreciated that the primary replica node of the present embodiment may continue to make multiple determinations of the cold data set, thereby generating target remote metadata corresponding to each determined cold data set; corresponding to the target remote metadata is local metadata, wherein the local metadata is attribute data of data stored locally and is used for recording a storage path of the data locally at the node, and the stored data can be searched locally at the node through the local metadata.

After executing S203, the master-copy node of this embodiment may further include the following: acquiring target progress identification information sent by a management node; determining a target invalid data list corresponding to the acquired target progress identification information; and deleting the data corresponding to the acquired target invalid data list in the remote storage system.

That is, the master-slave node of this embodiment may implement the purpose of deleting the corresponding invalid data in the remote storage system according to the target progress identification information sent by the management node, so that the data synchronization of the slave-slave node is not affected, and further, the waste of storage resources in the remote storage system is reduced.

Fig. 3 is a schematic diagram according to a third embodiment of the present disclosure. As shown in fig. 3, the data processing method for data storage of the present embodiment is applied to a slave replica node of a target data set, where the slave replica node is a node where a slave data set replica is located, and specifically includes the following steps:

s301, determining a cold data set from a target data set, wherein the target data set is located locally from a copy node;

s302, target remote metadata is obtained from a remote storage system according to main copy identification information, the target remote metadata is used for replacing the locally-located cold data set, and the main copy identification information is sent by a management node.

The execution main body of the data processing method for data storage in this embodiment is located in a slave copy node, and after the slave copy node determines a cold data set in a target data set, according to main copy identification information sent by a management node, corresponding target remote metadata is obtained from a remote storage system, and then the obtained target remote metadata is used to replace the determined cold data set located locally.

The slave replica node of the present embodiment may adopt an alternative implementation manner when executing S301 to determine a cold data set from the target data set: determining the storage time of each data according to the current time and the storage time of each data in the target data set; and determining a cold data set according to the data with the storage time length being greater than or equal to the preset time length threshold, wherein the cold data set determined by the embodiment contains at least one data.

It will be appreciated that for the same target data set, the times of storage of the same data in the copies of the data sets at different nodes are the same, so that at different times the data contained in the cold data sets determined by the slave and master copy nodes are the same; thus, the embodiment may use the cold data set determined by the primary replica node as the first cold data set and the cold data set determined from the replica node as the second cold data set.

After the slave replica node of the present embodiment determines the cold data set from the target data set in S301, the slave replica node obtains the target remote metadata from the remote storage system according to the master replica identification information in S302, and replaces the cold data set locally with the target remote metadata.

In the embodiment, when executing S302, the slave replica node may further acquire progress identification information corresponding to the target remote metadata acquired this time, where different target remote metadata correspond to different progress identification information.

In the embodiment, when the slave copy node performs S302 to acquire the target remote metadata from the remote storage system according to the master copy identification information, the slave copy node may further acquire the target remote metadata from the remote storage system according to the master copy identification information and the progress identification information acquired in the previous time, so as to ensure that repeated target remote metadata cannot be acquired.

For example, if the progress identification information obtained from the replica node last time is the progress identification information 1; if the remote storage system comprises target remote metadata 1 corresponding to the progress identification information 1, target remote metadata 2 corresponding to the progress identification information 2 and target remote metadata 3 corresponding to the progress identification information 3; when the slave replica node executes S302, target remote metadata 2 corresponding to the progress identification information 2 is acquired from the remote storage system, and the progress identification information 2 is further transmitted to the management node as minimum progress identification information.

The slave replica node of the present embodiment may further include the following after performing S302 the replacement of the locally located cold data set with the target remote metadata: taking the progress identification information acquired this time as the latest progress identification information; and sending the latest progress identification information to the management node for the management node to determine target progress identification information according to the latest progress identification information.

In addition, in some special cases, the slave replica node of the present embodiment may send the cold data set and the slave replica identification information of the slave replica node to a remote storage system, where the cold data set and the slave replica identification information stored in the remote storage system and uploaded by the slave replica node are used as invalid data; different slave replica nodes correspond to different slave replica identification information, such as "data set 1 replica 2", "data set 1 replica 3", and so forth.

Fig. 4 is a schematic diagram according to a fourth embodiment of the present disclosure. A schematic diagram of a data processing method for data storage of the present embodiment is shown in fig. 4; wherein the numbers in the copies represent different data stored locally at the node; the primary and secondary nodes corresponding to the primary copy (the data set copy is simply called copy) take the data in the range of 0-100 as a cold data set, and further generate target remote metadata corresponding to the cold data set, namely remoteMetaID:100, sending the generated target remote metadata and the cold data set in the range of 0-100 to a remote storage system, storing the target remote metadata and the cold data set corresponding to the target remote metadata by the remote storage system, and using the remoteMetaID:100 replace the local cold data set (referred to simply as "0-100" is replaced in the figure); when the slave copy 2 triggers hot data to cold data, target remote metadata 'remoteMetaID' is acquired from a remote storage system according to master copy identification information corresponding to a master copy: 100", replacing the local cold data set" data in the range of 0-100 "with the target remote metadata (simply referred to as" 0-100 "being replaced in the figure), thereby completing the synchronization from copy 2; when the slave copy 1 triggers hot data to cold data, target remote metadata 'remoteMetaID' is acquired from a remote storage system according to master copy identification information corresponding to a master copy: 100", and then the target remote metadata is used to replace the data in the range of" 0-50 "and" 51-100 "locally (in the figure, simply" 0-50 "is replaced, and" 51-100 "is replaced), synchronization from copy 1 is completed.

Fig. 5 is a schematic diagram according to a fifth embodiment of the present disclosure. As shown in fig. 5, the data processing apparatus 500 for data storage of the present embodiment is applied to a management node, and includes:

a first determining unit 501, configured to select one data set copy from multiple data set copies corresponding to a target data set as a master data set copy, and use the remaining data set copies as slave data set copies;

the first sending unit 502 is configured to obtain master copy identification information of the master data set copy, send the master copy identification information to a slave copy node of the slave data set copy, where the master copy identification information is used for the slave copy node to obtain target remote metadata from a remote storage system, replace a cold data set located locally with the target remote metadata, and send the target remote metadata to the remote storage system by the master copy node of the master data set copy.

The first determining unit 501 may first determine, when one data set copy is selected from a plurality of data set copies corresponding to a target data set as a master data set copy, the plurality of data set copies corresponding to the target data set, then select one of the plurality of data set copies as the master data set copy in a random selection manner, and use the remaining data set copies as slave data set copies.

It may be appreciated that after the first determining unit 501 determines the master data set replica and the slave data set replica, a node storing the master data set replica may be further regarded as a master replica node, and a node storing the slave data set replica may be regarded as a slave replica node.

When selecting one data set copy from the multiple data set copies corresponding to the target data set as the master data set copy, the first determining unit 501 may also determine multiple data set copies corresponding to the target data set first, then acquire the number of data intervals of each data set copy, and finally select one data set copy whose number of data intervals meets the preset requirement as the master data set copy, and use the remaining data set copies as the slave data set copies.

When the first determining unit 501 selects one data set copy whose number of data intervals meets the preset requirement as the main data set copy, one data set copy whose number of data intervals is greater than or equal to the preset number threshold may be selected as the main data set copy.

Since the more the number of data intervals is, the finer the division of the data set copy to the target data set is, the first determining unit 501 takes the data set copy with finer data division as the main data set copy, which is more beneficial to the data synchronization of the nodes of the follow-up data set copy.

The management node of the present embodiment, after determining the master data set replica and the slave data set replica by the first determination unit 501, acquires master replica identification information of the master data set replica by the first transmission unit 502, and transmits the master replica identification information to the slave replica node of the slave data set replica.

The primary copy identification information acquired by the first sending unit 502 in this embodiment can uniquely identify different primary data set copies corresponding to different target data sets; for example, the primary copy identification information obtained may be "data set 1 copy 1", i.e., indicating that the primary copy identification information corresponds to data set copy 1 of target data set 1.

In order to implement deletion of invalid data in the remote storage system, the data processing apparatus 500 for data storage of the present embodiment may further include an updating unit 503 for performing the following: acquiring latest progress identification information sent from a duplicate node; determining target progress identification information according to different latest progress identification information sent by different slave replica nodes; and sending the determined target progress identification information to the master-copy node, so that the master-copy node can delete invalid data corresponding to the target progress identification information in the remote storage system.

The updating unit 503 of the present embodiment may, when determining the target progress identification information according to different latest progress identification information transmitted from the replica node, take the identical latest progress identification information as the target progress identification information in the case where it is determined that the different latest progress identification information transmitted from the replica node is identical.

The updating unit 503 of the present embodiment may further use the minimum latest progress identification information as the target progress identification information when determining the target progress identification information according to different latest progress identification information transmitted from the replica node, where it is determined that the latest progress identification information transmitted from the replica node is inconsistent.

Fig. 6 is a schematic diagram according to a sixth embodiment of the present disclosure. As shown in fig. 6, a data processing apparatus 600 for data storage of this embodiment is applied to a primary replica node, where the primary replica node is a node where a primary data set replica of a target data set is located, and includes:

a second determining unit 601, configured to determine a cold data set from a target data set, and generate target remote metadata corresponding to the cold data set, where the target data set is located in a local area of a primary replica node;

a second sending unit 602, configured to send the cold data set, the target remote metadata, and primary copy identification information to a remote storage system, where the remote storage system is configured to store the target remote metadata and the cold data set according to the primary copy identification information, where the target remote metadata is configured to search the cold data set in the remote storage system;

a first replacing unit 603 for replacing the cold data set locally located using the target remote metadata.

The second determining unit 601 of the present embodiment may adopt an alternative implementation manner when determining a cold data set from a target data set: determining the storage time of each data according to the current time and the storage time of each data in the target data set; and determining a cold data set according to the data with the storage time length being greater than or equal to the preset time length threshold, wherein the cold data set determined by the embodiment contains at least one data.

That is, the second determining unit 601 performs the determination of the cold data set according to the storage duration of each data in the target data set, and transfers and stores the data having a longer storage duration as the cold data to the remote storage system, so that the accuracy of the determined cold data set can be improved.

The target remote metadata corresponding to the cold data set generated by the second determining unit 601 in this embodiment is used to store the cold data set determined at this time in the remote storage system, where the target remote metadata may be considered as attribute data of the cold data set and is used to record a storage path of the cold data set in the remote storage system, and the cold data set determined at this time can be searched in the remote storage system through the target remote metadata.

The second determining unit 601 of the present embodiment may further generate progress identification information corresponding to target remote metadata after generating the target remote metadata corresponding to the cold data set, the progress identification information being used to indicate how often the primary replica node is currently generating the target remote metadata; for example, if the target remote metadata is generated for the first time, the progress identification information generated at this time is 1, if the target remote metadata is generated for the second time, the progress identification information generated at this time is 2, and so on.

Further, when generating the progress identification information corresponding to the target remote metadata, the second determining unit 601 of the present embodiment may further determine an invalid data list corresponding to the progress identification information, and further record the determined invalid data list.

The second determining unit 601 of this embodiment may obtain invalid data in the remote storage system after the target remote metadata is generated this time, and further obtain and record an invalid data list according to the obtained invalid data; the invalid data obtained in this embodiment may be data that is left over due to different cold data sets being combined and stored in the remote storage system at the current time, or data that is uploaded from a replica node and stored in the remote storage system at the current time.

The primary replica node of the present embodiment, after generating target remote metadata corresponding to the cold data set by the second determination unit 601, performs transmission of the cold data set, the target remote metadata, and primary replica identification information to the remote storage system by the second transmission unit 602.

If the second determining unit 601 of the present embodiment generates the progress identifying information corresponding to the target remote metadata, the second sending unit 602 may also send the progress identifying information to the remote storage system together, so that the corresponding progress identifying information can also be obtained together when the target remote metadata is obtained from the remote storage system from the replica node.

The present embodiment replaces the locally located cold data set with the target remote metadata by the first replacement unit 603 after the cold data set, the target remote metadata, and the primary copy identification information are transmitted to the remote storage system by the second transmission unit 602.

That is, after the cold data set determined at the current time and the target remote metadata corresponding to the cold data set are sent to the remote storage system, the first replacing unit 603 may replace the cold data set located locally by using the target remote metadata, so that the node local does not store the cold data set determined at the current time any more, thereby reducing the waste of the storage resources of the node local.

The first replacement unit 603 of the present embodiment may also perform the following: acquiring target progress identification information sent by a management node; determining a target invalid data list corresponding to the acquired target progress identification information; and deleting the data corresponding to the acquired target invalid data list in the remote storage system.

That is, the first replacing unit 603 may implement the purpose of deleting the corresponding invalid data in the remote storage system according to the target progress identification information sent by the management node, so that the data synchronization of the slave copy node is not affected, and further, the waste of storage resources in the remote storage system is reduced.

Fig. 7 is a schematic diagram according to a seventh embodiment of the present disclosure. As shown in fig. 7, a data processing apparatus 700 for data storage of this embodiment is applied to a slave replica node of a target data set, where the slave replica node is a node where a slave data set replica is located, and includes:

a third determining unit 701 for determining a cold data set from a target data set, the target data set being local to the slave replica node;

a second replacing unit 702 is configured to obtain target remote metadata from a remote storage system according to primary copy identification information, where the primary copy identification information is sent by a management node, and replace the locally located cold dataset with the target remote metadata.

The third determining unit 701 of the present embodiment may adopt an alternative implementation manner when determining a cold data set from a target data set: determining the storage time of each data according to the current time and the storage time of each data in the target data set; and determining a cold data set according to the data with the storage time length being greater than or equal to the preset time length threshold, wherein the cold data set determined by the embodiment contains at least one data.

That is, the third determining unit 701 performs the determination of the cold data set according to the storage time length of each data in the target data set, and transfers and stores the data with a longer storage time length as the cold data to the remote storage system, so that the accuracy of the determined cold data set can be improved.

It will be appreciated that for the same target data set, the times of storage of the same data in the data set replicas at different nodes are the same, and therefore at different times the data contained in the cold data set determined from the replica node and the master replica node is the same.

The slave replica node of the present embodiment acquires target remote metadata from the remote storage system according to the master replica identification information by the second replacement unit 702 after the cold data set is determined from the target data set by the third determination unit 701, and replaces the locally located cold data set with the target remote metadata.

The second replacing unit 702 may further acquire progress identification information corresponding to the target remote metadata acquired this time, where different target remote metadata corresponds to different progress identification information.

The second replacing unit 702 of this embodiment may further obtain the target remote metadata from the remote storage system according to the primary copy identification information and the progress identification information obtained in the previous time when obtaining the target remote metadata from the remote storage system according to the primary copy identification information, so as to ensure that repeated target remote metadata is not obtained.

The second replacement unit 702 of the present embodiment may further include the following after replacing the locally located cold data set with the target remote metadata: taking the progress identification information acquired this time as the latest progress identification information; and sending the latest progress identification information to the management node for the management node to determine target progress identification information according to the latest progress identification information.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

As shown in fig. 8, is a block diagram of an electronic device for a data processing method for data storage according to an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the various methods and processes described above, such as a data processing method for data storage. For example, in some embodiments, the data processing method for data storage may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808.

In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM802 and/or communication unit 809. When a computer program is loaded into RAM803 and executed by computing unit 801, one or more steps of the data processing method for data storage described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the data processing method for data storage by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here can be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other data processing apparatus that is programmable for data storage, such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram block or blocks to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a presentation device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for presenting information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A data processing method for data storage, applied to a management node, comprising:

selecting one data set copy from a plurality of data set copies corresponding to the target data set as a master data set copy, and taking the rest data set copies as slave data set copies;

and acquiring primary copy identification information of the primary data set copy, transmitting the primary copy identification information to a secondary copy node of the secondary data set copy, wherein the primary copy identification information is used for acquiring target remote metadata from a remote storage system by the secondary copy node, replacing a cold data set positioned locally by using the target remote metadata, and transmitting the target remote metadata to the remote storage system by the primary and secondary nodes of the primary data set copy.

2. The method of claim 1, wherein the selecting one of the plurality of data set copies corresponding to the target data set as the master data set copy and the remaining data set copies as the slave data set copies comprises:

determining a plurality of data set copies corresponding to the target data set;

acquiring the number of data intervals of each data set copy;

and selecting one data set copy with the number of the data intervals meeting the preset requirement as the master data set copy, and taking the rest data set copies as the slave data set copies.

3. The method of claim 1, further comprising,

acquiring the latest progress identification information sent by the slave copy node;

determining target progress identification information according to different latest progress identification information sent by different slave replica nodes;

and sending the target progress identification information to the master-slave node so as to be used for deleting invalid data corresponding to the target progress identification information in the remote storage system by the master-slave node.

4. The method of claim 3, wherein the determining the target progress identifying information based on different latest progress identifying information transmitted from the replica node comprises:

under the condition that the latest progress identification information sent by different slave copy nodes is identical, the identical latest progress identification information is used as the target progress identification information;

otherwise, the latest minimum progress identification information is used as the target progress identification information.

5. A data processing method for data storage, applied to a primary replica node, comprising:

determining a cold dataset from a target dataset, generating target remote metadata corresponding to the cold dataset, the target dataset being local to the primary replica node;

Transmitting the cold data set, the target remote metadata and the main copy identification information to a remote storage system for the remote storage system to store the target remote metadata and the cold data set according to the main copy identification information, wherein the target remote metadata are used for searching the cold data set in the remote storage system;

replacing the cold data set locally with the target remote metadata.

6. The method of claim 5, wherein the determining a cold data set from the target data set comprises:

determining the storage time of each data according to the current time and the storage time of each data in the target data set;

and determining the cold data set according to the data with the stored time length being greater than or equal to a preset time length threshold value.

7. The method of claim 5, further comprising,

and generating progress identification information corresponding to the target remote metadata.

8. The method of claim 7, wherein the sending the cold data set, the target remote metadata, and primary copy identification information to a remote storage system comprises:

and sending the cold data set, the target remote metadata, the primary copy identification information and the progress identification information to the remote storage system.

9. The method of claim 7, further comprising,

acquiring target progress identification information sent by a management node;

determining a target invalid data list corresponding to the target progress identification information;

and deleting the data corresponding to the target invalid data list in the remote storage system.

10. A data processing method for data storage, applied to slave replica nodes, comprising:

determining a cold dataset from a target dataset, the target dataset being local to the slave replica node;

and acquiring target remote metadata from a remote storage system according to the main copy identification information, and replacing the locally-located cold data set by using the target remote metadata, wherein the main copy identification information is sent by a management node.

11. The method of claim 10, wherein the determining a cold data set from the target data set comprises:

12. The method of claim 10, wherein the retrieving target remote metadata from the remote storage system based on the primary copy identification information comprises:

And acquiring the target remote metadata and progress identification information corresponding to the target remote metadata from the remote storage system according to the main copy identification information.

13. The method of claim 12, further comprising,

taking the progress identification information acquired this time as the latest progress identification information;

and sending the latest progress identification information to the management node so as to be used for determining target progress identification information by the management node according to the latest progress identification information.

14. A data processing apparatus for data storage, for application to a management node, comprising:

a first determining unit, configured to select one data set copy from a plurality of data set copies corresponding to the target data set as a master data set copy, and use the remaining data set copies as slave data set copies;

the first sending unit is used for obtaining the main copy identification information of the main data set copy, sending the main copy identification information to the slave copy node of the slave data set copy, wherein the main copy identification information is used for obtaining target remote metadata from a remote storage system by the slave copy node, replacing a cold data set located locally by using the target remote metadata, and the target remote metadata is sent to the remote storage system by the main copy node of the main data set copy.

15. A data processing apparatus for data storage, for application to a primary replica node, comprising:

a second determining unit, configured to determine a cold data set from a target data set, and generate target remote metadata corresponding to the cold data set, where the target data set is located locally to the primary replica node;

the second sending unit is used for sending the cold data set, the target remote metadata and the main copy identification information to a remote storage system, so that the remote storage system can store the target remote metadata and the cold data set according to the main copy identification information, and the target remote metadata can be used for searching the cold data set in the remote storage system;

a first replacement unit for replacing the cold data set locally using the target remote metadata.

16. A data processing apparatus for data storage, for application to a slave replica node, comprising:

a third determining unit configured to determine a cold data set from a target data set, the target data set being located locally to the slave replica node;

and the second replacing unit is used for acquiring target remote metadata from the remote storage system according to the main copy identification information, replacing the locally-located cold data set by using the target remote metadata, and the main copy identification information is sent by the management node.

17. A data processing system for data storage, comprising a management node, a master replica node, and at least one slave replica node;

wherein the management node is configured to perform the method of any of the preceding claims 1-4, the master replica node is configured to perform the method of any of the preceding claims 5-9, and the at least one slave replica node is configured to perform the method of any of the preceding claims 10-13.

18. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-13.

19. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-13.

20. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-13.