CN114036104A

CN114036104A - Cloud filing method, device and system for re-deleted data based on distributed storage

Info

Publication number: CN114036104A
Application number: CN202111309242.5A
Authority: CN
Inventors: 陈元强; 蔡涛; 李文祥
Original assignee: Shenzhen Mulangyun Technology Co ltd
Current assignee: Shenzhen Mulangyun Technology Co ltd
Priority date: 2021-11-06
Filing date: 2021-11-06
Publication date: 2022-02-11

Abstract

The invention discloses a cloud filing method, a cloud filing device and a cloud filing system for deleted data based on distributed storage. Wherein, the method comprises the following steps: analyzing local metadata, traversing all referenced data fragments of the local metadata, and checking whether each data fragment in the data fragments already exists at a cloud storage end; under the condition that no corresponding data fragment exists in a cloud storage end, uploading the corresponding data fragment serving as a data object to the cloud storage end, adding an object tag, and setting the number of times of reference in the object tag as 1; and under the condition that the corresponding data fragment exists at the cloud storage end, adding 1 to the reference times in the object tag corresponding to the corresponding data fragment. The method and the device solve the technical problem that the object storage cannot be used for archiving the deleted data due to the unstructured storage of the object storage.

Description

Cloud filing method, device and system for re-deleted data based on distributed storage

Technical Field

The invention relates to the field of cloud storage, in particular to a cloud filing method, device and system for deleted data based on distributed storage.

Background

Object storage is an ideal data archiving solution, and has the characteristics of stability, reliability, large capacity (no limit in theory) and extremely low cost. The object storage is very simple to use, data is taken as objects, each object has a unique ID, and the data is uploaded into the object storage by using a related interface.

The use of object storage for data archiving is not complex, but in the case that local storage is deduplication storage, if data is directly uploaded, deduplication effects are lost, and space overhead is increased. The deleted data is complex in management, the same data block can be shared by different files, and the association relationship between the data needs to be maintained. The object storage is an unstructured storage, does not support data modification, can only be completely uploaded again after local modification, and is not suitable for processing the association relationship among data.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a cloud archiving method, a cloud archiving device and a cloud archiving system for deleted data based on distributed storage, and aims to at least solve the technical problem that the deleted data cannot be archived by using object storage due to unstructured storage of the object storage.

According to an aspect of an embodiment of the present invention, there is provided a cloud archive method for deduplication data based on distributed storage, including: analyzing local metadata, traversing all referenced data fragments of the local metadata, and checking whether each data fragment in the data fragments already exists at a cloud storage end; under the condition that no corresponding data fragment exists in a cloud storage end, uploading the corresponding data fragment serving as a data object to the cloud storage end, adding an object tag, and setting the number of times of reference in the object tag as 1; and under the condition that the corresponding data fragment exists at the cloud storage end, adding 1 to the reference times in the object tag corresponding to the corresponding data fragment.

According to another aspect of the embodiments of the present invention, there is also provided a cloud archive apparatus based on distributed storage and deleted data, including: the analysis module is configured to analyze local metadata, traverse all referenced data fragments of the local metadata, and check whether each data fragment in the data fragments already exists at a cloud storage end; the uploading module is configured to upload the corresponding data fragments as data objects to the cloud storage end under the condition that the corresponding data fragments do not exist in the cloud storage end, add object tags and set the number of times of reference in the object tags to be 1; and under the condition that the corresponding data fragment exists at the cloud storage end, adding 1 to the reference times in the object tag corresponding to the corresponding data fragment.

According to another aspect of the embodiment of the present invention, there is further provided a cloud archive system for deduplication data based on distributed storage, including a local end and a cloud storage end, where the local end includes the cloud archive apparatus as described above, and the cloud storage end is configured to store data fragments in an object storage manner.

In the embodiment of the invention, the following modes are adopted: analyzing local metadata, traversing all referenced data fragments of the local metadata, and checking whether each data fragment in the data fragments already exists at a cloud storage end; under the condition that no corresponding data fragment exists in a cloud storage end, uploading the corresponding data fragment serving as a data object to the cloud storage end, adding an object tag, and setting the number of times of reference in the object tag as 1; under the condition that corresponding data fragments exist in a cloud storage end, the number of times of reference in an object tag corresponding to the corresponding data fragments is increased by 1, and the aim of storing the deleted data in an object storage mode is fulfilled by setting the object tag, so that the technical effects of utilizing the advantages of object storage and further saving space are achieved, and the technical problem that the deleted data cannot be archived by using the object storage due to unstructured storage of the object storage is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow diagram of a cloud archive storage structure based on distributed storage of deduplicated data, according to an embodiment of the present invention;

FIG. 2 is a flow chart of a cloud archiving method based on distributed storage of deduplication data in accordance with an embodiment of the present invention;

fig. 3 is a flowchart of an uploading method of deduplication data based on distributed storage according to an embodiment of the present invention;

fig. 4 is a flowchart of a method for downloading deduplication data based on distributed storage according to an embodiment of the present invention;

FIG. 5 is a flowchart of a method for deleting deduplicated data based on distributed storage according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of an archiving system based on distributed storage for deduplication data according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an archive device for deduplication data based on distributed storage according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Definition of terms

And (3) re-deletion: that is, data de-duplication is a technology for saving storage space, and means that only one duplicate part of data is stored.

Data archiving: refers to the process of transferring infrequently used data to another storage device for long term storage.

Object storage: one form of cloud storage, with all objects stored in a single planar address space, has no hierarchy. The method is suitable for storing mass data.

Object tag: one function provided by the object storage system, in the form of "key-value" pairs, may be to add additional information to the object, may be used for classification, management, etc. of the object.

SUMMARY

The storage structure comprises local data and cloud storage end data. The data is stored in a local storage and can be filed to a cloud storage; data may also be retrieved from cloud storage. The overall structure of the memory structure of the present embodiment is shown in fig. 1.

The data of a file is divided into two parts in the local storage: metadata (Metadata) and Data fragments (Data Chunk).

The metadata contains basic information of the file, has a unique ID (0001 in FIG. 2), and the main content is the ID of the data block contained in the file.

Data fragmentation: data is managed in smaller pieces, each piece having a unique ID. Data deduplication is performed at the fragment level, and only one copy of the same fragment is stored. Since the same shard may be shared by multiple files, an additional database is maintained to record the number of times each shard is referenced.

When the file is archived in cloud storage, the two pieces of Data become a metadata Object (Meta Object) and a Data Object (Data Object), respectively.

The metadata object: and uploading the unique ID of the local metadata as the Object ID to an Object storage.

Data object: the unique ID of the local data fragment is used as the Object ID and uploaded to the Object storage, and because the database cannot be stored in the Object storage, the reference count of the data fragment is recorded as a tag in the Object in the application.

Example 1

According to an embodiment of the present invention, a flowchart of a cloud archive method based on distributed storage and deleted data is provided, and as shown in fig. 2, the method includes:

step S202, analyzing local metadata, traversing all referenced data fragments of the local metadata, and checking whether each data fragment in the data fragments exists at a cloud storage end;

for each data slice, a different case is distinguished, and step S104 or step S106 is executed.

Step S204, under the condition that the corresponding data fragments do not exist in the cloud storage end, uploading the corresponding data fragments serving as data objects to the cloud storage end, adding object tags, and setting the number of times of reference in the object tags to be 1;

step S206, if there is a corresponding data fragment at the cloud storage end, adding 1 to the reference number in the object tag corresponding to the corresponding data fragment.

In an exemplary embodiment, after traversing all referenced data slices of the local metadata, the method further comprises: and distributing the task of uploading the corresponding data fragment to the corresponding node based on a preset dispatching rule.

For example, the number of tasks that each node differs from the completion target is calculated; if the task number of each data-containing node which differs from the completion target is different, selecting the node with the largest task number, and distributing the task of uploading the corresponding data fragment to the selected node; if the task number of each data node different from the completion target is the same, calculating the residual assignable task number of each data node; and if the number of the remaining assignable tasks of each node with data is different, selecting the node with the least remaining assignable tasks, otherwise, selecting the first node, and then distributing the task of uploading the corresponding data fragment to the selected node.

For example, before distributing the task of uploading the respective data fragment to the respective node, the method further comprises: executing a data distribution algorithm, and calculating the distribution condition of all data fragments on each node to obtain a data distribution matrix; and traversing the distribution matrix according to columns, and calculating the number of data fragments on each node.

In an exemplary embodiment, calculating the number of tasks that each data node differs from the completion goal comprises: calculating the task number of each data-contained node which is different from a completion target based on the task number expected to be dispatched by each data-contained node and the task number already dispatched, wherein the task number expected to be dispatched is determined by the total number of the nodes and the number of data fragments on each node; and/or calculating the remaining number of dispatchable tasks for each data-bearing node comprises: and calculating the remaining assignable task number of each data node based on the number of times each data node has participated in selection and the number of the contained data fragments.

In an exemplary embodiment, a task assignment method similar to that at the time of uploading may also be employed to download data from the cloud storage. For example, the downloading method may include: downloading a metadata object from a cloud storage end; analyzing the metadata in the metadata object, and traversing all the data fragments referenced by the metadata; for each of all referenced data slices, performing the following: distributing tasks for downloading corresponding data fragments to corresponding nodes based on preset allocation rules; checking whether the corresponding data slice already exists locally; under the condition that corresponding data fragments do not exist locally, downloading the corresponding data fragments from a cloud storage end by the distributed nodes, copying the data fragments, and updating information in a local reference counting database; in the case of a local presence of a corresponding data fragment, the information in the local reference count database is updated by the assigned node.

In an exemplary embodiment, a method for deleting the deduplication data of the cloud storage side is further provided. The deleting method comprises the following steps: downloading a metadata object from a cloud storage end, and removing the metadata object from the cloud storage end; analyzing the metadata in the metadata object, and traversing all the data fragments referenced by the metadata; and for each data fragment in all the referenced data fragments, subtracting 1 from the reference count in the object tag corresponding to the corresponding data fragment at the cloud storage end, and removing the data object corresponding to the corresponding data fragment from the cloud storage end under the condition that the current count is 0.

By the method, a simple and feasible cloud filing technical scheme is realized. The embodiment is used for archiving the data in the local deduplication storage to the object storage and managing the data, so that the advantages of the object storage are utilized, and the space of the cloud storage end is further saved.

Example 2

An embodiment of the present invention provides a cloud archiving method for deduplication data stored in a distributed manner, and as shown in fig. 3, the method includes:

step S301, analyzing the local metadata and traversing all the referenced fragments.

Step S302, a dispatching method is executed, and the upload task is distributed to each node.

The task assignment method will be described in detail below.

In this embodiment, it is assumed that the current distributed storage system has 3 nodes, the system is configured with two copies, and one data fragment needs to store two copies and is placed on two nodes (nodes). Assuming 6 DATA slices (DATA Chunk), the distribution is shown in table 1 below (with √ to indicate which nodes the DATA slice is on):

TABLE 1

The distribution of data is a plurality of strategies, which are important links in distributed storage, but are not related to the content of the application, and therefore, the details are not described here.

The above distribution is idealized and the number of data fragments stored on each node is uniform. However, in practical applications, such an idealized situation is not uncommon.

In this embodiment, the following function is used to represent the distribution of data:

[Node Number]＝Distribute(chunk-id)

each data slice has a unique identity (chunk-id), and when the chunk-id is substituted into the distribution function, a distributed node list is returned. For example, applying the above example, the data needs two copies, and substituting Chunk-1 into the distribution function will result in [ Node-1, Node-2 ].

Returning to the problem of cloud archiving, data needs to be uploaded to a cloud storage end currently. Chunk-1 stores one copy on Node-1 and Node-2, respectively, but only needs to upload once, and then which Node to perform upload? In this embodiment, the uploaded task needs to be dispatched to a specific node for execution, and the process of selecting the node is defined as a task dispatching method as follows:

Task Node Number＝Select(Distribute(chunk-id))

how the assignment method is designed will be explained below. The present embodiment first takes the simplest idea, i.e. always selects the first from the list, as follows:

Task Node Number＝Select-First(Distribute(chunk-id))

according to this algorithm, the resulting task assignments are shown in table 2 below:

TABLE 2

As shown in table 2 above, such a dispensing method has the advantage of being relatively simple, but has the disadvantages of: no task is assigned to Node-3, and Node-3 is completely idle in the whole uploading process, so that the performance of the system cannot be fully exerted.

Therefore, the embodiment provides another task assignment method.

In order to solve the above-mentioned problem of maldistribution, the goal here needs to be to distribute the upload task as evenly as possible to the different nodes. The implementation method in the implementation is as follows: firstly, the target of each node is set, in the above example, a total of 6 data fragments need to be uploaded, and there are 3 nodes, then the ideal target is: each node uploads 6/3-2 data fragments. After the target is set, all the fragments are traversed by combining the actual data distribution condition, and the nodes are selected according to the following principle:

1. and calculating the distance between each node with data and the completed target to be different, and selecting the node with the largest difference.

2. If condition 1 is the same, the nodes that remain with the least assignable tasks are selected.

3. If condition 2 remains the same, the first node is selected.

In this embodiment, the condition 1 is a core, which ensures the uniformity of the overall distribution, and a node is assigned to a task, and becomes closer to the intended target, and the priority of the node is lowered in the next selection. Condition 2 ensures that the less data node takes precedence because the more data node may have a chance to participate in the dispatch later. Condition 3 ensures that the algorithm can terminate.

The assignment procedure is not actually developed here, but the assignment results of the above example are given directly, as shown in table 3 below:

	Node-1	Node-2	Node-3
				Chunk-1	√
Chunk-2		√
				Chunk-3		√
Chunk-4		√
				Chunk-5	√
Chunk-6			√

TABLE 3

Each node is allocated exactly 2 tasks. The process can be generalized to a generalized scenario, and then a formal definition of the algorithm is given, which is described in terms of pseudo-code.

First, the following definitions are made:

1) the number of system nodes is m (i.e.: node-1, Node-2, Node-3

2) There are n data slices (i.e.: chunk-1, Chunk-2, Chunk-3

3) The number of copies of the system is r

4) The number of tasks each node is expected to dispatch is: e-n/m

For each node, the following values are required:

1) the number of tasks that have been dispatched is: t is t

2) The distance to the target is: d-e-t

3) The number of data fragments contained is: a is

4) The number of times a selection has been engaged (whether selected or not) is: s

5) The remaining number of dispatchable tasks is: f is a-s

After the above definition, first traverse all the data fragments that need to be uploaded, execute a data distribution algorithm, and calculate the distribution situation of all the fragments on the node:

this results in a data distribution matrix, like the distribution table in the example above, as follows:

traversing the matrix according to columns, and calculating the number of data fragments on each node:

foriin 0→m：

a[i]＝sum(matrix[][i])

then, the task assignment is started as follows:

dispatch (x) is performed to assign the task to node x. The above flow is task assignment in the data uploading process.

In one exemplary embodiment, tasks may also be assigned to nodes based on a neural network model and using the following formula:

the above formula is a neural network model trained based on a neural network,

indicating that the assigned node is the ith row, the jth column,eta is a correction parameter, H_jIs the quadratic term matrix coefficient of the neural network model, n is the number of data fragments, e is the number of tasks expected to be assigned by each node, r is the number of copies of the system, t_ijFor the number of tasks already assigned, d_ijIs the distance from the target.

Step S303, checking whether the fragment already exists at the cloud storage end.

If so, step S306 is performed, otherwise, step S304 is performed.

And step S304, uploading the fragments to a cloud object storage.

In step S305, an object tag is added, and the number of references is set to 1.

Step S307 is executed.

Step S306, the number of times of reference in the object label is modified to be added with 1.

Step S307, determining whether all the slices have been processed.

And if not, processing the next data fragment and jumping to the step S303, otherwise, executing the step S308.

In step S308, the metadata object is uploaded.

Example 3

According to an embodiment of the present invention, there is provided a method for downloading data from a cloud storage, as shown in fig. 4, the method including:

in step S401, a metadata object is downloaded.

Step S402, analyzing the metadata and traversing all the referenced fragments.

Step S403, executing the assignment method, and distributing the download task to each node.

In data downloading, the same problems as in data uploading exist: 1 of cloud storage data and local data: and (4) N relation. And distributing the downloading task to different nodes (the uniformity can be ensured) by using the same distribution algorithm as that used in data uploading, then downloading the data by the nodes, and copying the data to other nodes according to the distribution algorithm after the downloading is finished.

Step S404 checks whether the fragment already exists locally.

If the data fragment already exists locally, step S407 is executed, otherwise, step S405 is executed.

Step S405, downloading the data fragments from the cloud object storage.

And downloading the data fragments from the cloud object storage of the cloud storage terminal.

In step S406, the data slice is copied.

In step S407, the information in the local reference count database is updated.

Step S408, whether all the fragments are processed is finished.

If the data fragment has been completely processed, the process is ended, otherwise, the next data fragment is processed, and the step S404 is skipped.

Example 4

In deduplication storage, data cannot be deleted directly because one piece of data may be shared by multiple files. In this embodiment, reference counting is adopted for management, and when the reference counting is 0, the data fragment may be deleted.

According to an embodiment of the present invention, there is provided a method for deleting data from cloud storage, as shown in fig. 5, the method including:

step S501, the metadata object is downloaded and removed from the cloud storage.

Step S502, the metadata is analyzed, and all the quoted fragments are traversed.

In step S503, the reference count in the object tag is decremented by 1.

In step S504, it is determined whether the current count is 0.

If the current count is 0, step S505 is performed, otherwise step S506 is performed.

And step S505, removing the data object from the cloud storage terminal.

Step S506, determining whether all the slices have been processed.

If the data fragment is completely processed, the process is ended, otherwise, the next data fragment is processed, and the step S503 is skipped.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention. Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 5

According to an embodiment of the present invention, a cloud archiving system for deduplication data stored in a distributed manner is provided, as shown in fig. 6, the system includes: a local side 62 and a cloud storage side 64.

The local end 62 includes a storage access interface 621, a deduplication management module 622, a metadata management module 624, a data block storage module 626, an archive management module 628, and a cloud data management module 620.

The storage access interface 621 stores the data in a local storage device, and then archives the data to the cloud storage. The module implements an interface to access local storage, providing applications with the ability to write or read data.

The deduplication management module 622 performs data deduplication on locally stored data, for example, implementing functions of data fragmentation, fingerprint calculation, fingerprint storage and query, and the like. Since data deduplication is not the content of the present application, it is not described herein again.

The metadata management module 620 is used for managing metadata of the storage data, and is divided into two parts: data attributes and data block references. The data attribute comprises information such as size, time, authority and the like of the data; the data block reference includes information that records which data blocks a piece of data contains.

The data block storage module 626 is used for storage and management of data blocks. After the data is re-deleted, the data is divided into independent data blocks, and the data block storage module 626 is used for storing and managing the independent data blocks. Data block storage module 626 includes storing and indexing (for accelerating queries) data blocks, and garbage collection for invalid data (no references).

The archive management module 628 is used for managing data archive method and records, and a user may configure a certain policy, such as regular archive, archive after a certain time, archive a certain type of data, and the like, or manually select to archive a certain part of locally stored data to the cloud storage. After the archiving, a record is formed for distinguishing which data exists at the cloud storage end.

The cloud data management module 620 interfaces with the cloud storage 64, and is a module that actually performs data archiving. The cloud storage management system comprises a plurality of sub-modules such as cloud connection management, task assignment, data uploading and data downloading, and is used for interacting with the cloud storage 64. The reference relation of the data of the cloud storage end is different from the local relation, so that the cloud storage end data reference calculation sub-module is further included, recalculation is needed during each uploading and downloading, and when the reference count is reduced to 0, the data fragment of the cloud storage end needs to be deleted.

The cloud storage 64 includes an object storage module, which belongs to an external resource, for storage of archived data. Cloud storage has many different forms, and the application relates to object storage.

In an exemplary embodiment, the cloud data management module 620 of the local end 62 may be a distributed storage deduplication data cloud archive. The cloud archive device for deduplication data of distributed storage will be described in detail below, and will not be described in detail here.

The cloud archiving system for the distributed storage of the deleted data in this embodiment may implement the cloud archiving method for the distributed storage of the deleted data in the foregoing embodiment, and details are not described here.

Example 6

According to an embodiment of the present invention, there is also provided a cloud archive apparatus for implementing the above-mentioned deduplication data based on distributed storage, as shown in fig. 7, the apparatus includes:

the parsing module 72 is configured to parse the local metadata, traverse all referenced data slices of the local metadata, and check whether each of the data slices already exists at a cloud storage end;

the uploading module 74 is configured to, when there is no corresponding data fragment at the cloud storage, upload the corresponding data fragment as a data object to the cloud storage, add an object tag, and set the number of references in the object tag to 1; and under the condition that the corresponding data fragment exists at the cloud storage end, adding 1 to the reference times in the object tag corresponding to the corresponding data fragment.

In one exemplary embodiment, the apparatus may further include a task dispatch module configured to: and distributing the task of uploading the corresponding data fragment to the corresponding node based on a preset dispatching rule.

Specifically, the task dispatch module is configured to: calculating the task number of each node which is different from the completed target; if the task number of each node which differs from the completion target is different, selecting the node with the largest task number of the difference, and distributing the task of uploading the corresponding data fragment to the selected node; if the task number of each node which is different from the completion target is the same, calculating the residual assignable task number of each node; and if the number of the remaining dispatchable tasks of each node is different, selecting the node with the least remaining dispatchable tasks, otherwise, selecting the first node, and then distributing the task of uploading the corresponding data fragment to the selected node.

In an exemplary embodiment, the task assignment module may be further configured to: executing a data distribution algorithm, and calculating the distribution condition of all data fragments on each node to obtain a data distribution matrix; and traversing the distribution matrix according to columns, and calculating the number of data fragments on each node.

For example, the number of tasks that are different from the completion goal of each node is calculated based on the number of tasks expected to be dispatched by each node and the number of tasks already dispatched, wherein the number of tasks expected to be dispatched is determined by the total number of nodes and the number of data fragments on each node.

For example, the number of remaining dispatchable tasks for each node is calculated based on the number of times each node has participated in the selection and the number of data fragments contained.

In an exemplary embodiment, the apparatus may further include a download module configured to download the metadata object from the cloud storage; analyzing the metadata in the metadata object, and traversing all the data fragments referenced by the metadata; for each of all referenced data slices, performing the following: distributing tasks for downloading corresponding data fragments to corresponding nodes based on preset allocation rules; checking whether the corresponding data slice already exists locally; under the condition that corresponding data fragments do not exist locally, downloading the corresponding data fragments from a cloud storage end by the distributed nodes, copying the data fragments, and updating information in a local reference counting database; in the case of a local presence of a corresponding data fragment, the information in the local reference count database is updated by the assigned node.

For example, the number of tasks that each node differs from the completion target is calculated; if the task number of each node which differs from the completion target is different, selecting the node with the largest task number of the difference, and distributing the task for downloading the corresponding data fragment to the selected node; if the task number of each node which is different from the completion target is the same, calculating the residual assignable task number of each node; and if the number of the remaining dispatchable tasks of each node is different, selecting the node with the least remaining dispatchable tasks, otherwise, selecting the first node, and then distributing the task of downloading the corresponding data fragment to the selected node.

In an exemplary embodiment, the apparatus further comprises a deletion module configured to download the metadata object from the cloud storage and remove the metadata object from the cloud storage; analyzing the metadata in the metadata object, and traversing all the data fragments referenced by the metadata; and for each data fragment in all the referenced data fragments, subtracting 1 from the reference count in the object tag corresponding to the corresponding data fragment at the cloud storage end, and removing the data object corresponding to the corresponding data fragment from the cloud storage end under the condition that the current count is 0.

The embodiment archives the data in the local deduplication storage to the object storage and manages the data. The advantages of object storage are utilized, and space is further saved.

Example 7

The embodiment of the invention also provides a storage medium. Optionally, in this embodiment, the storage medium may perform the cloud archive method based on the distributed storage deduplication data in the embodiment.

Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing one or more computer devices (which may be personal computers, servers, network devices, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A cloud archive method of deleted data based on distributed storage is characterized by comprising the following steps:

analyzing local metadata, traversing all referenced data fragments of the local metadata, and checking whether each data fragment in the data fragments already exists at a cloud storage end;

under the condition that no corresponding data fragment exists in the cloud storage end, uploading the corresponding data fragment serving as a data object to the cloud storage end, adding an object tag, and setting the number of times of reference in the object tag as 1;

and adding 1 to the reference times in the object tag corresponding to the corresponding data fragment under the condition that the corresponding data fragment exists at the cloud storage end.

2. The method of claim 1, wherein after traversing all referenced data slices of the local metadata, the method further comprises: and distributing the task of uploading the corresponding data fragment to the corresponding node based on a preset dispatching rule.

3. The method of claim 2, wherein allocating the task of uploading the respective data segment to the respective node based on a preset allocation rule comprises:

calculating the task number of each node which is different from the completed target;

if the task number of each node which differs from the completion target is different, selecting the node with the largest task number of the difference, and distributing the task of uploading the corresponding data fragment to the selected node;

if the task number of each node which is different from the completion target is the same, calculating the residual assignable task number of each node; and if the number of the remaining dispatchable tasks of each node is different, selecting the node with the least remaining dispatchable tasks, otherwise, selecting the first node, and then distributing the task of uploading the corresponding data fragment to the selected node.

4. The method of claim 3, wherein prior to distributing the task of uploading the respective data slice to the respective node, the method further comprises:

executing a data distribution algorithm, and calculating the distribution condition of all data fragments on each node to obtain a data distribution matrix;

and traversing the distribution matrix according to columns, and calculating the number of data fragments on each node.

5. The method of claim 4,

calculating the task number of each node different from the completion target comprises the following steps: calculating the task number which is different from the completion target of each node based on the task number expected to be dispatched by each node and the task number already dispatched by each node, wherein the task number expected to be dispatched is determined by the total number of the nodes and the number of data fragments on each node; and/or

Calculating the remaining number of dispatchable tasks for each node includes: and calculating the remaining assignable task number of each node based on the number of times each node has participated in selection and the number of the contained data fragments.

6. The method of any one of claims 1 to 5, further comprising:

downloading a metadata object from a cloud storage end;

analyzing the metadata in the metadata object, and traversing all the data fragments referenced by the metadata;

for each of all referenced data slices, performing the following:

distributing tasks for downloading corresponding data fragments to corresponding nodes based on preset allocation rules;

checking whether the corresponding data slice already exists locally;

under the condition that corresponding data fragments do not exist locally, downloading the corresponding data fragments from a cloud storage end by the distributed nodes, copying the data fragments, and updating information in a local reference counting database;

in the case of a local presence of a corresponding data fragment, the information in the local reference count database is updated by the assigned node.

7. The method of claim 6, further comprising: distributing the task of downloading the corresponding data fragment to the corresponding node based on the preset allocation rule comprises:

if the task number of each node which differs from the completion target is different, selecting the node with the largest task number of the difference, and distributing the task for downloading the corresponding data fragment to the selected node;

if the task number of each node which is different from the completion target is the same, calculating the residual assignable task number of each node; and if the number of the remaining dispatchable tasks of each node is different, selecting the node with the least remaining dispatchable tasks, otherwise, selecting the first node, and then distributing the task of downloading the corresponding data fragment to the selected node.

8. The method of any one of claims 1 to 5, further comprising:

downloading a metadata object from a cloud storage end, and removing the metadata object from the cloud storage end;

and for each data fragment in all the referenced data fragments, subtracting 1 from the reference count in the object tag corresponding to the corresponding data fragment at the cloud storage end, and removing the data object corresponding to the corresponding data fragment from the cloud storage end under the condition that the current count is 0.

9. A cloud archive device for deduplication data based on distributed storage is characterized by comprising:

the analysis module is configured to analyze local metadata, traverse all referenced data fragments of the local metadata, and check whether each data fragment in the data fragments already exists at a cloud storage end;

the uploading module is configured to upload the corresponding data fragments as data objects to the cloud storage end under the condition that the corresponding data fragments do not exist in the cloud storage end, add object tags and set the number of times of reference in the object tags to be 1; and under the condition that the corresponding data fragment exists at the cloud storage end, adding 1 to the reference times in the object tag corresponding to the corresponding data fragment.

10. A cloud archive system for deduplication data based on distributed storage, comprising:

a local end comprising the cloud archive apparatus of claim 9;

and the cloud storage end is used for storing the data fragments in an object storage mode.