CN116578746A

CN116578746A - Object de-duplication method and device

Info

Publication number: CN116578746A
Application number: CN202310573503.7A
Authority: CN
Inventors: 刘易
Original assignee: Shanghai Bilibili Technology Co Ltd
Current assignee: Shanghai Bilibili Technology Co Ltd
Priority date: 2023-05-19
Filing date: 2023-05-19
Publication date: 2023-08-11

Abstract

The embodiment of the application provides an object deduplication method, which comprises the following steps: acquiring a first hash value and a first object identifier of a first object; determining whether the first object belongs to a repeated object according to the first hash value and the first object identifier; performing a deduplication operation on the first object or the second object under the condition that the first object is determined to belong to a duplicate object; wherein the second object comprises an object that is duplicated in the storage system with the first object. The technical scheme of the embodiment of the application determines whether a repeated object exists or not through the hash value and the object identifier. When the repeated object exists, the object of the repeated object is cleaned, so that the utilization rate of the storage resources of the storage system is optimized as much as possible, and the waste of the storage resources in the storage system is avoided as much as possible. At the same time, the time and cost of data recovery when data is lost is also reduced, as the storage of duplicate objects is reduced.

Description

Object de-duplication method and device

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to an object deduplication method, an object deduplication device, computer equipment and a computer readable storage medium.

Background

With the development of computer technology, mass storage, reading and writing of various data become a current hot spot problem. In data storage, object storage is an important data storage mode, and is receiving more and more attention and application due to the advantages of high reliability, strong expandability, high access speed and the like. However, with the increasing amount of data, a large amount of duplicate data may appear in the object store, wasting space and increasing management and maintenance costs.

It should be noted that the foregoing is not necessarily prior art, and is not intended to limit the scope of the present application.

Disclosure of Invention

Embodiments of the present application provide an object deduplication method, apparatus, computer device, and computer-readable storage medium, to solve or alleviate one or more of the technical problems set forth above.

An aspect of an embodiment of the present application provides an object deduplication method, including:

acquiring a first hash value and a first object identifier of a first object;

determining whether the first object belongs to a repeated object according to the first hash value and the first object identifier;

performing a deduplication operation on the first object or the second object under the condition that the first object is determined to belong to a duplicate object; wherein the second object comprises an object that is duplicated in the storage system with the first object.

Optionally, the first object is an object to be uploaded.

Optionally, determining whether the first object belongs to a duplicate object according to the first hash value and the first object identifier includes:

retrieving a second hash value identical to the first hash value, the second hash value mapping a second object identification;

determining that the first object does not belong to a duplicate object if the second hash value is not retrieved;

and under the condition that the first object is determined not to belong to the repeated object, the first object is physically stored in the storage system.

determining that the first object belongs to a repeated object under the condition that the second hash value is retrieved and the first object identifier is the same as the second object identifier;

and returning a message indicating that the response is successful under the condition that the first object is determined to belong to the repeated object.

physically storing the first object if the second hash value is retrieved and the first object identification and the second object identification are different;

by comparing the first object and the second object, it is determined whether the first object and the second object are repeated.

Optionally, the method further comprises:

asynchronously establishing a first mapping relation, wherein the first mapping relation represents the mapping relation between the first object identifier and first metadata; wherein the first metadata comprises a physical storage address of the first object in the storage system;

asynchronously establishing a second mapping relation, wherein the second mapping relation represents the mapping relation between the first hash value and the first object identifier;

correspondingly, determining whether the first object and the second object are repeated by comparing the first object and the second object comprises:

under the condition that the first hash value and the second hash value generate hash collision and the first object identifier and the second object identifier are different, acquiring a first object associated with the first object identifier and a second object associated with the second object identifier; wherein the second hash value and the second object identifier are in a mapping relationship;

In the case that the first object and the second object are byte-by-byte identical, determining that the first object and the second object are repeated.

Optionally, in the case that the first object is determined to belong to a duplicate object, performing a deduplication operation on the first object or the second object includes:

selecting the first object or the second object as a recyclable object according to the life cycle information of the first object and the life cycle information of the second object in the case that the first object and the second object are determined to be repeated;

physically deleting the recyclable object in the storage system; a kind of electronic device with high-pressure air-conditioning system

And updating information of the first metadata of the first object and/or the second metadata of the second object.

Optionally, determining whether the first object and the second object are repeated by comparing the first object and the second object further comprises:

in the event that the first object and the second object are not byte-wise identical, it is determined that the first object and the second object are not duplicated.

Optionally, acquiring the first object associated with the first object identifier and the second object associated with the second object identifier includes:

Acquiring first metadata of the first object through the first object identifier;

acquiring second metadata of the second object through the second object identifier;

wherein first metadata is used for locating the first object in the storage system, and second metadata is used for locating the second object, and the second metadata comprises a physical storage address of the second object in the storage system.

Optionally, the first object is a stored object.

and determining that the first object does not belong to a repeated object under the condition that the second hash value is not retrieved.

acquiring a first object associated with the first object identifier and a second object associated with the second object identifier under the condition that the second hash value is retrieved and the first object identifier and the second object identifier are different;

Another aspect of an embodiment of the present application provides an object deduplication apparatus, the apparatus including:

the acquisition module is used for acquiring a first hash value of the first object and a first object identifier;

The determining module is used for determining whether the first object belongs to a repeated object according to the first hash value and the first object identifier;

the de-duplication module is used for executing de-duplication operation on the first object or the second object under the condition that the first object is determined to belong to a repeated object; wherein the second object comprises an object that is duplicated in the storage system with the first object.

Another aspect of an embodiment of the present application provides a computer apparatus, including:

at least one processor; a kind of electronic device with high-pressure air-conditioning system

A memory communicatively coupled to the at least one processor;

wherein: the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

Another aspect of embodiments of the present application provides a computer-readable storage medium having stored therein computer instructions which, when executed by a processor, implement a method as described above.

The embodiment of the application adopts the technical scheme and can have the following advantages: whether a duplicate object exists may be determined by the hash value and the object identification. When the repeated object exists, the object of the repeated object is cleaned, so that the utilization rate of the storage resources of the storage system is optimized as much as possible, and the waste of the storage resources in the storage system is avoided as much as possible. At the same time, the time and cost of data recovery when data is lost is also reduced, as the storage of duplicate objects is reduced.

Drawings

The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.

FIG. 1 schematically illustrates a diagram of an operating environment for an object deduplication method according to a first embodiment of the present application;

FIG. 2 schematically illustrates a flow chart of an object deduplication method according to a first embodiment of the present application;

FIG. 3 schematically illustrates a processing architecture in an exemplary application;

FIG. 4 schematically illustrates a process flow in an exemplary application;

FIG. 5 schematically shows a block diagram of an object deduplication apparatus according to a second embodiment of the present application; a kind of electronic device with high-pressure air-conditioning system

Fig. 6 schematically shows a hardware architecture diagram of a computer device according to a third embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It should be noted that the descriptions of "first," "second," etc. in the embodiments of the present application are for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present application.

In the description of the present application, it should be understood that the numerical references before the steps do not identify the order in which the steps are performed, but are merely used to facilitate description of the present application and to distinguish between each step, and thus should not be construed as limiting the present application.

First, a term explanation is provided in relation to the present application:

object storage: is a computer data storage architecture that manages data as objects, each of which typically includes the data itself, an unequal amount of metadata, and a globally unique identifier (object ID).

Storing an object: is the most basic concept in cloud storage, and refers to files or data uploaded by a user in cloud storage. Each Object has a unique identifier (Object ID) through which the Object can be accessed and manipulated.

Hash collision: two or more key values are calculated by the hash function to the same index position, thereby creating a collision situation.

Physical deletion: is to delete data from the storage medium.

MD5: all called Message-Digest Algorithm 5, is a commonly used Hash Function (Hash Function) for compressing messages of arbitrary length into a 128-bit Message Digest (Message Digest). MD5 may be used in the fields of data integrity verification, cryptographic encryption, etc.

Bucket (Bucket): and a container storing the object, for storing the object. Each bucket has a unique name that can create, delete, and manage storage objects in the bucket. A bucket may be viewed as a top-level directory or container that contains a plurality of objects.

ObjectName: the key name of an object, which is used to reference a particular object in a socket, is unique to the ObjectName within a socket.

ObjectKey: the unique identification of the object storage system is equal to the Bucket/ObjectName.

Index: a data structure capable of accelerating a data search speed. The index data node has the storage address of the actual file, because the index is constructed according to the specific rule and algorithm, the node corresponding to the data can be quickly found out by following the rule of the index when searching, thereby achieving the effect of quickly searching the data.

Hash Index (Hash Index): is based on hash table implementation, the query that exactly matches all columns of the index is valid. For each row of data, the storage engine computes a Hash Code (Hash Code) for all index columns, the Hash Code being a smaller value, and the Hash codes computed for the rows of different key values being different. The hash index stores all hash codes in the index while maintaining pointers to each data line in the hash table.

Next, in order to facilitate understanding of the technical solutions provided by the embodiments of the present application by those skilled in the art, the following description is made on related technologies:

object storage technology is a distributed storage technology for storing and managing large-scale data objects that meets the needs of applications to handle large-scale data by providing a highly scalable, highly reliable data storage solution. The presence of duplicate data in an object storage system can result in wasted storage space and can also increase the time and cost of data recovery. Specifically, the inventors found that: 1. for repeated uploaded data, the object storage system has no explicit de-duplication method, and repeated data can apply for new storage space, so that resource waste is caused; 2. there is no explicit duplicate data elimination method for duplicate data that has been written to the storage system.

Therefore, the embodiment of the application provides a technical scheme for eliminating the repeated data in the object storage system. See in particular below.

Finally, for ease of understanding, an exemplary operating environment is provided below.

As shown in fig. 1, the environment schematic includes a storage system 2, a network 4, and a client 6, wherein:

storage system 2, as a storage platform, may be comprised of a plurality of computing devices. The plurality of computing devices may include virtualized computing instances. Virtualized computing instances may include virtual machines such as emulation of computer systems, operating systems, servers, and the like. The computing device may load the virtual machine based on a virtual image and/or other data defining particular software (e.g., operating system, dedicated application, server) for emulation. As the demand for different types of processing services changes, different virtual machines may be loaded and/or terminated on one or more computing devices. A hypervisor may be implemented to manage the use of different virtual machines on the same computing device. The storage system 2 may include different storage functions such as storing objects, storing metadata, etc.

In some embodiments, the storage system 2 is comprised of a plurality of storage nodes forming a distributed architecture. Distributed storage systems may employ various networking approaches, with each storage node being used to provide computing and storage services. The number of storage nodes can be configured according to actual requirements. The storage nodes may be magnetic disks or other non-volatile storage media. For example, the storage unit may be a single disk or a disk array formed by a plurality of disks.

In some embodiments, the storage system 2 may have multiple storage forms, where the object store is one of the storage forms. Object storage consists in storing data as objects, each of which can be bound to a unique identifier. This form of storage can be made with high scalability and availability and support large data sets. The distributed storage system may be managed using various architectures such as Amazon S3 (object storage service provided by Amazon corporation).

The storage system 2 may provide storage, reading, writing, querying, deleting, etc. services.

The storage system 2 may be configured to communicate with clients 6 and the like through a network 4.

The client 6 may be provided with a user access page for enabling manipulation of the storage system 2 or uploading objects etc.

The following describes the technical solution of the present application through a plurality of embodiments by taking the storage system 2 as an execution body. It should be understood that these embodiments may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein.

Example 1

Fig. 2 schematically shows a flow chart of an object deduplication method according to a first embodiment of the present application.

As shown in fig. 2, the object deduplication method may include steps S200 to S204, in which:

Step S200A first hash value and a first object identification of a first object are obtained.

The first hash value: the file or data of the first object may be mapped to an output of a preset length (first Hash value) by a Hash Function (Hash Function). The input and output may be any binary data.

First object identification: may be a unique identification (ObjectKey) of the first object, the identification within a socket being unique.

In this embodiment, the first object may be a stored object in the storage system 2, or may be an object to be uploaded.

In this embodiment, the first object may be a file, a picture, audio, or other data of a different type.

Taking the first object as an example, the first hash value and the first identifier may be obtained using the following manner: accessing the file and obtaining its file name as said first identifier. The file is hashed using an MD5 algorithm to obtain a first hash value. A first hash value is associated with the first identifier map. In some embodiments, other hash algorithms may also be used to generate the first hash value, such as SHA-1, SHA-256, and the like.

Step S202 And determining whether the first object belongs to a repeated object according to the first hash value and the first object identifier.

The storage system 2 stores a large number of objects, and a large number of repeated objects may exist in the large number of objects, so that storage resource waste is caused.

The first object, whether it is a stored object or an object to be uploaded, should avoid the above-mentioned repetitive storage behavior as much as possible.

For this purpose, it is possible to compare the first hash value of the first object with hash values of other objects in the storage system 2, and to preliminarily determine whether the first object is duplicated with other objects of the storage system 2 by the comparison result. It may be further determined by the first object identification whether the first object belongs to a duplicate object.

Specifically, the storage system 2 stores a plurality of objects, each having a unique object identification and a hash value. If it is to be determined whether the first object is a repetitive object, it is first necessary to obtain a first hash value of the first object, and then compare the first hash value with hash values of a plurality of objects (other objects) stored in the storage system 2 one by one. If the first object and the other objects are different, it is determined that the first object and the other objects are not repeated, i.e. the first object is not a repeated object. If there is an object whose hash value is the same as the first hash value of the first object, then it is compared whether their object identifications are the same. If both are the same, the first object is determined to be a duplicate object. If there is a certain object whose hash value is the same as the first hash value of the first object, but whose object identifier is different from the first object identifier of the first object, a further determination is needed.

The result of the comprehensive comparison is as follows:

first kind: if the first object does not have the same hash value as the first hash value (of other objects), judging that the first object does not belong to a repeated object;

second kind: the first object is judged to belong to a repeated object if the first hash value is the same as the hash value (of other objects) and the object identifiers are the same;

third kind: there is a hash value (of other objects) that is the same as the first hash value, but the object identifiers corresponding to the two are different, and further determination is needed.

Step S204Performing a deduplication operation on the first object or the second object under the condition that the first object is determined to belong to a duplicate object; wherein the second object comprises a pair that is duplicated with the first object in the storage systemLike a Chinese character.

If the first object is an object to be uploaded into the storage system 2, storing the first object is refused.

If the first object is a stored object in the storage system 2, the first object or the second object is physically deleted, leaving behind one.

According to the object deduplication method provided by the embodiment, whether a duplicate object exists can be determined through the hash value and the object identification. When the repeated object exists, the object of the repeated object is cleaned, so that the utilization rate of the storage resources of the storage system 2 is optimized as much as possible, and the waste of the storage resources in the storage system 2 is avoided as much as possible. At the same time, the time and cost of data recovery when data is lost is also reduced, as the storage of duplicate objects is reduced.

As described above, the first object may be a stored object in the storage system 2 or an object to be uploaded. Different object types may correspond to different coping strategies. The operation of these two object types is further described below.

In an alternative embodiment, the first object is an object to be uploaded. When the first object is a repeated object, the first object can be refused to be stored, so that a new storage space does not need to be applied, and the waste of bandwidth and storage resources is reduced.

As an object to be uploaded, the first object may encounter a number of situations:

case (1): there is no hash value in the storage system 2 that is the same as the first hash value of the first object;

case (ii): the storage system 2 has the same hash value (second hash value) as the first hash value;

case (two) may include case (1) and case (2):

case (1): the first object identifier corresponding to the first hash value is the same as the second object identifier corresponding to the second hash value;

case (2): the first object identifier corresponding to the first hash value is different from the second object identifier corresponding to the second hash value.

Various cases will be described below.

Case 1

In an alternative embodiment, the step S202 "determining whether the first object belongs to a duplicate object according to the first hash value and the first object identifier" may include: retrieving a second hash value identical to the first hash value, the second hash value mapping a second object identification; determining that the first object does not belong to a duplicate object if the second hash value is not retrieved; and under the condition that the first object is determined not to belong to the repeated object, the first object is physically stored in the storage system.

The hash function has the properties of collision resistance (collision resistance) and irreversibility (one-way).

Different objects, even if slightly different, produce different hash values. If the second hash value is not retrieved, it is indicated that there is no object in the storage system 2 that is duplicated with the first object, and therefore the uploading of the first object is accepted and stored. In this optional embodiment, it may be effectively ensured that the uploaded object is new data.

Case (1) of cases (two)

In an alternative embodiment, the step S202 "determining whether the first object belongs to a duplicate object according to the first hash value and the first object identifier" may include: retrieving a second hash value identical to the first hash value, the second hash value mapping a second object identification; determining that the first object belongs to a repeated object under the condition that the second hash value is retrieved and the first object identifier is the same as the second object identifier; wherein storing the first object is refused and a message indicating that the response is successful is returned in case it is determined that the first object belongs to a duplicate object.

Taking the hash value as MD5 as an example, the probability of occurrence of the hash collision of the MD5 value is 1/(2≡128), the same description of the object identification is the same uploading party, and the condition of occurrence of the MD5 hash collision of the object to be uploaded of the same uploading party can be ignored.

In the above embodiment, the hash values and the object identifications of the first object and the second object are the same, which indicates that the first object and the second object are duplicate objects. The storage system 2 may refuse to store the first object reducing the waste of bandwidth and storage resources. Since there is a second object in the storage system 2 that is identical to the first object, although the first object is refused to be stored, the upload request of the first object is successful in response, and thus a message indicating that the response is successful is returned.

Case (2) among cases (two)

In an alternative embodiment, the step S202 "determining whether the first object belongs to a duplicate object according to the first hash value and the first object identifier" may include: retrieving a second hash value identical to the first hash value, the second hash value mapping a second object identification; physically storing the first object if the second hash value is retrieved and the first object identification and the second object identification are different; by comparing the first object and the second object, it is determined whether the first object and the second object are repeated.

In some cases, a hash collision does not represent a complete agreement of data. Therefore, when the hash values of the first object and the second object are the same but the object identifications are different, the uploading of the first object is accepted and stored first, and the uploading efficiency is ensured. Then, after the first object is stored in the storage system 2, whether the first object and the second object are identical is further compared, and whether to clean the object is determined according to the comparison result.

In order not to affect the uploading of the first object, the comparison of the first object and the second object may be performed asynchronously.

In an alternative embodiment, after physically storing the first object, the method may further include:

asynchronously establishing a second mapping relation, wherein the second mapping relation represents the mapping relation between the first hash value and the first object identifier; it should be noted that, the second mapping relationship may also be referred to as a hash index;

For example, a large number of objects are stored in the storage system 2. Metadata is associated with each object. The metadata of an object may include various information of the object such as name, size, owner, creation time, expiration time, physical storage address. In order to facilitate management and improve management efficiency, two layers of mapping relationships are stored. The two-layer mapping relationship is as follows:

(1) A first layer mapping relationship between an object identification of the object and metadata of the object;

(2) A second layer mapping relationship (hash index) between the object identification of the object and the hash value of the object.

The relationship data of the two-layer mapping relationship may be stored in the metadata server or may be stored in another place.

Therefore, after the first object is physically stored, the two-layer mapping relationship is also established and stored for the first object. Then, it is determined whether the first object belongs to a duplicate object based on contents in the two-layer mapping relationships (first mapping relationship and second mapping relationship). Specifically, if the first hash value of the first object is the same as the hash value of a certain object in the storage system 2 (the second hash value of the second object), but the first object identifier of the first object is different from the second object identifier of the second object, the first object and the second object are further compared to determine whether to clean the object according to the comparison result. And if the first object and the second object are the same, physically deleting the first object or the second object. It should be noted that, the establishment of the mapping relationship of the first object, and the determination and deletion of whether the first object belongs to the duplicate object are all performed asynchronously. Thus, the repetitive object can be effectively removed while the uploading efficiency is ensured.

In an alternative embodiment, step S204 "in the case that it is determined that the first object belongs to a duplicate object, performing a deduplication operation on the first object or the second object" includes:

The whole process from creation to deletion of an object is called the lifecycle of the object. In order to save space, objects whose lifecycle ends (expires) need to be deleted. When two objects are duplicated, then the object whose lifecycle end time is prior may be deleted as a recoverable object. When both objects repeat and the end time of the life cycle is the same, then the object with the later upload time is selected as the recyclable object. In addition, metadata of the recyclable object is updated. For example, in the case where the object a and the object B are duplicated and physically deleted, the metadata of the object a may be updated, such as updating a physical storage address in the metadata of the object a to a physical storage address of the object B, so that the object B may be accessed through the metadata of the object a. In this embodiment, the deletion of the duplicate object is determined based on the lifecycle information, further optimizing the deletion of the duplicate data.

In an alternative embodiment, determining whether the first object and the second object are repeated by comparing the first object and the second object further comprises:

In the above alternative embodiment, the first object and the second object are not identical byte by byte, indicating that the first object and the second object are not duplicate objects. Thus, the first object may be stored in the storage system 2 to enable object storage.

In the storage system 2, different objects correspond to different physical storage addresses.

Wherein "obtaining the first object associated with the first object identification and the second object associated with the second object identification" may include:

Under the condition that the hash values are the same but the object identifiers are different, the respective physical storage addresses are searched through the respective object identifiers, and then the first object and the second object in the storage system 2 are read through the respective physical storage addresses. The method for searching and reading the object based on the object identification can efficiently determine the repeated object by firstly carrying out hash collision through the hash index, further clear the repeated object and save storage resources caused by repeated data.

In an alternative embodiment, the first object is a stored object. When the first object is a duplicate object, a deduplication operation is performed to relieve storage resource costs caused by duplicate data in the storage system 2, and reduce the time and cost of data recovery.

As stored objects, the first object may encounter a number of situations:

case (two) may include case (1) and case (2):

Various cases will be described below.

Case 1

In an alternative embodiment, the step S202 "determining whether the first object belongs to a duplicate object according to the first hash value and the first object identifier" may include: retrieving a second hash value identical to the first hash value, the second hash value mapping a second object identification; and determining that the first object does not belong to a repeated object under the condition that the second hash value is not retrieved. Different objects, even if slightly different, produce different hash values. In case the second hash value is not retrieved, it is explained that there is no object in the storage system 2 that is duplicated with the first object. In the present embodiment, it can be efficiently determined that the first object does not belong to the repetitive object through the hash collision.

Case (1) of cases (two)

In an alternative embodiment, the step S202 "determining whether the first object belongs to a duplicate object according to the first hash value and the first object identifier" may include: retrieving a second hash value identical to the first hash value, the second hash value mapping a second object identification; and under the condition that the second hash value is retrieved and the first object identifier is the same as the second object identifier, determining that the first object belongs to a repeated object.

In the above embodiment, the hash values and the object identifications of the first object and the second object are the same, which indicates that the first object and the second object are duplicate objects. Therefore, the storage system 2 can execute the deduplication operation, and the waste of storage resources is reduced.

Case (2) among cases (two)

In an alternative embodiment, the step S202 "determining whether the first object belongs to a duplicate object according to the first hash value and the first object identifier" may include: retrieving a second hash value identical to the first hash value, the second hash value mapping a second object identification; acquiring a first object associated with the first object identifier and a second object associated with the second object identifier under the condition that the second hash value is retrieved and the first object identifier and the second object identifier are different; by comparing the first object and the second object, it is determined whether the first object and the second object are repeated. In this embodiment, in the case where the hash values of the first object and the second object are the same but the object identifications are different, whether the first object and the second object are the same is further compared, and whether to clean the objects is determined according to the comparison result.

In the above alternative embodiment, the first object and the second object are not identical byte by byte, indicating that the first object and the second object are not duplicate objects. Thus, the first object may be stored further in the storage system 2 to enable object storage.

To make the application easier to understand, an exemplary application is provided below in connection with fig. 3 and 4.

In this exemplary application, for data storage, two layers of mappings may be set, one layer of mappings being (bucket/objectName, meta) and the other layer of mappings being (MD 5, bucket/objectName).

When uploading the object object_1 metadata meta_1 through the S3 gateway, if the object_1 metadata already exists (judged according to the socket/object name) and the MD5 value is the same, the other information of the metadata is directly returned after being updated.

If the MD5 values are the same but the object identifiers (bucket/objectName) are different, the object_1 is uploaded (S400) and the following operations are performed:

S402: the computing node reads the MD5 value of object_1.

S404: the compute node asynchronously creates a hash index (i.e., a second tier of mappings) for va lue with the MD5 value of object_1 as key and the object Key of object_1.

S406: the computing node determines whether a hash collision has occurred.

If no hash collision occurs, the asynchronous flow ends.

If the hash index already exists, it indicates that there is an object_2 in the storage system that is the same as the object_1Md5 value, and the process proceeds to S408.

S408: the computing node reads the object_2_name pointed to by the object_2md5 in the collision object object_2 according to the established hash index.

S410: the compute node obtains object_2 metadata meta_2 from object_2_name.

S412: the computing node sends the metadata of the two objects [ meta_1, meta_2] in the producer role to the message queue.

S414: the offline task reads the messages in the message queue in the consumer role.

S416: the offline task reads the data of both objects from the storage node.

S418: the offline task performs byte-by-byte comparison on the data of the two objects.

S420: the offline task determines whether the data of the two objects are identical.

If not, the asynchronous flow ends; if so, S422 is entered.

S422: the offline task determines (filters) recoverable objects based on the expiration time and creation time of both objects.

The screening logic is as follows: if the object with the early expiration time is selected, if the object with the late expiration time is not selected, the object with the late expiration time is selected.

S424: updating physical storage position information of recoverable object metadata by an offline task, and pointing to an object physical storage position collided with the physical storage position information;

s426: the storage node releases the physical storage resources of the recoverable object.

Example two

Fig. 5 schematically shows a block diagram of an object deduplication apparatus according to a second embodiment of the present application, which may be divided into one or more program modules, which are stored in a storage medium and executed by one or more processors, to complete the embodiment of the present application. Program modules in accordance with the embodiments of the present application are directed to a series of computer program instruction segments capable of performing the specified functions, and the following description describes each program module in detail. As shown in fig. 5, the object deduplication apparatus 500 may include: an acquisition module 510, a determination module 520, a deduplication module 530, wherein:

an obtaining module 510, configured to obtain a first hash value of a first object and a first object identifier;

A determining module 520, configured to determine whether the first object belongs to a duplicate object according to the first hash value and the first object identifier;

a deduplication module 530, configured to perform a deduplication operation on the first object or the second object if it is determined that the first object belongs to a duplicate object; wherein the second object comprises an object that is duplicated in the storage system with the first object.

In an alternative embodiment, the first object is an object to be uploaded.

In an alternative embodiment, the determining module 520 is further configured to:

In an alternative embodiment, the apparatus further comprises an asynchronous module for:

correspondingly, the determining module 520 is further configured to:

In an alternative embodiment, the deduplication module 530 is further configured to:

In an alternative embodiment, the first object is a stored object.

Example III

Fig. 6 schematically shows a hardware architecture diagram of a computer device 10000 adapted to implement an object deduplication method according to a third embodiment of the present application. In some embodiments, the computer device 10000 can be a rack server, a blade server, a tower server, or a rack server (including a stand-alone server, or a server cluster composed of multiple servers), or the like. As shown in fig. 6, the computer device 10000 includes, but is not limited to: the memory 10010, processor 10020, network interface 10030 may be communicatively linked to each other via a system bus. Wherein:

memory 10010 includes at least one type of computer-readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory), random Access Memory (RAM), static Random Access Memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, and the like. In some embodiments, memory 10010 may be an internal storage module of computer device 10000, such as a hard disk or memory of computer device 10000. In other embodiments, the memory 10010 may also be an external storage device of the computer device 10000, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the computer device 10000. Of course, the memory 10010 may also include both an internal memory module of the computer device 10000 and an external memory device thereof. In this embodiment, the memory 10010 is typically used for storing an operating system installed on the computer device 10000 and various application software, such as program codes of an object deduplication method. In addition, the memory 10010 may be used to temporarily store various types of data that have been output or are to be output.

The processor 10020 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other chip in some embodiments. The processor 10020 is typically configured to control overall operation of the computer device 10000, such as performing control and processing related to data interaction or communication with the computer device 10000. In this embodiment, the processor 10020 is configured to execute program codes or process data stored in the memory 10010.

The network interface 10030 may comprise a wireless network interface or a wired network interface, which network interface 10030 is typically used to establish a communication link between the computer device 10000 and other computer devices. For example, the network interface 10030 is used to connect the computer device 10000 to an external terminal through a network, establish a data transmission channel and a communication link between the computer device 10000 and the external terminal, and the like. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a global system for mobile communications (Global System of Mobile communication, abbreviated as GSM), wideband code division multiple access (Wideband Code Divi sion Multiple Access, abbreviated as WCDMA), a 4G network, a 5G network, bluetooth (bluetooth), wi-Fi, etc.

It should be noted that fig. 6 only shows a computer device having components 10010-10030, but it should be understood that not all of the illustrated components are required to be implemented, and that more or fewer components may be implemented instead.

In this embodiment, the object deduplication method stored in the memory 10010 may be further divided into one or more program modules and executed by one or more processors (such as the processor 10020) to complete the embodiment of the present application.

Example IV

The embodiment of the application also provides a computer readable storage medium, on which a computer program is stored, wherein the computer program, when being executed by a processor, implements the steps of the object deduplication method in the embodiment.

In this embodiment, the computer-readable storage medium includes a flash memory, a hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEP ROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the computer readable storage medium may be an internal storage unit of a computer device, such as a hard disk or a memory of the computer device. In other embodiments, the computer readable storage medium may also be an external storage device of a computer device, such as a plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash memory Card (Flash Card), etc. that are provided on the computer device. Of course, the computer-readable storage medium may also include both internal storage units of a computer device and external storage devices. In this embodiment, the computer readable storage medium is typically used to store an operating system and various types of application software installed on a computer device, such as program codes of the object deduplication method in the embodiment, and the like. Furthermore, the computer-readable storage medium may also be used to temporarily store various types of data that have been output or are to be output.

It will be apparent to those skilled in the art that the modules or steps of the embodiments of the application described above may be implemented in a general purpose computer device, they may be concentrated on a single computer device, or distributed over a network of multiple computer devices, they may alternatively be implemented in program code executable by a computer device, so that they may be stored in a storage device for execution by the computer device, and in some cases, the steps shown or described may be performed in a different order than what is shown or described, or they may be separately made into individual integrated circuit modules, or a plurality of modules or steps in them may be made into a single integrated circuit module. Thus, embodiments of the application are not limited to any specific combination of hardware and software.

It should be noted that the foregoing is only a preferred embodiment of the present application, and is not intended to limit the scope of the present application, and all equivalent structures or equivalent processes using the descriptions of the present application and the accompanying drawings, or direct or indirect application in other related technical fields, are included in the scope of the present application.

Claims

1. A method of object deduplication, the method comprising:

acquiring a first hash value and a first object identifier of a first object;

2. The method of claim 1, wherein the first object is an object to be uploaded.

3. The method of claim 2, wherein determining whether the first object belongs to a duplicate object based on the first hash value and the first object identification comprises:

4. The method of claim 2, wherein determining whether the first object belongs to a duplicate object based on the first hash value and the first object identification comprises:

5. The method of claim 2, wherein determining whether the first object belongs to a duplicate object based on the first hash value and the first object identification comprises:

6. The method of claim 5, wherein the method further comprises:

7. The method of claim 6, wherein performing a deduplication operation on the first object or the second object if it is determined that the first object belongs to a duplicate object, comprises:

8. The method of claim 6, wherein determining whether the first object and the second object are repeated by comparing the first object and the second object, further comprises:

9. The method of claim 6, 7 or 8, wherein obtaining the first object associated with the first object identification and the second object associated with the second object identification comprises:

10. The method of claim 1, wherein the first object is a stored object.

11. The method of claim 10, wherein determining whether the first object belongs to a duplicate object based on the first hash value and the first object identification comprises:

12. The method of claim 10, wherein determining whether the first object belongs to a duplicate object based on the first hash value and the first object identification comprises:

13. The method of claim 12, wherein performing a deduplication operation on the first object or the second object if it is determined that the first object belongs to a duplicate object, comprises:

14. The method of claim 12, wherein determining whether the first object and the second object are repeated by comparing the first object and the second object, further comprises:

15. An object deduplication apparatus, the apparatus comprising:

16. A computer device, comprising:

A memory communicatively coupled to the at least one processor; wherein:

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 14.

17. A computer readable storage medium having stored therein computer instructions which when executed by a processor implement the method of any one of claims 1 to 14.