WO2022148538A1

WO2022148538A1 - Method and system for managing data deduplication

Info

Publication number: WO2022148538A1
Application number: PCT/EP2021/050156
Authority: WO
Inventors: Assaf Natanzon; Aviv Kuvent; Yaron MOR; Asaf Yeger
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2021-01-07
Filing date: 2021-01-07
Publication date: 2022-07-14

Abstract

A method for managing data deduplication in a data deduplication system comprises: receiving, by the data deduplication system, a copy of first data stored in a computer storage system in communication with the data deduplication system, determining whether the first data matches further data stored in the computer storage system, using an index stored in the data deduplication system in which data stored in the computer storage system is associated with a location of the data in the computer storage system, and in response to a determination that the first data matches further data stored in the computer storage system, identifying, by the data deduplication system using the index, a location of the further data in the computer storage system, and sending, by the data deduplication system, a notification to the computer storage system identifying the location associated with the further data in the computer storage system.

Description

METHOD AND SYSTEM FOR MANAGING DATA DEDUPLICATION

Field of the Disclosure

The present disclosure relates to a method and system for data deduplication.

Background of the Disclosure

Data deduplication is a computer storage saving technique intended to eliminate duplicate data in a computer storage system. Deduplication ensures that only one instance of data is stored in the computer storage system. Subsequent copies of the data may be identified as duplicate, and in place of the duplicate data a machine-readable reference pointing to the original data is stored. The reference may typically be expected to be significantly smaller in memory size than the data it replaces, thereby the overall memory footprint of the data may be correspondingly reduced. Data deduplication may thereby advantageously reduce the overall memory footprint of data in a data storage system by eliminating storage of duplicate data. However, data deduplication may disadvantageously consume computational resource in identifying duplicate data.

Summary of the Disclosure

An objective of the present disclosure is to provide a method for data deduplication in a computer storage system, in which a demand on computational resource of the computer storage system in performing deduplication operations, e.g. in identifying duplicate data, is reduced.

The foregoing and other objectives are achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the Figures.

A first aspect of the present disclosure provides a method for managing data deduplication in a data deduplication system, the method comprising: receiving, by the data deduplication system, a copy of first data stored in a computer storage system in communication with the data deduplication system, determining, by the data deduplication system, whether the first data matches further data stored in the computer storage system, using an index stored in the data deduplication system in which data stored in the computer storage system is associated with a location of the data in the computer storage system, and in response to a determination that the first data matches further data stored in the computer storage system, identifying, by the data deduplication system using the index, a location of the further data in the computer storage system, and sending, by the data deduplication system, a notification to the computer storage system identifying the location associated with the further data in the computer storage system.

In the disclosed method, the data deduplication system receives a copy of data stored by the computer storage system. In examples, the data received could, for example, be a copy of all of the data stored on the computer storage system. In other examples, the data received be a copy of only a part of the data stored on the computer storage system, for example, a single file or a small number of files, or even one or more blocks of data. For example, the computer storage system may itself have received, from a coupled processor, a batch of data for storage by the computer storage system, and the computer storage system may wish to determine whether particular data, e.g. the newly received batch of data, is a duplicate of further data already stored in the computer storage system.

The data deduplication system comprises, for example, stored in a memory of the data deduplication system, a pre-defmed index in which one or more items, e.g. files or blocks, of data already stored in the computer storage system, i.e. further data, is associated with a location of that further data in the computer storage system. The index could, for example, be generated by the data deduplication system by reading the contents of the computer storage system, and could be periodically updated by the data deduplication system. In an example to be described in detail herein, the computer storage system periodically sends a copy of data stored thereon to the data deduplication system, whereby the data deduplication system may update the index. The index could, for example, record a digest or hash representation of data stored in the computer storage system, and associate each digest/hash with the location, in the computer storage system, of the corresponding data.

The data deduplication system may compare the received data, either as a whole or following splitting of the received data into smaller blocks, to the entries of the index table, to determine, by reference to the index, whether the received data, or a constituent block of the received data, is a duplicate of further data already stored in the computer storage system. If the data deduplication system identifies a match between the received data, or a file or block thereof, and further data recorded in the index table, indicating that the received data is a duplicate of further data already stored in the computer storage system, the data deduplication may determine, by reference to the index, the location of the further data of which the received data is a duplicate. The data deduplication may then send a notification, e.g. via a network, to the computer storage system, notifying the computer storage system of the location in the computer storage system of the further data corresponding to the duplicate data. For example, the notification could identify the duplicate data and identify the location in the computer storage system of the corresponding further data. In this regard, the notification may be considered as a ‘hint’ provided by the data deduplication system to the computer storage system, indicating the likely existence of duplicate data in the computer storage system.

The computer storage system may then optionally take an action based on the notification received from the data deduplication. For example, the computer storage system could decide to deduplicate the identified duplicate data, by storing in the computer storage system, in place of the duplicate data, a reference pointing to the location, as identified by the notification, of the corresponding further data already stored in the computer storage system. In examples, the computer storage system may then subsequently erase the duplicate data from memory. The storage resource of the computer storage system saved by such a deduplication operation is thus the memory size of the duplicate data minus the memory size of the reference, which may be significant, in particular where the amount of duplicate data is great. In other examples, the computer storage system could decide not to take action in response to the notification, for example, where rules governing operation of the computer storage system dictate that data deduplication should be only be performed when the computer storage system is close to capacity.

In other words, in the disclosed method, one or more processes involved in data deduplication are offloaded by the computer storage system for performance by the data deduplication system. In particular, in the method, the data deduplication system performs the process of identifying duplicate data in the computer storage system. The data deduplication system does this by reference to the index, which contains a record of data stored in the computer storage system associated with a location in the computer storage system of that data. Storage of the index itself consumes memory resource. In particular, where the amount of data stored in the computer storage system, and so recorded in the index, is relatively great, storage of the index may correspondingly require relatively great memory resource. Further, computational processes involved in identifying, using the index, duplicate data and a location of the corresponding data in the computer storage system, may consume computational resource, e.g. processor time. However because, in the method, the process of identifying duplicate data, and a location of the corresponding original data in the computer storage system, is performed by the data deduplication system using an index stored by the data deduplication system, rather than by the computer storage system, demand on computational resource of the computer storage system, such as memory and/or processor time, resulting from the deduplication operation may be reduced. The disclosed method may thus find particular utility where the computer storage system has relatively low computational resource, e.g. low memory and/or processor capacity.

In an implementation, the receiving, by the data deduplication system, a copy of first data stored in a computer storage system, comprises receiving a copy of the first data labelled with a location of the first data in the computer storage system, and the method further comprises, in response to a determination that the first data does not match further data stored in the computer storage system, modifying, by the data deduplication system, the index, to include a representation of the first data associated with the location of the first data in the computer storage system.

In other words, the computer storage system may send the copy of the first data along with one or more labels indicating the location of the data in the computer storage system. In the event that, on receipt of the data from the first computer storage system, the data deduplication system determines, using the index, that the first data does not match further data stored in the computer storage system, e.g. in the absence of a positive determination that the first data matches further data stored in the computer storage system, thereby indicating that the first data is not a duplicate of further data already stored in the computer storage system, the data deduplication system may modify, i.e. update, the index to include representation(s), e.g. a digest or hash representing the first data (or a constituent block thereof), associated with the location in the computer storage system of the first data, as identified by the received label. The index maintained by the data deduplication system is thereby dynamically updated to account for changes to the data stored in the computer storage system, and in particular to account for new data stored in the storage system. Thereby, if the data deduplication system is presented with the new data again, it may, using the updated index, identify the location of the new data in the computer storage system, and notify the storage system accordingly.

In an implementation, the determining, by the data deduplication system, whether the first data matches further data stored in the computer storage system, comprises chunking, by the data deduplication system, the first data into a sequence of chunks of data using a chunking algorithm, and, for each chunk of data, determining, by the data deduplication system, whether the respective chunk of data matches further data stored in the computer storage system, using the index stored in the data deduplication system, and the identifying, by the data deduplication system using the index, a location of the further data in the computer storage system, and sending, by the data deduplication system, a notification to the computer storage system identifying the location associated with the further data in the computer storage system, comprises for one or more of the chunks of data, identifying, by the data deduplication system using the index, a location of the respective further data in the computer storage system, and sending, by the data deduplication system, a notification to the computer storage system identifying the location associated with the respective further data in the computer storage system.

In other words, in an implementation, the data deduplication system may perform variable- length block deduplication, whereby a chunk size for segmenting the first data is determined by the chunking algorithm. The data deduplication may subsequently perform the processes for identifying duplicate data, i.e. comparing the first data to the index, on a chunk level. Block, or in other words chunk, level deduplication, whereby the first data is segmented into smaller blocks/chunks and each chunk is inspected for commonality with the further data already stored in the computer storage system, provides ‘higher-resolution’ deduplication, which may advantageously result in relatively higher deduplication rates, and thus advantageously lower memory footprint. Moreover, the chunking algorithm may permit variable-length deduplication, which may advantageously allow for relatively greater identification of common segments of data occurring across a data set.

In an implementation, the method further comprises storing a copy of the first data in the data deduplication system. In other words, the data deduplication system may itself support data storage functionality, and the data deduplication system may function as backup storage for backing-up data stored in the computer storage system. Data stored in the computer storage system may thereby be better protected from accidental loss through erasure or corruption of the data stored on the computer storage system. Moreover, consolidating the functions of data-deduplication management and backing-up of data on the data deduplication system has the particular advantage that the computer storage system is only required to send a single copy of the first data to a single recipient, i.e. the data deduplication system, to achieve both data deduplication and backing- up. This may advantageously reduce the amount of communication bandwidth required for performing the data deduplication and backing-up functions. In contrast, if the computer storage system were required to send a copy of the first data to the data deduplication system for data deduplication, and then a further copy of the first data to a separate data backup system, a relatively higher communication bandwidth would be required, i.e. sufficient communication bandwidth to complete the two separate communications.

In an implementation, the storing the copy of the first data in the data deduplication system comprises, determining, by the data deduplication system, whether the first data matches stored data stored in the data deduplication system, and in response to a determination that the first data does not match stored data stored in the data deduplication system, storing the copy of the first data in the data deduplication system.

In other words, where the data deduplication system is serving additionally as a data backup, the data deduplication system may, in response to receiving the copy of the first data, perform an internal check to determine whether or not the first data is a duplicate of data already stored in the data deduplication system. Where the data deduplication system determines that the first data is not a duplicate of data already stored on the data deduplication system, the data deduplication may write the first data to memory of the data deduplication system.

This determination may advantageously allow the data deduplication system to additionally perform internal data deduplication, and thereby avoid storing of duplicate data on the data deduplication system. Like the computer storage system, where, alternatively, the data deduplication system determines that the first data is a match for data already stored, the data deduplication system may determine a location of the matching data already stored, and store in place of the first data a reference to the location of the matching data. For the purpose of determining whether the first data matches stored data stored in the data deduplication system, the data deduplication system may maintain a further index, in which data stored in the data deduplication system is associated with a location of the data in the data deduplication system.

In an implementation, the receiving, by the data deduplication system, a copy of first data stored in the computer storage system, comprises receiving a copy of the first data labelled with a location of the first data in the computer storage system, and the method further comprises storing in the data deduplication system a label associated with the copy of the first data identifying the location of the first data in the computer storage system.

In other words, where the data deduplication system functions additionally as a data backup, the data deduplication system may receive from the computer storage system the first data labelled with a location of that data in the computer storage system, and, where the data deduplication stores a backup copy of the first data, the data deduplication system may further store a label, associated with the backup copy of data, identifying a location of the original data in the computer storage system. This labelling may advantageously allow for convenient retrieval of data from the backup system by reference to a location in the computer storage system. This may be useful, for example, where data stored in particular location in the computer storage system has been erroneously erased or corrupted, as the computer storage system may subsequently send a retrieval request to the data deduplication system identifying the erased/corrupted location, whereupon the data deduplication system may identify and return the corresponding backed up data.

In an implementation, the method further comprises, sending, by the computer storage system, the copy of first data stored in the computer storage system. In other words, the method may further involve a prior step of sending the copy of the first data from the computer storage system to the data deduplication system. For example, the data deduplication system may send a request for the copy data, or another prompt or trigger, to the computer storage system.

In an implementation, the method further comprises, in response to receiving, using the computer storage system, the notification, storing in place of the first data in the computer storage system a reference identifying the identified location of the further data in the computer storage system.

In other words, the computer storage system may, on notification by the data deduplication system of a location of data already stored in the computer storage system for which the first data is a duplicate, store in the computer storage system, in place of the first data, store the reference pointing to the matching data. For example, where the first data is a component, i.e. a chunk or block, of a data file, and the first data is already stored in the computer storage system, the computer storage system may store, in the data file structure, the reference.

In an implementation, the method further comprises, in response to receiving, using the computer storage system, the notification, checking, using the computer storage system, whether the first data matches the further data stored in the identified location of the computer storage system, and in response to a determination that the first data matches the further data, storing in place of the first data in the computer storage system a reference identifying the identified location of the further data in the computer storage system.

In other words, the computer storage system, on receiving from the data deduplication system the notification identifying the location of the duplicated data, may perform its own internal check to confirm that the first data does indeed match the further data already stored in the location identified in the notification. In other words, the computer storage system may double check the determination made by the data deduplication system. This double-check may ensure that the duplicate data determination made by the data deduplication system is correct, or conversely identify errors in the determination made by the data deduplication system, thereby avoiding corruption of data structures, such as files, in the computer storage system with incorrect references. Such an incorrect determination by the data deduplication may occur, for example, where the data stored in the primary storage has changed since the index stored in the data deduplication system was last updated, in which circumstance the index stored in the data deduplication system, and so the notification generated by the data deduplication system based on the index, may be inaccurate.

However, because the data deduplication system has already identified the suspected location of duplicated data, the computer storage system may narrow the scope if its onboard check for duplicated data based on the location identified in the notification as a starting point. For example, this process may involve the computer storage system checking only the location identified in the notification. This may advantageously be relatively less consumptive of computational resource of the computer storage system than if the computer storage system were required to perform the entirety of the data deduplication investigation itself, e.g. by checking every location of the computer storage system for duplicated data. In an implementation, the first data comprises data stored as a plurality of blocks in the computer storage system, and the method further comprises, in response to receiving, using the computer storage system, the notification, for each chunk of data, determining, using the computer storage system, one or more blocks of data comprising the entirety of the respective chunk of data.

In other words, the computer storage system may, perform an additional operation of determining how the chunks of data identified by the data deduplication system map to blocks of data stored in the computer storage system.

A second aspect of the present disclosure provides a computer program comprising instructions, which, when executed by a computing system, cause the computing system to carry out the method of the first aspect of the disclosure or any implementation of the first aspect.

A third aspect of the present disclosure provides a computer-readable data carrier having the computer program of the second aspect of the disclosure stored thereon.

A fourth aspect of the present disclosure provides a computing system comprising a data deduplication system, wherein the data deduplication system comprises: a receiving module suitable for communication with a computer storage system to receive a copy of first data stored in the computer storage system, computer memory comprising a machine-readable index stored in the computer memory in which data stored in the computer storage system is associated with a location of the data in the computer storage system, a determination module configured to determine, using the index, whether the copy of first data received by the receiving module from the computer storage system matches further data stored in the computer storage system, and in response to a determination that the first data matches further data stored in the computer storage system, identify, using the index, a location of the further data in the computer storage system, and generate a notification identifying the location of the further data in the computer storage system, and a sending module suitable for sending the notification to the computer storage system.

In an implementation, the receiving module is further suitable for receiving from the computer storage system location data identifying a location of the first data in the computer storage system, and the determination module is further configured to, in response to a determination that the first data does not match further data stored in the computer storage system, modify the index to include a representation of the first data associated with the location of the first data in the computer storage system.

In an implementation, the determination module is configured to chunk a copy of first data received from the computer storage system into a sequence of chunks of data using a chunking algorithm, and, for each chunk of data, determine, using the index, whether the respective chunk of data matches further data stored in the computer storage system, and for one or more of the chunks of data, identify, using the index, a location of the respective further data in the computer storage system, and generate a notification identifying the location associated with the respective further data in the computer storage system.

In an implementation, the determination module is further configured to store a copy of the first data in the computer memory.

In an implementation, the determination module is configured to determine whether the first data matches stored data stored in the computer memory, and in response to a determination that the first data does not match stored data stored in the computer memory, store the copy of the first data in the computer memory.

In an implementation, the determination module is further configured to store in the computer memory of the data deduplication system a label associated with the copy of the first data identifying the location of the first data in the computer storage system.

In an implementation, the computing system further comprises the computer storage system in communication with the receiving module of the data deduplication system.

In an example, the computer storage system could comprise a plurality of discrete storage devices, e.g. a cluster of disk drives. As an example alternative, the computer storage system could comprise a single storage device.

In an implementation, the computer storage system is configured to send the copy of first data stored in the computer storage system.

In an implementation, the computer storage system is configured to, in response to receiving the notification, store in place of the first data in the computer storage system a reference identifying the identified location of the further data in the computer storage system. In an implementation, the computer storage system is configured to, in response to receiving the notification, determine whether the first data matches the further data stored in the identified location of the computer storage system, and in response to a determination that the first data matches the further data, store in place of the first data in the computer storage system a reference identifying the identified location of the further data in the computer storage system.

In an implementation, the first data comprises data stored as a plurality of blocks in the computer storage system, and the computer storage system is configured to, in response to receiving the notification, for each chunk of data, determine one or more blocks of data comprising the entirety of the respective chunk of data.

These and other aspects of the disclosure will be apparent from the embodiment s) described below.

Brief Description of the Drawings

In order that the present disclosure may be more readily understood, embodiments of the disclosure will now be described, by way of example, with reference to the accompanying drawings, in which:

Figure 1 shows schematically an example of a computing system embodying an aspect of the present disclosure, comprising a computer storage system and a backup storage and data deduplication system;

Figure 2 shows schematically an example of a data deduplication module of the computing system identified with reference to Figure 1;

Figure 3 shows schematically an example of a memory of the data deduplication module identified with reference to Figure 2;

Figure 4 shows processes involved in a data back-up and deduplication operation performed by the computer storage system identified with reference to Figure 1;

Figure 5 shows processes involved in a data back-up and deduplication operation performed by the backup storage and data deduplication system identified with reference to Figure 1, including a process of identifying duplicate data in received data and storing a back-up copy of the data;

Figure 6 shows processes involved in identifying duplicate data performed by the backup storage and data deduplication system, which includes a process of determining whether received data matches further data stored in the computer storage system;

Figure 7 shows processes involved in determining whether received data matches further data stored in the computer storage system;

Figure 8 shows processes involved in storing a back-up copy of the data, performed by the computer storage system; and

Figure 9 shows processes involved in determining whether received data matches further data stored in the computer storage system, performed by the backup storage and data deduplication system.

Detailed Description of the Disclosure

Referring firstly to Figure 1, a computing system 101 embodying an example of an aspect of the present disclosure comprises a plurality of client devices 102, 103, computer storage system 104, and backup storage and data deduplication system 105. Client devices 102, 103 are communicatively coupled to computer storage system 104 by network 106. Computer storage system 104 is communicatively coupled to backup storage system 105 by network 107.

Client devices 102, 103 utilise computer storage system 104 for storage of data. For example, client devices 102 to 103 may send requests to store or retrieve data to computer storage system 104 via network 106, and may similarly exchange data for storage or retrieval with computer storage system 104 via network 106. Client devices 102, 103 may, for example, be portable computers, or smart-phones. Client devices 102, 103 may include computing functionality, e.g. a computer processor. In the example, two client devices 102, 103, are depicted as being coupled to computer storage system 104. In other examples, the number of client devices coupled to computer storage system 104 may be greater or lesser than two. Client devices 102, 103 may be located remotely from the computer storage system 104, and indeed remotely relative to one another. For example, computer storage system 104 could be located in a central data centre. The client devices 102, 103 may thereby utilise storage resource of the computer storage system 104 for storing data.

Computer storage system 104 is for providing storage resource for a plurality of client devices, such as client devices 102, 103. As will be described, in the example, computer storage system 104 is configured to communicate with client devices 102, 103 for storage and retrieval of data, and further to communicate with backup storage and data deduplication system 105 to back-up data, and perform data deduplication, of data stored on computer storage system 104.

Computer storage system 104 comprises processor 108, memory 109, storage 110, input/output interface 111, and system bus 112. Processor 108 is configured for controlling the operation of the computer storage system, for example, processing storage and retrieval requests by client devices 102, 103. As will be described herein, in examples, processor 108 is configured for controlling processes of a data backup and data deduplication operation, in accordance with a data backup and deduplication computer program stored on memory 109. Memory 109 is configured as non-volatile read/write memory for storage of computer programs for execution by processor 108 and operational data associated with operations executed by the processor 108. In examples, memory 109 is flash memory, although in other examples flash memory could be substituted for alternative forms of memory. Storage 110 is configured for storing data, e.g. for storing data for client devices 102, 103. In examples, storage 110 comprises one or more disk drives. In examples, storage 110 may be configured as a plurality of logical units which are individually addressable by the processor 108. Input/output interface 111 is provided for connection of client devices 102, 103 to computer storage system 104, and for connection of computer storage system 104 to backup storage and data deduplication system 105. The components 108 to 111 of the computer 104 are in communication via system bus 112.

Backup storage and data deduplication system 105 is to provide backup storage resource to computer storage system 104, and further to perform processes of a deduplication operation for deduplication of data stored in computer storage system 104. Back-up storage of data provides useful redundancy to allow recovery of data in the event of loss of the original data, e.g. in the event of failure of storage 110. Data deduplication may usefully reduce the size of a collection of data by elimination of duplicate instances of data within the collection, to thereby reduce the amount of computer storage required to store the collection of data. Thus, in examples described in further detail herein, backup storage and data deduplication system 105 may periodically store a copy of data stored on computer storage system 104, and may further perform processes of a deduplication operation for deduplicating data stored on computer storage system 104.

Backup storage and data deduplication system 105 comprises processor 112, storage 113, data deduplication module 114, input/output interface 115, and system bus 116. Processor 112 is to control the operation of the backup storage and data deduplication system 105. As will be described herein, in an example processor 112 is configured for controlling communication with computer storage system 104 and for controlling processes of a data backing-up operation in cooperation with computer storage system 104. Storage 113 is configured for non-volatile storage of data, e.g. for storing a copy of data received from computer storage system 104 for backing up. In examples, storage 113 comprises one or more disk drives. Data deduplication module 114 is for performing processes of a data deduplication procedure, for deduplicating data stored by computer storage system 104, as will be described in further detail with reference to Figures 4 to 7. Input/output interface 115 is provided for connection of backup storage and data deduplication system 105 to computer storage system 104. The components 112 to 115 are in communication via system bus 116.

Backup storage and data deduplication system 105 may be located remotely from computer storage system 104. Indeed, tor the purpose of providing backup storage to computer storage system 104, it may be desirable that backup storage and data deduplication system 105 comprises physical storage resource, e.g. one or more disk drives, that is separate to storage resource of the computer storage system 104, to avoid a risk of simultaneous failure of both storage resources.

In examples, networks 105 and 107 may each be implemented, for example, by wide area networks (WANs) such as the Internet, local area networks (LANs), metropolitan area networks (MANs), and/or personal area networks (PANs), etc. The networks may be implemented using wired technology such as Ethernet, Data Over Cable Service Interface Specification (DOCSIS), synchronous optical networking (SONET), and/or synchronous digital hierarchy (SOH), etc.) and/or wireless technology e.g., Institute of Electrical and Electronics (IEEE) 802.11 (Wi-Fi), IEEE 802.15 (WiMAX), Bluetooth, ZigBee, near-field communication (NFC), and/or Long- Term Evolution (LTE), etc.). The networks may include at least one device for communicating data in the network. For example, each network 106, 107 may include computing devices, routers, switches, gateways, access points, and/or modems. Because, in the disclosure, data deduplication management functionality is provided to computer storage system 104 by backup storage and data deduplication system 105, computer storage system 104 is relieved of the computational burden involved in performing certain of the computational processes involved in data deduplication. In particular, as will be described herein, in the example backup storage and data deduplication system 105 performs a computational process of identifying duplicate instances of data stored in computer storage system 104. Thus, computer storage system 104 is not required to perform this process, thereby reducing the demand on computational resource, e.g. memory and processor time, of the computer storage system 104.

Moreover, in the example, backup storage and deduplication functionality are integrated into backup storage and data deduplication system 105. Consolidating the functions of backup storage and data deduplication management into a single system has the advantage that the computer storage system 104 is only required to send a single copy of the first data to a single recipient, i.e. the backup storage and data deduplication system 105, to achieve both data deduplication and backing-up. This may advantageously reduce the amount of communication bandwidth required for performing the backing-up and data deduplication functions. In other examples backup storage and data deduplication functionality may be performed by separate systems, for example, systems having mutually independent communication links to computer storage system 104. In other examples, data deduplication management may be provided to a computer storage system without backing-up functionality also being provided to the computer storage system.

Referring next to Figures 2 and 3, data deduplication module 114 of backup storage system 105 comprises a receiving module 201, memory 202, a determination module 203, a sending module 204, and system bus 205.

Receiving module 201 is functional to interface with system bus 116 of backup storage system 105, to thereby communicate with the computer storage system 104 via the input/output interface 115 of backup storage system 105. Receiving module 201 may thereby receive data from computer storage system 104 via input/output interface 115 of backup storage system 105. Receiving module 201 may, for example, comprise a processor for controlling communication with processor 112 of the backup storage system 105 and/or with computer storage system 104. Memory 202 is configured as non-volatile read/write memory for storage of data received from computer storage system 104 via receiving module 201. Referring in particular to Figure 3, in examples, memory 202 comprises, stored thereon, a data deduplication computer program 301 comprising machine-readable instructions for operation of the data deduplication module 114 for managing a data deduplication operation, and in particular for identifying duplicate instances of data. Memory 202 further comprises, stored thereon, an index 302, recording data stored in the storage 110 of computer storage system 104 along with locations of the respective data in the storage 110. In examples, memory 202 is flash memory, although in other examples flash memory could be substituted for alternative forms of memory.

Referring back to Figure 2, determination module 203 is functional to determine whether data received by the backup storage system 105 from the computer storage system 104 is a duplicate of other data already stored in the computer storage system, i.e. for identifying duplicate instances of data in storage 110 of computer storage system 104. As will be described with reference to later Figures, in an example, determination module 203 identifies duplicate data by reference to the index 302 stored in memory 202. Determination module 203 may, for example, comprise a processor for executing processes of the computer program 301 stored in memory 202.

Sending module 204 is functional to communicate with the computer storage system 104 via the input/output interface 115 of backup storage system 105. Sending module 204 may thereby send data to computer storage system 104 via input/output interface 115 of backup storage system 105. In examples, as will be described, sending module 204 may send to computer storage system 104 a notification identifying duplicate data stored in computer storage system 104. Sending module 204 may, for example, comprise a processor for controlling communication with processor 112 of the backup storage system and/or with computer storage system 104. The components 201 to 204 are in communication via system bus 205.

Referring next to Figure 4, in examples, a computer program stored on the memory 109 of computer storage system 104, for performance by computer storage system 104, comprises four stages.

At stage 401, the computer program causes the processor 108 to initiate a data backup and data deduplication procedure. In examples, stage 401 could be initiated periodically, for example, daily. Data on computer storage system 104 may thereby be frequently backed-up to backup storage and data deduplication system 105. Frequent performance of the method may thereby reduce the risk of significant data loss in the event of failure of computer storage system 104, and further data on computer storage system 104 may be frequently deduplicated to minimise the memory footprint of the data. In alternative examples, stage 401 could be initiated in response to a manual input by an operator to computer storage system 104.

At stage 402, the computer program causes the processor 108 to retrieve data, herein termed ‘first data’, from storage 110 for backup and deduplication, and send a copy of the first data, via input/output interface 111 and network 107, to backup storage and data deduplication system 105 for backing-up and deduplication. In examples, the computer program could cause the processor 108 to send a copy of an entirety of the data stored in storage 110, or a logical unit of storage 110, such that the backup storage and data deduplication system 105 may receive a copy of all data on the computer storage system 104. In alternative examples, the computer program could cause the processor 108 to send a copy of only part of the data stored in storage 110, for example, only data stored in storage 110 since a previous iteration of the method. In examples, at stage 402, the processor 108 also identifies and sends with the first data location information defining respective locations in the storage 110 of data items of the first data. For example, the location information may identify the number of a logical unit in which the first data is stored in the storage 110, and/or an offset of the first data in a storage device of the storage 110.

At stage 403, the computer program causes the processor 108 to await reception of a notification from backup storage system 105 identifying the existence of duplicate instances of data stored in the storage 110 of the computer storage system.

At stage 404, in dependence on the nature of the notification received at stage 403, the computer program causes the processor 108 to perform a data deduplication procedure for deduplicating data stored in storage 110.

Referring next to Figure 5, in examples, the computer program 301 stored in the memory 202 of backup storage and data deduplication system 105 comprises four stages.

At stage 501, the computer program causes the processor 112 of the system 105 to receive, via the network 107, a copy of the first data sent by the computer storage system 104 at stage 402. Stage 501 may involve the processor 112 communicating a copy of the first data to the receiving module 201 of data deduplication module 114, whereby the data deduplication module 114 may save a copy of the data in memory 202.

At stage 502, the computer program 301 causes the determination module 203 of data deduplication module 114 to identify whether the first data received at stage 501 contains duplicate data, for example, duplicate instances of other data stored in storage 110 of computer storage system 104. Example processes performed at stage 502 will be described in further detail with particular reference to Figure 6.

At stage 503, the computer program 301 causes the sending module 204 to send a notification, via network 107, to computer storage system 104, reporting the determination of stage 502, i.e. the determination as to whether the first data contains duplicate data.

At stage 504, the computer program 301 may cause the processor 112 to initiate a data backup procedure, the result of which may be storing a copy of the first data received at stage 501 in storage 113.

Referring next to Figure 6, in examples, the method of stage 502 for identifying duplicate data comprises four stages.

At stage 601, the determination module 203 performs a determination procedure to determine whether the first data, received at stage 501, matches further data stored in the computer storage system 104. Stage 601 may involve determination module 203 retrieving from memory 202 a predefined index 302, in which data stored in the storage 110 of the computer storage system 104 is associated with a location of the data in the computer storage system. For example, the index 302 may take the form of a look-up table, in which items of data stored in the storage 110 of computer storage system 104 have respective entries, each entry having associated therewith location information defining a location in the storage 110 of the respective data item. For example, entries may identify a disk drive, logical unit, and/or offset of the respective data item in the storage 110. In examples, data in the storage 110 may be listed in the index at a file-level. In other examples, data in the storage 110 may be listed in the index at the block-level. The level of granularity of the determination made by the determination module 203 at stage 601 may be set in accordance with the level of granularity at which data is recorded in the index 302. For example, where the index records data in storage 110 at the file-level, the determination performed at stage 601 may also look for matching data files. In other examples, where the index 302 records data in storage 110 at the block-level, the determination at stage 601 may similarly look for matching data blocks. The index could, for example, have been predefined by the backup storage and data deduplication system 105 by previous iterations of the data back-up procedure. For example, on occasions when the system 105 has received copy data from the computer storage system 104 for backing-up in storage 113, the system 105, e.g. the data deduplication module 114, could update the index table 302, based on the copy data, to list the data and location(s) of the data in the computer storage system 104.

If the determination at stage 601 is in the affirmative, indicating that the first data, e.g. a file or block thereof, does match further data stored in the storage system, at stage 602 the determination module 203 identifies the location(s) of the matching further data, e.g. the matching data file or block, in the storage 110 of the computer storage system 104, by reference to the index 302.

Alternatively, if the determination at stage 601 is in the negative, indicating that the first data, e.g. a file or block thereof, does not match further data in the computer storage system 104, or in other words that the first data is the first occurrence of that data in the computer storage system, the data deduplication module 114 may modify, i.e. update, the index 302, to include one or more entries in the index corresponding to the first data, e.g. a data file or block of the first data.

At stage 503, as previously described, the data deduplication module 114 may then send a notification to the computer storage system 104 to notify the computer storage system if the first data, or a file or block thereof, matches, i.e. is a duplicate, of further data stored in the computer storage system. The notification may identify the matching data, and further identify, based on the relevant entry in the index, the expected location of the matching data in the storage 110 of the computer storage system 104. In examples, at stage 503, the data deduplication module 114 could send a notification to the computer storage system irrespective of the determination at stage 601. In other words, in examples, if the determination at stage 601 is answered in the affirmative, indicating that matching data has been identified, the data deduplication module 114 could send a notification to computer storage system 104 identifying the matching data and its location(s) in the storage 110. And, if the determination at stage 601 is answered in the negative, indicating that matching data has not been identified, the data deduplication module 114 could send a notification reporting the same. In other examples, the data deduplication module 114 may send a notification only if matching, i.e. duplicate, data is identified. In such other examples, the computer storage system may be configured to interpret a lack of a notification from the data deduplication module as indicating that duplicate data has not been identified.

Referring next to Figure 7, in examples, the method of stage 601 for performing a determination procedure to determine whether the first data matches further data stored in the computer storage system comprises two stages.

At stage 701, the data deduplication module 114 performs a chunking procedure on the first data received at stage 501, comprising chunking the first data into a sequence of chunks of data. In examples, stage 701 may involve performing variable-length block deduplication, whereby a variable chunk size for segmenting the first data is determined by a chunking algorithm.

At stage 702, the data deduplication module 114, for each chunk of data generated at stage 701, determines whether the respective chunk of data matches further data stored in the computer storage system 104, using the index 302 stored in the memory 202. The locations of the matching data chunks may then be identified at stage 602, as previously described.

In other words, in an implementation, the data deduplication module 114 may perform block- level deduplication, e.g. variable-length block deduplication, whereby a chunk size for segmenting the first data is determined by the chunking algorithm. Block, or in other words chunk, level deduplication, whereby the first data is segmented into smaller blocks/chunks and each chunk is inspected for commonality with the further data already stored in the computer storage system, provides ‘higher-resolution’ deduplication, which may advantageously result in relatively higher deduplication rates, and thus advantageously lower memory footprint. Moreover, the chunking algorithm may permit variable-length deduplication, which may advantageously allow for relatively greater identification of common segments of data occurring across a data set.

Referring next to Figure 8, in examples, the method of stage 404 for performing, by the computer storage system 104, data deduplication operations for deduplicating data in storage 110, comprises five stages.

At stage 801, in response to receiving, at stage 403, the notification from the backup storage and data deduplication system 105, the processor 108 inspects the received notification to determine whether the notification indicates potentially matching data in the storage 110. If the determination at stage 801 is in the affirmative, indicating that potentially matching data has been identified, at stage 802 the computer storage system 104 performs a check of the matching data. In other words, at stage 802, the processor 108 may read from the notification the potentially matching data and its indicated location in storage 110, and perform its own check, by analysing the particular location(s) in storage 110 that is indicated in the notification, to confirm whether the indicated location(s) contain data that is a duplicate of the first data, e.g. a file or block of the first data. Thus, by this check, errors made by the data deduplication module 114 in identifying duplicate data may be identified.

If the determination at stage 802 is in the affirmative, indicating that the data identified by the notification does indeed match the data at the indicated location of the storage 110, at stage 803, the processor 108 may perform a data deduplication operation, whereby the processor 108 stores in the storage 110, in the place of the identified matching portion(s) of the first data, e.g. the file or block thereof, a reference identifying the location, identified in the notification, of the matching further data in the storage 110.

Subsequently, at stage 804, following storage of the reference at stage 803, the computer storage system 104 may erase the matching first data, e.g. the matching files or blocks, from storage 110

In the alternative, if the determinations at either of stages 801, 802 are answered in the negative, indicating that the first data does not match further data in the storage 110 of computer storage system 104, the computer storage system 104 may decide not to perform deduplication of the first data, and instead, at stage 805, the computer storage system 104 may instead allow the first data to remain in the storage 110 in non-deduplicated form.

Referring finally to Figure 9, in examples, the method of stage 504 for backing-up of the first data by the backup storage and data deduplication system 105 comprises three stages.

At stage 901, the processor 112 of the system 105 determines whether the first data or a part, e.g. a file or a block thereof, matches data already stored in storage 113 of system 105.

If the determination at stage 901 is in the affirmative, indicating that the first data matches data already stored in storage 113, or in other words that storage 113 already holds a backup copy of the first data, at stage 902 the system 105 may opt not to store a further copy of the first data in the storage 113, and may take no action. Such a situation may be expected to arise, for example, where the computer storage system 104 is configured to periodically send a copy of data stored in storage 110, i.e. first data, to backup storage and data deduplication system 105, and wherein the data stored in storage 110 has not in fact changed since a previous instance of back-up of the data in storage 110 to backup storage and data deduplication system 105. In examples, in response to the affirmative determination at stage 901, at stage 902 the system 105 may erase any temporary copy of the first data that is held by the backup storage system from memory/storage, e.g. from memory 202 of data deduplication module 114.

Alternatively, if the determination at stage 901 is in the negative, indicating that the system 105 does not already hold a copy of the first data in storage 110, at stage 902, the system 105 may save a copy of the first data to storage 113.

Aspects of the disclosure, relating to data deduplication, have been described herein in the context of backup storage and data deduplication system 105, which comprises functionality both for backing-up of data, and also for executing certain procedures of a data deduplication operation. As described herein, this dual-functionality has certain practical advantages. However, aspects of the disclosure have broader utility than this configuration. For example, in simpler examples of an aspect of the disclosure, backup storage and data deduplication system 105 may be substituted for a data deduplication system that does not also include data backing- up functionality.

For example, in simpler examples, a ‘stand-alone’ data deduplication system may be provided, that is suitable for communication with a computer storage system, i.e. a computing device comprising data storage functionality, such as computer storage system 104. Such a data deduplication system may comprise: a receiving module, for example, receiving module 201, suitable for communication with a computer storage system to receive a copy of first data stored in the computer storage system; computer memory, for example, memory 202, comprising a machine-readable index stored in the computer memory in which data stored in the computer storage system is associated with a location of the data in the computer storage system; a determination module, for example, determination module 203, configured to determine, using the index, whether the copy of first data received by the receiving module from the computer storage system matches further data stored in the computer storage system, and in response to a determination that the first data matches further data stored in the computer storage system, identify, using the index, a location of the further data in the computer storage system, and generate a notification identifying the location of the further data in the computer storage system; and a sending module, for example, sending module 204, suitable for sending the notification to the computer storage system. In this example, a technical advantage resulting from the data deduplication system is that the process of determining whether some data, termed herein ‘first data’, matches other data, termed herein ‘further data’, stored in the computer storage system, does not need to be performed by the computer storage system, thereby conserving computational resource, such as processor and/or memory time, of the computer storage system. In examples, the computer storage system could be a storage resource shared between numerous client devices, e.g. numerous mutually remotely located devices. In other examples, the computer storage system could be a device having storage functionality dedicated to a processor integrated with the device. For example, the computer storage system could be a desktop/portable computer, or a smart-phone.

Although aspects of the present disclosure and their associated advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the disclosure as defined by the appended claims. In the claims, the word “comprising” does not exclude other elements or steps, and the indefinite article “a” or “an” does not exclude a plurality. Although processes of methods of the present disclosure have been described herein as occurring in a particular order, in other examples of the disclosure, processes of the methods may be performed in alternative orders, or may even be omitted from the method.

Claims

1. A method for managing data deduplication in a data deduplication system, the method comprising: receiving, by the data deduplication system, a copy of first data stored in a computer storage system in communication with the data deduplication system, determining, by the data deduplication system, whether the first data matches further data stored in the computer storage system, using an index stored in the data deduplication system in which data stored in the computer storage system is associated with a location of the data in the computer storage system, and in response to a determination that the first data matches further data stored in the computer storage system, identifying, by the data deduplication system using the index, a location of the further data in the computer storage system, and sending, by the data deduplication system, a notification to the computer storage system identifying the location associated with the further data in the computer storage system.

2. The method as claimed in claim 1, wherein the receiving, by the data deduplication system, a copy of first data stored in the computer storage system comprises, receiving a copy of the first data labelled with a location of the first data in the computer storage system, and the method further comprises, in response to a determination that the first data does not match further data stored in the computer storage system, modifying, by the data deduplication system, the index, to include a representation of the first data associated with the location of the first data in the computer storage system.

3. The method as claimed in claim 1 or claim 2, wherein: the determining, by the data deduplication system, whether the first data matches further data stored in the computer storage system, comprises chunking, by the data deduplication system, the first data into a sequence of chunks of data using a chunking algorithm, and, for each chunk of data, determining, by the data deduplication system, whether the respective chunk of data matches further data stored in the computer storage system, using the index stored in the data deduplication system, and the identifying, by the data deduplication system using the index, a location of the further data in the computer storage system, and sending, by the data deduplication system, a notification to the computer storage system identifying the location associated with the further data in the computer storage system, comprises for one or more of the chunks of data, identifying, by the data deduplication system using the index, a location of the respective further data in the computer storage system, and sending, by the data deduplication system, a notification to the computer storage system identifying the location associated with the respective further data in the computer storage system.

4. The method as claimed in any one of the preceding claims, further comprising storing a copy of the first data by the data deduplication system.

5. The method as claimed in claim 4, wherein storing the copy of the first data in the data deduplication system comprises, determining, by the data deduplication system, whether the first data matches stored data stored in the data deduplication system, and in response to a determination that the first data does not match stored data stored in the data deduplication system, storing the copy of the first data in the data deduplication system.

6. The method of claim 4 or claim 5, wherein the receiving, by the data deduplication system, a copy of first data stored in the computer storage system, comprises receiving a copy of the first data labelled with a location of the first data in the computer storage system, and the method further comprises storing in the data deduplication system a label associated with the copy of the first data identifying the location of the first data in the computer storage system.

7. The method of any one of the preceding claims, further comprising, sending, by the computer storage system, the copy of the first data stored in the computer storage system.

8. The method as claimed in any one of the preceding claims, further comprising, in response to receiving, using the computer storage system, the notification, storing in place of the first data in the computer storage system a reference identifying the identified location of the further data in the computer storage system.

9. The method as claimed in any one of the preceding claims, further comprising, in response to receiving, using the computer storage system, the notification, checking, using the computer storage system, whether the first data matches the further data stored in the identified location of the computer storage system, and in response to a determination that the first data matches the further data, storing in place of the first data in the computer storage system a reference identifying the identified location of the further data in the computer storage system.

10. The method as claimed in any one of claims 3 to 9, wherein the first data comprises data stored as a plurality of blocks in the computer storage system, and the method further comprises, in response to receiving, using the computer storage system, the notification, for each chunk of data, determining, using the computer storage system, one or more blocks of data comprising the entirety of the respective chunk of data.

11. A computer program comprising instructions, which, when executed by a computing system, cause the computing system to carry out the method of any one of claims 1 to 10.

12. A computer-readable data carrier having the computer program of claim 11 stored thereon.

13. A computing system comprising a data deduplication system, wherein the data deduplication system comprises: a receiving module suitable for communication with a computer storage system to receive a copy of first data stored in the computer storage system, computer memory comprising a machine-readable index stored in the computer memory in which data stored in the computer storage system is associated with a location of the data in the computer storage system, a determination module configured to determine, using the index, whether the copy of first data received by the receiving module from the computer storage system matches further data stored in the computer storage system, and in response to a determination that the first data matches further data stored in the computer storage system, identify, using the index, a location of the further data in the computer storage system, and generate a notification identifying the location of the further data in the computer storage system, and a sending module suitable for sending the notification to the computer storage system.

14. The computing system as claimed in claim 13, wherein the receiving module is further suitable for receiving from the computer storage system location data identifying a location of the first data in the computer storage system, and the determination module is further configured to, in response to a determination that the first data does not match further data stored in the computer storage system, modify the index to include a representation of the first data associated with the location of the first data in the computer storage system.

15. The computing system as claimed in claim 13 or claim 14, wherein the determination module is configured to chunk a copy of first data received from the computer storage system into a sequence of chunks of data using a chunking algorithm, and, for each chunk of data, determine, using the index, whether the respective chunk of data matches further data stored in the computer storage system, and for one or more of the chunks of data, identify, using the index, a location of the respective further data in the computer storage system, and generate a notification identifying the location associated with the respective further data in the computer storage system.

16. The computing system as claimed in any one of claims 13 to 15, wherein the determination module is further configured to store a copy of the first data in the computer memory.

17. The computing system as claimed in claim 16, wherein the determination module is configured to determine whether the first data matches stored data stored in the computer memory, and in response to a determination that the first data does not match stored data stored in the computer memory, store a copy of the first data in the computer memory.

18. The computing system of any one of claims 14 to 17, wherein the determination module is further configured to store in the computer memory of the data deduplication system a label associated with the copy of the first data identifying the location of the first data in the computer storage system.

19. The computing system of any one of claims 13 to 18, further comprising the computer storage system in communication with the receiving module of the data deduplication system.

20. The computing system of claim 19, wherein the computer storage system is configured to send the copy of first data stored in the computer storage system.

21. The computing system of claim 19 or claim 20, wherein the computer storage system is configured to, in response to receiving the notification, store in place of the first data in the computer storage system a reference identifying the identified location of the further data in the computer storage system.

22. The computing system as claimed in any one of claims 19 to 21, wherein the computer storage system is configured to, in response to receiving the notification, determine whether the first data matches the further data stored in the identified location of the computer storage system, and in response to a determination that the first data matches the further data, store in place of the first data in the computer storage system a reference identifying the identified location of the further data in the computer storage system.

23. The computing system as claimed in any one of claims 19 to 22, wherein the first data comprises data stored as a plurality of blocks in the computer storage system, and the computer storage system is configured to, in response to receiving the notification, for each chunk of data, determine one or more blocks of data comprising the entirety of the respective chunk of data.