WO2022258184A1 - Method, device, and system of data management for peer-to-peer data transfer - Google Patents

Method, device, and system of data management for peer-to-peer data transfer Download PDF

Info

Publication number
WO2022258184A1
WO2022258184A1 PCT/EP2021/065636 EP2021065636W WO2022258184A1 WO 2022258184 A1 WO2022258184 A1 WO 2022258184A1 EP 2021065636 W EP2021065636 W EP 2021065636W WO 2022258184 A1 WO2022258184 A1 WO 2022258184A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
file
blocks
data storage
constituent
Prior art date
Application number
PCT/EP2021/065636
Other languages
French (fr)
Inventor
Idan Zach
Assaf Natanzon
Michael Sternberg
Yossi Siles
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/EP2021/065636 priority Critical patent/WO2022258184A1/en
Publication of WO2022258184A1 publication Critical patent/WO2022258184A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0611Improving I/O performance in relation to response time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • G06F16/152File search processing using file content signatures, e.g. hash values
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • G06F16/1834Distributed file systems implemented based on peer-to-peer networks, e.g. gnutella
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • G06F3/0641De-duplication techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/065Replication mechanisms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Definitions

  • the duplicate data files and the similar data files which may be stored on the same data center are de-duplicated (i.e., their backup layout is de-duplicated), and hence, the de-duped data files may be stored in a backup repository, which is not desirable and results in storage inefficiency.
  • same portions of the same file may be stored in different storage systems.
  • recovery is done directly from one location.
  • the entire object/file from the same backup location is recovered.
  • there are several problems associated with this approach as there is an inferior recovery resilience due to no redundancy. In other words, when all the data is available at one place, a risk of not being able to recover the data is much higher. Further, the conventional approach is limited to one type of data. Thus, there is technical problem of high recovery time and limited utilization of bandwidth resources.
  • dividing each file in the data storage system into a plurality of blocks comprises dividing each file into a plurality of blocks having a common size, and determining the plurality of constituent blocks comprises dividing the requested file into a plurality of blocks having the same common size.
  • dividing each file in the data storage system into a plurality of blocks comprises dividing each file into a plurality of blocks using a variable chunking algorithm, and determining the plurality of constituent blocks comprises dividing the requested file into a plurality of blocks using the same variable chunking algorithm.
  • determining the one or more locations includes querying the generated index.
  • the querying of the generated index enables in generating a list of all hash values of blocks that built the entire requested file.
  • determining the one or more locations comprises determining all available locations for each constituent block.
  • transferring the data includes establishing a P2P connection to a new location for a constituent block if a transfer for the constituent block fails.
  • the method 100 comprises dividing each file in the data storage system into a plurality of blocks. For example, video file or an audio file is divided into the plurality of blocks while saving the video file or the audio file in the data storage system.
  • Each file in the data storage system is divided into the plurality of blocks to enable improved recovery such as when a portion of file i.e., some block of the file is required by a given data storage unit. Further, dividing of each file into the plurality of blocks also enables in providing data deduplication i.e., some blocks in the file which are duplicate may be identified easily.
  • the index can run different type of queries and perform analytic and supply insight on a customer storage enterprise. For example: find hot and cold data according to customer policy.
  • the index provides a service of unstructured data management. The index has a single pane of glass of an entire enterprise storage.
  • the method 100 further comprises in response to the request for the file at the specified location, establishing a peer-to-peer, P2P, connection to transfer the data for each constituent block concurrently to the specified location.
  • the method 100 comprises setting-up a peer-to-peer file-sharing protocol and providing all the meta-data information for the requested file.
  • the peer-to-peer protocol is used to recover/restore the requested file based on the meta-data information stored in the index.
  • the peer-to-peer file sharing may refer to the distribution and sharing of digital media using peer-to-peer networking technology.
  • the peer-to-peer file sharing allows users to access media files such as books, music, movies, and games using a peer-to-peer software program that searches for other connected computers on a peer-to-peer network to locate the desired content in a distributed application architecture that partitions tasks or workloads between peers.
  • the method 100 comprising dividing each file in the data storage system into a plurality of blocks comprises dividing each file into a plurality of blocks having a common size, and determining the plurality of constituent blocks comprises dividing the requested file into a plurality of blocks having the same common size.
  • the method 100 comprises splitting each of the file into a series of non-overlapping fixed-sized blocks of a defined size bytes. Further, hash values of such fixed-sized blocks are calculated for storage in the index.
  • the method 100 comprises dividing the requested file into the plurality of blocks having the same common size. Further, hash values of such plurality of blocks may be determined and further used for determining the hash values already (i.e., previously) stored in the index. By virtue of dividing each file into a plurality of blocks having a common size, indexing of the file is improved for storage in the index. Further, deduplication can be implemented to identify duplicate blocks.
  • the method 100 comprising determining the one or more locations includes querying the generated index.
  • a search may be executed on the index to determine the plurality of constituent blocks that make up the requested file.
  • the method 100 comprises executing a query on the index and generating a list of all fingerprints (i.e., hash values) of blocks that built the entire requested file.
  • the present disclosure provides a computer-readable medium configured to store instruction which, when executed by a processor, cause the processor to perform the method 100.
  • a computer program product comprising a non- transitory computer-readable medium having computer instructions stored thereon, the computer instructions being executable by the processor to execute the method 100.
  • Examples of implementation of the non-transitory computer-readable storage medium include, but is not limited to, Electrically Erasable Programmable Read-Only Memory (EEPROM), Random Access Memory (RAM), Read Only Memory (ROM), Hard Disk Drive (HDD), Flash memory, a Secure Digital (SD) card, Solid-State Drive (SSD), a computer readable storage medium, and/or CPU cache memory.
  • a computer readable storage medium for providing a non-transient memory may include, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • FIG. 2 is a block diagram that illustrates various exemplary components of a data management device, in accordance with an embodiment of the present disclosure.
  • FIG. 2 is described in conjunction with elements from FIG. 1.
  • a block diagram 200 of a data management device 202 for a data storage system that includes a plurality of data storage units 204, a memory 206 and a processor 208.
  • the memory 206 may further include an index 210.
  • the plurality of data storage units 204 and the memory 206 may be communicatively coupled to the processor 208.
  • the processor 208 of the data management device 202 executes the method 100 (of FIG. 1).
  • the data storage system is described in detail, for example, in FIG. 3.
  • Examples of implementation of the memory 206 may include, but are not limited to, an Electrically Erasable Programmable Read- Only Memory (EEPROM), Random Access Memory (RAM), Hard Disk Drive (HDD), Flash memory, Solid-State Drive (SSD), or CPU cache memory.
  • EEPROM Electrically Erasable Programmable Read- Only Memory
  • RAM Random Access Memory
  • HDD Hard Disk Drive
  • Flash memory Solid-State Drive
  • CPU cache memory any type of the memory 206 may store an operating system or other program products (including one or more operation algorithms) to operate the data management device 202.
  • the data management device 202 is configured to divide each file in the data storage system into a plurality of blocks. For example, a video file or an audio file is divided into the plurality of blocks while saving the video file or the audio file in the data storage system. Each file in the data storage system is divided into the plurality of blocks to enable improved recovery such as when a portion of file i.e., some block of the file is required by a given data storage unit. Further, dividing of each file into the plurality of blocks also enables in providing data deduplication i.e., some blocks in the file which are duplicate may be identified easily.
  • the data management device 202 is further configured to generate an index 210 comprising the calculated hash value and an associated address for each block in the data storage system.
  • the index 210 is generated to store the calculated hash value and associated address for each block to enable quick recovery when similar block is required by any data storage unit.
  • a data storage unit identifier may also be stored in the index 210 to enable identifying the data storage unit storing the respective block.
  • the index 210 is used to store meta data information about each file and hash values of all the blocks of the file.
  • the index 210 may further store data such as an identifier along with the hash value of each block to identify the file each block is associated with.
  • the index 210 can run different type of queries and perform analytic and supply insight on a customer storage enterprise. For example: find hot and cold data according to customer policy.
  • the index 210 provides a service of unstructured data management.
  • the index 210 has a single pane of glass of an entire enterprise storage.
  • the data management device 202 is further configured to, in response to a request for a file at a specified location, determine a plurality of constituent blocks that make up the requested file.
  • the plurality of constituent blocks that make up the requested file are determined based on the data stored in the index 210
  • the index 210 stores meta-data information about each file and identifiers to enable identification of blocks associated with each file.
  • the plurality of constituent blocks that make up the requested file are determined from the index 210.
  • the data management device 202 is further configured to, in response to a request for a file at a specified location, determine one or more locations for each constituent block.
  • the index 210 stores the associated address for each block in the data storage system.
  • the one or more locations for each constituent block of the requested file is determined based on the index 210.
  • the available locations of that fingerprint are found inside the index 210
  • same fingerprint may exist at multiple times in multiple different files.
  • the data management device 202 of the present disclosure provides an improved data transfer in comparison to conventional approaches.
  • the data management device 202 enables peer-to-peer connection between the plurality of data storage units for data sharing or data transfer.
  • the data management device 202 provides a de-centralized way of data sharing among the plurality of data storage units in comparison to the conventional approach where data sharing is very centralized i.e., data stored in one location is provided to all data storage units conventionally.
  • the data management device 202 enables recovery of data from multiple locations (i.e., the data storage units) based on the index.
  • the data management device 202 provides an improved (i.e., reduced) recovery time for recovering constituent blocks of the requested file.
  • the data management device 202 enables improved utilization of computational resources as the data management device 202 allows recovering partial blocks from different locations with different latencies and resource limitations.
  • hash values of such plurality of blocks may be determined and further used for determining the hash values already (i.e., previously) stored in the index 210.
  • indexing of the file is improved for storage in the index 210.
  • deduplication can be implemented to identify duplicate blocks.
  • the data management device 202 comprising determining the one or more locations comprises determining all available locations for each constituent block.
  • the data management device 202 is configured to determine the one or more locations for each constituent block in the plurality of data storage units 204.
  • Each of the constituent block may be stored at different locations in the plurality of data storage units 204.
  • location of all such constituent blocks is determined and then used for transferring the blocks based on the requested file.
  • the transferring of the blocks is faster as compared to the conventional approach.
  • the data management device 202 further comprises calculating one or more new hash values and updating the index 210 for any updated files in the data storage system.
  • the blocks which have hash values stored in the index 210 may get changed or updated, in such a case, the new hash values for such blocks are calculated and updated in the index 210.
  • the changed or updated blocks can be recovered.
  • FIG. 4 is a network environment diagram of communication between the data storage system and the data management device, in accordance with an embodiment of the present disclosure.
  • FIG. 4 is described in conjunction with elements from FIGs. 1, 2, and 3.
  • a network environment 400 that includes data centers 402A and 402B, AI (Artificial Intelligence) catalogue service 404.
  • the data centers 402A and 402B further include NAS (Network Attached Storage) 406A and 406B, data object 408, a production ESX server 410, a production MSSQL 412, a production oracle 414.
  • the NAS 406A and 406B include collectors 416.
  • the AI catalogue service 404 includes an AI catalogue engine 418, a database 420 and an internal collector 422.
  • FIG. 5 is an illustration of an exemplary scenario of implementation of the data storage system and the data management device, in accordance with an embodiment of the present disclosure.
  • FIG. 5 is described in conjunction with elements from FIGs. 1, 2, 3, and 4.
  • FIG. 5 there is shown an implementation scenario 500.
  • storage systems 502A to 502N there is shown storage systems 502A to 502N, a target system 504, a catalog 506, a peer-to-peer file sharing protocol 508.
  • Each of the storage systems 502A to 502N include storage disks 510A to 510N.
  • blocks 512A to 512N are further shown.
  • the storage systems 502A to 502N may also be referred to as data storage units such as data storage units 204.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A computer-implemented method of data management in data storage system comprising data storage units. The method comprises dividing each file in data storage system into blocks; calculating hash value for each block using hash algorithm; generating index comprising calculated hash value and associated address for each block in data storage system. The method comprises, in response to request for file at specified location: determining constituent blocks that make up requested file, determining one or more locations for each constituent block, establishing peer-to-peer, P2P, connection to transfer data for each constituent block concurrently to specified location, and transferring data for each constituent block to specified location. The method provides improved recovery time for recovering blocks based on the requested file. Further, the method also enables improved utilization of computational resources.

Description

METHOD, DEVICE, AND SYSTEM OF DATA MANAGEMENT FOR PEER-TO-
PEER DATA TRANSFER
TECHNICAL FIELD
The present disclosure relates generally to the field of data management; and more specifically, to a method, a data management device, and a data storage system comprising data storage units for peer-to-peer connection to transfer data.
BACKGROUND
Typically, data related to an entity, such as an enterprise or an organization, is stored in terms of multiple data files and data objects which are further stored on various physical machines, or virtual machines located on different hosts (e.g., VMware, Hyper-V and the like) and data centers (e.g., network attached servers (NAS)). In case of storage of data in terms of the multiple data files and data objects, there may be duplicity of data files and similarity (i.e., partially same data) may also exist among the multiple data files, while saving the data related to the entire enterprise or the organization. Currently, certain approaches are available to reduce storage of duplicate data files and similar (i.e., partially same) data files, such as by use of a secondary storage cluster, data de-duplication or a back-up layout. In case of use of the secondary storage cluster, depending on the scalability of the secondary storage cluster and size of the data center, more than one secondary storage cluster may be needed in order to protect one data center. Data de-duplication may be defined as a technique to eliminate duplicate copies of repeating data. In the backup layout, different layouts may be used to store the backed-up data, such as incremental block level, file level, and data de-duplication. In all these layouts, data that is to be stored is either compressed or encrypted. In this way, the duplicate data files and the similar data files which may be stored on the same data center, are de-duplicated (i.e., their backup layout is de-duplicated), and hence, the de-duped data files may be stored in a backup repository, which is not desirable and results in storage inefficiency. In a large multi-system storage environment, same portions of the same file may be stored in different storage systems. In most existing storage systems, recovery is done directly from one location. As a result, during recovery of data, the entire object/file from the same backup location is recovered. However, there are several problems associated with this approach as there is an inferior recovery resilience due to no redundancy. In other words, when all the data is available at one place, a risk of not being able to recover the data is much higher. Further, the conventional approach is limited to one type of data. Thus, there is technical problem of high recovery time and limited utilization of bandwidth resources.
Therefore, in light of the foregoing discussion, there exists a need to overcome the aforementioned drawbacks associated with the conventional methods of data transfer.
SUMMARY
The present disclosure seeks to provide a method and a system comprising data storage units for peer-to-peer connection to transfer data. The present disclosure seeks to provide a solution to the existing problem of inefficient data transfer, i.e., how to further reduce recovery time and improve resource utilization as compared to conventional methods and systems. An aim of the present disclosure is to provide a solution that overcomes at least partially the problems encountered in prior art, and provide improved methods and systems that provides an efficient and reliable data transfer with reduced recovery time and improved resource utilization.
The one or more objects of the present disclosure is achieved by the solutions provided in the enclosed independent claims. Advantageous implementations of the present disclosure are further defined in the dependent claims.
In one aspect, the present disclosure provides a computer-implemented method of data management in a data storage system comprising a plurality of data storage units, the method comprising: dividing each file in the data storage system into a plurality of blocks; calculating a hash value for each block using a hash algorithm; generating an index comprising the calculated hash value and an associated address for each block in the data storage system; and in response to a request for a file at a specified location: determining a plurality of constituent blocks that make up the requested file, determining one or more locations for each constituent block, establishing a peer-to-peer, P2P, connection to transfer the data for each constituent block concurrently to the specified location, and transferring the data for each constituent block to the specified location.
The method of the present disclosure provides an improved data transfer in comparison to conventional approaches. The method enables peer-to-peer connection between the plurality of data storage units for data sharing or data transfer. Thus, the method provides a de-centralized way of data sharing among the plurality of data storage units in comparison to the conventional approach where data sharing is very centralized i.e., data stored in one location is provided to all data storage units conventionally. The method enables recovery of data from multiple locations (i.e., the data storage units) based on the index. Thus, the method provides an improved (i.e., reduced) recovery time for recovering constituent blocks of the requested file. Further, the method enables improved utilization of computational resources as the method allows recovering partial blocks from different locations with different latencies and resource limitations.
In an implementation form, dividing each file in the data storage system into a plurality of blocks comprises dividing each file into a plurality of blocks having a common size, and determining the plurality of constituent blocks comprises dividing the requested file into a plurality of blocks having the same common size.
By virtue of dividing each file into a plurality of blocks having a common size, indexing of the file is improved for storage in the index. Further, deduplication can be implemented to identify duplicate blocks.
In a further implementation form, dividing each file in the data storage system into a plurality of blocks comprises dividing each file into a plurality of blocks using a variable chunking algorithm, and determining the plurality of constituent blocks comprises dividing the requested file into a plurality of blocks using the same variable chunking algorithm.
By virtue of dividing each file into a plurality of blocks using a variable chunking algorithm, indexing of the file is improved for storage in the index. Further, deduplication can be implemented to identify duplicate blocks.
In a further implementation form, determining the one or more locations includes querying the generated index. The querying of the generated index enables in generating a list of all hash values of blocks that built the entire requested file.
In a further implementation form, determining the one or more locations comprises determining all available locations for each constituent block.
The one or more locations of all constituent blocks is determined and then used for transferring the blocks based on the requested file. Thus, the transferring of the blocks is faster as compared to the conventional approach.
In a further implementation form, transferring the data includes transferring data for one constituent block from a plurality of locations concurrently and stopping the remaining transfers for the constituent block when one of the plurality of transfers is complete.
Thus, the method enables recovering the same block from multiple locations concurrently and stopping the recovery when a first block is received. This improves the entire file recovery and end-to-end latency.
In a further implementation form, transferring the data includes establishing a P2P connection to a new location for a constituent block if a transfer for the constituent block fails.
Beneficially, this ensures all constituent blocks are restored and no constituent block is lost during transfer of the data.
In a further implementation form, the method further comprises calculating one or more new hash values and updating the index for any updated files in the data storage system.
The new hash values for such blocks are calculated and updated in the index to enable recovery of the changed or updated blocks when there is a file request.
In another aspect, the present disclosure provides a computer-readable medium configured to store instruction which, when executed by a processor, cause the processor to perform the method of the previous aspect.
The computer-readable medium achieves all the advantages and effects of the method of the present disclosure. In another aspect, the present disclosure provides a data management device for a data storage system comprising a plurality of data storage units, the data management device configured to: divide each file in the data storage system into a plurality of blocks; calculate a hash value for each block using a hash algorithm; generate an index comprising the calculated hash value and an associated address for each block in the data storage system; and in response to a request for a file at a specified location: determine a plurality of constituent blocks that make up the requested file, determine one or more locations for each constituent block, establish a peer-to- peer, P2P, connection to transfer the data for each constituent block concurrently to the specified location, and transfer the data for each constituent block to the specified location.
The data management device achieves all the advantages and effects of the method of the present disclosure.
In another aspect, the present disclosure provides a data storage system comprising: one or more interconnected data storage units; and the data management device of the previous aspect.
The data storage system achieves all the advantages and effects of the data management device as well as the method of the present disclosure.
It is to be appreciated that all the aforementioned implementation forms can be combined. It has to be noted that all devices, elements, circuitry, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof. It will be appreciated that features of the present disclosure are susceptible to being combined in various combinations without departing from the scope of the present disclosure as defined by the appended claims. Additional aspects, advantages, features and objects of the present disclosure would be made apparent from the drawings and the detailed description of the illustrative implementations construed in conjunction with the appended claims that follow.
BRIEF DESCRIPTION OF THE DRAWINGS
The summary above, as well as the following detailed description of illustrative embodiments, is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the present disclosure, exemplary constructions of the disclosure are shown in the drawings. However, the present disclosure is not limited to specific methods and instrumentalities disclosed herein. Moreover, those in the art will understand that the drawings are not to scale. Wherever possible, like elements have been indicated by identical numbers.
Embodiments of the present disclosure will now be described, by way of example only, with reference to the following diagrams wherein:
FIG. 1 is a flowchart of a method for establishing peer-to-peer connection to transfer data, in accordance with an embodiment of the present disclosure;
FIG. 2 is a block diagram that illustrates various exemplary components of a data management device, in accordance with an embodiment of the present disclosure;
FIG. 3 is a block diagram that illustrates various exemplary components of a data storage system, in accordance with an embodiment of the present disclosure;
FIG. 4 is a network environment diagram of communication between the data storage system and the data management device, in accordance with an embodiment of the present disclosure; and
FIG. 5 is an illustration of an exemplary scenario of implementation of the data storage system and the data management device, in accordance with an embodiment of the present disclosure.
In the accompanying drawings, an underlined number is employed to represent an item over which the underlined number is positioned or an item to which the underlined number is adjacent. A non-underlined number relates to an item identified by a line linking the non- underlined number to the item. When a number is non-underlined and accompanied by an associated arrow, the non-underlined number is used to identify a general item at which the arrow is pointing.
DETAILED DESCRIPTION OF EMBODIMENTS
The following detailed description illustrates embodiments of the present disclosure and ways in which they can be implemented. Although some modes of carrying out the present disclosure have been disclosed, those skilled in the art would recognize that other embodiments for carrying out or practicing the present disclosure are also possible.
FIG. 1 is a flowchart of a method for establishing peer-to-peer connection to transfer data, in accordance with an embodiment of the present disclosure. With reference to FIG. 1, there is shown a computer implemented method 100 of data management in a data storage system comprising a plurality of data storage units. The method 100 is executed at a data management device described, for example, in Fig. 2. The method 100 includes steps 102 to 114
In one aspect, the present disclosure provides a computer-implemented method 100 of data management in a data storage system comprising a plurality of data storage units, the method 100 comprising: dividing each file in the data storage system into a plurality of blocks; calculating a hash value for each block using a hash algorithm; generating an index comprising the calculated hash value and an associated address for each block in the data storage system; and in response to a request for a file at a specified location: determining a plurality of constituent blocks that make up the requested file, determining one or more locations for each constituent block, establishing a peer-to-peer, P2P, connection to transfer the data for each constituent block concurrently to the specified location, and transferring the data for each constituent block to the specified location.
The present disclosure provides a computer-implemented method 100 of data management in a data storage system comprising a plurality of data storage units. The data storage unit may include all types of sources, such as NAS (Network-attached storage), VM (Virtual Machine), S3 Object Storage, Kubemetes, storage servers, production servers and the like in an entire enterprise storage. The data storage system represents a storage system of the entire enterprise storage or an organization. The method 100 of the present disclosure enables peer-to-peer connection between the plurality of data storage units for data sharing or data transfer. In other words, the present method 100 provides a de-centralized method of data sharing among the plurality of data storage units in comparison to the conventional approach where data sharing is very centralized i.e., data stored in one location is provided to all data storage units conventionally. The method 100 enables recovery of data from multiple locations (i.e., the data storage units) concurrently by leveraging an index or a catalog. Thus, the method 100 improves recovery time and reduces computational resources as it allows to recover partial blocks from different locations with different latencies and resource limitations.
Each of the data storage units comprises generating a file whenever there is any new data created or existing data updated by virtual machines or physical machines associated with servers or network attached storages. Such file may further be requested by any of the data storage units.
At step 102, the method 100 comprises dividing each file in the data storage system into a plurality of blocks. For example, video file or an audio file is divided into the plurality of blocks while saving the video file or the audio file in the data storage system. Each file in the data storage system is divided into the plurality of blocks to enable improved recovery such as when a portion of file i.e., some block of the file is required by a given data storage unit. Further, dividing of each file into the plurality of blocks also enables in providing data deduplication i.e., some blocks in the file which are duplicate may be identified easily.
In an example, the data storage system may further comprise one or more collectors. Such collectors enable in collecting the files from the data storage units. The collection can be made periodically or may have a live (continuous) update.
At step 104, the method 100 further comprises calculating a hash value for each block using a hash algorithm. In other words, the method 100 further comprises determining a fingerprint (i.e., hash value) for each block of the plurality of blocks of the file. Such hash values are distinct for every block to enable identification of different block. In an example, similar blocks may have same hash values. In an example, a secure hash algorithm 1 (SHA-1) may be used to generate the hash for each block of the plurality of blocks. Generally, the secure hash algorithm 1 (SHA-1) may be defined as a cryptographic hash function which takes an input and generates a hash value of 160 bits in output.
At step 106, the method 100 further comprises generating an index comprising the calculated hash value and an associated address for each block in the data storage system. The index may also be referred to as a catalog or a catalog engine or an artificial intelligence (AI) catalog engine. The index is generated to store the calculated hash value and associated address for each block to enable quick recovery when similar block is required by any data storage unit. In an example, a data storage unit identifier may also be stored in the index to enable identifying the data storage unit storing the respective block. In other words, the index is used to store meta data information about each file and hash values of all the blocks of the file. In an example, the index may further store data such as an identifier along with the hash value of each block to identify the file each block is associated with. Thus, when a recovery of some file is required, the meta-data information as well as the list of blocks of the file are retrieved from the index. In an example, the index can run different type of queries and perform analytic and supply insight on a customer storage enterprise. For example: find hot and cold data according to customer policy. In an example, the index provides a service of unstructured data management. The index has a single pane of glass of an entire enterprise storage.
The method 100 may further comprise receiving a request for a given file by a data storage unit. The method 100 in response to the request may provide the given file from different locations in comparison to conventional approach where such files are provided from a single location. Thus, the method 100 of the present disclosure is more reliable, fast and has lower latency. In other words, the method 100 stores and retrieves data in a decentralized manner in comparison to conventional approaches where data is stored at a single location i.e., at a secondary storage.
At step 108, the method 100 further comprises in response to the request for the file at a specified location, determining a plurality of constituent blocks that make up the requested file. The plurality of constituent blocks that make up the requested file are determined based on the data stored in the index. In other words, the index stores meta-data information about each file and identifiers to enable identification of blocks associated with each file. Thus, based on the request for the file, the plurality of constituent blocks that make up the requested file are determined from the index.
At step 110, the method 100 further comprises in response to the request for the file at the specified location, determining one or more locations for each constituent block. The index stores the associated address for each block in the data storage system. Thus, the one or more locations for each constituent block of the requested file is determined based on the index. In other words, for each fingerprint (i.e., hash value), the available locations of that fingerprint are found inside the index. In an example, same fingerprint may exist at multiple times in multiple different files.
At step 112, the method 100 further comprises in response to the request for the file at the specified location, establishing a peer-to-peer, P2P, connection to transfer the data for each constituent block concurrently to the specified location. In other words, the method 100 comprises setting-up a peer-to-peer file-sharing protocol and providing all the meta-data information for the requested file. The peer-to-peer protocol is used to recover/restore the requested file based on the meta-data information stored in the index. The peer-to-peer file sharing may refer to the distribution and sharing of digital media using peer-to-peer networking technology. The peer-to-peer file sharing allows users to access media files such as books, music, movies, and games using a peer-to-peer software program that searches for other connected computers on a peer-to-peer network to locate the desired content in a distributed application architecture that partitions tasks or workloads between peers.
At step 114, the method 100 further comprises in response to the request for the file at the specified location, transferring the data for each constituent block to the specified location. Thus, the method 100 comprises recovery of data (i.e., the request file) from one or more locations concurrently. As the data is transferred from the one or more locations, thus a total time of the file recovery is reduced dramatically. Beneficially, the method 100 improves recovery time and reduces utilization of computational resources as it allows recovery of partial blocks from different locations with different latencies and resource limitations. In comparison to conventional approaches where such data is stored and recovered from one location i.e., a secondary storage and thus conventional approaches are time consuming and utilize higher computational resources. Thus, the method 100 of the present disclosure provides an improved data transfer in comparison to conventional approaches. The method 100 enables peer-to-peer connection between the plurality of data storage units for data sharing or data transfer. Thus, the method 100 provides a de-centralized way of data sharing among the plurality of data storage units, which increases reliability of recovery, in comparison to the conventional approach where data sharing is very centralized i.e., data stored in one location is provided to all data storage units conventionally. The method 100 enables recovery of data from multiple locations (i.e., the data storage units) based on the index. Thus, the method 100 provides an improved (i.e., reduced) recovery time for recovering constituent blocks of the requested file while still increasing reliability of recovery. Further, the method 100 enables improved utilization of computational resources as the method 100 allows recovering partial blocks from different locations with different latencies and resource limitations.
According to an embodiment, the method 100 comprising dividing each file in the data storage system into a plurality of blocks comprises dividing each file into a plurality of blocks having a common size, and determining the plurality of constituent blocks comprises dividing the requested file into a plurality of blocks having the same common size. In other words, the method 100 comprises splitting each of the file into a series of non-overlapping fixed-sized blocks of a defined size bytes. Further, hash values of such fixed-sized blocks are calculated for storage in the index. In an example, the method 100 comprises dividing the requested file into the plurality of blocks having the same common size. Further, hash values of such plurality of blocks may be determined and further used for determining the hash values already (i.e., previously) stored in the index. By virtue of dividing each file into a plurality of blocks having a common size, indexing of the file is improved for storage in the index. Further, deduplication can be implemented to identify duplicate blocks.
According to an embodiment, the method 100 comprising dividing each file in the data storage system into a plurality of blocks comprises dividing each file into a plurality of blocks using a variable chunking algorithm, and determining the plurality of constituent blocks comprises dividing the requested file into a plurality of blocks using the same variable chunking algorithm. In other words, the method 100 comprises splitting each of the file into a series of non overlapping variable-length blocks using the variable chunking algorithm. Further, hash values of such variable-length blocks are calculated for storage in the index. In an example, the method 100 comprises dividing the requested file into the plurality of blocks having variable size. Further, hash values of such plurality of blocks may be determined and further used for determining the hash values already (i.e., previously) stored in the index. The method 100 may use the variable chunking algorithm which is already known in the art. By virtue of dividing each file into a plurality of blocks using a variable chunking algorithm, indexing of the file is improved for storage in the index. Further, deduplication can be implemented to identify duplicate blocks.
According to an embodiment, the method 100 comprising determining the one or more locations includes querying the generated index. In an example, a search may be executed on the index to determine the plurality of constituent blocks that make up the requested file. In other words, the method 100 comprises executing a query on the index and generating a list of all fingerprints (i.e., hash values) of blocks that built the entire requested file.
According to an embodiment, the method 100 comprising determining the one or more locations comprises determining all available locations for each constituent block. In other words, the method 100 comprises determining the one or more locations for each constituent block in the plurality of data storage units. Each of the constituent block may be stored at different locations in the plurality of data storage units. Thus, location of all such constituent blocks is determined and then used for transferring the blocks based on the requested file. Thus, the transferring of the blocks is faster as compared to the conventional approach.
According to an embodiment, the method 100 comprising transferring the data includes transferring data for one constituent block from a plurality of locations concurrently and stopping the remaining transfers for the constituent block when one of the plurality of transfers is complete. In other words, the method 100 comprises concurrently accessing different locations (i.e., one or more locations) for transferring data for one constituent block. Thus, the method 100 enables recovering the same block from multiple locations concurrently. Further, the method 100 stops recovering when a first block is received. Beneficially, this improves the entire file recovery and end-to-end latency.
According to an embodiment, the method 100 comprising transferring the data includes storing each received constituent block at a specified offset to build the requested file. In an example, the received constituent block may be stored in a defined sequential manner to enable building of the requested files. In an example, a given constituent block which is to be sequentially arranged prior to other constituent blocks may also be received at the offset prior to the other constituent blocks. According to an embodiment, the method 100 comprising transferring the data includes establishing a P2P connection to a new location for a constituent block if a transfer for the constituent block fails. In other words, if some failure occurs for transferring the data, the method 100 comprises trying a different location for the failed constituent block. Further, once all the blocks are received, the recovery/restore is done. Thus, the method 100 ensures all constituent blocks are restored and no constituent block is lost during transfer of the data.
According to an embodiment, in the method 100, the P2P connection is established using a BitTorrent protocol or a IPFS protocol. The BitTorrent protocol is a communication protocol for peer-to-peer file sharing (P2P), which enables users to distribute data and electronic files over the Internet in a decentralized manner. The IPFS protocol is a P2P hypermedia protocol designed to make the World Wide Web more fast. The BitTorrent protocol or the IPFS protocol are exemplary protocols and any other protocols which are known in the art may be used.
According to an embodiment, in the method 100, the request for a file is a restore request or a file copy request. In an example, a given data storage unit which already has a file or had previously provided the file for hash value storage in the index, may provide the file copy request. In an example, a given data storage unit which previously did not have the file, may provide a restore request.
According to an embodiment, the method 100 further comprises calculating one or more new hash values and updating the index for any updated files in the data storage system. The blocks which have hash values stored in the index, may get changed or updated, in such a case, the new hash values for such blocks are calculated and updated in the index. Thus, when there is a file request, the changed or updated blocks can be recovered.
According to an embodiment, in the method 100, the data storage system is a network based system. The data storage system is the network-based system, such as the centralized network attached server (NAS), or a cloud network which is linked with a network, for example, an internet. In an example, for the first time, the method 100 calculates fingerprints for all the blocks of the object/file. Further, for next time, the method 100 calculates fingerprints only for the changed blocks. This ensures, the hash values of duplicate blocks are not stored in the index. In an example, the data storage system may have an internal deduplication system and thus there is no need for calculating fingerprints and there is only a need for collecting the fingerprints from the deduplication system. According to an embodiment, the present method 100 may be used to create an inter-planetary large file-system (consists of a lot files) for fast recovery purpose. The meta-data of each file (e.g., absolute path, uid/gid, permissions, etc.,) is generated directly from the Catalog (or multiple Catalogs system) and the constituent blocks of each file are shared and restored by using the P2P protocol as described by the recovery procedure above.
In another aspect, the present disclosure provides a computer-readable medium configured to store instruction which, when executed by a processor, cause the processor to perform the method 100. In another aspect, a computer program product is provided comprising a non- transitory computer-readable medium having computer instructions stored thereon, the computer instructions being executable by the processor to execute the method 100. Examples of implementation of the non-transitory computer-readable storage medium include, but is not limited to, Electrically Erasable Programmable Read-Only Memory (EEPROM), Random Access Memory (RAM), Read Only Memory (ROM), Hard Disk Drive (HDD), Flash memory, a Secure Digital (SD) card, Solid-State Drive (SSD), a computer readable storage medium, and/or CPU cache memory. A computer readable storage medium for providing a non-transient memory may include, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
FIG. 2 is a block diagram that illustrates various exemplary components of a data management device, in accordance with an embodiment of the present disclosure. FIG. 2 is described in conjunction with elements from FIG. 1. With reference to FIG. 2, there is shown a block diagram 200 of a data management device 202 for a data storage system that includes a plurality of data storage units 204, a memory 206 and a processor 208. The memory 206 may further include an index 210. The plurality of data storage units 204 and the memory 206 may be communicatively coupled to the processor 208. The processor 208 of the data management device 202 executes the method 100 (of FIG. 1). The data storage system is described in detail, for example, in FIG. 3.
In another aspect, the present disclosure provides a data management device 202 for a data storage system comprising a plurality of data storage units 204, the data management device 202 configured to: divide each file in the data storage system into a plurality of blocks; calculate a hash value for each block using a hash algorithm; generate an index 210 comprising the calculated hash value and an associated address for each block in the data storage system; and in response to a request for a file at a specified location: determine a plurality of constituent blocks that make up the requested file, determine one or more locations for each constituent block, establish a peer-to-peer, P2P, connection to transfer the data for each constituent block concurrently to the specified location, and transfer the data for each constituent block to the specified location.
The data management device 202 includes suitable logic, circuitry, interfaces, or code that is configured to collect information of an unstructured data that may be stored on different data storage units of the entire enterprise storage, such as a data center (e.g., network attached server (NAS)), a data object, a virtual machine (VM), or a physical machine and the like. The information may be collected either periodically or as a “live update” by use of the plurality of data storage units 204 (e.g., collector). The plurality of data storage units 204 (e.g., collector) may run in-host or outside of the host. The collected information may be a native meta data with an additional synthetic data, for example, input or output temperature of a file may be obtained by calculating number of reads per file in the last 24 hours or any other synthetic information. In an example, the data management device 202 may be a single computing device or an electronic device. In another example, the data management device 202 may be a plurality of computing devices, or electronic devices, operating in a parallel or distributed architecture. In a yet another example, the data management device 202 may be implemented as a computer program that provides various services (such as database management) to other devices, modules or apparatus.
The plurality of data storage units 204 includes suitable logic, circuitry, interfaces, or code that is configured to store data. Examples of the plurality of data storage units 204 include, but are not limited to a primary storage, a secondary storage, such as a memory of a virtual machine, a network attached server (NAS), a collector, and the like. The virtual machine may be located on different hosts or hosts type, such as VMware, Hyper- V, and the like. The memory 206 includes suitable logic, circuitry, interfaces, or code that is configured to store data and the instructions executable by the processor 208. Examples of implementation of the memory 206 may include, but are not limited to, an Electrically Erasable Programmable Read- Only Memory (EEPROM), Random Access Memory (RAM), Hard Disk Drive (HDD), Flash memory, Solid-State Drive (SSD), or CPU cache memory. The memory 206 may store an operating system or other program products (including one or more operation algorithms) to operate the data management device 202.
The processor 208 includes suitable logic, circuitry, interfaces, or code that is configured to execute the instructions stored in the memory 206. In an example, the processor 208 may be a general-purpose processor. Other examples of the processor 208 may include, but is not limited to a microprocessor, a microcontroller, a complex instruction set computing (CISC) processor, an application-specific integrated circuit (ASIC) processor, a reduced instruction set (RISC) processor, a very long instruction word (VLIW) processor, a central processing unit (CPU), a state machine, a data processing unit, and other processors or control circuitry. Moreover, the processor 208 may refer to one or more individual processors, processing devices, a processing unit that is part of a machine, such as the data management device 202.
The index 210 may also be referred to as a catalog or a catalog engine or an artificial intelligence (AI) catalog engine. The index 210 is configured to store the calculated hash value and associated address for each block to enable quick recovery when similar block is required by any data storage unit.
In operation, the data management device 202 enables peer-to-peer connection between the plurality of data storage units 204 for data sharing or data transfer. In other words, the data management device 202 provides a de-centralized way of data sharing among the plurality of data storage units 204 in comparison to the conventional approach where data sharing is very centralized i.e., data stored in one location is provided to all data storage units conventionally. The data management device 202 enables recovery of data from multiple locations (i.e., the data storage units) concurrently by leveraging an index 210 or a catalog. Thus, the data management device 202 improves recovery time and reduces computational resources as it allows to recover partial blocks from different locations with different latencies and resource limitations. Each of the data storage units 204 comprises generating a file whenever there is any new data created or existing data updated by virtual machines or physical machines associated with servers or network attached storages. Such file may further be requested by any of the data storage units 204
The data management device 202 is configured to divide each file in the data storage system into a plurality of blocks. For example, a video file or an audio file is divided into the plurality of blocks while saving the video file or the audio file in the data storage system. Each file in the data storage system is divided into the plurality of blocks to enable improved recovery such as when a portion of file i.e., some block of the file is required by a given data storage unit. Further, dividing of each file into the plurality of blocks also enables in providing data deduplication i.e., some blocks in the file which are duplicate may be identified easily.
The data management device 202 is further configured to calculate a hash value for each block using a hash algorithm. In other words, the data management device 202 is configured to determine a fingerprint (i.e., hash value) for each block of the plurality of blocks of the file. Such hash values are distinct for every block to enable identification of different block. In an example, similar blocks may have same hash values. In an example, a secure hash algorithm 1 (SHA-1) may be used to generate the hash for each block of the plurality of blocks. Generally, the secure hash algorithm 1 (SHA-1) may be defined as a cryptographic hash function which takes an input and generates a hash value of 160 bits in output.
The data management device 202 is further configured to generate an index 210 comprising the calculated hash value and an associated address for each block in the data storage system. The index 210 is generated to store the calculated hash value and associated address for each block to enable quick recovery when similar block is required by any data storage unit. In an example, a data storage unit identifier may also be stored in the index 210 to enable identifying the data storage unit storing the respective block. In other words, the index 210 is used to store meta data information about each file and hash values of all the blocks of the file. In an example, the index 210 may further store data such as an identifier along with the hash value of each block to identify the file each block is associated with. Thus, when a recovery of some file is required, the meta-data information as well as the list of blocks of the file are retrieved from the index 210 In an example, the index 210 can run different type of queries and perform analytic and supply insight on a customer storage enterprise. For example: find hot and cold data according to customer policy. In an example, the index 210 provides a service of unstructured data management. The index 210 has a single pane of glass of an entire enterprise storage.
The data management device 202 may be further configured to receive a request for a given file by a data storage unit. The data management device 202 in response to the request may provide the given file from different locations in comparison to conventional approach where such files are provided from a single location. Thus, the data management device 202 of the present disclosure is more reliable, fast and has lower latency. In other words, the data management device 202 stores and retrieves data in a decentralized manner in comparison to conventional approaches where data is stored at a single location i.e., at a secondary storage.
The data management device 202 is further configured to, in response to a request for a file at a specified location, determine a plurality of constituent blocks that make up the requested file. The plurality of constituent blocks that make up the requested file are determined based on the data stored in the index 210 In other words, the index 210 stores meta-data information about each file and identifiers to enable identification of blocks associated with each file. Thus, based on the request for the file, the plurality of constituent blocks that make up the requested file are determined from the index 210.
The data management device 202 is further configured to, in response to a request for a file at a specified location, determine one or more locations for each constituent block. The index 210 stores the associated address for each block in the data storage system. Thus, the one or more locations for each constituent block of the requested file is determined based on the index 210. In other words, for each fingerprint (i.e., hash value), the available locations of that fingerprint are found inside the index 210 In an example, same fingerprint may exist at multiple times in multiple different files.
The data management device 202 is further configured to, in response to a request for a file at a specified location, establish a peer-to-peer, P2P, connection to transfer the data for each constituent block concurrently to the specified location. In other words, the data management device 202 is configured to set-up a peer-to-peer file-sharing protocol and provide all the meta data information for the requested file. The peer-to-peer protocol is used to recover/restore the requested file based on the meta-data information stored in the index 210 The peer-to-peer file sharing may refer to the distribution and sharing of digital media using peer-to-peer networking technology. The peer-to-peer file sharing allows users to access media files such as books, music, movies, and games using a peer-to-peer software program that searches for other connected computers on a peer-to-peer network to locate the desired content in a distributed application architecture that partitions tasks or workloads between peers.
The data management device 202 is further configured to, in response to a request for a file at a specified location, transfer the data for each constituent block to the specified location. Thus, the data management device 202 is configured to recover data (i.e., the request file) from one or more locations concurrently. Beneficially, this improves recovery time and reduces utilization of computational resources as it allows recovery of partial blocks from different locations with different latencies and resource limitations. In comparison to conventional approaches where such data is stored and recovered from one location i.e., a secondary storage and thus conventional approaches are time consuming and utilize higher computational resources.
Thus, the data management device 202 of the present disclosure provides an improved data transfer in comparison to conventional approaches. The data management device 202 enables peer-to-peer connection between the plurality of data storage units for data sharing or data transfer. Thus, the data management device 202 provides a de-centralized way of data sharing among the plurality of data storage units in comparison to the conventional approach where data sharing is very centralized i.e., data stored in one location is provided to all data storage units conventionally. The data management device 202 enables recovery of data from multiple locations (i.e., the data storage units) based on the index. Thus, the data management device 202 provides an improved (i.e., reduced) recovery time for recovering constituent blocks of the requested file. Further, the data management device 202 enables improved utilization of computational resources as the data management device 202 allows recovering partial blocks from different locations with different latencies and resource limitations.
According to an embodiment, the data management device 202 comprising dividing each file in the data storage system into a plurality of blocks comprises dividing each file into a plurality of blocks having a common size, and determining the plurality of constituent blocks comprises dividing the requested file into a plurality of blocks having the same common size. In other words, the data management device 202 is configured to split each of the file into a series of non-overlapping fixed-sized blocks of a defined size bytes or variable sized blocks. Further, hash values of such fixed-sized blocks or variable sized blocks are calculated for storage in the index 210. In an example, the data management device 202 is configured to divide the requested file into the plurality of blocks having the same common size. Further, hash values of such plurality of blocks may be determined and further used for determining the hash values already (i.e., previously) stored in the index 210. By virtue of dividing each file into a plurality of blocks having a common size or of variable size, indexing of the file is improved for storage in the index 210. Further, deduplication can be implemented to identify duplicate blocks.
According to an embodiment, the data management device 202 comprising dividing each file in the data storage system into a plurality of blocks comprises dividing each file into a plurality of blocks using a variable chunking algorithm, and determining the plurality of constituent blocks comprises dividing the requested file into a plurality of blocks using the same variable chunking algorithm. In other words, the data management device 202 comprises splitting each of the file into a series of non-overlapping variable-length blocks using the variable chunking algorithm. Further, hash values of such variable-length blocks are calculated for storage in the index 210. In an example, the data management device 202 comprises dividing the requested file into the plurality of blocks having variable size. Further, hash values of such plurality of blocks may be determined and further used for determining the hash values already (i.e., previously) stored in the index 210. The data management device 202 may use the variable chunking algorithm which is already known in the art. By virtue of dividing each file into a plurality of blocks using a variable chunking algorithm, indexing of the file is improved for storage in the index 210. Further, deduplication can be implemented to identify duplicate blocks.
According to an embodiment, the data management device 202 comprising determining the one or more locations includes querying the generated index 210. In an example, a search may be executed on the index 210 to determine the plurality of constituent blocks that make up the requested file. In other words, the data management device 202 is configured to execute a query on the index 210 and generate a list of all fingerprints (i.e., hash values) that build the entire requested file.
According to an embodiment, the data management device 202 comprising determining the one or more locations comprises determining all available locations for each constituent block. In other words, the data management device 202 is configured to determine the one or more locations for each constituent block in the plurality of data storage units 204. Each of the constituent block may be stored at different locations in the plurality of data storage units 204. Thus, location of all such constituent blocks is determined and then used for transferring the blocks based on the requested file. Thus, the transferring of the blocks is faster as compared to the conventional approach.
According to an embodiment, the data management device 202 comprising transferring the data includes transferring data for one constituent block from a plurality of locations concurrently and stopping the remaining transfers for the constituent block when one of the plurality of transfers is complete. In other words, the data management device 202 is configured to concurrently access different locations (i.e., one or more locations) for transferring data for one constituent block. Thus, the data management device 202 enables recovering the same block from multiple locations concurrently. Further, the data management device 202 stops recovering when a first block is received. Beneficially, this improves the entire file recovery and end-to-end latency.
According to an embodiment, the data management device 202 comprising transferring the data includes storing each received constituent block at a specified offset to build the requested file. In an example, the received constituent block may be stored in a defined sequential manner to enable building of the requested files. In an example, a given constituent block which is to be sequentially arranged prior to other constituent blocks may also be received at the offset prior to the other constituent blocks.
According to an embodiment, the data management device 202 comprising transferring the data includes establishing a P2P connection to a new location for a constituent block if a transfer for the constituent block fails. In other words, if some failure occurs for transferring the data, the data management device 202 is configured to try a different location for the failed constituent block. Further, once all the blocks are received, the recovery/restore is done. Thus, the data management device 202 ensures all constituent blocks are restored and no constituent block is lost during transfer of the data.
According to an embodiment, in the data management device 202, the P2P connection is established using a BitTorrent protocol or a IPFS protocol. The BitTorrent protocol is a communication protocol for peer-to-peer file sharing (P2P), which enables users to distribute data and electronic files over the Internet in a decentralized manner. The IPFS protocol is a P2P hypermedia protocol designed to make the World Wide Web more fast. The BitTorrent protocol or the IPFS protocol are exemplary protocols and any other protocols which are known in the art may be used. According to an embodiment, in the data management device 202, the request for a file is a restore request or a file copy request. In an example, a given data storage unit which already has a file or had previously provided the file for hash value storage in the index 210, may provide the file copy request. In an example, a given data storage unit which previously did not have the file, may provide a restore request.
According to an embodiment, the data management device 202 further comprises calculating one or more new hash values and updating the index 210 for any updated files in the data storage system. The blocks which have hash values stored in the index 210, may get changed or updated, in such a case, the new hash values for such blocks are calculated and updated in the index 210. Thus, when there is a file request, the changed or updated blocks can be recovered.
According to an embodiment, the data storage system is a network based system. The data storage system is the network-based system, such as the centralized network attached server (NAS), or a cloud network which is linked with a network, for example, an internet. In an example, for the first time, the data management device 202 calculates fingerprints for all the blocks of the object/file. Further, for next time, the data management device 202 calculates fingerprints only for the changed blocks. This ensures, the hash values of duplicate blocks are not stored in the index 210. In an example, the data storage system may have an internal deduplication system and thus there is no need for calculating fingerprints and there is only a need for collecting the fingerprints from the deduplication system.
FIG. 3 is a block diagram that illustrates various exemplary components of a data storage system, in accordance with an embodiment of the present disclosure. FIG. 3 is described in conjunction with elements from FIGs. 1, and 2. With reference to FIG. 3, there is shown a block diagram 300 of a data storage system 302 that includes one or more interconnected data storage units 304, a memory 306, a processor 308 and the data management device 202. The one or more interconnected data storage units 304, and the memory 306 may be communicatively coupled to the processor 308. The processor 308 of the data storage system 302 executes the method 100 (of FIG. 1).
In another aspect, the present disclosure provides a data storage system 302 comprising: one or more interconnected data storage units 304; and the data management device 202. The data storage system 302 includes suitable logic, circuitry, interfaces, or code that is configured to store data. The data storage system 302 may have one or more data storage units (or devices), such as physical machines, virtual machines, data centers (or network attached servers), collectors, and the like. The data storage system 302 represents a storage system of the entire enterprise storage or a subset of the entire enterprise or an organization. Examples of the data storage system 302 may include, but are not limited to a data center, network attached server, and the like.
The one or more interconnected data storage units 304 corresponds to the plurality of data storage units 204 of the data management device 202 (of FIG. 2).
The memory 306 includes suitable logic, circuitry, interfaces, or code that is configured to store data and the instructions executable by the processor 308. Examples of implementation of the memory 306 may include, but are not limited to, an Electrically Erasable Programmable Read- Only Memory (EEPROM), Random Access Memory (RAM), Hard Disk Drive (HDD), Flash memory, Solid-State Drive (SSD), or CPU cache memory. The memory 306 may store an operating system or other program products (including one or more operation algorithms) to operate the data storage system 302.
The processor 308 includes suitable logic, circuitry, interfaces, or code that is configured to execute the instructions stored in the memory 306. In an example, the processor 308 may be a general-purpose processor. Other examples of the processor 308 may include, but is not limited to a microprocessor, a microcontroller, a complex instruction set computing (CISC) processor, an application-specific integrated circuit (ASIC) processor, a reduced instruction set (RISC) processor, a very long instruction word (VLIW) processor, a central processing unit (CPU), a state machine, a data processing unit, and other processors or control circuitry. Moreover, the processor 308 may refer to one or more individual processors, processing devices, a processing unit that is part of a machine, such as the data storage system 302.
FIG. 4 is a network environment diagram of communication between the data storage system and the data management device, in accordance with an embodiment of the present disclosure. FIG. 4 is described in conjunction with elements from FIGs. 1, 2, and 3. With reference to FIG. 4, there is shown a network environment 400 that includes data centers 402A and 402B, AI (Artificial Intelligence) catalogue service 404. The data centers 402A and 402B further include NAS (Network Attached Storage) 406A and 406B, data object 408, a production ESX server 410, a production MSSQL 412, a production oracle 414. The NAS 406A and 406B include collectors 416. The AI catalogue service 404 includes an AI catalogue engine 418, a database 420 and an internal collector 422.
The data centers 402A and 402B refer to production environments where data associated with virtual or physical machines is stored. The data centers 402A and 402B may together form an enterprise storage. The NAS 406A and 406B includes suitable logic, circuitry, and interfaces that may be configured to store the data received from the one or more virtual or physical machines. The NAS 406A and 406B may store data received from the one or more virtual or physical machines along with a client identifier. In an example, the NAS 406A and 406B may be a primary datacenter that include one or more hard disk drives, often arranged into logical, redundant storage containers or Redundant Array of Inexpensive Disks (RAID).
The data object 408 includes suitable logic, circuitry, and interfaces that may be configured to provide object storage. In an example, the data object 408 may be Amazon S3 or Amazon Simple Storage Service.
The production ESX server 410 includes suitable logic, circuitry, and interfaces that may be configured to virtually provide processor, memory, storage and networking resources into multiple virtual machines. The production MSSQL 412 includes suitable logic, circuitry, and interfaces that may be configured to store the data that is used for production tasks such as creating and updating features of virtual or physical machines. The production oracle 414 includes suitable logic, circuitry, and interfaces that may be configured to provide data management for virtual or physical machines associated therewith. The production ESX server 410, the production MSSQL 412 and the production oracle 414 are used here for exemplary purpose and any other servers may be used.
The collectors 416 includes suitable logic, circuitry, and interfaces that may be configured to collect information on the unstructured data from all types of sources, such as NAS 406A and 406B, data object 408, virtual machines and the like in the entire enterprise storage. In other words, the collectors 416 are agents that run inside or outside a host (e.g., NAS) and collects the file system metadata. The collection can be made by the collectors 416 periodically or “live update” may be executed. The collectors 416 collects native metadata and additional synthetic data. The AI catalogue service 404 enables in providing data management in the data storage system such as data storage system 302. The AI catalogue service 404 enables in providing a peer-to- peer data transfer among the data storage units such as NAS 406 A and 406B.
The AI catalogue engine 418 may also be referred to as an index such as index 210 or a catalog or a catalog engine. The AI catalogue engine 418 is generated to store calculated hash value and associated address for each block to enable quick recovery when similar block is required by any data storage unit.
The database 420 enables in querying the AI catalogue engine 418 for determining one or more locations for each constituent block associated with a requested file. In other words, the database 420 enables in executing a query on the AI catalogue engine 418 and generating a list of all fingerprints (i.e., hash values) that build the entire requested file.
The internal collector 422 includes suitable logic, circuitry, and interfaces that may be configured to collect information on the unstructured data from all collectors 416. The collection can be made by the internal collector 422 periodically or “live update” can be executed.
In operation, the collectors 406 are configured to provide files from the data centers 402A and 402B to the internal collector 422. A processor such as the processor 208 may divide each of the files into plurality of blocks. Further, hash values for each block are calculated. The AI catalogue engine 418 is configured to store the calculated hash value and an associated address for each block. Further, when a request for a file is made, the database 420 determines a plurality of constituent blocks that make up the requested file and one or more locations for each constituent block. Further, a peer-to-peer connection is established to transfer the data for each constituent block concurrently to a specified location. Further, the data for each constituent block is transferred to the specified location.
FIG. 5 is an illustration of an exemplary scenario of implementation of the data storage system and the data management device, in accordance with an embodiment of the present disclosure. FIG. 5 is described in conjunction with elements from FIGs. 1, 2, 3, and 4. With reference to FIG. 5, there is shown an implementation scenario 500. With reference to FIG. 5, there is shown storage systems 502A to 502N, a target system 504, a catalog 506, a peer-to-peer file sharing protocol 508. Each of the storage systems 502A to 502N include storage disks 510A to 510N. There is further shown blocks 512A to 512N. The storage systems 502A to 502N may also be referred to as data storage units such as data storage units 204. The target system 504 is configured to request for a file from the storage systems 502A to 502N. The catalog 506 may also be referred to as the index 210. The catalog 506 is configured to store hash values and associated address for each block in the storage systems 502A to 502N. Based on the file request from the target system 504, the catalog 506 determines a plurality of constituent blocks that make up the requested file and one or more locations for each constituent block. Further, the peer-to-peer file sharing protocol 508 is established to transfer the data for each constituent block concurrently to the target system 504. In an example, the storage systems 502A, 502B and 502N provide the data for each constituent block concurrently to the target system 504. The storage systems 502A to 502N provide the data for each constituent block stored in the respective storage disks 510A to 510N. The data for each constituent block is transferred to the blocks 512A to 512N. The blocks 512A to 512N form the requested file.
Modifications to embodiments of the present disclosure described in the foregoing are possible without departing from the scope of the present disclosure as defined by the accompanying claims. Expressions such as "including", "comprising", "incorporating", "have", "is" used to describe and claim the present disclosure are intended to be construed in a non-exclusive manner, namely allowing for items, components or elements not explicitly described also to be present. Reference to the singular is also to be construed to relate to the plural. The word "exemplary" is used herein to mean "serving as an example, instance or illustration". Any embodiment described as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments. The word "optionally" is used herein to mean "is provided in some embodiments and not provided in other embodiments". It is appreciated that certain features of the present disclosure, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the present disclosure, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable combination or as suitable in any other described embodiment of the disclosure.

Claims

1. A computer-implemented method (100) of data management in a data storage system (302) comprising a plurality of data storage units (204, 304), the method (100) comprising: dividing each file in the data storage system (302) into a plurality of blocks; calculating a hash value for each block using a hash algorithm; generating an index (210) comprising the calculated hash value and an associated address for each block in the data storage system (302); and in response to a request for a file at a specified location: determining a plurality of constituent blocks that make up the requested file, determining one or more locations for each constituent block, establishing a peer-to-peer, P2P, connection to transfer the data for each constituent block concurrently to the specified location, and transferring the data for each constituent block to the specified location.
2. The method (100) of claim 1, wherein dividing each file in the data storage system (302) into a plurality of blocks comprises dividing each file into a plurality of blocks having a common size, and determining the plurality of constituent blocks comprises dividing the requested file into a plurality of blocks having the same common size.
3. The method (100) of claim 1, wherein dividing each file in the data storage system (302) into a plurality of blocks comprises dividing each file into a plurality of blocks using a variable chunking algorithm, and determining the plurality of constituent blocks comprises dividing the requested file into a plurality of blocks using the same variable chunking algorithm.
4. The method (100) of any preceding, wherein determining the one or more locations includes querying the generated index (210).
5. The method (100) of any preceding claim, wherein determining the one or more locations comprises determining all available locations for each constituent block.
6. The method (100) of any preceding claim, wherein transferring the data includes transferring data for one constituent block from a plurality of locations concurrently and stopping the remaining transfers for the constituent block when one of the plurality of transfers is complete.
7. The method (100) of any preceding claim, wherein transferring the data includes storing each received constituent block at a specified offset to build the requested file.
8. The method (100) of any preceding claim, wherein transferring the data includes establishing a P2P connection to a new location for a constituent block if a transfer for the constituent block fails.
9. The method (100) of any preceding claim, wherein the P2P connection is established using a BitTorrent protocol or a IPFS protocol.
10. The method (100) of any preceding claim, wherein the request for a file is a restore request or a file copy request.
11. The method (100) of any preceding claim, further comprising calculating one or more new hash values and updating the index (210) for any updated files in the data storage system (302).
12. The method (100) of any preceding claim, wherein the data storage system (302) is a network based system.
13. A computer-readable medium configured to store instruction which, when executed by a processor (208), cause the processor (208) to perform the method (100) of any preceding claim.
14. A data management device (202) for a data storage system (302) comprising a plurality of data storage units (204, 304), the data management device (202) configured to: divide each file in the data storage system (302) into a plurality of blocks; calculate a hash value for each block using a hash algorithm; generate an index (210) comprising the calculated hash value and an associated address for each block in the data storage system (302); and in response to a request for a file at a specified location: determine a plurality of constituent blocks that make up the requested file, determine one or more locations for each constituent block, establish a peer-to-peer, P2P, connection to transfer the data for each constituent block concurrently to the specified location, and transfer the data for each constituent block to the specified location.
15. The device (202) of claim 14, wherein dividing each file in the data storage system into a plurality of blocks comprises dividing each file into a plurality of blocks having a common size, and determining the plurality of constituent blocks comprises dividing the requested file into a plurality of blocks having the same common size.
16. The device of claim 14, wherein dividing each file in the data storage system into a plurality of blocks comprises dividing each file into a plurality of blocks using a variable chunking algorithm, and determining the plurality of constituent blocks comprises dividing the requested file into a plurality of blocks using the same variable chunking algorithm.
17. The device (202) of any one of claims 14 to 16, wherein determining the one or more locations includes querying the generated index (210).
18. The device (202) of any one of claims 14 to 17, wherein determining the one or more locations comprises determining all available locations for each constituent block.
19. The device (202) of any one of claims 14 to 18, wherein transferring the data includes transferring data for one constituent block from a plurality of locations concurrently and stopping the remaining transfers for the constituent block when one of the plurality of transfers is complete.
20. The device (202) of any one of claims 14 to 19, wherein transferring the data includes storing each received constituent block at a specified offset to build the requested file.
21. The device (202) of any one of claims 14 to 20, wherein transferring the data includes establishing a P2P connection to a new location for a constituent block if a transfer for the constituent block fails.
22. The device (202) of any one of claims 14 to 21, wherein the P2P connection is established using a BitTorrent protocol or a IPFS protocol.
23. The device (202) of any one of claims 14 to 22, wherein the request for a file is a restore request or a file copy request.
24. The device (202) of any one of claims 14 to 23, further comprising calculating one or more new hash values and updating the index (210) for any updated files in the data storage system (302).
25. The device (202) of any one of claims 14 to 24, wherein the data storage system (302) is a network based system.
26. A data storage system (302) comprising: one or more interconnected data storage units (204, 304); and the data management device (202) of any one of claims 14 to 25.
PCT/EP2021/065636 2021-06-10 2021-06-10 Method, device, and system of data management for peer-to-peer data transfer WO2022258184A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2021/065636 WO2022258184A1 (en) 2021-06-10 2021-06-10 Method, device, and system of data management for peer-to-peer data transfer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2021/065636 WO2022258184A1 (en) 2021-06-10 2021-06-10 Method, device, and system of data management for peer-to-peer data transfer

Publications (1)

Publication Number Publication Date
WO2022258184A1 true WO2022258184A1 (en) 2022-12-15

Family

ID=76502723

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2021/065636 WO2022258184A1 (en) 2021-06-10 2021-06-10 Method, device, and system of data management for peer-to-peer data transfer

Country Status (1)

Country Link
WO (1) WO2022258184A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3404891A1 (en) * 2016-08-15 2018-11-21 Huawei Technologies Co., Ltd. Method and system for distributing digital content in peer-to-peer network

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3404891A1 (en) * 2016-08-15 2018-11-21 Huawei Technologies Co., Ltd. Method and system for distributing digital content in peer-to-peer network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JAHN ARNE JOHNSEN ET AL: "Peer-to-peer networking with BitTorrent", INTERNET CITATION, 31 December 2005 (2005-12-31), pages 1 - 22, XP002740292, Retrieved from the Internet <URL:http://www.cs.ucla.edu/classes/cs217/05BitTorrent.pdf> [retrieved on 20150528] *

Similar Documents

Publication Publication Date Title
US11892912B2 (en) Incremental file system backup using a pseudo-virtual disk
US10747618B2 (en) Checkpointing of metadata into user data area of a content addressable storage system
US10846267B2 (en) Masterless backup and restore of files with multiple hard links
EP3215940B1 (en) Data management system
US8402250B1 (en) Distributed file system with client-side deduplication capacity
WO2021011051A1 (en) Data recovery in a virtual storage system
US8452731B2 (en) Remote backup and restore
US11379322B2 (en) Scaling single file snapshot performance across clustered system
US20200042394A1 (en) Managing journaling resources with copies stored in multiple locations
US10877931B1 (en) Backups of file system instances having subtrees
US11474912B2 (en) Backup and restore of files with multiple hard links
Manogar et al. A study on data deduplication techniques for optimized storage
WO2016148670A1 (en) Deduplication and garbage collection across logical databases
US10762051B1 (en) Reducing hash collisions in large scale data deduplication
US20210303511A1 (en) Cloning a Managed Directory of a File System
US20210124648A1 (en) Scaling single file snapshot performance across clustered system
US20210303528A1 (en) Managing Directory-Tree Operations in File Storage
US11238063B2 (en) Provenance-based replication in a storage system
US20210240911A1 (en) Online change of page size in a content aware storage logical layer
Varade et al. Distributed metadata management scheme in hdfs
US11809449B2 (en) Granular data replication
Xu et al. YuruBackup: a space-efficient and highly scalable incremental backup system in the cloud
US20230195502A1 (en) Policy enforcement and performance monitoring at sub-lun granularity
WO2022258184A1 (en) Method, device, and system of data management for peer-to-peer data transfer
Rao Data duplication using Amazon Web Services cloud storage

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21733097

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21733097

Country of ref document: EP

Kind code of ref document: A1