WO2020215580A1 - 一种分布式全局数据去重方法和装置 - Google Patents

一种分布式全局数据去重方法和装置 Download PDF

Info

Publication number
WO2020215580A1
WO2020215580A1 PCT/CN2019/104330 CN2019104330W WO2020215580A1 WO 2020215580 A1 WO2020215580 A1 WO 2020215580A1 CN 2019104330 W CN2019104330 W CN 2019104330W WO 2020215580 A1 WO2020215580 A1 WO 2020215580A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
target data
data
fingerprint
storage node
Prior art date
Application number
PCT/CN2019/104330
Other languages
English (en)
French (fr)
Inventor
齐泽青
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020215580A1 publication Critical patent/WO2020215580A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24564Applying rules; Deductive queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Definitions

  • This application relates to the field of big data technology, and in particular to a distributed global data deduplication method and device.
  • Distributed storage system is to store data in multiple independent devices.
  • the traditional network storage system uses a centralized storage server to store all data.
  • the storage server becomes the bottleneck of system performance and the focus of reliability and security, which cannot meet the needs of large-scale storage applications.
  • the distributed network storage system adopts an expandable system structure, uses multiple storage servers to share the storage load, and uses location servers to locate storage information. It not only improves the reliability, availability, and access efficiency of the system, it is also easy to expand.
  • the embodiments of the present application provide a distributed global data deduplication method and device to solve the problem of large storage space consumption and low storage efficiency caused by a large amount of redundant and repeated information in a distributed storage system in the prior art. problem.
  • an embodiment of the present application provides a distributed global data deduplication method.
  • the method is applied to a storage system.
  • the method includes: a storage gateway receives a client's target data write request, and determines according to a first preset rule The target object number corresponding to the target data, and the correspondence relationship between the target data and the target object number is stored in a metadata list; the target storage node corresponding to the target object number is determined according to a second preset rule, so The storage gateway writes the target data into the cache layer of the target storage node, and the second preset rule is the corresponding rule between the object number and the storage node; judging whether the target data needs to calculate a data fingerprint; if If the target data needs to calculate the data fingerprint, then the data fingerprint of the target data is calculated according to the preset algorithm to obtain the target data fingerprint.
  • the storage gateway determines the target Whether the storage layer of the storage node has stored the target data fingerprint; if the storage layer of the target storage node does not store the target data fingerprint, store the target data fingerprint in the storage layer of the target storage node, The storage gateway returns prompt information for prompting a successful write, and the prompt information carries the target data fingerprint; if the storage layer of the target storage node has stored the target data fingerprint, then return to the storage gateway
  • the prompt information used to prompt the write success the prompt information carries the target data fingerprint, then the target data stored in the cache layer of the target storage node is deleted, and the storage layer stored in the target storage node is updated
  • the reference count of the target data fingerprint; the storage gateway receives the prompt information and determines whether the prompt information carries the target data fingerprint; if the prompt information carries the target data fingerprint, then the meta
  • the corresponding relationship between the target data and the target object number in the data list is updated to the corresponding relationship between the target data, the target object number and
  • an embodiment of the present application provides a distributed global data deduplication method.
  • the method is executed by a client.
  • the method includes: receiving a write request for target data, and determining the target according to a first preset rule The target object number corresponding to the data, and store the correspondence between the target data and the target object number in the metadata list; determine the target storage node corresponding to the target object number according to a second preset rule, and the second The preset rule is the corresponding rule between the object number and the storage node.
  • the target storage node is deployed in the storage system; the target data is sent to the target storage node; the prompt information returned by the target storage node is received and judged Whether the prompt information carries a target data fingerprint, the target data fingerprint is data generated at the target storage node according to the target data; if the prompt information carries the target data fingerprint, the metadata list
  • the corresponding relationship between the target data and the target object number is updated to the corresponding relationship between the target data, the target object number and the target data fingerprint, and the corresponding relationship between the target data and the target data fingerprint is updated.
  • the corresponding rule of the target object number and the target storage node is updated to the corresponding rule between the target object number, the target data fingerprint and the target storage node.
  • an embodiment of the present application provides a distributed global data deduplication method, the method is executed by a storage system, and the method includes: receiving target data sent by a client, and writing the target data to the target storage The cache layer of the node; determine whether the target data needs to calculate a data fingerprint; if the target data needs to calculate a data fingerprint, calculate the data fingerprint of the target data according to a preset algorithm to obtain the target data fingerprint, the target data fingerprint There is a one-to-one correspondence with the target data; it is determined whether the storage layer of the target storage node has stored the target data fingerprint; if the storage layer of the target storage node does not store the target data fingerprint, then The target data fingerprint is stored in the storage layer of the target storage node, and prompt information for prompting the write success is returned to the client, the prompt information carries the target data fingerprint; if the target storage node's The storage layer has stored the target data fingerprint, and then returns prompt information for prompting the write success to the client.
  • the prompt information carries the target data fingerprint
  • an embodiment of the present application provides a distributed global data deduplication device.
  • the device includes: a first receiving unit, configured to receive a target data write request from a client by a storage gateway, and determine the destination according to a first preset rule.
  • the target object number corresponding to the target data, and the corresponding relationship between the target data and the target object number is stored in the metadata list;
  • the first determining unit is configured to determine the target object number corresponding to the second preset rule
  • the storage gateway writes the target data into the cache layer of the target storage node, and the second preset rule is the corresponding rule between the object number and the storage node;
  • the first judgment unit uses To determine whether the target data needs to calculate a data fingerprint;
  • the first calculation unit is configured to calculate the data fingerprint of the target data according to a preset algorithm if the target data needs to calculate the data fingerprint, to obtain the target data fingerprint, There is a one-to-one correspondence between the target data fingerprint and the target data;
  • the second judgment unit is used to judge whether the storage layer of the target storage node has stored the target data fingerprint;
  • the first storage unit is used to If the storage layer of the target storage node does not store the target data fingerprint, the target data fingerprint is stored in the storage layer of the target storage node, and prompt information
  • an embodiment of the present application provides a distributed global data deduplication device, the device includes: a first receiving unit, configured to receive a write request of target data, and determine the target data according to a first preset rule Corresponding target object number, and store the correspondence between the target data and the target object number in a metadata list; the determining unit is configured to determine the target storage node corresponding to the target object number according to a second preset rule, The second preset rule is a corresponding rule between an object number and a storage node, and the target storage node is deployed in a storage system; a sending unit, configured to send the target data to the target storage node; a second receiving unit , For receiving the prompt information returned by the target storage node, and determining whether the prompt information carries a target data fingerprint, the target data fingerprint is data generated at the target storage node according to the target data; an update unit, If the prompt information carries the target data fingerprint, update the corresponding relationship between the target data and the target object number in the metadata list to the target
  • an embodiment of the present application provides a distributed global data deduplication device, the device includes: a receiving unit configured to receive target data sent by a client, and write the target data into the cache of the target storage node
  • the first judging unit is used to judge whether the target data needs to calculate the data fingerprint
  • the calculation unit is used to calculate the data fingerprint of the target data according to a preset algorithm if the target data needs to be calculated to obtain Target data fingerprint, there is a one-to-one correspondence between the target data fingerprint and the target data
  • a second judgment unit used to judge whether the storage layer of the target storage node has stored the target data fingerprint
  • a storage unit If the storage layer of the target storage node does not store the target data fingerprint, store the target data fingerprint in the storage layer of the target storage node, and return to the client a prompt for writing success Prompt information, the prompt information carries the target data fingerprint;
  • the deleting unit is configured to return to the client a prompt for writing success if the storage layer of the target storage node has stored the target
  • an embodiment of the present application provides a computer non-volatile readable storage medium.
  • the computer non-volatile readable storage medium includes a stored program, wherein the non-volatile computer is controlled while the program is running.
  • the device where the volatile readable storage medium is located executes the aforementioned distributed global data deduplication method.
  • an embodiment of the present application provides a computer device, including a memory and a processor, the memory is configured to store information including program instructions, the processor is configured to control the execution of the program instructions, and the program instructions are executed by the processor.
  • the steps of the above-mentioned distributed global data deduplication method are realized.
  • the target object number is determined by the written target data
  • the target storage node is determined according to the target object number
  • the target data is written into the cache layer of the target storage node
  • the target data fingerprint corresponding to the target data is generated according to the preset algorithm.
  • the target data corresponds to the unique storage node in the distributed storage system, and a unique target data fingerprint is generated for the target data, indicating the uniqueness of the target data in the storage system.
  • the storage layer of the target storage node has stored the target data Fingerprint, delete the target data of the cache layer of the target storage node, avoid the problems of large storage space consumption and low storage efficiency caused by a large amount of redundant and repeated information in the distributed storage system in the prior art, reduce storage space consumption, and improve Improve the storage efficiency of the storage system.
  • Fig. 1 is a flowchart of an optional distributed global data deduplication method according to an embodiment of the present application
  • Fig. 2 is a schematic diagram of an optional distributed global data deduplication device according to an embodiment of the present application.
  • Embodiment 1 of the present application provides a distributed global data deduplication method, which is applied to a storage system. As shown in FIG. 1, the method includes the following steps:
  • Step S102 The storage gateway receives the target data write request from the client, determines the target object number corresponding to the target data according to the first preset rule, and stores the corresponding relationship between the target data and the target object number in the metadata list.
  • Step S104 Determine the target storage node corresponding to the target object number according to a second preset rule, and the storage gateway writes the target data into the cache layer of the target storage node.
  • the second preset rule is a corresponding rule between the object number and the storage node.
  • Step S106 It is judged whether the target data needs to calculate the data fingerprint.
  • Step S108 If the target data needs to calculate the data fingerprint, the data fingerprint of the target data is calculated according to the preset algorithm to obtain the target data fingerprint. There is a one-to-one correspondence between the target data fingerprint and the target data.
  • Step S110 Determine whether the storage layer of the target storage node has stored the target data fingerprint.
  • Step S112 If the storage layer of the target storage node does not store the target data fingerprint, store the target data fingerprint in the storage layer of the target storage node, and return a prompt message for prompting the write success to the storage gateway, and the prompt message carries the target data fingerprint .
  • Step S114 If the storage layer of the target storage node has stored the target data fingerprint, a prompt message for prompting the write success is returned to the storage gateway, the prompt message carries the target data fingerprint, and then the target stored in the cache layer of the target storage node is deleted Data, update the reference count of the target data fingerprint stored in the storage layer of the target storage node.
  • Step S116 The storage gateway receives the prompt information, and determines whether the prompt information carries the target data fingerprint.
  • Step S118 If the prompt information carries the target data fingerprint, the corresponding relationship between the target data and the target object number in the metadata list is updated to the corresponding relationship between the target data, the target object number and the target data fingerprint, and the second preset It is assumed that the corresponding rule between the target object number and the target storage node in the rule is updated to the corresponding rule between the target object number, the target data fingerprint and the target storage node.
  • the preset algorithm may be the MD5 message digest algorithm, a widely used cryptographic hash function that can generate a 128-bit (16-byte) hash value to ensure complete and consistent information transmission.
  • the target data fingerprint is a value calculated according to the MD5 message digest algorithm.
  • the first preset rule is a rule corresponding to the written data and the object number.
  • the reference count of the data fingerprint needs to be increased by 1; if the data fingerprint stored in the storage layer decreases by one client object reference, the reference count of the data fingerprint The operation of subtracting 1 is required. If the value of the reference count of the data fingerprint is 0, it means that the data fingerprint stored in the storage layer is not referenced by the client object, and the data fingerprint stored in the storage layer can be deleted, and the data fingerprint with the reference count of 0 can be corresponded to The data stored in the cache layer is collected as garbage data, and the free space is connected to the free linked list for recycling.
  • the target data is duplicate data
  • the cache layer of the target storage node stores the target data
  • the storage layer of the target storage node stores the target data fingerprint and the reference count of the target data fingerprint.
  • the deduplication of the cache layer in the storage system avoids the problem of redundant duplicate data increasing storage space consumption and improves the storage efficiency of the storage space.
  • the target object number is determined by the written target data
  • the target storage node is determined according to the target object number
  • the target data is written into the cache layer of the target storage node
  • the target data fingerprint corresponding to the target data is generated according to the preset algorithm.
  • the target data corresponds to the unique storage node in the distributed storage system, and a unique target data fingerprint is generated for the target data, indicating the uniqueness of the target data in the storage system.
  • the storage layer of the target storage node has stored the target data Fingerprint, delete the target data of the cache layer of the target storage node, avoid the problems of large storage space consumption and low storage efficiency caused by a large amount of redundant and repeated information in the distributed storage system in the prior art, reduce storage space consumption, and improve Improve the storage efficiency of the storage system.
  • the method further includes: receiving the customer Terminal to read the request of target data; determine whether there is a target data fingerprint in the metadata list; if there is no target data fingerprint in the metadata list, determine the target storage node according to the target object number in the metadata list and the second preset rule; Obtain the target data stored in the cache layer of the target storage node according to the target object number; return the target data to the client.
  • the target data fingerprint and target object number are metadata of the target data.
  • Metadata also known as intermediary data and relay data, is data describing data (data about data), mainly information describing data properties, used to support such as indicating storage location, historical data, resource search, file recording, etc.
  • Metadata can be regarded as an electronic catalog. In order to achieve the purpose of cataloging, it is necessary to describe and collect the content or characteristics of the data, and then achieve the purpose of assisting data retrieval.
  • the method further includes: if the target data fingerprint exists in the metadata list, determining the target storage node according to a second preset rule; and searching for the storage of the target storage node The target data fingerprint stored in the layer; the target data stored in the cache layer of the target storage node is determined according to the target data fingerprint; the target data is returned to the client.
  • the target data fingerprint is obtained, the target data is searched according to the target data fingerprint; otherwise, the target data is searched according to the target object number. Both methods can find the target data.
  • the method further includes: receiving the first A write request for target data; determine object number 1 according to the first preset rule; determine whether there is a corresponding data fingerprint for object number 1 in the metadata list; if there is a corresponding data fingerprint a for object number 1, determine the first target data To update the write; determine storage node A according to the second preset rule, and determine whether the storage system is write priority or read priority; if it is read priority, write the first target data to the cache layer of storage node A; according to the data fingerprint a Obtain the target data stored in the cache layer of storage node A, and merge the first target data with the target data to obtain the second target data; calculate the data fingerprint of the second target data according to the preset algorithm to obtain the data fingerprint a1; The fingerprint a1 is stored in the storage layer of storage node A, and the second target data is
  • the method further includes: if it is write priority, write the first target data to the cache layer of storage node A, and mark the first target data as dirty data;
  • the preset algorithm calculates the data fingerprint of the first target data to obtain the data fingerprint a2; stores the data fingerprint a2 in the storage layer of storage node A, and stores the first target data in the cache layer of storage node A; obtains data from the cache layer Target data corresponding to fingerprint a, merge the target data with the first target data to obtain the second target data, the second target data corresponds to the object number 1 and the data fingerprint a2; update the data fingerprint a stored in the storage layer of the storage node A Reference count; update the data fingerprint a in the metadata list and the second preset rule to the data fingerprint a2.
  • the client initiates a write request. If it is an update write, the updated data is stored in the cache layer of the storage node, and the data fingerprint a1 is stored in the storage layer of the storage node.
  • the reference count of the data fingerprint a corresponding to the data before the update needs to be reduced by 1 Operation.
  • Embodiment 1 of the present application provides a distributed global data deduplication device, which is used to implement the distributed global data deduplication method provided in Embodiment 1.
  • the device is deployed in a storage system, as shown in FIG.
  • the device includes: a first receiving unit 10, a first determining unit 20, a first determining unit 30, a first calculating unit 40, a second determining unit 50, a first storing unit 60, a deleting unit 70, a third determining unit 80, and a One update unit 90.
  • the first receiving unit 10 is configured to receive the target data write request from the client by the storage gateway, determine the target object number corresponding to the target data according to the first preset rule, and store the corresponding relationship between the target data and the target object number in the metadata list .
  • the first determining unit 20 is configured to determine the target storage node corresponding to the target object number according to a second preset rule, the storage gateway writes the target data into the cache layer of the target storage node, and the second preset rule is the relationship between the object number and the storage node Correspondence rules between.
  • the first judging unit 30 is used for judging whether the target data needs to calculate the data fingerprint.
  • the first calculation unit 40 is configured to calculate the data fingerprint of the target data according to a preset algorithm if the target data needs to calculate the data fingerprint to obtain the target data fingerprint. There is a one-to-one correspondence between the target data fingerprint and the target data.
  • the second judgment unit 50 is used to judge whether the storage layer of the target storage node has stored the target data fingerprint.
  • the first storage unit 60 is configured to store the target data fingerprint in the storage layer of the target storage node if the storage layer of the target storage node does not store the target data fingerprint, and return prompt information for prompting the writing success to the storage gateway, prompting The information carries the target data fingerprint.
  • the deleting unit 70 is configured to, if the storage layer of the target storage node has stored the target data fingerprint, return to the storage gateway a prompt message for prompting the write success, the prompt message carries the target data fingerprint, and then delete the target data fingerprint in the cache layer of the target storage node
  • the stored target data updates the reference count of the target data fingerprint stored in the storage layer of the target storage node.
  • the third judging unit 80 is used to store the gateway to receive the prompt information and determine whether the prompt information carries the target data fingerprint.
  • the first update unit 90 is configured to update the correspondence between the target data and the target object number in the metadata list to the correspondence between the target data, the target object number and the target data fingerprint if the prompt information carries the target data fingerprint ,
  • the corresponding rule between the target object number and the target storage node in the second preset rule is updated to the corresponding rule between the target object number, the target data fingerprint and the target storage node.
  • the preset algorithm may be the MD5 message digest algorithm, a widely used cryptographic hash function that can generate a 128-bit (16-byte) hash value to ensure complete and consistent information transmission.
  • the target data fingerprint is a value calculated according to the MD5 message digest algorithm.
  • the first preset rule is a rule corresponding to the written data and the object number.
  • the reference count of the data fingerprint needs to be increased by 1; if the data fingerprint stored in the storage layer decreases by one client object reference, the reference count of the data fingerprint The operation of subtracting 1 is required. If the value of the reference count of the data fingerprint is 0, it means that the data fingerprint stored in the storage layer is not referenced by the client object, and the data fingerprint stored in the storage layer can be deleted, and the data fingerprint with the reference count of 0 can be corresponded to The data stored in the cache layer is collected as garbage data, and the free space is connected to the free linked list for recycling.
  • the target data is duplicate data
  • the cache layer of the target storage node stores the target data
  • the storage layer of the target storage node stores the target data fingerprint and the reference count of the target data fingerprint.
  • the deduplication of the cache layer in the storage system avoids the problem of redundant duplicate data increasing storage space consumption and improves the storage efficiency of the storage space.
  • the target object number is determined by the written target data
  • the target storage node is determined according to the target object number
  • the target data is written into the cache layer of the target storage node
  • the target data fingerprint corresponding to the target data is generated according to the preset algorithm.
  • the target data corresponds to the unique storage node in the distributed storage system, and a unique target data fingerprint is generated for the target data, indicating the uniqueness of the target data in the storage system.
  • the storage layer of the target storage node has stored the target data Fingerprint, delete the target data of the cache layer of the target storage node, avoid the problems of large storage space consumption and low storage efficiency caused by a large amount of redundant and repeated information in the distributed storage system in the prior art, reduce storage space consumption, and improve Improve the storage efficiency of the storage system.
  • the device further includes: a second receiving unit, configured to update the corresponding rule of the target object number and the target storage node in the second preset rule to the target object number, target data fingerprint and target storage node in the first update unit 90
  • the client receives the request to read the target data
  • the fourth determining unit is used to determine whether the target data fingerprint exists in the metadata list
  • the second determining unit is used to determine if the target data is in the metadata list If there is no target data fingerprint, the target storage node is determined according to the target object number in the metadata list and the second preset rule
  • the second storage unit is configured to obtain the target data stored in the cache layer of the target storage node according to the target object number
  • the first return unit is used to return target data to the client.
  • the target data fingerprint and target object number are metadata of the target data.
  • Metadata also known as intermediary data and relay data, is data describing data (data about data), mainly information describing data properties, used to support such as indicating storage location, historical data, resource search, file recording, etc.
  • Metadata can be regarded as an electronic catalog. In order to achieve the purpose of cataloging, it is necessary to describe and collect the content or characteristics of the data, and then achieve the purpose of assisting data retrieval.
  • the device further includes: a third determining unit, configured to determine whether the target data fingerprint exists in the metadata list after the fourth determining unit determines whether the target data fingerprint exists in the metadata list, according to the second preset rule Target storage node; searching unit, used to search the target data fingerprint stored in the storage layer of the target storage node; fourth determining unit, used to determine the target data stored in the cache layer of the target storage node according to the target data fingerprint; second returning unit , Used to return target data to the client.
  • a third determining unit configured to determine whether the target data fingerprint exists in the metadata list after the fourth determining unit determines whether the target data fingerprint exists in the metadata list, according to the second preset rule Target storage node
  • searching unit used to search the target data fingerprint stored in the storage layer of the target storage node
  • fourth determining unit used to determine the target data stored in the cache layer of the target storage node according to the target data fingerprint
  • second returning unit Used to return target data to the client.
  • the target data fingerprint is obtained, the target data is searched according to the target data fingerprint; otherwise, the target data is searched according to the target object number. Both methods can find the target data.
  • the device further includes: a third receiving unit, configured to update the corresponding rule of the target object number and the target storage node in the second preset rule to the target object number, target data fingerprint, and target storage node in the first update unit 90
  • the write request of the first target data is received;
  • the fifth determining unit is used to determine the object number 1 according to the first preset rule;
  • the fifth determining unit is used to determine the metadata list Whether the object number 1 has a corresponding data fingerprint;
  • the sixth determining unit is used to determine if the object number 1 has a corresponding data fingerprint a, determine that the first target data is an update write;
  • the sixth determining unit is used to determine according to the second preset rule Determine storage node A, and determine whether the storage system is write priority or read priority;
  • write unit used to write the first target data to the cache layer of storage node A if it is read priority;
  • merge unit used according to data fingerprint a Get the target data stored in the cache layer of storage node A, and merge the first target target
  • the second target data corresponds to object number 1 and data Fingerprint a1; the second update unit is used to update the reference count of the data fingerprint a; the third update unit is used to update the data fingerprint a in the metadata list and the second preset rule to the data fingerprint a1.
  • the device further includes: a marking unit, configured to write the first target data to the cache layer of storage node A after the sixth determining unit determines whether the storage system is write priority or read priority, if it is write priority, and mark The first target data is dirty data; the third calculation unit is used to calculate the data fingerprint of the first target data according to a preset algorithm to obtain the data fingerprint a2; the fourth storage unit is used to store the data fingerprint a2 in the storage node A
  • the storage layer stores the first target data in the cache layer of storage node A;
  • the obtaining unit is used to obtain the target data corresponding to the data fingerprint a from the cache layer, and merge the target data with the first target data to obtain the second target Data, the second target data corresponds to the object number 1 and the data fingerprint a2;
  • the fourth update unit is used to update the reference count of the data fingerprint a stored in the storage layer of the storage node A; the fifth update unit is used to combine the metadata list with In the second preset rule, the data fingerprint a is updated to
  • the client initiates a write request. If it is an update write, the updated data is stored in the cache layer of the storage node, and the data fingerprint a1 is stored in the storage layer of the storage node.
  • the reference count of the data fingerprint a corresponding to the data before the update needs to be reduced by 1 Operation.
  • Embodiment 1 of the present application provides a computer non-volatile readable storage medium.
  • the computer non-volatile readable storage medium includes a stored program, where the non-volatile readable storage medium of the computer is controlled when the program is running.
  • the device executes the distributed global data deduplication method provided in Embodiment 1.
  • Embodiment 1 of the present application provides a storage system, including a memory and a processor, the memory is used to store information including program instructions, the processor is used to control the execution of the program instructions, and the program instructions are loaded and executed by the processor to implement Embodiment 1 Provide the steps of the distributed global data deduplication method.
  • Embodiment 2 of the present application provides a distributed global data deduplication method, which is executed jointly by a client and a storage system, wherein the client executes step S202 to step S210; the storage system executes step S302 to step S312.
  • Step S202 The client receives the write request of the target data, determines the target object number corresponding to the target data according to the first preset rule, and stores the correspondence between the target data and the target object number in the metadata list.
  • Step S204 The client determines the target storage node corresponding to the target object number according to a second preset rule.
  • the second preset rule is a corresponding rule between the object number and the storage node, and the target storage node is deployed in the storage system.
  • Step S206 The client sends the target data to the target storage node.
  • Step S208 The client receives the prompt information returned by the target storage node, and determines whether the prompt information carries a target data fingerprint, and the target data fingerprint is data generated at the target storage node according to the target data.
  • step S210 if the prompt message carries the target data fingerprint, the client terminal updates the corresponding relationship between the target data and the target object number in the metadata list to the corresponding relationship between the target data, the target object number and the target data fingerprint, and the first 2.
  • the corresponding rule between the target object number and the target storage node in the preset rule is updated to the corresponding rule between the target object number, the target data fingerprint and the target storage node.
  • the target data fingerprint and target object number are metadata of the target data.
  • Metadata also known as intermediary data and relay data, is data describing data (data about data), mainly information describing data properties, used to support such as indicating storage location, historical data, resource search, file recording, etc.
  • Metadata can be regarded as an electronic catalog. In order to achieve the purpose of cataloging, it is necessary to describe and collect the content or characteristics of the data, and then achieve the purpose of assisting data retrieval.
  • the target object number is determined by the written target data
  • the target storage node is determined according to the target object number
  • the target data is written into the cache layer of the target storage node
  • the target data fingerprint corresponding to the target data is generated according to the preset algorithm.
  • the target data corresponds to the unique storage node in the distributed storage system, and a unique target data fingerprint is generated for the target data, indicating the uniqueness of the target data in the storage system.
  • the storage layer of the target storage node has stored the target data Fingerprint, delete the target data of the cache layer of the target storage node, avoid the problems of large storage space consumption and low storage efficiency caused by a large amount of redundant and repeated information in the distributed storage system in the prior art, reduce storage space consumption, and improve Improve the storage efficiency of the storage system.
  • the client also performs the following steps: before receiving the write request of the target data, slice each disk image of the client according to a preset value to obtain multiple shards; for each of the multiple shards The slice is assigned an object number; according to the second preset rule, a storage node is assigned to each object number.
  • the preset value for slicing the client's disk image can be set to a size of 8K to 64M according to different application scenarios.
  • the mapping relationship can be: object 1 to object 10 distribution On storage node A, objects 11 to 20 are distributed on storage node B. Specify the unique storage node of the storage system for each segment of the client, and store the written data in the specified storage node, which facilitates storage system management and data query, and improves the energy consumption management efficiency of the storage system.
  • Embodiment 2 of the present application provides a computer non-volatile readable storage medium.
  • the computer non-volatile readable storage medium includes a stored program, wherein the client is controlled to execute the above steps S202 to S210 when the program is running.
  • Embodiment 2 of the present application provides a client, including a memory and a processor, the memory is used to store information including program instructions, the processor is used to control the execution of the program instructions, and the above step S202 is implemented when the program instructions are loaded and executed by the processor. Go to step S210.
  • Step S302 The storage system receives the target data sent by the client, and writes the target data into the cache layer of the target storage node.
  • step S304 the storage system judges whether the target data needs to calculate a data fingerprint.
  • Step S306 If the target data needs to calculate the data fingerprint, the storage system calculates the data fingerprint of the target data according to the preset algorithm to obtain the target data fingerprint. There is a one-to-one correspondence between the target data fingerprint and the target data.
  • step S308 the storage system determines whether the storage layer of the target storage node has stored the target data fingerprint.
  • Step S310 If the storage layer of the target storage node does not store the target data fingerprint, the storage system stores the target data fingerprint in the storage layer of the target storage node, and returns to the client a prompt message for prompting the write success, and the prompt message carries the target Data fingerprint.
  • Step S312 If the storage layer of the target storage node has stored the target data fingerprint, the storage system returns to the client a prompt message for prompting the write success, the prompt message carries the target data fingerprint, and then deletes the storage in the cache layer of the target storage node Update the reference count of the target data fingerprint stored in the storage layer of the target storage node.
  • the preset algorithm may be the MD5 message digest algorithm, a widely used cryptographic hash function that can generate a 128-bit (16-byte) hash value to ensure complete and consistent information transmission.
  • the target data fingerprint is a value calculated according to the MD5 message digest algorithm.
  • the reference count of the data fingerprint needs to be increased by 1; if the data fingerprint stored in the storage layer decreases by one client object reference, the reference count of the data fingerprint The operation of subtracting 1 is required. If the value of the reference count of the data fingerprint is 0, it means that the data fingerprint stored in the storage layer is not referenced by the client object, and the data fingerprint stored in the storage layer can be deleted, and the data fingerprint with the reference count of 0 can be corresponded to The data stored in the cache layer is collected as garbage data, and the free space is connected to the free linked list for recycling.
  • the target data is duplicate data
  • the cache layer of the target storage node stores the target data
  • the storage layer of the target storage node stores the target data fingerprint and the reference count of the target data fingerprint.
  • the deduplication of the cache layer in the storage system avoids the problem of redundant duplicate data increasing storage space consumption and improves the storage efficiency of the storage space.
  • the target object number is determined by the written target data
  • the target storage node is determined according to the target object number
  • the target data is written into the cache layer of the target storage node
  • the target data fingerprint corresponding to the target data is generated according to the preset algorithm.
  • the target data corresponds to the unique storage node in the distributed storage system, and a unique target data fingerprint is generated for the target data, indicating the uniqueness of the target data in the storage system.
  • the storage layer of the target storage node has stored the target data Fingerprint, delete the target data of the cache layer of the target storage node, avoid the problems of large storage space consumption and low storage efficiency caused by a large amount of redundant and repeated information in the distributed storage system in the prior art, reduce storage space consumption, and improve Improve the storage efficiency of the storage system.
  • Embodiment 2 of the present application provides a computer non-volatile readable storage medium, and the computer non-volatile readable storage medium includes a stored program, where the storage system is controlled to execute the above steps S302 to S312 when the program is running.
  • Embodiment 2 of the present application provides a storage system including a memory and a processor.
  • the memory is used to store information including program instructions.
  • the processor is used to control the execution of the program instructions.
  • the above step S302 is implemented when the program instructions are loaded and executed by the processor. Go to step S312.
  • Embodiment 2 of the present application provides a distributed global data deduplication device, which is used to implement the distributed global data deduplication method provided in Embodiment 2.
  • the device is deployed on a client and a storage system, wherein the client is The following units are deployed: the first receiving unit 10, the determining unit 20, the sending unit 30, the second receiving unit 40, and the updating unit 50; the following units are deployed in the storage system: the receiving unit 11, the first judging unit 21, and the computing unit 31 , The second judgment unit 41, the storage unit 51, and the deletion unit 61.
  • the first receiving unit 10 is configured to receive a write request for target data, determine a target object number corresponding to the target data according to a first preset rule, and store the correspondence relationship between the target data and the target object number in a metadata list.
  • the determining unit 20 is configured to determine the target storage node corresponding to the target object number according to a second preset rule.
  • the second preset rule is a corresponding rule between the object number and the storage node, and the target storage node is deployed in the storage system.
  • the sending unit 30 is configured to send target data to the target storage node.
  • the second receiving unit 40 is configured to receive the prompt information returned by the target storage node, and determine whether the prompt information carries a target data fingerprint.
  • the target data fingerprint is data generated at the target storage node according to the target data.
  • the update unit 50 is configured to update the correspondence between the target data and the target object number in the metadata list to the correspondence between the target data, the target object number and the target data fingerprint if the prompt information carries the target data fingerprint, and change
  • the corresponding rule between the target object number and the target storage node in the second preset rule is updated to the corresponding rule between the target object number, the target data fingerprint and the target storage node.
  • the target data fingerprint and target object number are metadata of the target data.
  • Metadata also known as intermediary data and relay data, is data describing data (data about data), mainly information describing data properties, used to support such as indicating storage location, historical data, resource search, file recording, etc.
  • Metadata can be regarded as an electronic catalog. In order to achieve the purpose of cataloging, it is necessary to describe and collect the content or characteristics of the data, and then achieve the purpose of assisting data retrieval.
  • a segmentation unit and a first allocation unit are also deployed in the client.
  • the slicing unit is used to slice each disk image of the client according to a preset value before the first receiving unit 10 receives the write request of the target data to obtain multiple shards; the first allocation unit is used to Each of the shards is assigned an object number; the second assignment unit is used to assign a storage node to each object number according to a second preset rule.
  • the preset value for slicing the client's disk image can be set to a size of 8K to 64M according to different application scenarios.
  • the mapping relationship can be: object 1 to object 10 distribution On storage node A, objects 11 to 20 are distributed on storage node B. Specify the unique storage node of the storage system for each segment of the client, and store the written data in the specified storage node, which facilitates storage system management and data query, and improves the energy consumption management efficiency of the storage system.
  • the receiving unit 11 is configured to receive the target data sent by the client, and write the target data into the cache layer of the target storage node.
  • the first judging unit 21 is used for judging whether the target data needs to calculate the data fingerprint.
  • the calculation unit 31 is configured to calculate the data fingerprint of the target data according to a preset algorithm if the target data needs to calculate the data fingerprint to obtain the target data fingerprint. There is a one-to-one correspondence between the target data fingerprint and the target data.
  • the second judgment unit 41 is used to judge whether the storage layer of the target storage node has stored the target data fingerprint.
  • the storage unit 51 is configured to store the target data fingerprint in the storage layer of the target storage node if the storage layer of the target storage node does not store the target data fingerprint, and return prompt information for prompting the write success to the client, and the prompt information carries Target data fingerprint.
  • the deleting unit 61 is configured to, if the storage layer of the target storage node has stored the target data fingerprint, return to the client a prompt message for prompting the write success, the prompt message carrying the target data fingerprint, and then delete the target data fingerprint in the cache layer of the target storage node
  • the stored target data updates the reference count of the target data fingerprint stored in the storage layer of the target storage node.
  • the preset algorithm may be the MD5 message digest algorithm, a widely used cryptographic hash function that can generate a 128-bit (16-byte) hash value to ensure complete and consistent information transmission.
  • the target data fingerprint is a value calculated according to the MD5 message digest algorithm.
  • the reference count of the data fingerprint needs to be increased by 1; if the data fingerprint stored in the storage layer decreases by one client object reference, the reference count of the data fingerprint The operation of subtracting 1 is required. If the value of the reference count of the data fingerprint is 0, it means that the data fingerprint stored in the storage layer is not referenced by the client object, and the data fingerprint stored in the storage layer can be deleted, and the data fingerprint with the reference count of 0 can be corresponded to The data stored in the cache layer is collected as garbage data, and the free space is connected to the free linked list for recycling.
  • the target data is duplicate data
  • the cache layer of the target storage node stores the target data
  • the storage layer of the target storage node stores the target data fingerprint and the reference count of the target data fingerprint.
  • the deduplication of the cache layer in the storage system avoids the problem of redundant duplicate data increasing storage space consumption and improves the storage efficiency of the storage space.
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined Or it can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the above-mentioned integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium.
  • the above-mentioned software functional unit is stored in a storage medium and includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (Processor) execute the method described in each embodiment of the present application Part of the steps.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种分布式全局数据去重方法和装置,涉及大数据技术领域,该方法包括:接收目标数据写入请求,并确定目标数据对应的目标对象编号;根据目标对象编号确定对应的目标存储节点,向目标存储节点的缓存层写入目标数据;判断目标数据是否需要计算数据指纹;如果目标数据需要计算数据指纹,则计算目标数据的数据指纹,得到目标数据指纹;返回用于提示写入成功的提示信息,提示信息携带目标数据指纹;存储目标数据指纹。因此,所述方法能够解决现有技术中分布式存储系统中存在大量冗余重复信息导致存储空间消耗大、存储效率低的问题。

Description

一种分布式全局数据去重方法和装置
本申请要求于2019年4月23日提交中国专利局、申请号为201910327312.6、申请名称为“一种分布式全局数据去重方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及大数据技术领域,尤其涉及一种分布式全局数据去重方法和装置。
背景技术
分布式存储系统,是将数据分散存储在多台独立的设备上。传统的网络存储系统采用集中的存储服务器存放所有数据,存储服务器成为系统性能的瓶颈,也是可靠性和安全性的焦点,不能满足大规模存储应用的需要。分布式网络存储系统采用可扩展的系统结构,利用多台存储服务器分担存储负荷,利用位置服务器定位存储信息,它不但提高了系统的可靠性、可用性和存取效率,还易于扩展。
目前数据量爆炸性增长,对现有的分布式存储系统的容量、能耗管理等方面都带来了新的挑战。分布式存储系统中存在大量的冗余重复信息,而冗余重复信息增加了存储空间的消耗,降低了存储效率。
申请内容
有鉴于此,本申请实施例提供了一种分布式全局数据去重方法和装置,用以解决现有技术中分布式存储系统中存在大量冗余重复信息导致存储空间消耗大、存储效率低的问题。
一方面,本申请实施例提供了一种分布式全局数据去重方法,所述方法应用于存储系统,所述方法包括:存储网关接收客户端的目标数据写入请求,按照第一预设规则确定所述目标数据对应的目标对象编号,并将所述目标数据与所述目标对象编号的对应关系存储在元数据列表;根据第二预设规则确定所述目标对象编号对应的目标存储节点,所述存储网关将所述目标数据写入所述目标存储节点的缓存层,所述第二预设规则为对象编号与存储节点之间的对应规则;判断所述目标数据是否需要计算数据指纹;如果所述目标数据需要计算数据指纹,则根据预设算法计算所述目标数据的数据指纹,得到目标数据指纹,所述目标数据指纹与所述目标数据之间存在一一对应关系;判断所述目标存储节点的存储层是否已存储所述目标数据指纹;如果所述目标存储节点的存储层没有存储所述目标数据指纹,则将所述目标数据指纹存储在所述目标存储节点的存储层,向所述存储网关返回用于提示写入成功的提示信息,所述提示信息携带所述目标数据指纹;如果所述目标存储节点的存储层已存储所述目标数据指纹,则向所述存储网关返回用于提示写入成功的提示信息,所述提示信息携带所述目标数据指纹,然后删除所述目标存储节点的缓存层中存储的所述目标数据,更新所述目标存储节点的存储层存储的所述目标数据指纹的引用计数;所述存储网关接收所述提示信息,并判断所述提示信息是否携带所述目标数据指纹;如果所述提示信息携带所述目标数据指纹,则将所述元数据列表中所述目标数据与所述目标对象编号的对应关系更新为所述目标数据、所述目标对象编号与所述目标数据指纹三者之间的对应关系,将所述第二预设规则中所述目标对象编号与所述目标存储节点的对应规则更新为所述目标对象编号、所述目标数据指纹与所述目标存储节点三者之间的对应规则。
一方面,本申请实施例提供了一种分布式全局数据去重方法,所述方法由客户端执行,所述方法包括:接收目标数据的写入请求,按照第一预设规则确定所述目标数据对应的目标对象编号,并将所述目标数据与所述目标对象编号的对应关系存储在元数据列表;根据第二预设规则确定所述目标对象编号对应的目标存储节点,所述第二预设规则为对象编号与存储节点之间的对应规则,所述目标存储节点部署在存储系统;向所述目标存储节点发送所述目标数据;接收所述目标存储节点返回的提示信息,并判断所述提示信息是否携带目标数据指纹,所述目标数据指纹是在所述目标存储节点根据所述目标数据生成的数据;如果所述提示信息携带所述目 标数据指纹,则将所述元数据列表中所述目标数据与所述目标对象编号的对应关系更新为所述目标数据、所述目标对象编号与所述目标数据指纹三者之间的对应关系,将所述第二预设规则中所述目标对象编号与所述目标存储节点的对应规则更新为所述目标对象编号、所述目标数据指纹与所述目标存储节点三者之间的对应规则。
一方面,本申请实施例提供了一种分布式全局数据去重方法,所述方法由存储系统执行,所述方法包括:接收客户端发送的目标数据,并将所述目标数据写入目标存储节点的缓存层;判断所述目标数据是否需要计算数据指纹;如果所述目标数据需要计算数据指纹,则根据预设算法计算所述目标数据的数据指纹,得到目标数据指纹,所述目标数据指纹与所述目标数据之间存在一一对应关系;判断所述目标存储节点的存储层是否已存储所述目标数据指纹;如果所述目标存储节点的存储层没有存储所述目标数据指纹,则将所述目标数据指纹存储到所述目标存储节点的存储层,向所述客户端返回用于提示写入成功的提示信息,所述提示信息携带所述目标数据指纹;如果所述目标存储节点的存储层已存储所述目标数据指纹,则向所述客户端返回用于提示写入成功的提示信息,所述提示信息携带目标数据指纹,然后删除所述目标存储节点的缓存层中存储的目标数据,更新所述目标存储节点的存储层存储的所述目标数据指纹的引用计数。
一方面,本申请实施例提供了一种分布式全局数据去重装置,所述装置包括:第一接收单元,用于存储网关接收客户端的目标数据写入请求,按照第一预设规则确定所述目标数据对应的目标对象编号,并将所述目标数据与所述目标对象编号的对应关系存储在元数据列表;第一确定单元,用于根据第二预设规则确定所述目标对象编号对应的目标存储节点,所述存储网关将所述目标数据写入所述目标存储节点的缓存层,所述第二预设规则为对象编号与存储节点之间的对应规则;第一判断单元,用于判断所述目标数据是否需要计算数据指纹;第一计算单元,用于如果所述目标数据需要计算数据指纹,则根据预设算法计算所述目标数据的数据指纹,得到目标数据指纹,所述目标数据指纹与所述目标数据之间存在一一对应关系;第二判断单元,用于判断所述目标存储节点的存储层是否已存储所述目标数据指纹;第一存储单元,用于如果所述目标存储节点的存储层没有存储所述目标数据指纹,则将所述目标数据指纹存储在所述目标存储节点的存储层,向所述存储网关返回用于提示写入成功的提示信息,所述提示信息携带所述目标数据指纹;删除单元,用于如果所述目标存储节点的存储层已存储所述目标数据指纹,则向所述存储网关返回用于提示写入成功的提示信息,所述提示信息携带所述目标数据指纹,然后删除所述目标存储节点的缓存层中存储的所述目标数据,更新所述目标存储节点的存储层存储的所述目标数据指纹的引用计数;第三判断单元,用于所述存储网关接收所述提示信息,并判断所述提示信息是否携带所述目标数据指纹;第一更新单元,用于如果所述提示信息携带所述目标数据指纹,则将所述元数据列表中所述目标数据与所述目标对象编号的对应关系更新为所述目标数据、所述目标对象编号与所述目标数据指纹三者之间的对应关系,将所述第二预设规则中所述目标对象编号与所述目标存储节点的对应规则更新为所述目标对象编号、所述目标数据指纹与所述目标存储节点三者之间的对应规则。
一方面,本申请实施例提供了一种分布式全局数据去重装置,所述装置包括:第一接收单元,用于接收目标数据的写入请求,按照第一预设规则确定所述目标数据对应的目标对象编号,并将所述目标数据与所述目标对象编号的对应关系存储在元数据列表;确定单元,用于根据第二预设规则确定所述目标对象编号对应的目标存储节点,所述第二预设规则为对象编号与存储节点之间的对应规则,所述目标存储节点部署在存储系统;发送单元,用于向所述目标存储节点发送所述目标数据;第二接收单元,用于接收所述目标存储节点返回的提示信息,并判断所述提示信息是否携带目标数据指纹,所述目标数据指纹是在所述目标存储节点根据所述目标数据生成的数据;更新单元,用于如果所述提示信息携带所述目标数据指纹,则将所述元数据列表中所述目标数据与所述目标对象编号的对应关系更新为所述目标数据、所述目标对象编号与所述目标数据指纹三者之间的对应关系,将所述第二预设规则中所述目标对象编号与所述目标存储节点的对应规则更新为所述目标对象编号、所述目标数据指纹与所述目标存储节点三者之间的对应规则。
一方面,本申请实施例提供了一种分布式全局数据去重装置,所述装置包括:接收单元,用于接收客户端发送的目标数据,并将所述目标数据写入目标存储节点的缓存层;第一判断单元,用于判断所述目标数据是否需要计算数据指纹;计算单元,用于如果所述目标数据需要计算数据指纹,则根据预设算法计算所述目标数据的数据指纹,得到目标数据指纹,所述目标数据指纹与所述目标数据之间存在一一对应关系;第二判断单元,用于判断所述目标存储节点的存储层是否已存储所述目标数据指纹;存储单元,用于如果所述目标存储节点的存储层没有存储所述目标数据指纹,则将所述目标数据指纹存储到所述目标存储节点的存储层,向所述客户端返回用于提示写入成功的提示信息,所述提示信息携带所述目标数据指纹;删除单元,用于如果所述目标存储节点的存储层已存储所述目标数据指纹,则向所述客户端返回用于提示写入成功的提示信息,所述提示信息携带目标数据指纹,然后删除所述目标存储节点的缓存层中存储的目标数据,更新所述目标存储节点的存储层存储的所述目标数据指纹的引用计数。
一方面,本申请实施例提供了一种计算机非易失性可读存储介质,所述计算机非易失性可 读存储介质包括存储的程序,其中,在所述程序运行时控制所述计算机非易失性可读存储介质所在设备执行上述的分布式全局数据去重方法。
一方面,本申请实施例提供了一种计算机设备,包括存储器和处理器,所述存储器用于存储包括程序指令的信息,所述处理器用于控制程序指令的执行,所述程序指令被处理器加载并执行时实现上述的分布式全局数据去重方法的步骤。
本方案中,通过写入的目标数据确定目标对象编号,根据目标对象编号确定目标存储节点,将目标数据写入目标存储节点的缓存层,并按照预设算法生成目标数据对应的目标数据指纹,通过目标数据对应分布式存储系统中的唯一存储节点,并且为目标数据生成具有唯一性的目标数据指纹,指示了目标数据在存储系统中的唯一性,如果目标存储节点的存储层已存储目标数据指纹,则删除目标存储节点的缓存层的目标数据,避免了现有技术中分布式存储系统中存在大量冗余重复信息导致存储空间消耗大、存储效率低的问题,降低了存储空间消耗,提高了存储系统的存储效率。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其它的附图。
图1是根据本申请实施例的一种可选的分布式全局数据去重方法的流程图;
图2是根据本申请实施例的一种可选的分布式全局数据去重装置的示意图。
具体实施方式
为了更好的理解本申请的技术方案,下面结合附图对本申请实施例进行详细描述。
应当明确,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
在本申请实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在本申请实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。
应当理解,本文中使用的术语“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
实施例1
本申请实施例1提供了一种分布式全局数据去重方法,该方法应用于存储系统,如图1所示,该方法包括以下步骤:
步骤S102,存储网关接收客户端的目标数据写入请求,按照第一预设规则确定目标数据对应的目标对象编号,并将目标数据与目标对象编号的对应关系存储在元数据列表。
步骤S104,根据第二预设规则确定目标对象编号对应的目标存储节点,存储网关将目标数据写入目标存储节点的缓存层,第二预设规则为对象编号与存储节点之间的对应规则。
步骤S106,判断目标数据是否需要计算数据指纹。
步骤S108,如果目标数据需要计算数据指纹,则根据预设算法计算目标数据的数据指纹,得到目标数据指纹,目标数据指纹与目标数据之间存在一一对应关系。
步骤S110,判断目标存储节点的存储层是否已存储目标数据指纹。
步骤S112,如果目标存储节点的存储层没有存储目标数据指纹,则将目标数据指纹存储在目标存储节点的存储层,向存储网关返回用于提示写入成功的提示信息,提示信息携带目标数据指纹。
步骤S114,如果目标存储节点的存储层已存储目标数据指纹,则向存储网关返回用于提示写入成功的提示信息,提示信息携带目标数据指纹,然后删除目标存储节点的缓存层中存储的目标数据,更新目标存储节点的存储层存储的目标数据指纹的引用计数。
步骤S116,存储网关接收提示信息,并判断提示信息是否携带目标数据指纹。
步骤S118,如果提示信息携带目标数据指纹,则将元数据列表中目标数据与目标对象编号的对应关系更新为目标数据、目标对象编号与目标数据指纹三者之间的对应关系,将第二预设规则中目标对象编号与目标存储节点的对应规则更新为目标对象编号、目标数据指纹与目标存储节点三者之间的对应规则。
预设算法可以是MD5消息摘要算法,一种被广泛使用的密码散列函数,可以产生出一个128位(16字节)的散列值(hash value),用于确保信息传输完整一致。
目标数据指纹是按照MD5消息摘要算法计算得到的值。
第一预设规则为写入的数据与对象编号之间对应的规则。
如果存储层存储的数据指纹,每增加一个客户端对象引用,则数据指纹的引用计数需要加1的操作;如果存储层存储的数据指纹,每减少一个客户端对象引用,则数据指纹的引用计数需要减1的操作,如果数据指纹的引用计数的值是0,表示存储层存储的数据指纹没有被客户端对象引用,则可以删除存储层存储的数据指纹,将引用计数为0的数据指纹对应缓存层存储的数据作为垃圾数据回收,空闲出的空间连接到空闲链表,以备循环使用。
如果目标数据是重复数据,则需要在目标存储节点的缓存层删除目标数据,并将客户端或者存储系统中存储的目标对象编号的对应关系更新为目标对象编号和目标数据指纹之间的对应关系,方便以后访问查找目标数据,目标存储节点的缓存层存储的是目标数据,目标存储节点的存储层存储的是目标数据指纹及目标数据指纹的引用计数。
将存储系统中缓存层的重复数据删除,避免了冗余重复数据增加存储空间消耗问题,提高了存储空间的存储效率。
本方案中,通过写入的目标数据确定目标对象编号,根据目标对象编号确定目标存储节点,将目标数据写入目标存储节点的缓存层,并按照预设算法生成目标数据对应的目标数据指纹,通过目标数据对应分布式存储系统中的唯一存储节点,并且为目标数据生成具有唯一性的目标数据指纹,指示了目标数据在存储系统中的唯一性,如果目标存储节点的存储层已存储目标数据指纹,则删除目标存储节点的缓存层的目标数据,避免了现有技术中分布式存储系统中存在大量冗余重复信息导致存储空间消耗大、存储效率低的问题,降低了存储空间消耗,提高了存储系统的存储效率。
可选地,在将第二预设规则中目标对象编号与目标存储节点的对应规则更新为目标对象编号、目标数据指纹与目标存储节点三者之间的对应规则之后,方法还包括:接收客户端读取目标数据的请求;判断元数据列表中是否存在目标数据指纹;如果元数据列表中不存在目标数据指纹,则根据元数据列表中目标对象编号与第二预设规则确定目标存储节点;根据目标对象编号获取目标存储节点的缓存层中存储的目标数据;向客户端返回目标数据。
目标数据指纹和目标对象编号是目标数据的元数据。
元数据又称中介数据、中继数据,为描述数据的数据(data about data),主要是描述数据属性(property)的信息,用来支持如指示存储位置、历史数据、资源查找、文件记录等功能。元数据算是一种电子式目录,为了达到编制目录的目的,必须在描述并收藏数据的内容或特色,进而达成协助数据检索的目的。
可选地,在判断元数据列表中是否存在目标数据指纹之后,方法还包括:如果元数据列表中存在目标数据指纹,则根据第二预设规则,确定目标存储节点;查找目标存储节点的存储层中存储的目标数据指纹;根据目标数据指纹确定目标存储节点的缓存层存储的目标数据;向客户端返回目标数据。
在读取目标数据时,如果获取到目标数据指纹,则根据目标数据指纹查找目标数据,否则根据目标对象编号查找目标数据,两种方法都可以查找到目标数据。
可选地,在将第二预设规则中目标对象编号与目标存储节点的对应规则更新为目标对象编号、目标数据指纹与目标存储节点三者之间的对应规则之后,方法还包括:接收第一目标数据的写入请求;按照第一预设规则确定对象编号1;判断元数据列表中对象编号1是否存在对应的数据指纹;如果对象编号1存在对应的数据指纹a,确定第一目标数据为更新写;根据第二预设规则确定存储节点A,并判断存储系统是写优先还是读优先;如果是读优先,则向存储节点A的缓存层写入第一目标数据;根据数据指纹a获取存储节点A的缓存层存储的目标数据,并将第一目标数据与目标数据进行合并,得到第二目标数据;根据预设算法计算第二目标数据的数据指纹,得到数据指纹a1;将数据指纹a1存储到存储节点A的存储层,将第二目标数据存储到存储节点A的缓存层,第二目标数据对应对象编号1和数据指纹a1;更新数据指纹a的引用计数;将元数据列表和第二预设规则中数据指纹a更新为数据指纹a1。
可选地,在判断存储系统是写优先还是读优先之后,方法还包括:如果是写优先,向存储节点A的缓存层写入第一目标数据,并标记第一目标数据为脏数据;根据预设算法计算第一目标数据的数据指纹,得到数据指纹a2;将数据指纹a2存储到存储节点A的存储层,将第一目标数据存储到存储节点A的缓存层;从缓存层中获取数据指纹a对应的目标数据,将目标数据与第一目标数据进行合并,得到第二目标数据,第二目标数据对应对象编号1和数据指纹a2;更新存储节点A的存储层存储的数据指纹a的引用计数;将元数据列表和第二预设规则中数据指纹a更新为数据指纹a2。
客户端发起写请求,如果是更新写,则在存储节点的缓存层存储更新后的数据,在存储节点的存储层存储数据指纹a1,更新之前的数据对应的数据指纹a的引用计数需要减1的操作。
本申请实施例1提供了一种分布式全局数据去重装置,该装置用于执行实施例1提供的分布式全局数据去重方法,该装置部署在存储系统中,如图2所示,该装置包括:第一接收单元10、第一确定单元20、第一判断单元30、第一计算单元40、第二判断单元50、第一存储单元60、删除单元70、第三判断单元80、第一更新单元90。
第一接收单元10,用于存储网关接收客户端的目标数据写入请求,按照第一预设规则确定目标数据对应的目标对象编号,并将目标数据与目标对象编号的对应关系存储在元数据列表。
第一确定单元20,用于根据第二预设规则确定目标对象编号对应的目标存储节点,存储网关将目标数据写入目标存储节点的缓存层,第二预设规则为对象编号与存储节点之间的对应规则。
第一判断单元30,用于判断目标数据是否需要计算数据指纹。
第一计算单元40,用于如果目标数据需要计算数据指纹,则根据预设算法计算目标数据的数据指纹,得到目标数据指纹,目标数据指纹与目标数据之间存在一一对应关系。
第二判断单元50,用于判断目标存储节点的存储层是否已存储目标数据指纹。
第一存储单元60,用于如果目标存储节点的存储层没有存储目标数据指纹,则将目标数据指纹存储在目标存储节点的存储层,向存储网关返回用于提示写入成功的提示信息,提示信息携带目标数据指纹。
删除单元70,用于如果目标存储节点的存储层已存储目标数据指纹,则向存储网关返回用于提示写入成功的提示信息,提示信息携带目标数据指纹,然后删除目标存储节点的缓存层中存储的目标数据,更新目标存储节点的存储层存储的目标数据指纹的引用计数。
第三判断单元80,用于存储网关接收提示信息,并判断提示信息是否携带目标数据指纹。
第一更新单元90,用于如果提示信息携带目标数据指纹,则将元数据列表中目标数据与目标对象编号的对应关系更新为目标数据、目标对象编号与目标数据指纹三者之间的对应关系,将第二预设规则中目标对象编号与目标存储节点的对应规则更新为目标对象编号、目标数据指纹与目标存储节点三者之间的对应规则。
预设算法可以是MD5消息摘要算法,一种被广泛使用的密码散列函数,可以产生出一个128位(16字节)的散列值(hash value),用于确保信息传输完整一致。
目标数据指纹是按照MD5消息摘要算法计算得到的值。
第一预设规则为写入的数据与对象编号之间对应的规则。
如果存储层存储的数据指纹,每增加一个客户端对象引用,则数据指纹的引用计数需要加1的操作;如果存储层存储的数据指纹,每减少一个客户端对象引用,则数据指纹的引用计数需要减1的操作,如果数据指纹的引用计数的值是0,表示存储层存储的数据指纹没有被客户端对象引用,则可以删除存储层存储的数据指纹,将引用计数为0的数据指纹对应缓存层存储的数据作为垃圾数据回收,空闲出的空间连接到空闲链表,以备循环使用。
如果目标数据是重复数据,则需要在目标存储节点的缓存层删除目标数据,并将客户端或者存储系统中存储的目标对象编号的对应关系更新为目标对象编号和目标数据指纹之间的对应关系,方便以后访问查找目标数据,目标存储节点的缓存层存储的是目标数据,目标存储节点的存储层存储的是目标数据指纹及目标数据指纹的引用计数。
将存储系统中缓存层的重复数据删除,避免了冗余重复数据增加存储空间消耗问题,提高了存储空间的存储效率。
本方案中,通过写入的目标数据确定目标对象编号,根据目标对象编号确定目标存储节点,将目标数据写入目标存储节点的缓存层,并按照预设算法生成目标数据对应的目标数据指纹,通过目标数据对应分布式存储系统中的唯一存储节点,并且为目标数据生成具有唯一性的目标数据指纹,指示了目标数据在存储系统中的唯一性,如果目标存储节点的存储层已存储目标数据指纹,则删除目标存储节点的缓存层的目标数据,避免了现有技术中分布式存储系统中存在大量冗余重复信息导致存储空间消耗大、存储效率低的问题,降低了存储空间消耗,提高了存储系统的存储效率。
可选地,装置还包括:第二接收单元,用于在第一更新单元90将第二预设规则中目标对象编号与目标存储节点的对应规则更新为目标对象编号、目标数据指纹与目标存储节点三者之间的对应规则之后,接收客户端读取目标数据的请求;第四判断单元,用于判断元数据列表中是否存在目标数据指纹;第二确定单元,用于如果元数据列表中不存在目标数据指纹,则根据元数据列表中目标对象编号与第二预设规则确定目标存储节点;第二存储单元,用于根据目标对象编号获取目标存储节点的缓存层中存储的目标数据;第一返回单元,用于向客户端返回目标数据。
目标数据指纹和目标对象编号是目标数据的元数据。
元数据又称中介数据、中继数据,为描述数据的数据(data about data),主要是描述数据属性(property)的信息,用来支持如指示存储位置、历史数据、资源查找、文件记录等功能。元数据算是一种电子式目录,为了达到编制目录的目的,必须在描述并收藏数据的内容或特色,进而达成协助数据检索的目的。
进一步地,装置还包括:第三确定单元,用于在第四判断单元判断元数据列表中是否存在目标数据指纹之后,如果元数据列表中存在目标数据指纹,则根据第二预设规则,确定目标存储节点;查找单元,用于查找目标存储节点的存储层中存储的目标数据指纹;第四确定单元,用于根据目标数据指纹确定目标存储节点的缓存层存储的目标数据;第二返回单元,用于向客 户端返回目标数据。
在读取目标数据时,如果获取到目标数据指纹,则根据目标数据指纹查找目标数据,否则根据目标对象编号查找目标数据,两种方法都可以查找到目标数据。
可选地,装置还包括:第三接收单元,用于在第一更新单元90将第二预设规则中目标对象编号与目标存储节点的对应规则更新为目标对象编号、目标数据指纹与目标存储节点三者之间的对应规则之后,接收第一目标数据的写入请求;第五确定单元,用于按照第一预设规则确定对象编号1;第五判断单元,用于判断元数据列表中对象编号1是否存在对应的数据指纹;第六确定单元,用于如果对象编号1存在对应的数据指纹a,确定第一目标数据为更新写;第六判断单元,用于根据第二预设规则确定存储节点A,并判断存储系统是写优先还是读优先;写入单元,用于如果是读优先,则向存储节点A的缓存层写入第一目标数据;合并单元,用于根据数据指纹a获取存储节点A的缓存层存储的目标数据,并将第一目标数据与目标数据进行合并,得到第二目标数据;第二计算单元,用于根据预设算法计算第二目标数据的数据指纹,得到数据指纹a1;第三存储单元,用于将数据指纹a1存储到存储节点A的存储层,将第二目标数据存储到存储节点A的缓存层,第二目标数据对应对象编号1和数据指纹a1;第二更新单元,用于更新数据指纹a的引用计数;第三更新单元,用于将元数据列表和第二预设规则中数据指纹a更新为数据指纹a1。
可选地,装置还包括:标记单元,用于在第六判断单元判断存储系统是写优先还是读优先之后,如果是写优先,向存储节点A的缓存层写入第一目标数据,并标记第一目标数据为脏数据;第三计算单元,用于根据预设算法计算第一目标数据的数据指纹,得到数据指纹a2;第四存储单元,用于将数据指纹a2存储到存储节点A的存储层,将第一目标数据存储到存储节点A的缓存层;获取单元,用于从缓存层中获取数据指纹a对应的目标数据,将目标数据与第一目标数据进行合并,得到第二目标数据,第二目标数据对应对象编号1和数据指纹a2;第四更新单元,用于更新存储节点A的存储层存储的数据指纹a的引用计数;第五更新单元,用于将元数据列表和第二预设规则中数据指纹a更新为数据指纹a2。
客户端发起写请求,如果是更新写,则在存储节点的缓存层存储更新后的数据,在存储节点的存储层存储数据指纹a1,更新之前的数据对应的数据指纹a的引用计数需要减1的操作。
本申请实施例1提供了一种计算机非易失性可读存储介质,计算机非易失性可读存储介质包括存储的程序,其中,在程序运行时控制计算机非易失性可读存储介质所在设备执行实施例1提供的分布式全局数据去重方法。
本申请实施例1提供了一种存储系统,包括存储器和处理器,存储器用于存储包括程序指令的信息,处理器用于控制程序指令的执行,程序指令被处理器加载并执行时实现实施例1提供的分布式全局数据去重方法的步骤。
实施例2
本申请实施例2提供了一种分布式全局数据去重方法,该方法由客户端和存储系统共同执行,其中,客户端执行步骤S202至步骤S210;存储系统执行步骤S302至步骤S312。
步骤S202,客户端接收目标数据的写入请求,按照第一预设规则确定目标数据对应的目标对象编号,并将目标数据与目标对象编号的对应关系存储在元数据列表。
步骤S204,客户端根据第二预设规则确定目标对象编号对应的目标存储节点,第二预设规则为对象编号与存储节点之间的对应规则,目标存储节点部署在存储系统。
步骤S206,客户端向目标存储节点发送目标数据。
步骤S208,客户端接收目标存储节点返回的提示信息,并判断提示信息是否携带目标数据指纹,目标数据指纹是在目标存储节点根据目标数据生成的数据。
步骤S210,如果提示信息携带目标数据指纹,则客户端将元数据列表中目标数据与目标对象编号的对应关系更新为目标数据、目标对象编号与目标数据指纹三者之间的对应关系,将第二预设规则中目标对象编号与目标存储节点的对应规则更新为目标对象编号、目标数据指纹与目标存储节点三者之间的对应规则。
目标数据指纹和目标对象编号是目标数据的元数据。
元数据又称中介数据、中继数据,为描述数据的数据(data about data),主要是描述数据属性(property)的信息,用来支持如指示存储位置、历史数据、资源查找、文件记录等功能。元数据算是一种电子式目录,为了达到编制目录的目的,必须在描述并收藏数据的内容或特色,进而达成协助数据检索的目的。
本方案中,通过写入的目标数据确定目标对象编号,根据目标对象编号确定目标存储节点,将目标数据写入目标存储节点的缓存层,并按照预设算法生成目标数据对应的目标数据指纹,通过目标数据对应分布式存储系统中的唯一存储节点,并且为目标数据生成具有唯一性的目标数据指纹,指示了目标数据在存储系统中的唯一性,如果目标存储节点的存储层已存储目标数据指纹,则删除目标存储节点的缓存层的目标数据,避免了现有技术中分布式存储系统中存在大量冗余重复信息导致存储空间消耗大、存储效率低的问题,降低了存储空间消耗,提高了存储系统的存储效率。
可选地,客户端还执行以下步骤:在接收目标数据的写入请求之前,按照预设值将客户端的每个磁盘镜像进行切片,得到多个分片;为多个分片的每个分片分配一个对象编号;按照第二预设规则,为每个对象编号分配存储节点。
其中,对客户端的磁盘镜像进行切片的预设值,可以按照应用场景不同设置为8K到64M的大小。
如果多个分片的对象编号分别为对象1、对象2、…、对象n,按照第二预设规则将多个对象编号分布到多个存储节点,映射关系可以为:对象1~对象10分布在存储节点A上,对象11~对象20分布在存储节点B上。为客户端的每个分片指定存储系统的唯一存储节点,将写入数据存储在指定存储节点,方便存储系统的管理和数据的查询,提高了存储系统的能耗管理效率。
本申请实施例2提供了一种计算机非易失性可读存储介质,计算机非易失性可读存储介质包括存储的程序,其中,在程序运行时控制客户端执行上述步骤S202至步骤S210。
本申请实施例2提供了一种客户端,包括存储器和处理器,存储器用于存储包括程序指令的信息,处理器用于控制程序指令的执行,程序指令被处理器加载并执行时实现上述步骤S202至步骤S210。
步骤S302,存储系统接收客户端发送的目标数据,并将目标数据写入目标存储节点的缓存层。
步骤S304,存储系统判断目标数据是否需要计算数据指纹。
步骤S306,如果目标数据需要计算数据指纹,则存储系统根据预设算法计算目标数据的数据指纹,得到目标数据指纹,目标数据指纹与目标数据之间存在一一对应关系。
步骤S308,存储系统判断目标存储节点的存储层是否已存储目标数据指纹。
步骤S310,如果目标存储节点的存储层没有存储目标数据指纹,则存储系统将目标数据指纹存储到目标存储节点的存储层,向客户端返回用于提示写入成功的提示信息,提示信息携带目标数据指纹。
步骤S312,如果目标存储节点的存储层已存储目标数据指纹,则存储系统向客户端返回用于提示写入成功的提示信息,提示信息携带目标数据指纹,然后删除目标存储节点的缓存层中存储的目标数据,更新目标存储节点的存储层存储的目标数据指纹的引用计数。
预设算法可以是MD5消息摘要算法,一种被广泛使用的密码散列函数,可以产生出一个128位(16字节)的散列值(hash value),用于确保信息传输完整一致。
目标数据指纹是按照MD5消息摘要算法计算得到的值。
如果存储层存储的数据指纹,每增加一个客户端对象引用,则数据指纹的引用计数需要加1的操作;如果存储层存储的数据指纹,每减少一个客户端对象引用,则数据指纹的引用计数需要减1的操作,如果数据指纹的引用计数的值是0,表示存储层存储的数据指纹没有被客户端对象引用,则可以删除存储层存储的数据指纹,将引用计数为0的数据指纹对应缓存层存储的数据作为垃圾数据回收,空闲出的空间连接到空闲链表,以备循环使用。
如果目标数据是重复数据,则需要在目标存储节点的缓存层删除目标数据,并将客户端或者存储系统中存储的目标对象编号的对应关系更新为目标对象编号和目标数据指纹之间的对应关系,方便以后访问查找目标数据,目标存储节点的缓存层存储的是目标数据,目标存储节点的存储层存储的是目标数据指纹及目标数据指纹的引用计数。
将存储系统中缓存层的重复数据删除,避免了冗余重复数据增加存储空间消耗问题,提高了存储空间的存储效率。
本方案中,通过写入的目标数据确定目标对象编号,根据目标对象编号确定目标存储节点,将目标数据写入目标存储节点的缓存层,并按照预设算法生成目标数据对应的目标数据指纹,通过目标数据对应分布式存储系统中的唯一存储节点,并且为目标数据生成具有唯一性的目标数据指纹,指示了目标数据在存储系统中的唯一性,如果目标存储节点的存储层已存储目标数据指纹,则删除目标存储节点的缓存层的目标数据,避免了现有技术中分布式存储系统中存在大量冗余重复信息导致存储空间消耗大、存储效率低的问题,降低了存储空间消耗,提高了存储系统的存储效率。
本申请实施例2提供了一种计算机非易失性可读存储介质,计算机非易失性可读存储介质包括存储的程序,其中,在程序运行时控制存储系统执行上述步骤S302至步骤S312。
本申请实施例2提供了一种存储系统,包括存储器和处理器,存储器用于存储包括程序指令的信息,处理器用于控制程序指令的执行,程序指令被处理器加载并执行时实现上述步骤S302至步骤S312。
本申请实施例2提供了一种分布式全局数据去重装置,该装置用于执行实施例2提供的分布式全局数据去重方法,该装置部署在客户端和存储系统,其中,客户端中部署了以下单元:第一接收单元10、确定单元20、发送单元30、第二接收单元40、更新单元50;存储系统中部署了以下单元:接收单元11、第一判断单元21、计算单元31、第二判断单元41、存储单元51、删除单元61。
第一接收单元10,用于接收目标数据的写入请求,按照第一预设规则确定目标数据对应的目标对象编号,并将目标数据与目标对象编号的对应关系存储在元数据列表。
确定单元20,用于根据第二预设规则确定目标对象编号对应的目标存储节点,第二预设规则为对象编号与存储节点之间的对应规则,目标存储节点部署在存储系统。
发送单元30,用于向目标存储节点发送目标数据。
第二接收单元40,用于接收目标存储节点返回的提示信息,并判断提示信息是否携带目标数据指纹,目标数据指纹是在目标存储节点根据目标数据生成的数据。
更新单元50,用于如果提示信息携带目标数据指纹,则将元数据列表中目标数据与目标对象编号的对应关系更新为目标数据、目标对象编号与目标数据指纹三者之间的对应关系,将第二预设规则中目标对象编号与目标存储节点的对应规则更新为目标对象编号、目标数据指纹与目标存储节点三者之间的对应规则。
目标数据指纹和目标对象编号是目标数据的元数据。
元数据又称中介数据、中继数据,为描述数据的数据(data about data),主要是描述数据属性(property)的信息,用来支持如指示存储位置、历史数据、资源查找、文件记录等功能。元数据算是一种电子式目录,为了达到编制目录的目的,必须在描述并收藏数据的内容或特色,进而达成协助数据检索的目的。
可选地,客户端中还部署了:切分单元、第一分配单元。切分单元,用于在第一接收单元10接收目标数据的写入请求之前,按照预设值将客户端的每个磁盘镜像进行切片,得到多个分片;第一分配单元,用于为多个分片的每个分片分配一个对象编号;第二分配单元,用于按照第二预设规则,为每个对象编号分配存储节点。
其中,对客户端的磁盘镜像进行切片的预设值,可以按照应用场景不同设置为8K到64M的大小。
如果多个分片的对象编号分别为对象1、对象2、…、对象n,按照第二预设规则将多个对象编号分布到多个存储节点,映射关系可以为:对象1~对象10分布在存储节点A上,对象11~对象20分布在存储节点B上。为客户端的每个分片指定存储系统的唯一存储节点,将写入数据存储在指定存储节点,方便存储系统的管理和数据的查询,提高了存储系统的能耗管理效率。
接收单元11,用于接收客户端发送的目标数据,并将目标数据写入目标存储节点的缓存层。
第一判断单元21,用于判断目标数据是否需要计算数据指纹。
计算单元31,用于如果目标数据需要计算数据指纹,则根据预设算法计算目标数据的数据指纹,得到目标数据指纹,目标数据指纹与目标数据之间存在一一对应关系。
第二判断单元41,用于判断目标存储节点的存储层是否已存储目标数据指纹。
存储单元51,用于如果目标存储节点的存储层没有存储目标数据指纹,则将目标数据指纹存储到目标存储节点的存储层,向客户端返回用于提示写入成功的提示信息,提示信息携带目标数据指纹。
删除单元61,用于如果目标存储节点的存储层已存储目标数据指纹,则向客户端返回用于提示写入成功的提示信息,提示信息携带目标数据指纹,然后删除目标存储节点的缓存层中存储的目标数据,更新目标存储节点的存储层存储的目标数据指纹的引用计数。
预设算法可以是MD5消息摘要算法,一种被广泛使用的密码散列函数,可以产生出一个128位(16字节)的散列值(hash value),用于确保信息传输完整一致。
目标数据指纹是按照MD5消息摘要算法计算得到的值。
如果存储层存储的数据指纹,每增加一个客户端对象引用,则数据指纹的引用计数需要加1的操作;如果存储层存储的数据指纹,每减少一个客户端对象引用,则数据指纹的引用计数需要减1的操作,如果数据指纹的引用计数的值是0,表示存储层存储的数据指纹没有被客户端对象引用,则可以删除存储层存储的数据指纹,将引用计数为0的数据指纹对应缓存层存储的数据作为垃圾数据回收,空闲出的空间连接到空闲链表,以备循环使用。
如果目标数据是重复数据,则需要在目标存储节点的缓存层删除目标数据,并将客户端或者存储系统中存储的目标对象编号的对应关系更新为目标对象编号和目标数据指纹之间的对应关系,方便以后访问查找目标数据,目标存储节点的缓存层存储的是目标数据,目标存储节点的存储层存储的是目标数据指纹及目标数据指纹的引用计数。
将存储系统中缓存层的重复数据删除,避免了冗余重复数据增加存储空间消耗问题,提高了存储空间的存储效率。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如,多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
上述以软件功能单元的形式实现的集成的单元,可以存储在一个计算机可读取存储介 质中。上述软件功能单元存储在一个存储介质中,包括若干指令用以使得一台计算机装置(可以是个人计算机,服务器,或者网络装置等)或处理器(Processor)执行本申请各个实施例所述方法的部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述仅为本申请的较佳实施例而已,并不用以限制本申请,凡在本申请的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本申请保护的范围之内。

Claims (20)

  1. 一种分布式全局数据去重方法,其特征在于,所述方法应用于存储系统,所述方法包括:
    存储网关接收客户端的目标数据写入请求,按照第一预设规则确定所述目标数据对应的目标对象编号,并将所述目标数据与所述目标对象编号的对应关系存储在元数据列表;
    根据第二预设规则确定所述目标对象编号对应的目标存储节点,所述存储网关将所述目标数据写入所述目标存储节点的缓存层,所述第二预设规则为对象编号与存储节点之间的对应规则;
    判断所述目标数据是否需要计算数据指纹;
    如果所述目标数据需要计算数据指纹,则根据预设算法计算所述目标数据的数据指纹,得到目标数据指纹,所述目标数据指纹与所述目标数据之间存在一一对应关系;
    判断所述目标存储节点的存储层是否已存储所述目标数据指纹;
    如果所述目标存储节点的存储层没有存储所述目标数据指纹,则将所述目标数据指纹存储在所述目标存储节点的存储层,向所述存储网关返回用于提示写入成功的提示信息,所述提示信息携带所述目标数据指纹;
    如果所述目标存储节点的存储层已存储所述目标数据指纹,则向所述存储网关返回用于提示写入成功的提示信息,所述提示信息携带所述目标数据指纹,然后删除所述目标存储节点的缓存层中存储的所述目标数据,更新所述目标存储节点的存储层存储的所述目标数据指纹的引用计数;
    所述存储网关接收所述提示信息,并判断所述提示信息是否携带所述目标数据指纹;
    如果所述提示信息携带所述目标数据指纹,则将所述元数据列表中所述目标数据与所述目标对象编号的对应关系更新为所述目标数据、所述目标对象编号与所述目标数据指纹三者之间的对应关系,将所述第二预设规则中所述目标对象编号与所述目标存储节点的对应规则更新为所述目标对象编号、所述目标数据指纹与所述目标存储节点三者之间的对应规则。
  2. 根据权利要求1所述的方法,其特征在于,在所述将所述第二预设规则中所述目标对象编号与所述目标存储节点的对应规则更新为所述目标对象编号、所述目标数据指纹与所述目标存储节点三者之间的对应规则之后,所述方法还包括:
    接收所述客户端读取所述目标数据的请求;
    判断所述元数据列表中是否存在所述目标数据指纹;
    如果所述元数据列表中不存在所述目标数据指纹,则根据所述元数据列表中所述目标对象编号与所述第二预设规则确定所述目标存储节点;
    根据所述目标对象编号获取所述目标存储节点的缓存层中存储的所述目标数据;
    向所述客户端返回所述目标数据。
  3. 根据权利要求2所述的方法,其特征在于,在所述判断所述元数据列表中是否存在所述目标数据指纹之后,所述方法还包括:
    如果所述元数据列表中存在所述目标数据指纹,则根据所述第二预设规则,确定所述目标存储节点;
    查找所述目标存储节点的存储层中存储的所述目标数据指纹;
    根据所述目标数据指纹确定所述目标存储节点的缓存层存储的所述目标数据;
    向所述客户端返回所述目标数据。
  4. 根据权利要求1至3任一项所述的方法,其特征在于,在所述将所述第二预设规则中所述目标对象编号与所述目标存储节点的对应规则更新为所述目标对象编号、所述目标数据指纹与所述目标存储节点三者之间的对应规则之后,所述方法还包括:
    接收第一目标数据的写入请求;
    按照所述第一预设规则确定对象编号1;
    判断所述元数据列表中所述对象编号1是否存在对应的数据指纹;
    如果所述对象编号1存在对应的数据指纹a,确定所述第一目标数据为更新写;
    根据所述第二预设规则确定存储节点A,并判断所述存储系统是写优先还是读优先;
    如果是读优先,则向所述存储节点A的缓存层写入所述第一目标数据;
    根据所述数据指纹a获取所述存储节点A的缓存层存储的所述目标数据,并将所述第一目标数据与所述目标数据进行合并,得到第二目标数据;
    根据所述预设算法计算所述第二目标数据的数据指纹,得到数据指纹a1;
    将所述数据指纹a1存储到所述存储节点A的存储层,将所述第二目标数据存储到所述存储节点A的缓存层,所述第二目标数据对应所述对象编号1和所述数据指纹a1;
    更新所述数据指纹a的引用计数;
    将所述元数据列表和所述第二预设规则中所述数据指纹a更新为所述数据指纹a1。
  5. 根据权利要求4所述的方法,其特征在于,在所述判断所述存储系统是写优先还是读优先之后,所述方法还包括:
    如果是写优先,向所述存储节点A的缓存层写入所述第一目标数据,并标记所述第一目标数据为脏数据;
    根据所述预设算法计算所述第一目标数据的数据指纹,得到数据指纹a2;
    将所述数据指纹a2存储到所述存储节点A的存储层,将所述第一目标数据存储到所述存储节点A的缓存层;
    从所述缓存层中获取所述数据指纹a对应的目标数据,将所述目标数据与所述第一目标数据进行合并,得到所述第二目标数据,所述第二目标数据对应所述对象编号1和所述数据指纹a2;
    更新所述存储节点A的存储层存储的所述数据指纹a的引用计数;
    将所述元数据列表和所述第二预设规则中所述数据指纹a更新为所述数据指纹a2。
  6. 一种分布式全局数据去重装置,其特征在于,所述装置应用于存储系统,所述装置包括:
    第一接收单元,用于使存储网关接收客户端的目标数据写入请求,按照第一预设规则确定所述目标数据对应的目标对象编号,并将所述目标数据与所述目标对象编号的对应关系存储在元数据列表;
    第一确定单元,用于根据第二预设规则确定所述目标对象编号对应的目标存储节点,所述存储网关将所述目标数据写入所述目标存储节点的缓存层,所述第二预设规则为对象编号与存储节点之间的对应规则;
    第一判断单元,用于判断所述目标数据是否需要计算数据指纹;
    第一计算单元,用于如果所述目标数据需要计算数据指纹,则根据预设算法计算所述目标数据的数据指纹,得到目标数据指纹,所述目标数据指纹与所述目标数据之间存在一一对应关系;
    第二判断单元,用于判断所述目标存储节点的存储层是否已存储所述目标数据指纹;
    第一存储单元,用于如果所述目标存储节点的存储层没有存储所述目标数据指纹,则将所述目标数据指纹存储在所述目标存储节点的存储层,向所述存储网关返回用于提示写入成功的提示信息,所述提示信息携带所述目标数据指纹;
    删除单元,用于如果所述目标存储节点的存储层已存储所述目标数据指纹,则向所述存储网关返回用于提示写入成功的提示信息,所述提示信息携带所述目标数据指纹,然后删除所述目标存储节点的缓存层中存储的所述目标数据,更新所述目标存储节点的存储层存储的所述目标数据指纹的引用计数;
    第三判断单元,用于所述存储网关接收所述提示信息,并判断所述提示信息是否携带所述目标数据指纹;
    第一更新单元,用于如果所述提示信息携带所述目标数据指纹,则将所述元数据列表中所述目标数据与所述目标对象编号的对应关系更新为所述目标数据、所述目标对象编号与所述目标数据指纹三者之间的对应关系,将所述第二预设规则中所述目标对象编号与所述目标存储节点的对应规则更新为所述目标对象编号、所述目标数据指纹与所述目标存储节点三者之间的对应规则。
  7. 根据权利要求6所述的装置,其特征在于,所述装置还包括:
    第二接收单元,用于在第一更新单元将所述第二预设规则中所述目标对象编号与所述目标存储节点的对应规则更新为所述目标对象编号、所述目标数据指纹与所述目标存储节点三者之间的对应规则之后,接收所述客户端读取所述目标数据的请求;
    第四判断单元,用于判断所述元数据列表中是否存在所述目标数据指纹;
    第二确定单元,用于如果所述元数据列表中不存在所述目标数据指纹,则根据所述元数据列表中所述目标对象编号与所述第二预设规则确定所述目标存储节点;
    第二存储单元,用于根据所述目标对象编号获取所述目标存储节点的缓存层中存储的所述目标数据;
    第一返回单元,用于向所述客户端返回所述目标数据。
  8. 根据权利要求7所述的装置,其特征在于,所述装置还包括:
    第三确定单元,用于在第四判断单元判断所述元数据列表中是否存在所述目标数据指纹之后,如果所述元数据列表中存在所述目标数据指纹,则根据所述第二预设规则,确定所述目标存储节点;
    查找单元,用于查找所述目标存储节点的存储层中存储的所述目标数据指纹;
    第四确定单元,用于根据所述目标数据指纹确定所述目标存储节点的缓存层存储的所述目标数据;
    第二返回单元,用于向所述客户端返回所述目标数据。
  9. 根据权利要求6至8任一项所述的装置,其特征在于,所述装置还包括:
    第三接收单元,用于在第一更新单元将所述第二预设规则中所述目标对象编号与所述目标存储节点的对应规则更新为所述目标对象编号、所述目标数据指纹与所述目标存储节点三者之间的对应规则之后,接收第一目标数据的写入请求;
    第五确定单元,用于按照所述第一预设规则确定对象编号1;
    第五判断单元,用于判断所述元数据列表中所述对象编号1是否存在对应的数据指纹;
    第六确定单元,用于如果所述对象编号1存在对应的数据指纹a,确定所述第一目标数据为更新写;
    第六判断单元,用于根据所述第二预设规则确定存储节点A,并判断所述存储系统是写优先还是读优先;
    写入单元,用于如果是读优先,则向所述存储节点A的缓存层写入所述第一目标数据;
    合并单元,用于根据所述数据指纹a获取所述存储节点A的缓存层存储的所述目标数据,并将所述第一目标数据与所述目标数据进行合并,得到第二目标数据;
    第二计算单元,用于根据所述预设算法计算所述第二目标数据的数据指纹,得到数据指纹a1;
    第三存储单元,用于将所述数据指纹a1存储到所述存储节点A的存储层,将所述第二目标数据存储到所述存储节点A的缓存层,所述第二目标数据对应所述对象编号1和所述数据指纹a1;
    第二更新单元,用于更新所述数据指纹a的引用计数;
    第三更新单元,用于将所述元数据列表和所述第二预设规则中所述数据指纹a更新为所述数据指纹a1。
  10. 根据权利要求9所述的装置,其特征在于,所述装置还包括:
    标记单元,用于在第六判断单元判断所述存储系统是写优先还是读优先之后,如果是写优先,向所述存储节点A的缓存层写入所述第一目标数据,并标记所述第一目标数据为脏数据;
    第三计算单元,用于根据所述预设算法计算所述第一目标数据的数据指纹,得到数据指纹a2;
    第四存储单元,用于将所述数据指纹a2存储到所述存储节点A的存储层,将所述第一目标数据存储到所述存储节点A的缓存层;
    获取单元,用于从所述缓存层中获取所述数据指纹a对应的目标数据,将所述目标数据与所述第一目标数据进行合并,得到所述第二目标数据,所述第二目标数据对应所述对象编号1和所述数据指纹a2;
    第四更新单元,用于更新所述存储节点A的存储层存储的所述数据指纹a的引用计数;
    第五更新单元,用于将所述元数据列表和所述第二预设规则中所述数据指纹a更新为所述 数据指纹a2。
  11. 一种计算机设备,包括存储器和处理器,所述存储器用于存储包括程序指令的信息,所述处理器用于控制所述程序指令的执行,其特征在于,所述程序指令被所述处理器加载并执行时实现以下步骤:
    控制存储网关接收客户端的目标数据写入请求,按照第一预设规则确定所述目标数据对应的目标对象编号,并将所述目标数据与所述目标对象编号的对应关系存储在元数据列表;
    根据第二预设规则确定所述目标对象编号对应的目标存储节点,所述存储网关将所述目标数据写入所述目标存储节点的缓存层,所述第二预设规则为对象编号与存储节点之间的对应规则;
    判断所述目标数据是否需要计算数据指纹;
    如果所述目标数据需要计算数据指纹,则根据预设算法计算所述目标数据的数据指纹,得到目标数据指纹,所述目标数据指纹与所述目标数据之间存在一一对应关系;
    判断所述目标存储节点的存储层是否已存储所述目标数据指纹;
    如果所述目标存储节点的存储层没有存储所述目标数据指纹,则将所述目标数据指纹存储在所述目标存储节点的存储层,向所述存储网关返回用于提示写入成功的提示信息,所述提示信息携带所述目标数据指纹;
    如果所述目标存储节点的存储层已存储所述目标数据指纹,则向所述存储网关返回用于提示写入成功的提示信息,所述提示信息携带所述目标数据指纹,然后删除所述目标存储节点的缓存层中存储的所述目标数据,更新所述目标存储节点的存储层存储的所述目标数据指纹的引用计数;
    所述存储网关接收所述提示信息,并判断所述提示信息是否携带所述目标数据指纹;
    如果所述提示信息携带所述目标数据指纹,则将所述元数据列表中所述目标数据与所述目标对象编号的对应关系更新为所述目标数据、所述目标对象编号与所述目标数据指纹三者之间的对应关系,将所述第二预设规则中所述目标对象编号与所述目标存储节点的对应规则更新为所述目标对象编号、所述目标数据指纹与所述目标存储节点三者之间的对应规则。
  12. 根据权利要求11所述的计算机设备,其特征在于,所述程序指令被所述处理器加载并执行时还实现以下步骤:
    在所述将所述第二预设规则中所述目标对象编号与所述目标存储节点的对应规则更新为所述目标对象编号、所述目标数据指纹与所述目标存储节点三者之间的对应规则之后,接收所述 客户端读取所述目标数据的请求;
    判断所述元数据列表中是否存在所述目标数据指纹;
    如果所述元数据列表中不存在所述目标数据指纹,则根据所述元数据列表中所述目标对象编号与所述第二预设规则确定所述目标存储节点;
    根据所述目标对象编号获取所述目标存储节点的缓存层中存储的所述目标数据;
    向所述客户端返回所述目标数据。
  13. 根据权利要求12所述的计算机设备,其特征在于,所述程序指令被所述处理器加载并执行时还实现以下步骤:
    在所述判断所述元数据列表中是否存在所述目标数据指纹之后,如果所述元数据列表中存在所述目标数据指纹,则根据所述第二预设规则,确定所述目标存储节点;
    查找所述目标存储节点的存储层中存储的所述目标数据指纹;
    根据所述目标数据指纹确定所述目标存储节点的缓存层存储的所述目标数据;
    向所述客户端返回所述目标数据。
  14. 根据权利要求11至13任一项所述的计算机设备,其特征在于,所述程序指令被所述处理器加载并执行时还实现以下步骤:
    在所述将所述第二预设规则中所述目标对象编号与所述目标存储节点的对应规则更新为所述目标对象编号、所述目标数据指纹与所述目标存储节点三者之间的对应规则之后,接收第一目标数据的写入请求;
    按照所述第一预设规则确定对象编号1;
    判断所述元数据列表中所述对象编号1是否存在对应的数据指纹;
    如果所述对象编号1存在对应的数据指纹a,确定所述第一目标数据为更新写;
    根据所述第二预设规则确定存储节点A,并判断所述存储系统是写优先还是读优先;
    如果是读优先,则向所述存储节点A的缓存层写入所述第一目标数据;
    根据所述数据指纹a获取所述存储节点A的缓存层存储的所述目标数据,并将所述第一目标数据与所述目标数据进行合并,得到第二目标数据;
    根据所述预设算法计算所述第二目标数据的数据指纹,得到数据指纹a1;
    将所述数据指纹a1存储到所述存储节点A的存储层,将所述第二目标数据存储到所述存储节点A的缓存层,所述第二目标数据对应所述对象编号1和所述数据指纹a1;
    更新所述数据指纹a的引用计数;
    将所述元数据列表和所述第二预设规则中所述数据指纹a更新为所述数据指纹a1。
  15. 根据权利要求14所述的计算机设备,其特征在于,所述程序指令被所述处理器加载并执行时还实现以下步骤:
    在所述判断所述存储系统是写优先还是读优先之后,如果是写优先,向所述存储节点A的缓存层写入所述第一目标数据,并标记所述第一目标数据为脏数据;
    根据所述预设算法计算所述第一目标数据的数据指纹,得到数据指纹a2;
    将所述数据指纹a2存储到所述存储节点A的存储层,将所述第一目标数据存储到所述存储节点A的缓存层;
    从所述缓存层中获取所述数据指纹a对应的目标数据,将所述目标数据与所述第一目标数据进行合并,得到所述第二目标数据,所述第二目标数据对应所述对象编号1和所述数据指纹a2;
    更新所述存储节点A的存储层存储的所述数据指纹a的引用计数;
    将所述元数据列表和所述第二预设规则中所述数据指纹a更新为所述数据指纹a2。
  16. 一种计算机非易失性可读存储介质,其特征在于,所述计算机非易失性可读存储介质包括存储的程序,其中,在所述程序运行时控制所述计算机非易失性可读存储介质所在设备执行以下步骤:
    控制存储网关接收客户端的目标数据写入请求,按照第一预设规则确定所述目标数据对应的目标对象编号,并将所述目标数据与所述目标对象编号的对应关系存储在元数据列表;
    根据第二预设规则确定所述目标对象编号对应的目标存储节点,所述存储网关将所述目标数据写入所述目标存储节点的缓存层,所述第二预设规则为对象编号与存储节点之间的对应规则;
    判断所述目标数据是否需要计算数据指纹;
    如果所述目标数据需要计算数据指纹,则根据预设算法计算所述目标数据的数据指纹,得到目标数据指纹,所述目标数据指纹与所述目标数据之间存在一一对应关系;
    判断所述目标存储节点的存储层是否已存储所述目标数据指纹;
    如果所述目标存储节点的存储层没有存储所述目标数据指纹,则将所述目标数据指纹存储在所述目标存储节点的存储层,向所述存储网关返回用于提示写入成功的提示信息,所述提示信息携带所述目标数据指纹;
    如果所述目标存储节点的存储层已存储所述目标数据指纹,则向所述存储网关返回用于提 示写入成功的提示信息,所述提示信息携带所述目标数据指纹,然后删除所述目标存储节点的缓存层中存储的所述目标数据,更新所述目标存储节点的存储层存储的所述目标数据指纹的引用计数;
    所述存储网关接收所述提示信息,并判断所述提示信息是否携带所述目标数据指纹;
    如果所述提示信息携带所述目标数据指纹,则将所述元数据列表中所述目标数据与所述目标对象编号的对应关系更新为所述目标数据、所述目标对象编号与所述目标数据指纹三者之间的对应关系,将所述第二预设规则中所述目标对象编号与所述目标存储节点的对应规则更新为所述目标对象编号、所述目标数据指纹与所述目标存储节点三者之间的对应规则。
  17. 根据权利要求16所述的计算机非易失性可读存储介质,其特征在于,在所述程序运行时控制所述计算机非易失性可读存储介质所在设备还执行以下步骤:
    在所述将所述第二预设规则中所述目标对象编号与所述目标存储节点的对应规则更新为所述目标对象编号、所述目标数据指纹与所述目标存储节点三者之间的对应规则之后,接收所述客户端读取所述目标数据的请求;
    判断所述元数据列表中是否存在所述目标数据指纹;
    如果所述元数据列表中不存在所述目标数据指纹,则根据所述元数据列表中所述目标对象编号与所述第二预设规则确定所述目标存储节点;
    根据所述目标对象编号获取所述目标存储节点的缓存层中存储的所述目标数据;
    向所述客户端返回所述目标数据。
  18. 根据权利要求17所述的计算机非易失性可读存储介质,其特征在于,在所述程序运行时控制所述计算机非易失性可读存储介质所在设备还执行以下步骤:
    在所述判断所述元数据列表中是否存在所述目标数据指纹之后,如果所述元数据列表中存在所述目标数据指纹,则根据所述第二预设规则,确定所述目标存储节点;
    查找所述目标存储节点的存储层中存储的所述目标数据指纹;
    根据所述目标数据指纹确定所述目标存储节点的缓存层存储的所述目标数据;
    向所述客户端返回所述目标数据。
  19. 根据权利要求16至18任一项所述的计算机非易失性可读存储介质,其特征在于,在所述程序运行时控制所述计算机非易失性可读存储介质所在设备还执行以下步骤:
    在所述将所述第二预设规则中所述目标对象编号与所述目标存储节点的对应规则更新为所述目标对象编号、所述目标数据指纹与所述目标存储节点三者之间的对应规则之后,接收第一 目标数据的写入请求;
    按照所述第一预设规则确定对象编号1;
    判断所述元数据列表中所述对象编号1是否存在对应的数据指纹;
    如果所述对象编号1存在对应的数据指纹a,确定所述第一目标数据为更新写;
    根据所述第二预设规则确定存储节点A,并判断所述存储系统是写优先还是读优先;
    如果是读优先,则向所述存储节点A的缓存层写入所述第一目标数据;
    根据所述数据指纹a获取所述存储节点A的缓存层存储的所述目标数据,并将所述第一目标数据与所述目标数据进行合并,得到第二目标数据;
    根据所述预设算法计算所述第二目标数据的数据指纹,得到数据指纹a1;
    将所述数据指纹a1存储到所述存储节点A的存储层,将所述第二目标数据存储到所述存储节点A的缓存层,所述第二目标数据对应所述对象编号1和所述数据指纹a1;
    更新所述数据指纹a的引用计数;
    将所述元数据列表和所述第二预设规则中所述数据指纹a更新为所述数据指纹a1。
  20. 根据权利要求19所述的计算机非易失性可读存储介质,其特征在于,在所述程序运行时控制所述计算机非易失性可读存储介质所在设备还执行以下步骤:
    在所述判断所述存储系统是写优先还是读优先之后,如果是写优先,向所述存储节点A的缓存层写入所述第一目标数据,并标记所述第一目标数据为脏数据;
    根据所述预设算法计算所述第一目标数据的数据指纹,得到数据指纹a2;
    将所述数据指纹a2存储到所述存储节点A的存储层,将所述第一目标数据存储到所述存储节点A的缓存层;
    从所述缓存层中获取所述数据指纹a对应的目标数据,将所述目标数据与所述第一目标数据进行合并,得到所述第二目标数据,所述第二目标数据对应所述对象编号1和所述数据指纹a2;
    更新所述存储节点A的存储层存储的所述数据指纹a的引用计数;
    将所述元数据列表和所述第二预设规则中所述数据指纹a更新为所述数据指纹a2。
PCT/CN2019/104330 2019-04-23 2019-09-04 一种分布式全局数据去重方法和装置 WO2020215580A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910327312.6A CN110245129B (zh) 2019-04-23 2019-04-23 一种分布式全局数据去重方法和装置
CN201910327312.6 2019-04-23

Publications (1)

Publication Number Publication Date
WO2020215580A1 true WO2020215580A1 (zh) 2020-10-29

Family

ID=67883298

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/104330 WO2020215580A1 (zh) 2019-04-23 2019-09-04 一种分布式全局数据去重方法和装置

Country Status (2)

Country Link
CN (1) CN110245129B (zh)
WO (1) WO2020215580A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114675783A (zh) * 2022-03-25 2022-06-28 苏州浪潮智能科技有限公司 一种数据存储方法、系统、设备以及介质
WO2023093091A1 (zh) * 2021-11-25 2023-06-01 华为技术有限公司 数据存储系统、智能网卡及计算节点

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111090620B (zh) * 2019-12-06 2022-04-22 浪潮电子信息产业股份有限公司 一种文件存储方法、装置、设备及可读存储介质
CN114138756B (zh) * 2020-09-03 2023-03-24 金篆信科有限责任公司 数据去重方法、节点及计算机可读存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130041872A1 (en) * 2011-08-12 2013-02-14 Alexander AIZMAN Cloud storage system with distributed metadata
CN106649556A (zh) * 2016-11-08 2017-05-10 深圳市中博睿存科技有限公司 基于分布式文件系统的多层重复数据删除方法及装置
CN107436725A (zh) * 2016-05-25 2017-12-05 杭州海康威视数字技术股份有限公司 一种数据写、读方法、装置及分布式对象存储集群
CN107506150A (zh) * 2017-08-30 2017-12-22 郑州云海信息技术有限公司 分布式存储装置、重删、写、删除、读取方法以及系统
CN108008918A (zh) * 2017-11-30 2018-05-08 联想(北京)有限公司 数据处理方法、存储节点及分布式存储系统
CN108052284A (zh) * 2017-12-08 2018-05-18 北京奇虎科技有限公司 一种分布式数据存储方法及装置

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10146787B2 (en) * 2013-07-26 2018-12-04 Quest Software Inc. Transferring differences between chunks during replication
CN104156380B (zh) * 2014-03-04 2019-03-26 深圳信息职业技术学院 一种分布式存储器哈希索引方法及系统
US10359942B2 (en) * 2016-10-31 2019-07-23 Pure Storage, Inc. Deduplication aware scalable content placement
CN108228083A (zh) * 2016-12-21 2018-06-29 伊姆西Ip控股有限责任公司 用于数据去重的方法和设备
CN109101365A (zh) * 2018-08-01 2018-12-28 南京壹进制信息技术股份有限公司 一种基于源端数据重删的数据备份和恢复方法
CN109614403B (zh) * 2018-10-24 2020-03-06 北京三快在线科技有限公司 集群服务节点的数据一致性校验方法及装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130041872A1 (en) * 2011-08-12 2013-02-14 Alexander AIZMAN Cloud storage system with distributed metadata
CN107436725A (zh) * 2016-05-25 2017-12-05 杭州海康威视数字技术股份有限公司 一种数据写、读方法、装置及分布式对象存储集群
CN106649556A (zh) * 2016-11-08 2017-05-10 深圳市中博睿存科技有限公司 基于分布式文件系统的多层重复数据删除方法及装置
CN107506150A (zh) * 2017-08-30 2017-12-22 郑州云海信息技术有限公司 分布式存储装置、重删、写、删除、读取方法以及系统
CN108008918A (zh) * 2017-11-30 2018-05-08 联想(北京)有限公司 数据处理方法、存储节点及分布式存储系统
CN108052284A (zh) * 2017-12-08 2018-05-18 北京奇虎科技有限公司 一种分布式数据存储方法及装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023093091A1 (zh) * 2021-11-25 2023-06-01 华为技术有限公司 数据存储系统、智能网卡及计算节点
CN114675783A (zh) * 2022-03-25 2022-06-28 苏州浪潮智能科技有限公司 一种数据存储方法、系统、设备以及介质

Also Published As

Publication number Publication date
CN110245129A (zh) 2019-09-17
CN110245129B (zh) 2022-05-13

Similar Documents

Publication Publication Date Title
WO2020215580A1 (zh) 一种分布式全局数据去重方法和装置
US9317519B2 (en) Storage system for eliminating duplicated data
CN106294190B (zh) 一种存储空间管理方法及装置
US8799601B1 (en) Techniques for managing deduplication based on recently written extents
CN102629247B (zh) 一种数据处理方法、装置和系统
CN111198856B (zh) 文件管理方法、装置、计算机设备和存储介质
CN108614837B (zh) 文件存储和检索的方法及装置
CN110888837B (zh) 对象存储小文件归并方法及装置
US20200117543A1 (en) Method, electronic device and computer readable storage medium for data backup and recovery
CN104516974A (zh) 一种文件系统目录项的管理方法及装置
CN113867627B (zh) 一种存储系统性能优化方法及系统
CN111399765B (zh) 数据处理方法、装置、电子设备及可读存储介质
CN105493080B (zh) 基于上下文感知的重复数据删除的方法和装置
CN115328403A (zh) 一种数据重删方法、装置、设备及存储介质
CN112817962B (zh) 基于对象存储的数据存储方法、装置和计算机设备
CN116303267A (zh) 数据访问方法、装置、设备以及存储介质
CN113835613B (zh) 一种文件读取方法、装置、电子设备和存储介质
CN104537023A (zh) 一种反向索引记录的存储方法及装置
WO2021004295A1 (zh) 一种元数据的处理方法、装置及计算机可读存储介质
CN111796767A (zh) 一种分布式文件系统及数据管理方法
WO2021189306A1 (en) Write operation in object storage system using enhanced meta structure
TWI475419B (zh) 用於在儲存系統上存取檔案的方法和系統
WO2021189308A1 (en) Delete operation in object storage system using enhanced meta structure
WO2021189311A1 (en) Read operation in object storage system using enhanced meta structure
CN110659250B (zh) 文件处理方法和系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19926170

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19926170

Country of ref document: EP

Kind code of ref document: A1