CN114442931A - Data deduplication method and system, electronic device and storage medium - Google Patents

Data deduplication method and system, electronic device and storage medium Download PDF

Info

Publication number
CN114442931A
CN114442931A CN202111593136.4A CN202111593136A CN114442931A CN 114442931 A CN114442931 A CN 114442931A CN 202111593136 A CN202111593136 A CN 202111593136A CN 114442931 A CN114442931 A CN 114442931A
Authority
CN
China
Prior art keywords
data
information
fingerprint
deduplication
fingerprint information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111593136.4A
Other languages
Chinese (zh)
Inventor
肖露
张维杰
林洁琬
黄鹄
黄润怀
李旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianyi Cloud Technology Co Ltd
Original Assignee
Tianyi Cloud Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianyi Cloud Technology Co Ltd filed Critical Tianyi Cloud Technology Co Ltd
Priority to CN202111593136.4A priority Critical patent/CN114442931A/en
Publication of CN114442931A publication Critical patent/CN114442931A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • G06F3/0641De-duplication techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data deduplication method and system, electronic equipment and a storage medium, which are used for achieving data deduplication and saving storage space. The embodiment of the application provides a data deduplication method, which comprises the following steps: the client splits the target data into a plurality of data pieces according to a preset rule and sends the data pieces to the global deduplication module; the global deduplication module calculates first fingerprint information of the data sheet and inquires whether the first fingerprint information exists in a fingerprint database; and when the first fingerprint information exists in the fingerprint database, the client deletes the data sheet corresponding to the first fingerprint information.

Description

Data deduplication method and system, electronic device and storage medium
Technical Field
The present application relates to the field of display technologies, and in particular, to a data deduplication method and system, an electronic device, and a storage medium.
Background
Whether the cloud storage system or the traditional data storage system exists, a large amount of redundant data exists, and the data repetition rate in some systems is as high as 70% -90%, so that the data deduplication of the storage system is very urgent and very necessary. The duplicate removal technology can delete redundant data in the storage system, so that the use amount of storage space is saved, the network bandwidth is saved, and meanwhile, the storage cost and daily energy consumption of a data center are reduced. However, the conventional deduplication technology faces a great challenge in deduplication of big data in a cloud storage system.
Disclosure of Invention
The embodiment of the application provides a data deduplication method and system, electronic equipment and a storage medium, which are used for realizing data deduplication and saving storage space.
The embodiment of the application provides a data deduplication method, which comprises the following steps:
the client splits the target data into a plurality of data pieces according to a preset rule and sends the data pieces to the global deduplication module;
the global deduplication module calculates first fingerprint information of the data sheet and inquires whether metadata mapping information of the first fingerprint information exists in a fingerprint library or not;
and when the metadata mapping information of the first fingerprint information exists in the fingerprint database, the client deletes the data sheet corresponding to the first fingerprint information.
In some embodiments, splitting the target data into a plurality of data pieces according to a preset rule, and sending the data pieces to the global deduplication module specifically includes:
and splitting the target data into a plurality of data pieces corresponding to the deduplication nodes, and sending the data pieces to the corresponding deduplication nodes.
In some embodiments, the calculating, by the global deduplication module, first fingerprint information of the data piece, and querying whether the first fingerprint information exists in a data fingerprint database corresponding to the deduplication node specifically includes:
and the deduplication node calculates first fingerprint information of the received data sheet, and inquires whether the first fingerprint information exists in a fingerprint library corresponding to the deduplication node.
In some embodiments, when the first fingerprint information is present in the database of fingerprints, the method further comprises:
judging whether the identity information of the first fingerprint information is consistent with the identity information corresponding to the metadata mapping information in the fingerprint database;
if the identity information of the first fingerprint information is inconsistent with the identity information corresponding to the metadata mapping information in the fingerprint library, the global deduplication module reads second fingerprint information corresponding to the metadata mapping information and stored in the storage module, takes the thermal information in the first fingerprint information and the thermal information in the second fingerprint information as new second fingerprint information according to the cold and hot information of the fingerprints, and updates the second fingerprint information stored in the storage module into the new second fingerprint information.
In some embodiments, further comprising:
when the first fingerprint information does not exist in the fingerprint library, the client sends a data piece downloading request to the storage module;
the storage module sends the download permission information to the client; the lower-disc permission information comprises the memory address of the data slice;
the client sends a storage request to the global deduplication module, wherein the storage request comprises first fingerprint information and a memory address of the data slice;
the global deduplication module responds to the storage request and stores the metadata mapping information of the first fingerprint information into a cache queue of the storage module;
and judging whether to download the metadata mapping information according to a preset caching rule.
In some embodiments, determining whether to store the metadata mapping information in the storage module according to a preset caching rule specifically includes:
judging whether the storage capacity of a cache queue of a storage module reaches a preset threshold value or not;
if yes, the global deduplication module deletes the metadata mapping information from the cache queue;
and if not, downloading the metadata mapping information according to a preset downloading sequence in the cache queue.
An embodiment of the present application provides a data deduplication system, where the data deduplication system includes:
the client is used for splitting the target data into a plurality of data pieces according to a preset rule and sending the data pieces to the global deduplication module;
the global deduplication module is used for calculating first fingerprint information of the data sheet and inquiring whether the first fingerprint information exists in a fingerprint library;
the client is further configured to: and when the first fingerprint information exists in the fingerprint database, deleting the data sheet corresponding to the first fingerprint information.
An embodiment of the present application provides a computer device, which includes:
one or more processors;
storage means for storing one or more programs;
when the one or more programs are executed by the one or more processors, the one or more processors implement the data deduplication method provided by the embodiments of the present application.
A computer-readable storage medium provided in an embodiment of the present application stores thereon a computer program, and the computer program, when executed by a processor, implements the data deduplication method provided in an embodiment of the present application.
The computer program product provided by the embodiment of the present application includes a computer program, and when the computer program is executed by a processor, the method for deleting data again provided by the embodiment of the present application is implemented.
According to the data deduplication method and system, the electronic device and the storage medium, the first fingerprint information of the data piece is calculated through the global deduplication module, when the metadata mapping information of the first fingerprint information exists in the fingerprint library, the data piece is considered to belong to the repeated data, and the data piece is deleted, so that the data storage quantity of the data storage system can be reduced, the storage space is saved, and the storage cost is saved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flowchart of a data deduplication method according to an embodiment of the present application;
fig. 2 is a schematic flowchart of another data deduplication method according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a data deduplication system according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings of the embodiments of the present application. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all embodiments. And the embodiments and features of the embodiments in the present application may be combined with each other without conflict. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the application without any inventive step, are within the scope of protection of the application.
Unless defined otherwise, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs. As used in this application, the terms "first," "second," and the like do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect.
It should be noted that the sizes and shapes of the figures in the drawings are not to be considered true scale, but are merely intended to schematically illustrate the present disclosure. And the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout.
An embodiment of the present application provides a data deduplication method, as shown in fig. 1, the method includes:
s101, splitting target data into a plurality of data pieces by a client according to a preset rule, and sending the data pieces to a global deduplication module;
s102, the global deduplication module calculates first fingerprint information of the data sheet and inquires whether metadata mapping information of the first fingerprint information exists in a fingerprint database;
s103, when the metadata mapping information of the first fingerprint information exists in the fingerprint database, the client deletes the data sheet corresponding to the first fingerprint information.
Note that data deduplication refers to deduplication.
According to the data deduplication method provided by the embodiment of the application, the first fingerprint information of the data piece is calculated through the global deduplication module, when the metadata mapping information of the first fingerprint information exists in the fingerprint library, the data piece is considered to belong to the repeated data, and the data piece is deleted, so that the data storage quantity of a data storage system can be reduced, the storage space and the energy consumption of the storage system are saved, and the storage cost is saved. In addition, the global deduplication module can delete duplicated data across a plurality of clients, and delete duplicated data in the whole storage system range, so that global data deduplication of the ceph distributed storage system is achieved, and performance expansion of the storage system is facilitated.
In some embodiments, the data deduplication method provided in the embodiments of the present application is applied to a ceph distributed storage system. Namely, the ceph distributed storage system comprises a client and a global deduplication module.
In some embodiments, the target data, the data slice, is Input/Output (IO) data.
In some embodiments, the fingerprint repository is a first fingerprint information table. The first fingerprint information table includes: identity information Poolid of the data, metadata mapping information FingerPrint of the first FingerPrint information, a data number Ref, and a memory address addr of the data.
In some embodiments, the global deduplication module calculates first fingerprint information of the data piece by using a preset algorithm. The preset algorithm may be, for example, a hash algorithm.
In some embodiments, the global deduplication module comprises at least one deduplication node; splitting target data into a plurality of data pieces according to a preset rule, and sending the data pieces to a global deduplication module, specifically comprising:
and splitting the target data into a plurality of data pieces corresponding to the deduplication nodes according to a preset rule, and sending the data pieces to the corresponding deduplication nodes.
In some embodiments, when the global deduplication module includes multiple deduplication nodes, the deduplication nodes in the global deduplication module may be deployed in the following manner:
the range of the IO data that can be processed by each deduplication node is calculated by the following formula:
range=s÷scope%dedup_nr;
wherein s ═ lba ÷ obj _ size; lba is the data size of the IO data; s is the size of the data slice fragment, i.e. the size of the data slice after being split according to the preset size obj _ size; the devup _ nr is a serial number of the deduplication node in the global deduplication module; scope is the data range that each deduplication node can process, and the larger the processing range is, the better the deduplication effect is. Taking the example that the global deduplication module includes three deduplication nodes, the range of IO data that can be processed by each deduplication node in sequence is: [ m, m + n ], [ m + n +1, m +2n +1], [ m +2n +2, m +3n +1] … …, and so on, wherein m and n are determined according to dedup _ nr and range. It should be noted that the larger the value of scope, the better the data deduplication effect.
In specific implementation, splitting the target data into a plurality of data pieces according to a preset rule includes: and splitting the target data into a plurality of data pieces by adopting a consistent Hash algorithm according to the range of IO data which can be processed by the deduplication node.
In specific implementation, each of the deduplication nodes in the global deduplication module has a corresponding node address IP. After splitting the target data into a plurality of data pieces, the client inquires the node IP of the deduplication node corresponding to the data pieces, and sends the data pieces to the corresponding deduplication nodes according to the node IP.
In some embodiments, the calculating, by the global deduplication module, first fingerprint information of the data piece, and querying whether the first fingerprint information exists in a data fingerprint database corresponding to the deduplication node specifically includes:
and the deduplication node calculates first fingerprint information of the received data sheet, and inquires whether the first fingerprint information exists in a fingerprint library corresponding to the deduplication node.
In some embodiments, when the first fingerprint information exists in the database fingerprint database, the method further comprises:
judging whether the identity information of the first fingerprint information is consistent with the identity information corresponding to the metadata mapping information in the fingerprint database;
if the identity information of the first fingerprint information is inconsistent with the identity information corresponding to the metadata mapping information in the fingerprint library, the global deduplication module reads second fingerprint information corresponding to the metadata mapping information and stored in the storage module, takes the thermal information in the first fingerprint information and the thermal information in the second fingerprint information as new second fingerprint information according to the cold and hot information of the fingerprints, and updates the second fingerprint information stored in the storage module into the new second fingerprint information.
In some embodiments, the Storage module includes an Object Storage Device (OSD) Storage cluster. And the OSD storage cluster stores the fingerprint information of the data and the metadata mapping information corresponding to the fingerprint information. In specific implementation, the metadata mapping information stored in the OSD storage cluster includes fingerprint information of the memory flash. The OSD storage cluster includes: a metadata storage Pool 1 storing metadata mapping information, and a data storage Pool 2 storing IO data. The metadata storage Pool 1 can be deployed in a high-performance storage Pool, and data deduplication and fingerprint retrieval are facilitated. In specific implementation, the global deduplication module reads the second fingerprint information in the data storage Pool 2, and the new second fingerprint information is stored in the data storage Pool 2.
In some embodiments, further comprising:
when the first fingerprint information does not exist in the fingerprint library, the client sends a data piece downloading request to the storage module;
the storage module sends the download permission information to the client; the lower-disc permission information comprises the memory address of the data slice;
the client sends a storage request to the global deduplication module, wherein the storage request comprises first fingerprint information and a memory address of the data slice;
the global deduplication module responds to the storage request and stores the metadata mapping information of the first fingerprint information into a cache queue of the storage module;
and judging whether to download the metadata mapping information according to a preset caching rule.
It should be noted that when the first fingerprint information does not exist in the fingerprint library, that is, the data piece is not stored, does not belong to the duplicate data, and needs to be downloaded, the client needs to send a download request to the storage module, write three copies, and after the storage module successfully writes the three copies, want the client to send download permission information.
It should be noted that the fingerprint information carried in the storage request sent by the client to the global deduplication module is the first fingerprint information, that is, the fingerprint information calculated by the deduplication node, so that it is not necessary to calculate the fingerprint information of the data slice again, the data storage process can be saved, and the repeated sending of the data can be avoided.
In some embodiments, determining whether to store the metadata in the storage module according to a preset caching rule specifically includes:
judging whether the storage capacity of a cache queue of a storage module reaches a preset threshold value or not;
if yes, the global deduplication module deletes the metadata mapping information from the cache queue;
and if not, downloading the metadata mapping information according to a preset downloading sequence in the cache queue.
During specific implementation, whether the storage capacity of a cache queue of the storage module reaches a preset threshold value is judged, and when the storage capacity reaches the preset threshold value, the deduplication module can eliminate the metadata mapping information corresponding to the fingerprint information by using a background thread, so that the memory pool is prevented from being burst.
Next, a flow of the data deduplication method provided in the embodiment of the present application is illustrated, as shown in fig. 2, the data deduplication method includes the following steps:
s201, splitting target data into a plurality of data pieces corresponding to the deduplication nodes according to a preset rule, and sending the data pieces to the corresponding deduplication nodes;
s202, calculating first fingerprint information of the received data piece by the deduplication node;
s203, inquiring whether the first fingerprint information exists in a fingerprint library corresponding to the re-deleted node; if yes, returning an operation code F _ EXIST to the client to execute the step S206 and execute the step S204, otherwise, returning the operation code F _ NO _ EXIST to the client to execute the step S207;
s204, judging whether the identity information of the first fingerprint information is consistent with the identity information corresponding to the metadata mapping information in the fingerprint database, and if not, executing the step S205;
s205, the global deduplication module reads second fingerprint information corresponding to the metadata mapping information and stored in the storage module, takes the thermal information in the first fingerprint information and the second fingerprint information as new second fingerprint information according to the cold and hot information of the fingerprints, and updates the second fingerprint information stored in the storage module into the new second fingerprint information;
s206, the client deletes the data sheet corresponding to the first fingerprint information;
s207, the client sends a data slice downloading request to the storage module;
s208, the storage module sends the download permission information to the client; the lower-disc permission information comprises the memory address of the data slice;
s209, the client sends a storage request to the global deduplication module, wherein the storage request comprises first fingerprint information and a memory address of the data slice;
s210, the global deduplication module responds to the storage request and stores the metadata mapping information of the first fingerprint information into a cache queue of the storage module;
s211, judging whether the storage capacity of a cache queue of a storage module reaches a preset threshold value; if yes, go to step S212, otherwise go to step S213;
s212, deleting the metadata mapping information from the cache queue;
and S213, downloading the metadata mapping information in time.
Based on the same inventive concept, an embodiment of the present application further provides a data deduplication system, as shown in fig. 3, the data deduplication system includes:
the client 101 is used for splitting the target data into a plurality of data pieces according to a preset rule and sending the data pieces to the global deduplication module;
the global deduplication module 102 is configured to calculate first fingerprint information of a data slice, and query whether the first fingerprint information exists in a fingerprint library;
the client is further configured to: and when the first fingerprint information exists in the fingerprint database, deleting the data sheet corresponding to the first fingerprint information.
In some embodiments, the data deduplication system is a ceph distributed storage system.
In some embodiments, the global deduplication module comprises at least one deduplication node configured to: and calculating first fingerprint information of the received data sheet, and inquiring whether the first fingerprint information exists in a fingerprint library corresponding to the re-deleted node.
In some embodiments, the client comprises:
a data splitting module to: splitting the target data into a plurality of data pieces corresponding to the re-deleted nodes according to a preset rule;
a first transmitting and receiving module, configured to: and sending the data sheet to the corresponding deduplication node.
In specific implementation, the first sending and receiving module is further configured to query a node IP of a deduplication node corresponding to the data slice. Therefore, the first sending and receiving module can send the data pieces to the corresponding deduplication nodes according to the node IP.
In some embodiments, as shown in fig. 3, the data deduplication system further comprises:
and the storage module 103 is configured to store the second fingerprint information and the metadata mapping information.
In some embodiments, the global deduplication module is further to: when the first fingerprint information exists in the data fingerprint database, judging whether the identity information of the first fingerprint information is consistent with the identity information corresponding to the metadata mapping information in the fingerprint database; if the identity information of the first fingerprint information is inconsistent with the identity information corresponding to the metadata mapping information in the fingerprint library, the global deduplication module reads second fingerprint information corresponding to the metadata mapping information and stored in the storage module, and takes the thermal information in the first fingerprint information and the second fingerprint information as new second fingerprint information according to the cold and hot information of the fingerprints;
the storage module is further used for updating the stored second fingerprint information into new second fingerprint information.
In some embodiments, when the first fingerprint information does not exist in the fingerprint library, the first sending and receiving unit of the client is further configured to send a data slice downloading request to the storage module;
the storage module also comprises a second sending and receiving module which is used for sending the download permission information to the client; the lower-disc permission information comprises the memory address of the data slice;
the first sending and receiving unit of the client is further configured to: responding to the download permission information, and sending a storage request to a global deduplication module, wherein the storage request comprises first fingerprint information and a memory address of a data slice;
the global deduplication module further comprises:
the third sending and receiving unit is used for responding to the storage request and storing the metadata mapping information of the first fingerprint information to a cache queue of the storage module;
and the threshold detection unit is used for judging whether to download the metadata mapping information according to a preset caching rule.
In some embodiments, the threshold detecting unit is configured to determine whether to store the metadata in the storage module according to a preset caching rule, and specifically includes:
judging whether the storage capacity of a cache queue of a storage module reaches a preset threshold value or not;
if yes, deleting the metadata mapping information from the cache queue;
and if not, downloading the metadata mapping information according to a preset downloading sequence in the cache queue.
Based on the same inventive concept, the embodiment of the present application further provides a computer device, which includes:
one or more processors;
storage means for storing one or more programs;
when the one or more programs are executed by the one or more processors, the one or more processors implement the data deduplication method provided by the embodiment of the application.
The electronic device may be a desktop computer, a portable computer, a smart phone, a tablet computer, a Personal Digital Assistant (PDA), a server, and the like. In some embodiments, the storage device is, for example, a memory, and as shown in fig. 4, the electronic device may include a processor 201 and a memory 202.
The Processor 201 may be a general-purpose Processor, such as a Central Processing Unit (CPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component, and may implement or execute the methods, steps, and logic blocks disclosed in the embodiments of the present Application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor.
Memory 202, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charged Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and so on. The memory is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 202 in the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; the computer storage media may be any available media or data storage device that can be accessed by a computer, including but not limited to: various media that can store program codes include a removable Memory device, a Random Access Memory (RAM), a magnetic Memory (e.g., a flexible disk, a hard disk, a magnetic tape, a magneto-optical disk (MO), etc.), an optical Memory (e.g., a CD, a DVD, a BD, an HVD, etc.), and a semiconductor Memory (e.g., a ROM, an EPROM, an EEPROM, a nonvolatile Memory (NAND FLASH), a Solid State Disk (SSD)).
Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media that can store program codes include a removable Memory device, a Random Access Memory (RAM), a magnetic Memory (e.g., a flexible disk, a hard disk, a magnetic tape, a magneto-optical disk (MO), etc.), an optical Memory (e.g., a CD, a DVD, a BD, an HVD, etc.), and a semiconductor Memory (e.g., a ROM, an EPROM, an EEPROM, a nonvolatile Memory (NAND FLASH), a Solid State Disk (SSD)).
In some possible embodiments, various aspects of the methods provided by the present disclosure may also be implemented in the form of a program product including program code for causing a computer device to perform the steps of the methods according to various exemplary embodiments of the present disclosure described above in the present specification when the program product runs on the computer device, for example, the computer device may perform the data deduplication method described in the embodiments of the present disclosure. The program product may employ any combination of one or more readable media.
To sum up, according to the data deduplication method and system, the electronic device, and the storage medium provided by the embodiment of the application, the first fingerprint information of the data piece is calculated through the global deduplication module, and when the metadata mapping information of the first fingerprint information exists in the fingerprint library, the data piece is considered to belong to the duplicated data, and is deleted, so that the data storage amount of the data storage system can be reduced, the storage space is saved, and the storage cost is saved.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (10)

1. A method for data deduplication, the method comprising:
the client splits the target data into a plurality of data pieces according to a preset rule and sends the data pieces to the global deduplication module;
the global deduplication module calculates first fingerprint information of the data sheet and inquires whether metadata mapping information of the first fingerprint information exists in a fingerprint library or not;
and when the metadata mapping information of the first fingerprint information exists in the fingerprint database, the client deletes the data sheet corresponding to the first fingerprint information.
2. The method of claim 1, wherein the global deduplication module comprises at least one deduplication node; splitting target data into a plurality of data pieces according to a preset rule, and sending the data pieces to a global deduplication module, specifically comprising:
and splitting the target data into a plurality of data pieces corresponding to the deduplication nodes according to the preset rule, and sending the data pieces to the corresponding deduplication nodes.
3. The method according to claim 2, wherein the global deduplication module calculates first fingerprint information of the data piece, and queries whether metadata mapping information of the first fingerprint information exists in a fingerprint library, specifically including:
the deduplication node calculates the first fingerprint information of the received data piece, and queries whether metadata mapping information of the first fingerprint information exists in the fingerprint database corresponding to the deduplication node.
4. The method of claim 1, wherein when metadata mapping information of the first fingerprint information exists in the fingerprint repository, the method further comprises:
judging whether the identity information of the first fingerprint information is consistent with the identity information corresponding to the metadata mapping information in the fingerprint database;
if the identity information of the first fingerprint information is inconsistent with the identity information corresponding to the metadata mapping information in the fingerprint library, the global deduplication module reads second fingerprint information corresponding to the metadata mapping information and stored in the storage module, uses the thermal information in the first fingerprint information and the second fingerprint information as new second fingerprint information according to cold and hot fingerprint information, and updates the second fingerprint information stored in the storage module into the new second fingerprint information.
5. The method of claim 1, further comprising:
when the first fingerprint information does not exist in the fingerprint library, the client sends a data piece downloading request to a storage module;
the storage module sends the download permission information to the client; the lower-disc permission information comprises the memory address of the data slice;
the client sends a storage request to the global deduplication module, wherein the storage request comprises the first fingerprint information of the data slice and the memory address;
the global deduplication module responds to the storage request and stores the metadata mapping information of the first fingerprint information into a cache queue of the storage module;
and judging whether to download the metadata mapping information according to a preset caching rule.
6. The method according to claim 5, wherein determining whether to store the metadata mapping information to the storage module according to a preset caching rule specifically includes:
judging whether the storage capacity of a cache queue of the storage module reaches a preset threshold value or not;
if yes, the global deduplication module deletes the metadata mapping information from the cache queue;
and if not, downloading the metadata mapping information according to a preset downloading sequence in the cache queue.
7. A data deduplication system, the data deduplication system comprising:
the client is used for splitting the target data into a plurality of data pieces according to a preset rule and sending the data pieces to the global deduplication module;
the global deduplication module is used for calculating first fingerprint information of the data sheet and inquiring whether the first fingerprint information exists in a fingerprint library;
the client is further configured to: and when the first fingerprint information exists in the fingerprint database, deleting the data sheet corresponding to the first fingerprint information.
8. A computer device, the device comprising:
one or more processors;
storage means for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the data deduplication method of any of claims 1-6.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a data deduplication method according to any one of claims 1-6.
10. A computer program product comprising a computer program, characterized in that the computer program realizes the data deduplication method of any one of claims 1 to 6 when executed by a processor.
CN202111593136.4A 2021-12-23 2021-12-23 Data deduplication method and system, electronic device and storage medium Pending CN114442931A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111593136.4A CN114442931A (en) 2021-12-23 2021-12-23 Data deduplication method and system, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111593136.4A CN114442931A (en) 2021-12-23 2021-12-23 Data deduplication method and system, electronic device and storage medium

Publications (1)

Publication Number Publication Date
CN114442931A true CN114442931A (en) 2022-05-06

Family

ID=81364541

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111593136.4A Pending CN114442931A (en) 2021-12-23 2021-12-23 Data deduplication method and system, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN114442931A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114911419A (en) * 2022-05-07 2022-08-16 阿里巴巴达摩院(杭州)科技有限公司 Data storage method, system, storage medium and computer terminal
CN117369731A (en) * 2023-12-07 2024-01-09 苏州元脑智能科技有限公司 Data reduction processing method, device, equipment and medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970875A (en) * 2014-05-15 2014-08-06 华中科技大学 Parallel repeated data deleting method
US20150213049A1 (en) * 2014-01-30 2015-07-30 Netapp, Inc. Asynchronous backend global deduplication
CN106066896A (en) * 2016-07-15 2016-11-02 中国人民解放军理工大学 A kind of big Data duplication applying perception deletes storage system and method
US10037337B1 (en) * 2015-09-14 2018-07-31 Cohesity, Inc. Global deduplication
CN109800218A (en) * 2019-01-04 2019-05-24 平安科技(深圳)有限公司 Distributed memory system, memory node equipment and data duplicate removal method
CN111984203A (en) * 2020-09-27 2020-11-24 苏州浪潮智能科技有限公司 Data deduplication method and device, electronic equipment and storage medium
CN112148217A (en) * 2020-09-11 2020-12-29 北京浪潮数据技术有限公司 Caching method, device and medium for deduplication metadata of full flash storage system
CN112684975A (en) * 2019-10-17 2021-04-20 华为技术有限公司 Data storage method and device
CN112817962A (en) * 2021-03-16 2021-05-18 广州鼎甲计算机科技有限公司 Data storage method and device based on object storage and computer equipment
CN113227958A (en) * 2019-12-03 2021-08-06 华为技术有限公司 Apparatus, system, and method for optimization in deduplication

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150213049A1 (en) * 2014-01-30 2015-07-30 Netapp, Inc. Asynchronous backend global deduplication
CN103970875A (en) * 2014-05-15 2014-08-06 华中科技大学 Parallel repeated data deleting method
US10037337B1 (en) * 2015-09-14 2018-07-31 Cohesity, Inc. Global deduplication
CN106066896A (en) * 2016-07-15 2016-11-02 中国人民解放军理工大学 A kind of big Data duplication applying perception deletes storage system and method
CN109800218A (en) * 2019-01-04 2019-05-24 平安科技(深圳)有限公司 Distributed memory system, memory node equipment and data duplicate removal method
CN112684975A (en) * 2019-10-17 2021-04-20 华为技术有限公司 Data storage method and device
CN113227958A (en) * 2019-12-03 2021-08-06 华为技术有限公司 Apparatus, system, and method for optimization in deduplication
CN112148217A (en) * 2020-09-11 2020-12-29 北京浪潮数据技术有限公司 Caching method, device and medium for deduplication metadata of full flash storage system
CN111984203A (en) * 2020-09-27 2020-11-24 苏州浪潮智能科技有限公司 Data deduplication method and device, electronic equipment and storage medium
CN112817962A (en) * 2021-03-16 2021-05-18 广州鼎甲计算机科技有限公司 Data storage method and device based on object storage and computer equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114911419A (en) * 2022-05-07 2022-08-16 阿里巴巴达摩院(杭州)科技有限公司 Data storage method, system, storage medium and computer terminal
CN117369731A (en) * 2023-12-07 2024-01-09 苏州元脑智能科技有限公司 Data reduction processing method, device, equipment and medium
CN117369731B (en) * 2023-12-07 2024-02-27 苏州元脑智能科技有限公司 Data reduction processing method, device, equipment and medium

Similar Documents

Publication Publication Date Title
US10942895B2 (en) Storage system with decrement protection of reference counts
US8370315B1 (en) System and method for high performance deduplication indexing
US8751763B1 (en) Low-overhead deduplication within a block-based data storage
US9910620B1 (en) Method and system for leveraging secondary storage for primary storage snapshots
US10303797B1 (en) Clustering files in deduplication systems
US11263087B2 (en) Methods and systems for serverless data deduplication
US10891074B2 (en) Key-value storage device supporting snapshot function and operating method thereof
CN110018998B (en) File management method and system, electronic equipment and storage medium
US10095624B1 (en) Intelligent cache pre-fetch
CN114442931A (en) Data deduplication method and system, electronic device and storage medium
CN111381779B (en) Data processing method, device, equipment and storage medium
US10409692B1 (en) Garbage collection: timestamp entries and remove reference counts
WO2020093501A1 (en) File storage method and deletion method, server, and storage medium
US11625304B2 (en) Efficient method to find changed data between indexed data and new backup
WO2019015490A1 (en) Data processing method, apparatus, device, and system
CN104750432A (en) Data storage method and device
US10223377B1 (en) Efficiently seeding small files with certain localities
US20190332487A1 (en) Generic metadata tags with namespace-specific semantics in a storage appliance
CN114115734A (en) Data deduplication method, device, equipment and storage medium
US10216630B1 (en) Smart namespace SSD cache warmup for storage systems
CN111813740A (en) File layered storage method and server
US11010332B2 (en) Set-based mutual exclusion using object metadata tags in a storage appliance
US11340999B2 (en) Fast restoration method from inode based backup to path based structure
US10795596B1 (en) Delayed deduplication using precalculated hashes
US11614999B2 (en) Efficient method to index scheduled backup of same target and the corresponding files

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20220506

RJ01 Rejection of invention patent application after publication