CN113535708A - Data deduplication method, system, storage medium and equipment - Google Patents

Data deduplication method, system, storage medium and equipment Download PDF

Info

Publication number
CN113535708A
CN113535708A CN202111090326.4A CN202111090326A CN113535708A CN 113535708 A CN113535708 A CN 113535708A CN 202111090326 A CN202111090326 A CN 202111090326A CN 113535708 A CN113535708 A CN 113535708A
Authority
CN
China
Prior art keywords
data
metadata
unit data
fingerprint value
deduplication
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111090326.4A
Other languages
Chinese (zh)
Inventor
刚亚州
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202111090326.4A priority Critical patent/CN113535708A/en
Publication of CN113535708A publication Critical patent/CN113535708A/en
Priority to PCT/CN2022/078324 priority patent/WO2023040200A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data deduplication method, a system, a storage medium and equipment, wherein the method comprises the following steps: responding to the operation that the host abandons the data to be deleted again and writing the data with granularity as a unit into the hard disk, calculating the fingerprint value of the data, writing the fingerprint value and the logic address into the hard disk, and judging whether the occupied storage space in the hard disk reaches a preset threshold value or not; if the preset threshold value is reached, acquiring unit data from the hard disk and judging whether the unit data is subjected to deduplication operation or not; if the deduplication operation is not carried out, acquiring the fingerprint value of the unit data, and inquiring whether original metadata containing the mapping relation between the fingerprint value of the unit data and the corresponding original physical address exists or not through a metadata management module; if the first metadata exists, the first metadata containing the mapping relation between the original physical address and the fingerprint value of the unit data is established and stored in the metadata management module to perform the deduplication operation of the unit data. The method for deleting data again can avoid the influence on the performance of the storage system.

Description

Data deduplication method, system, storage medium and equipment
Technical Field
The present invention relates to the field of storage technologies, and in particular, to a data deduplication method, system, storage medium, and device.
Background
Metadata (Mete data) refers to data (data about data) describing data, and can be understood as data with a wider scope than general meaning, and not only represents information such as type, name, value and the like of the data, but also further provides context information of the data, such as a domain to which the data belongs, a data source and the like. In a data storage system, metadata is the basis for information storage, being the smallest unit of data. In recent years, with the development of information technology, massive data is generated, but how to effectively manage and organize the massive data has become a prominent problem. For a large amount of stored data, the query analyzes the data content and data meaning in the stored data, so that the data can be used more effectively. In a storage system, efficient organization and management of metadata is an effective means for solving the problem, and can support management and maintenance of data by the system. Therefore, only if metadata is managed efficiently, data becomes more valuable.
The full flash memory is a storage system based on a full flash memory array, is an independent storage array or device completely composed of solid state storage media, and is mainly different from the traditional hard disk storage in that the performance is higher, and data processing is faster and more stable. The online data deduplication characteristic is the most important and necessary characteristic in the full flash memory storage system, and because the solid state disk is used as a storage medium at the rear end of the full flash memory storage system, in view of the value problem of the solid state disk, the full flash memory storage system requires online data deduplication to achieve reduction of the actual storage space of the rear end disk. For realizing online deduplication of a full flash memory storage system, metadata management is crucial, and the metadata management mainly manages L-P (LBA → PBA) mapping, P-L (PBA → LBA) mapping, and H-P (HASHKEY → PBA) mapping relations, wherein LBA (logical Block Address) represents a logical Block address, PBA (physical Block Address) represents a physical Block address, and HASHKEY represents a hash value. Compared with the traditional characteristic that online deduplication is not supported, metadata management has two metadata of P-L mapping and H-P mapping, and the metadata management is more stressed by large-amount, high-concurrency and short-delay data access.
In some special scenarios, such as when a controller fails or data write pressure is large, so that the performance cannot meet the requirement, the performance can be met by abandoning part of the online deduplication requests. However, this will not cause the part of data that should be deleted again to be deleted again, and thus the storage space of the backend fixed hard disk is occupied more.
Disclosure of Invention
In view of the above, an object of the present invention is to provide a data deduplication method, system, storage medium and device, so as to solve the problem in the prior art that a backend hard disk storage space is wasted due to abandoning a deduplication operation on partial data.
Based on the above purpose, the present invention provides a data deduplication method, which comprises the following steps:
responding to the operation that the host abandons the data deduplication and writes the data with granularity as a unit into the hard disk, calculating a fingerprint value of the data, writing the fingerprint value and a logic address of the data into the hard disk, and judging whether the occupied storage space in the hard disk reaches a preset threshold value or not;
acquiring unit data from the hard disk in response to the occupied storage space reaching a preset threshold, and judging whether the unit data are subjected to deduplication operation or not based on a fingerprint value and/or a logical address of the unit data;
responding to the unit data without the deduplication operation, acquiring a fingerprint value of the unit data, and inquiring whether original metadata containing a mapping relation between the fingerprint value of the unit data and a corresponding original physical address exists or not through a metadata management module;
and in response to the existence of the original metadata, establishing first metadata containing a mapping relation between the original physical address and the fingerprint value of the unit data, and storing the first metadata to a metadata management module to perform a deduplication operation of the unit data.
In some embodiments, determining whether a unit data has been subjected to a deduplication operation based on a fingerprint value and/or a logical address of the unit data comprises:
judging whether the unit data has a corresponding fingerprint value and a corresponding logical address;
responding to the corresponding fingerprint value and the logical address of the unit data, and confirming that the unit data is not subjected to the deduplication operation;
in response to the unit data having a corresponding fingerprint value and no corresponding logical address, it is determined that the unit data has been subjected to a deduplication operation.
In some embodiments, the method further comprises:
and in response to the existence of the original metadata, establishing a second metadata group containing the mapping relation between the original physical address and the logical address of the unit data, and storing the second metadata group to the metadata management module.
In some embodiments, the method further comprises:
and setting the logic address of the single-bit data as invalid in response to storing the second metadata group in the metadata management module.
In some embodiments, the method further comprises:
and responding to the situation that the first metadata and the second metadata are stored in the metadata management module, and informing the garbage collection module to carry out garbage collection on the unit data.
In some embodiments, establishing the second metadata set containing the mapping of the original physical address to the logical address of the unit data comprises:
metadata is created that includes key-value pairs that point from the original physical address to the logical address of the unit data, and metadata that includes key-value pairs that point from the logical address of the unit data to the original physical address.
In some embodiments, the method further comprises:
and in response to the absence of the original metadata, establishing third metadata containing a fingerprint value of the unit data and a mapping relation of a physical address of the unit data, establishing fourth data containing a mapping relation of a logical address of the unit data and a physical address of the unit data, and storing the third metadata and the fourth data to a metadata management module.
In another aspect of the present invention, a data deduplication system is further provided, including:
the storage space judgment module is configured to respond to the operation that the host abandons the data deduplication, write the data with granularity as a unit into the hard disk, calculate the fingerprint value of the data, write the fingerprint value of the data and the logic address thereof into the hard disk, and judge whether the occupied storage space in the hard disk reaches a preset threshold value;
the deduplication judging module is configured to respond that the occupied storage space reaches a preset threshold, acquire unit data from the hard disk, and judge whether the unit data is subjected to deduplication operation or not based on a fingerprint value and/or a logical address of the unit data;
the original metadata query module is configured to respond to the unit data without deduplication operation, acquire a fingerprint value of the unit data, and query whether original metadata containing a mapping relation between the fingerprint value of the unit data and a corresponding original physical address exists through the metadata management module; and
and the data deduplication module is configured for establishing first metadata containing a mapping relation between an original physical address and a fingerprint value of the unit data in response to the existence of the original metadata, and storing the first metadata to the metadata management module to perform deduplication operation of the unit data.
In yet another aspect of the present invention, there is also provided a computer readable storage medium storing computer program instructions which, when executed by a processor, implement any one of the methods described above.
In yet another aspect of the present invention, a computer device is provided, which includes a memory and a processor, the memory storing a computer program, the computer program executing any one of the above methods when executed by the processor.
The invention has at least the following beneficial technical effects:
1. according to the data deduplication method, by acquiring the fingerprint value of unit data which is not subjected to deduplication operation, establishing first metadata containing the mapping relation between the original physical address and the fingerprint value of the unit data when the original metadata containing the mapping relation between the fingerprint value of the unit data and the corresponding original physical address exists, and storing the first metadata to the metadata management module, online deduplication of the data is achieved, meanwhile, the influence on the performance of a storage system is avoided, and therefore the requirement of the whole deduplication rate of the storage system is met, and the data deduplication method is efficient and accurate;
2. the metadata management module is arranged to improve the concurrency degree of access so as to obtain efficient metadata access.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.
Fig. 1 is a schematic diagram illustrating a data deduplication method according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a data deduplication system provided in an embodiment of the present invention;
fig. 3 is a schematic diagram of a computer-readable storage medium for implementing a data deduplication method according to an embodiment of the present invention;
fig. 4 is a schematic hardware structure diagram of a computer device for performing a data deduplication method according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.
It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two non-identical entities with the same name or different parameters, and it is understood that "first" and "second" are only used for convenience of expression and should not be construed as limiting the embodiments of the present invention. Furthermore, the terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements does not include all of the other steps or elements inherent in the list.
In view of the foregoing, a first aspect of the embodiments of the present invention provides an embodiment of a data deduplication method. Fig. 1 is a schematic diagram illustrating an embodiment of a data deduplication method provided by the present invention. As shown in fig. 1, the embodiment of the present invention includes the following steps:
step S10, responding to the operation that the host abandons the data deduplication and writes the data with granularity as a unit into the hard disk, calculating the fingerprint value of the data, writing the fingerprint value and the logic address of the data into the hard disk, and judging whether the occupied storage space in the hard disk reaches the preset threshold value;
step S20, in response to the occupied storage space reaching a preset threshold, acquiring unit data from the hard disk, and judging whether the unit data are subjected to deduplication operation based on a fingerprint value and/or a logical address of the unit data;
step S30, responding to the unit data not to be subjected to the deduplication operation, acquiring the fingerprint value of the unit data, and inquiring whether the original metadata containing the mapping relation between the fingerprint value of the unit data and the corresponding original physical address exists or not through the metadata management module;
step S40, in response to the original metadata, establishing first metadata containing a mapping relationship between the original physical address and the fingerprint value of the unit data, and storing the first metadata in the metadata management module for deduplication of the unit data.
In the embodiment of the present invention, the deduplication operation is not intended to delete duplicate data literally, and the deduplication operation is: querying the data newly written into the hard disk for HP (HASK KEY → PBA) mapping, and if querying that the H (HASK KEY, fingerprint value) has corresponding HP mapping, which indicates that the data to which the H belongs already has the data with the same content in the storage pool, no physical address (PBA) is allocated to the data, so as to avoid the data with the same content being repeatedly called. It should be noted that a piece of data that has not undergone a deduplication operation has both a logical address (LBA) corresponding to a storage volume and a physical address (PBA) corresponding to a storage pool. The fingerprint value, i.e. the hash value, is mainly used as a unique identifier of the data content, and if the contents of two data are the same, the fingerprint values of the two data are also the same, but the logical address and the physical address of the two data are not necessarily the same.
In the embodiment of the present invention, the granularity (Grain) represents the minimum capacity unit of data, and the data is written according to the granularity unit. The unit data in the embodiment of the present invention represents one granularity unit of data.
In the embodiment of the invention, the metadata are stored in the metadata management module. When traversing the hard disk and acquiring a unit data in the hard disk, if the unit data is not subjected to deduplication operation, acquiring a fingerprint value H of the unit data, querying whether original metadata containing a mapping relation between the fingerprint value H and a corresponding original physical address P0 exists through a metadata management module, if so, establishing first metadata containing a mapping relation between the original physical address P0 and the fingerprint value H of the unit data, and storing the first metadata to the metadata management module. Assuming that the fingerprint value in the original metadata is H0 and the unit data has its own physical address P, H is the same as H0, i.e. the data content of both is the same, therefore, in order to implement deduplication, H and P0 are combined into a mapping relationship, and the first metadata containing the mapping relationship is stored in the metadata management module, at this time, the original metadata in the metadata management module is the same as the first metadata, and repeated calling of data with the same content through the metadata can be avoided.
According to the data deduplication method, the fingerprint value of the unit data which is not subjected to deduplication operation is obtained, when the original metadata containing the mapping relation between the fingerprint value of the unit data and the corresponding original physical address exists, the first metadata containing the mapping relation between the original physical address and the fingerprint value of the unit data is established, and the first metadata is stored in the metadata management module, so that the influence on the performance of the storage system is avoided while online deduplication of the data is achieved, and the requirement on the overall deduplication rate of the storage system is met, and the method is efficient and accurate; and the metadata management module is arranged to improve the concurrency degree of access so as to obtain efficient metadata access.
In some embodiments, determining whether a unit data has been subjected to a deduplication operation based on a fingerprint value and/or a logical address of the unit data comprises: judging whether the unit data has a corresponding fingerprint value and a corresponding logical address; responding to the corresponding fingerprint value and the logical address of the unit data, and confirming that the unit data is not subjected to the deduplication operation; in response to the unit data having a corresponding fingerprint value and no corresponding logical address, it is determined that the unit data has been subjected to a deduplication operation.
In this embodiment, if the original process of a unit data after being sent from the host is the conventional deduplication process, only the fingerprint value is stored in the hard disk, and the logical address is not stored.
In some embodiments, the method further comprises: and in response to the existence of the original metadata, establishing a second metadata group containing the mapping relation between the original physical address and the logical address of the unit data, and storing the second metadata group to the metadata management module.
In this embodiment, in order to ensure the validity of the management function of the metadata management module and ensure the integrity of the metadata in the metadata management module, the metadata related to the unit data is all stored in the metadata management module.
In some embodiments, the method further comprises: and setting the logic address of the single-bit data as invalid in response to storing the second metadata group in the metadata management module.
In this embodiment, the logical address of the unit data is set to be invalid, which is also helpful for determining whether the unit data has been subjected to the deduplication operation through the logical address.
In some embodiments, the method further comprises: and responding to the situation that the first metadata and the second metadata are stored in the metadata management module, and informing the garbage collection module to carry out garbage collection on the unit data.
In this embodiment, in order to further avoid the memory space occupied by the data of the repeated content, a garbage collection mechanism is started for the unit data.
In some embodiments, establishing the second metadata set containing the mapping of the original physical address to the logical address of the unit data comprises: metadata is created that includes key-value pairs that point from the original physical address to the logical address of the unit data, and metadata that includes key-value pairs that point from the logical address of the unit data to the original physical address.
In this embodiment, in order to ensure integrity of metadata in the metadata management module, both metadata of two key value pairs including a mapping relationship between an original physical address and a logical address of unit data are stored in the metadata management module.
In some embodiments, the method further comprises: and in response to the absence of the original metadata, establishing third metadata containing a fingerprint value of the unit data and a mapping relation of a physical address of the unit data, establishing fourth data containing a mapping relation of a logical address of the unit data and a physical address of the unit data, and storing the third metadata and the fourth data to a metadata management module.
In this embodiment, if original metadata including a mapping relationship between a fingerprint value of unit data and a corresponding original physical address is not found, it is indicated that the unit data is new data, and there is no data with the same content, and there is naturally no over-deduplication operation. Therefore, the corresponding mapping relation is established among the fingerprint value, the physical address and the logical address of the unit data, and the third metadata and the fourth metadata with the corresponding mapping relation are stored in the metadata management module, so that whether the fingerprint value consistent with the content of the unit data exists can be known by inquiring the metadata management module when other new data come. The fourth metadata in the present embodiment includes metadata of a key-value pair pointed to by the logical address of the unit data to a physical address, and also includes metadata of a key-value pair pointed to by the physical address of the unit to a logical address.
In a second aspect of the embodiments of the present invention, a data deduplication system is further provided. Fig. 2 is a schematic diagram illustrating an embodiment of a data deduplication system provided in the present invention. As shown in fig. 2, a data deduplication system includes: a storage space judgment module 10 configured to respond to a host abandoning a data deduplication operation and write data in units of granularity into a hard disk, calculate a fingerprint value of the data, write the fingerprint value of the data and a logical address thereof into the hard disk, and judge whether an occupied storage space in the hard disk reaches a preset threshold; a deduplication determining module 20 configured to, in response to the occupied storage space reaching a preset threshold, obtain unit data from the hard disk, and determine whether the unit data has been subjected to deduplication operation based on a fingerprint value and/or a logical address of the unit data; an original metadata query module 30 configured to respond to the unit data not being subjected to deduplication operation, obtain a fingerprint value of the unit data, and query whether there is original metadata containing a mapping relationship between the fingerprint value of the unit data and a corresponding original physical address through a metadata management module; and a data deduplication module 40 configured to, in response to the existence of the original metadata, establish first metadata including a mapping relationship between an original physical address and a fingerprint value of the unit data, and store the first metadata to the metadata management module to perform a deduplication operation of the unit data.
In some embodiments, the deduplication determining module 20 includes a fingerprint value and logical address determining module configured to determine whether there is a corresponding fingerprint value and logical address in the unit data; responding to the corresponding fingerprint value and the logical address of the unit data, and confirming that the unit data is not subjected to the deduplication operation; in response to the unit data having a corresponding fingerprint value and no corresponding logical address, it is determined that the unit data has been subjected to a deduplication operation.
In some embodiments, the system further comprises a second metadata set storage module configured to establish a second metadata set containing a mapping relationship between the original physical address and the logical address of the unit data in response to the original metadata being present, and store the second metadata set to the metadata management module.
In some embodiments, the system further includes a logical address invalidation module configured to invalidate the logical address of the metadata in response to storing the second metadata group to the metadata management module.
In some embodiments, the system further includes a garbage collection module configured to notify the garbage collection module to garbage collect the unit data in response to depositing both the first metadata and the second metadata group to the metadata management module.
In some embodiments, the second metadata set deposit module includes a key-value pair module configured to create metadata including key-value pairs pointed to by the original physical address to the logical address of the unit data, and metadata including key-value pairs pointed to by the logical address of the unit data to the original physical address.
In some embodiments, the system further includes a metadata storage module configured to, in response to absence of the original metadata, establish third metadata including a mapping relationship between a fingerprint value of the unit data and a physical address thereof, establish fourth metadata including a mapping relationship between a logical address of the unit data and a physical address thereof, and store the third metadata and the fourth metadata to the metadata management module.
In a third aspect of the embodiment of the present invention, a computer-readable storage medium is further provided, and fig. 3 is a schematic diagram of a computer-readable storage medium for implementing a data deduplication method according to an embodiment of the present invention. As shown in fig. 3, the computer-readable storage medium 3 stores computer program instructions 31. The computer program instructions 31, when executed by a processor, implement the method of any of the embodiments described above.
It should be understood that all the embodiments, features and advantages set forth above with respect to the data deduplication method according to the present invention are equally applicable to the data deduplication system and the storage medium according to the present invention, without conflicting therewith.
In a fourth aspect of the embodiments of the present invention, there is further provided a computer device, including a memory 402 and a processor 401 as shown in fig. 4, where the memory 402 stores therein a computer program, and the computer program, when executed by the processor 401, implements the method of any one of the above embodiments.
Fig. 4 is a schematic hardware structure diagram of an embodiment of a computer device for performing a data deduplication method according to the present invention. Taking the computer device shown in fig. 4 as an example, the computer device includes a processor 401 and a memory 402, and may further include: an input device 403 and an output device 404. The processor 401, the memory 402, the input device 403 and the output device 404 may be connected by a bus or other means, and fig. 4 illustrates an example of a connection by a bus. The input device 403 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the data deduplication system. The output device 404 may include a display device such as a display screen.
The memory 402, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the data deduplication method in the embodiment of the present application. The memory 402 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created by use of the data deduplication method, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 402 may optionally include memory located remotely from processor 401, which may be connected to local modules via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The processor 401 executes various functional applications of the server and data processing, namely, implements the data deduplication method of the above-described method embodiment, by running nonvolatile software programs, instructions, and modules stored in the memory 402.
Finally, it should be noted that the computer-readable storage medium (e.g., memory) herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. By way of example, and not limitation, nonvolatile memory can include Read Only Memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which can act as external cache memory. By way of example and not limitation, RAM is available in a variety of forms such as synchronous RAM (DRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The storage devices of the disclosed aspects are intended to comprise, without being limited to, these and other suitable types of memory.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.
The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items. The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims (10)

1. A data deduplication method is characterized by comprising the following steps:
responding to the operation that a host abandons data deduplication and writes the data with granularity as a unit into a hard disk, calculating a fingerprint value of the data, writing the fingerprint value of the data and a logic address thereof into the hard disk, and judging whether the occupied storage space in the hard disk reaches a preset threshold value or not;
acquiring unit data from the hard disk in response to the occupied storage space reaching a preset threshold, and judging whether the unit data are subjected to deduplication operation or not based on a fingerprint value and/or a logical address of the unit data;
responding to the unit data without deduplication operation, acquiring a fingerprint value of the unit data, and inquiring whether original metadata containing a mapping relation between the fingerprint value of the unit data and a corresponding original physical address exists or not through a metadata management module;
and in response to the existence of the original metadata, establishing first metadata containing a mapping relation between the original physical address and the fingerprint value of the unit data, and storing the first metadata to the metadata management module to perform the deduplication operation of the unit data.
2. The method of claim 1, wherein determining whether the unit data has been subjected to the deduplication operation based on the fingerprint value and/or the logical address of the unit data comprises:
judging whether the unit data has a corresponding fingerprint value and a corresponding logical address;
responding to the unit data with corresponding fingerprint value and logical address, and confirming that the unit data is not subjected to deduplication operation;
and confirming that the unit data is subjected to the deduplication operation in response to the unit data having the corresponding fingerprint value and no corresponding logical address.
3. The method of claim 1, further comprising:
and in response to the existence of the original metadata, establishing a second metadata group containing the mapping relation between the original physical address and the logical address of the unit data, and storing the second metadata group to the metadata management module.
4. The method of claim 3, further comprising:
and setting the logic address of the unit data as invalid in response to storing the second metadata group in the metadata management module.
5. The method of claim 3, further comprising:
and in response to storing the first metadata and the second metadata group to the metadata management module, notifying a garbage collection module to perform garbage collection on the unit data.
6. The method of claim 3, wherein establishing a second metadata set containing a mapping of the original physical address to the logical address of the unit data comprises:
metadata is created that includes key-value pairs that are pointed to by the original physical address at the logical address of the unit of data, and metadata that includes key-value pairs that are pointed to by the logical address of the unit of data at the original physical address.
7. The method of claim 1, further comprising:
and in response to the absence of the original metadata, establishing third metadata containing a fingerprint value of the unit data and a mapping relation of a physical address of the unit data, establishing fourth data containing a mapping relation of a logical address of the unit data and a physical address of the unit data, and storing the third metadata and the fourth data to the metadata management module.
8. A data deduplication system, comprising:
the storage space judging module is configured to respond to the operation that the host abandons the data deduplication and writes the data with granularity as a unit into a hard disk, calculate a fingerprint value of the data, write the fingerprint value of the data and a logic address thereof into the hard disk, and judge whether the occupied storage space in the hard disk reaches a preset threshold value;
the deduplication judging module is configured to respond that an occupied storage space reaches a preset threshold, acquire unit data from the hard disk, and judge whether deduplication operation is performed on the unit data based on a fingerprint value and/or a logical address of the unit data;
the original metadata query module is configured to respond to the unit data without performing deduplication operation, acquire a fingerprint value of the unit data, and query whether original metadata containing a mapping relation between the fingerprint value of the unit data and a corresponding original physical address exists through the metadata management module; and
and the data deduplication module is configured to establish first metadata including a mapping relation between the original physical address and the fingerprint value of the unit data in response to the existence of the original metadata, and store the first metadata to the metadata management module to perform deduplication operation of the unit data.
9. A computer-readable storage medium, characterized in that computer program instructions are stored which, when executed by a processor, implement the method according to any one of claims 1-7.
10. A computer device comprising a memory and a processor, characterized in that the memory has stored therein a computer program which, when executed by the processor, performs the method according to any one of claims 1-7.
CN202111090326.4A 2021-09-17 2021-09-17 Data deduplication method, system, storage medium and equipment Pending CN113535708A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111090326.4A CN113535708A (en) 2021-09-17 2021-09-17 Data deduplication method, system, storage medium and equipment
PCT/CN2022/078324 WO2023040200A1 (en) 2021-09-17 2022-02-28 Data deduplication method and system, and storage medium and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111090326.4A CN113535708A (en) 2021-09-17 2021-09-17 Data deduplication method, system, storage medium and equipment

Publications (1)

Publication Number Publication Date
CN113535708A true CN113535708A (en) 2021-10-22

Family

ID=78093359

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111090326.4A Pending CN113535708A (en) 2021-09-17 2021-09-17 Data deduplication method, system, storage medium and equipment

Country Status (2)

Country Link
CN (1) CN113535708A (en)
WO (1) WO2023040200A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114138198A (en) * 2021-11-29 2022-03-04 苏州浪潮智能科技有限公司 Method, device and equipment for data deduplication and readable medium
CN114253472A (en) * 2021-11-29 2022-03-29 郑州云海信息技术有限公司 Metadata management method, equipment and storage medium
CN115437579A (en) * 2022-11-04 2022-12-06 苏州浪潮智能科技有限公司 Metadata management method and device, computer equipment and readable storage medium
WO2023040200A1 (en) * 2021-09-17 2023-03-23 苏州浪潮智能科技有限公司 Data deduplication method and system, and storage medium and device
CN117931092A (en) * 2024-03-20 2024-04-26 苏州元脑智能科技有限公司 Data deduplication adjustment method, device, equipment, storage system and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106339444A (en) * 2016-08-23 2017-01-18 深圳市金立通信设备有限公司 Method for instantly deleting file and terminal
CN106527973A (en) * 2016-10-10 2017-03-22 杭州宏杉科技股份有限公司 A method and device for data deduplication
CN107122130A (en) * 2017-04-13 2017-09-01 杭州宏杉科技股份有限公司 A kind of data delete method and device again
CN110727404A (en) * 2019-09-27 2020-01-24 苏州浪潮智能科技有限公司 Data deduplication method and device based on storage end and storage medium
CN110795031A (en) * 2019-10-17 2020-02-14 北京浪潮数据技术有限公司 Data deduplication method, device and system based on full flash storage

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113535708A (en) * 2021-09-17 2021-10-22 苏州浪潮智能科技有限公司 Data deduplication method, system, storage medium and equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106339444A (en) * 2016-08-23 2017-01-18 深圳市金立通信设备有限公司 Method for instantly deleting file and terminal
CN106527973A (en) * 2016-10-10 2017-03-22 杭州宏杉科技股份有限公司 A method and device for data deduplication
CN107122130A (en) * 2017-04-13 2017-09-01 杭州宏杉科技股份有限公司 A kind of data delete method and device again
CN110727404A (en) * 2019-09-27 2020-01-24 苏州浪潮智能科技有限公司 Data deduplication method and device based on storage end and storage medium
CN110795031A (en) * 2019-10-17 2020-02-14 北京浪潮数据技术有限公司 Data deduplication method, device and system based on full flash storage

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023040200A1 (en) * 2021-09-17 2023-03-23 苏州浪潮智能科技有限公司 Data deduplication method and system, and storage medium and device
CN114138198A (en) * 2021-11-29 2022-03-04 苏州浪潮智能科技有限公司 Method, device and equipment for data deduplication and readable medium
CN114253472A (en) * 2021-11-29 2022-03-29 郑州云海信息技术有限公司 Metadata management method, equipment and storage medium
CN114253472B (en) * 2021-11-29 2023-09-22 郑州云海信息技术有限公司 Metadata management method, device and storage medium
CN114138198B (en) * 2021-11-29 2024-05-28 苏州浪潮智能科技有限公司 Method, device, equipment and readable medium for deleting data
CN115437579A (en) * 2022-11-04 2022-12-06 苏州浪潮智能科技有限公司 Metadata management method and device, computer equipment and readable storage medium
CN115437579B (en) * 2022-11-04 2023-03-24 苏州浪潮智能科技有限公司 Metadata management method and device, computer equipment and readable storage medium
WO2024093090A1 (en) * 2022-11-04 2024-05-10 苏州元脑智能科技有限公司 Metadata management method and apparatus, computer device, and readable storage medium
CN117931092A (en) * 2024-03-20 2024-04-26 苏州元脑智能科技有限公司 Data deduplication adjustment method, device, equipment, storage system and storage medium
CN117931092B (en) * 2024-03-20 2024-05-24 苏州元脑智能科技有限公司 Data deduplication adjustment method, device, equipment, storage system and storage medium

Also Published As

Publication number Publication date
WO2023040200A1 (en) 2023-03-23

Similar Documents

Publication Publication Date Title
CN113535708A (en) Data deduplication method, system, storage medium and equipment
CN109753443B (en) Data processing method and device and electronic equipment
CN112714359B (en) Video recommendation method and device, computer equipment and storage medium
CN111381779B (en) Data processing method, device, equipment and storage medium
WO2021073510A1 (en) Statistical method and device for database
CN112579602A (en) Multi-version data storage method and device, computer equipment and storage medium
CN110618974A (en) Data storage method, device, equipment and storage medium
CN113326005B (en) Read-write method and device for RAID storage system
CA2896369A1 (en) Method for writing data into flash memory apparatus, flash memory apparatus, and storage system
CN114527938A (en) Data reading method, system, medium and device based on solid state disk
CN111913917A (en) File processing method, device, equipment and medium
CN105554181A (en) DNS log compression method and device
CN115437579B (en) Metadata management method and device, computer equipment and readable storage medium
CN106815232A (en) Catalog management method, apparatus and system
CN112631833A (en) Data archiving and querying method, system, storage medium and equipment
CN111857571A (en) Solid state disk physical block address distribution method, device, equipment and storage medium
CN110955682A (en) Method and device for deleting cache data, data cache and reading cache data
CN111309264A (en) Method, system, device and medium for making directory quota compatible with snapshot
CN113434489B (en) Real-time database online capacity expansion method, system, equipment and storage medium
KR101884726B1 (en) Method, apparatus, and computer program stored in computer readable medium for reading block in database system
CN111258871B (en) Verification method, device, equipment and storage medium of distributed file system
CN113419672A (en) Storage capacity management method, system and storage medium
CN111143418B (en) Method, device, equipment and storage medium for reading data from database
CN113505086B (en) Storage system capacity statistical method, device, storage medium and equipment
CN113609160A (en) B + tree traversal method, system, storage medium and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20211022