CN114138552A - Data dynamic deduplication method, system, terminal and storage medium - Google Patents

Data dynamic deduplication method, system, terminal and storage medium Download PDF

Info

Publication number
CN114138552A
CN114138552A CN202111335541.6A CN202111335541A CN114138552A CN 114138552 A CN114138552 A CN 114138552A CN 202111335541 A CN202111335541 A CN 202111335541A CN 114138552 A CN114138552 A CN 114138552A
Authority
CN
China
Prior art keywords
data
backup
characteristic
value
backup data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111335541.6A
Other languages
Chinese (zh)
Other versions
CN114138552B (en
Inventor
朱箫鸣
冀国威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202111335541.6A priority Critical patent/CN114138552B/en
Publication of CN114138552A publication Critical patent/CN114138552A/en
Application granted granted Critical
Publication of CN114138552B publication Critical patent/CN114138552B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • G06F11/1453Management of the data involved in backup or backup restore using de-duplication of the data
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method, a system, a terminal and a storage medium for dynamically deleting data, which comprises the following steps: extracting the characteristics of the backup data by using an F value dimension reduction method, and dividing the backup data into data blocks with irregular sizes based on the characteristics; calculating the hash value of each data block, and searching a matched data block of the data block from the stored data based on the hash value; the data blocks for which there is a matching data block are deleted. The method solves the problems existing in the division of the fixed block, greatly reduces the occupation of the computing resources of the client in the data division processing, can effectively improve the data deduplication rate in the backup process of massive small files, reduces the occupation of the computing resources, and can save more space during data storage.

Description

Data dynamic deduplication method, system, terminal and storage medium
Technical Field
The invention relates to the technical field of data storage, in particular to a method, a system, a terminal and a storage medium for dynamically deleting data.
Background
With the development of scientific technology, data starts to grow exponentially, and data security becomes a key point of government and enterprise attention, but in backup protection of data, a large amount of redundant data always fills up to occupy storage space. Therefore, in the backup disaster recovery product of data, the technology of data de-duplication also becomes one of the evaluation indexes considering whether the product is superior in the aspects of technical content, operation performance, product quality and the like.
In the implementation of data deduplication, manufacturers generally adopt a method that firstly performs blocking processing on data, that is, backup data is divided into fixed-length data blocks that do not overlap with each other, the size of the commonly used blocks is 4K/8K/16K/32K/128K, and the fixed-length data blocks selected by different manufacturers are different in size. Then, fingerprint information is established for each data block by utilizing a Hash algorithm, and the system judges whether the data block is repeated with the existing 'metadata' by calculating and checking the 'fingerprint' of the data block: if so, only the pointer to the "metadata" needs to be retained; if the fingerprint shows that the data block is brand new, the data block is reserved, and relevant information is extracted and stored as metadata for subsequent data verification and comparison.
In the whole process, it is easy to find that the size of the data block becomes a crucial problem, and the size of the data block will affect the operation performance and the deduplication rate of the data deduplication processing: the data block is large, the data deduplication processing operation performance is high, but the deduplication rate is low, and the accuracy is reduced; the data blocking is small, the data deduplication processing operation performance is low, the deduplication rate is high, and the accuracy is improved. Meanwhile, for a large amount of small file scenes, due to the complex data change situation, when data is inserted into or deleted from a source data object, due to the adoption of fixed-length data block segmentation, data block segmentation can be performed again, and the deduplication rate is lower while the calculated amount is increased.
Disclosure of Invention
In view of the above-mentioned deficiencies of the prior art, the present invention provides a method, a system, a terminal and a storage medium for dynamically deleting data, so as to solve the above-mentioned technical problems.
In a first aspect, the present invention provides a method for dynamically deleting data, including:
extracting the characteristics of the backup data by using an F value dimension reduction method, and dividing the backup data into data blocks with irregular sizes based on the characteristics;
calculating the hash value of each data block, and searching a matched data block of the data block from the stored data based on the hash value;
the data blocks for which there is a matching data block are deleted.
Further, extracting the characteristics of the backup data by using an F-score dimensionality reduction method, and dividing the backup data into data blocks with irregular sizes based on the characteristics, including:
using the F-score function:
Figure BDA0003350360190000021
calculating F-scores of K features of the backup data, wherein
Figure BDA0003350360190000022
Is the average of the ith feature over the entire data set,
Figure BDA0003350360190000023
is the average of the ith feature over the positive class data set,
Figure BDA0003350360190000024
is the average of the ith feature over the negative class data set,
Figure BDA0003350360190000025
is the eigenvalue of the ith characteristic of the kth positive type sample point,
Figure BDA0003350360190000026
the characteristic value of the ith characteristic of the kth negative type sample point;
sorting the F scores of the characteristic values according to a rule from big to small, and selecting the F scores of the corresponding number which are ranked in the front according to the set quantity of the characteristic values;
and taking the characteristic value corresponding to the selected F score as the characteristic value of the backup data.
Further, extracting the characteristics of the backup data by using an F-score dimensionality reduction method, and dividing the backup data into data blocks with irregular sizes based on the characteristics, including:
randomly selecting a target characteristic value from all characteristic values of the backup data, and taking the target characteristic value as a segmentation starting point;
substituting the target characteristic value into a segmentation length calculation function
Figure BDA0003350360190000031
Obtaining a division length SnWherein X isiIs a random variable sequence, aiIs a coefficient determined by the data structure of the backup data, n being the number of eigenvalues;
switching the target characteristic values until all the characteristic values are traversed to obtain the segmentation length corresponding to each characteristic value;
and dividing the backup data into a plurality of data blocks according to the characteristic values of the backup data and the division lengths corresponding to the characteristic values.
Further, calculating a hash value of each data block, and searching a matching block of the data block from the stored data based on the hash value, includes:
calculating hash values of the data blocks, and retrieving matched data blocks with the same hash values as the data blocks from the stored data;
and acquiring metadata of the matched data block, and taking the metadata as the metadata of the data block.
In a second aspect, the present invention provides a system for dynamically deleting data, including:
the data segmentation unit is used for extracting the characteristics of the backup data by using an F value dimensionality reduction method and dividing the backup data into data blocks with irregular sizes based on the characteristics;
the matching search unit is used for calculating the hash value of each data block and searching the matching data block of the data block from the stored data based on the hash value;
and the repeated deleting unit is used for deleting the data blocks with the matched data blocks.
Further, the data partitioning unit is configured to:
using the F-score function:
Figure BDA0003350360190000041
calculating F-scores of K features of the backup data, wherein
Figure BDA0003350360190000042
Is the average of the ith feature over the entire data set,
Figure BDA0003350360190000043
is the average of the ith feature over the positive class data set,
Figure BDA0003350360190000044
is the average of the ith feature over the negative class data set,
Figure BDA0003350360190000045
is the eigenvalue of the ith characteristic of the kth positive type sample point,
Figure BDA0003350360190000046
the characteristic value of the ith characteristic of the kth negative type sample point;
sorting the F scores of the characteristic values according to a rule from big to small, and selecting the F scores of the corresponding number which are ranked in the front according to the set quantity of the characteristic values;
and taking the characteristic value corresponding to the selected F score as the characteristic value of the backup data.
Further, the data partitioning unit is configured to:
randomly selecting a target characteristic value from all characteristic values of the backup data, and taking the target characteristic value as a segmentation starting point;
substituting the target characteristic value into a segmentation length calculation function
Figure BDA0003350360190000047
Obtaining a division length SnWherein X isiIs a random variable sequence, aiIs a coefficient determined by the data structure of the backup data, n being the number of eigenvalues;
switching the target characteristic values until all the characteristic values are traversed to obtain the segmentation length corresponding to each characteristic value;
and dividing the backup data into a plurality of data blocks according to the characteristic values of the backup data and the division lengths corresponding to the characteristic values.
Further, the matching search unit is configured to:
calculating hash values of the data blocks, and retrieving matched data blocks with the same hash values as the data blocks from the stored data;
and acquiring metadata of the matched data block, and taking the metadata as the metadata of the data block.
In a third aspect, a terminal is provided, including:
a processor, a memory, wherein,
the memory is used for storing a computer program which,
the processor is used for calling and running the computer program from the memory so as to make the terminal execute the method of the terminal.
In a fourth aspect, a computer storage medium is provided having stored therein instructions that, when executed on a computer, cause the computer to perform the method of the above aspects.
The method, the system, the terminal and the storage medium for dynamically deleting the data have the advantages that the characteristics of the backup data are extracted by using an F value dimensionality reduction method, the backup data are divided into data blocks with irregular sizes based on the characteristics, dynamic division of the backup data is realized, the problems existing in fixed block division are solved, division processing is only carried out on the data of a changed part, therefore, occupation of computing resources of a client side on data division processing is greatly reduced, the data deleting rate in the backup process of massive small files can be effectively improved, occupation of the computing resources is reduced, and meanwhile, space can be saved more during data storage.
In addition, the invention has reliable design principle, simple structure and very wide application prospect.
Drawings
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.
FIG. 1 is a schematic flow diagram of a method of one embodiment of the invention.
FIG. 2 is a schematic block diagram of a system of one embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a terminal according to an embodiment of the present invention.
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the prior art, a sliding window blocking scheme also exists, a first backup based on the sliding window blocking scheme is consistent with a fixed-length deduplication method, the whole string of data is blocked by a fixed length, and a hash value of each block is calculated. The fixed length is the length of the window, and the sliding of the window is used for trying to find and match the same data during secondary backup. Taking the data modification example, the second slice has a data change deab- - > ddab. First the hash value of ddab is calculated and the data for this slice is changed so no fingerprint can match. At this time, the next data slice is not processed urgently, the window is moved forward by one unit, the hash value (f i finger int 2') of the data under the window is calculated continuously, matching is attempted, and the like is carried out until a hash value which can be matched is found. When a certain data is overwritten, the effect is the same as the fixed-length deduplication effect, and the same deduplication rate can be obtained. The method still needs to divide the data into fixed-length data blocks at the beginning, and is not suitable for the division of small files; and then, the window is slid according to the specified step distance, and data is deleted again, so that the hash value of the data block needs to be calculated for many times, the calculation amount is large, and the deduplication efficiency is low.
In actual service production, due to different types of service systems, the size of data is different, for example, structured data is generally KB level, unstructured data is MB level, the problem of low deduplication rate exists by adopting fixed-length data block segmentation, and in order to solve the problem, the invention discloses a data deduplication method based on a variable-length data block segmentation technology. The method mainly determines the data boundary through a continuously sliding window, and dynamically divides backup data into data blocks with different sizes according to a characteristic function of the backup data, so that the data deduplication rate is improved.
FIG. 1 is a schematic flow diagram of a method of one embodiment of the invention. The execution subject in fig. 1 may be a data dynamic deduplication system.
As shown in fig. 1, the method includes:
step 110, extracting the characteristics of the backup data by using an F score dimensionality reduction method, and dividing the backup data into data blocks with irregular sizes based on the characteristics;
step 120, calculating hash values of the data blocks, and searching matched data blocks of the data blocks from the stored data based on the hash values;
and step 130, deleting the data blocks with the matching data blocks.
In order to facilitate understanding of the present invention, the following describes the data dynamic deduplication method provided by the present invention in further detail by using the principle of the data dynamic deduplication method of the present invention and combining the process of performing dynamic deduplication on data in the embodiment.
Specifically, the data dynamic deduplication method includes:
s1, extracting the characteristics of the backup data by using an F score dimensionality reduction method, and dividing the backup data into data blocks with irregular sizes based on the characteristics.
For a given data set of backup data, assume that the number of its positive class points and the number of its negative class points are n + and n _, respectively:
using the F-score function:
Figure BDA0003350360190000071
calculating F-scores of K features of the backup data, wherein
Figure BDA0003350360190000072
Is the average of the ith feature over the entire data set,
Figure BDA0003350360190000073
is the average of the ith feature over the positive class data set,
Figure BDA0003350360190000074
is the average of the ith feature over the negative class data set,
Figure BDA0003350360190000075
is the eigenvalue of the ith characteristic of the kth positive type sample point,
Figure BDA0003350360190000076
the feature value of the ith feature of the kth negative type sample point.
The right numerator of the formula roughly reflects the degree of difference of the positive class point and the negative class point on the Kth feature, and the left expression and the right expression of the denominator respectively reflect the degree of dispersion of the positive class point and the negative class point on the Kth feature, so that if the value of F (K) is larger, the two classes of points can be more distinguished for the Kth feature, and the F-score value method can be used as a standard for selecting the feature.
Assuming that the number of the selected features is d, n F scores are calculated by using the formula, wherein the n F scores are respectively: f (1), F (2), …, F (n); sorting the n F scores according to a rule from big to small, and selecting the front d F scores which are ranked earlier from the n F scores; the subscript k corresponding to the selected F value is taken outiThe subscripts have the corresponding characteristicsCharacteristic values of the data.
Randomly selecting a target characteristic value from all characteristic values of the backup data, and taking the target characteristic value as a segmentation starting point;
substituting the target characteristic value into the segmentation length calculation function
Figure BDA0003350360190000081
Obtaining a division length SnWherein X isiN is the number of eigenvalues for a random variable sequence representing an unknown number; a isiIs a coefficient determined by the data structure of the backup data, a if the backup data is structural dataiIs constant, a if the backup data is unstructured dataiIs a row of coefficient arrays.
Switching the target characteristic values until all the characteristic values are traversed to obtain the segmentation length corresponding to each characteristic value; and dividing the backup data into a plurality of data blocks according to the characteristic values of the backup data and the division lengths corresponding to the characteristic values.
The method comprises the steps of dynamically segmenting irregular data blocks of data, wherein the data are files with different sizes of 1-4 KB generally in a scene of a large number of small files, dividing the data into data blocks with different sizes according to the size of the files in the backup process based on the principle of variable length data blocks, for example, an original file is 1KB, carrying out a hash algorithm on the 1KB as one data block based on the principle of the variable length data blocks to obtain corresponding fingerprint information, and dividing the original file into two data blocks with 2KB if the original file is 3KB, and carrying out the hash algorithm to obtain two corresponding fingerprint information.
And S2, calculating the hash value of each data block, and searching the matched data block of the data block from the stored data based on the hash value.
Calculating hash values of the data blocks, and retrieving matched data blocks with the same hash values as the data blocks from the stored data; and acquiring metadata of the matched data block, and taking the acquired metadata as the metadata of the data block.
And S3, deleting the data blocks with the matching data blocks.
In step S2, the metadata of the matching data block has been bound to the metadata of the data block, so deleting the data block saves storage resources, and the content of the data block can be read out according to the metadata when searching the data block.
The embodiment effectively solves the problems existing in the fixed block segmentation, and when data is inserted into or deleted from the data object, if the changed content is not in the boundary of the data block, the data block is not changed; the boundaries between data blocks are random, dynamic, and portions of the contents of the data blocks may be repetitive. Therefore, the insertion or deletion of the content only affects one or two adjacent data blocks, and the rest of the data blocks are not affected, so that the deduplication of the data is more accurate.
For the backup of unstructured data, especially massive small files, a dynamic segmentation technology is adopted. Because the data change situation in the massive small files is complex, the whole backup data is often re-partitioned by adopting fixed block partitioning, and the partition processing of variable-length blocks is adopted, the problems existing in the fixed block partitioning are solved, and the partition processing is only carried out on the data of the changed part, so that the occupation of the computing resources of the client side on the data partition processing is greatly reduced, and the optimal data re-deleting effect of the massive small files is ensured.
As shown in fig. 2, the system 200 includes:
a data dividing unit 210, configured to extract features of the backup data by using an F-score dimensionality reduction method, and divide the backup data into data blocks with irregular sizes based on the features;
a matching search unit 220, configured to calculate a hash value of each data block, and search a matching data block of the data block from the stored data based on the hash value;
and a deduplication unit 230 configured to delete the data block where the matching data block exists.
Optionally, as an embodiment of the present invention, the data dividing unit is configured to:
using the F-score function:
Figure BDA0003350360190000101
calculating F-scores of K features of the backup data, wherein
Figure BDA0003350360190000102
Is the average of the ith feature over the entire data set,
Figure BDA0003350360190000103
is the average of the ith feature over the positive class data set,
Figure BDA0003350360190000104
is the average of the ith feature over the negative class data set,
Figure BDA0003350360190000105
is the eigenvalue of the ith characteristic of the kth positive type sample point,
Figure BDA0003350360190000106
the characteristic value of the ith characteristic of the kth negative type sample point;
sorting the F scores of the characteristic values according to a rule from big to small, and selecting the F scores of the corresponding number which are ranked in the front according to the set quantity of the characteristic values;
and taking the characteristic value corresponding to the selected F score as the characteristic value of the backup data.
Optionally, as an embodiment of the present invention, the data dividing unit is configured to:
randomly selecting a target characteristic value from all characteristic values of the backup data, and taking the target characteristic value as a segmentation starting point;
substituting the target characteristic value into a segmentation length calculation function
Figure BDA0003350360190000107
Obtaining a division length SnWherein X isiIs a random variable sequence, aiIs a coefficient determined by the data structure of the backup data, n being the number of eigenvalues;
switching the target characteristic values until all the characteristic values are traversed to obtain the segmentation length corresponding to each characteristic value;
and dividing the backup data into a plurality of data blocks according to the characteristic values of the backup data and the division lengths corresponding to the characteristic values.
Optionally, as an embodiment of the present invention, the matching search unit is configured to:
calculating hash values of the data blocks, and retrieving matched data blocks with the same hash values as the data blocks from the stored data;
and acquiring metadata of the matched data block, and taking the metadata as the metadata of the data block.
Fig. 3 is a schematic structural diagram of a terminal 300 according to an embodiment of the present invention, where the terminal 300 may be used to execute the data dynamic deduplication method according to the embodiment of the present invention.
Among them, the terminal 300 may include: a processor 310, a memory 320, and a communication unit 330. The components communicate via one or more buses, and those skilled in the art will appreciate that the architecture of the servers shown in the figures is not intended to be limiting, and may be a bus architecture, a star architecture, a combination of more or less components than those shown, or a different arrangement of components.
The memory 320 may be used for storing instructions executed by the processor 310, and the memory 320 may be implemented by any type of volatile or non-volatile storage terminal or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk. The executable instructions in memory 320, when executed by processor 310, enable terminal 300 to perform some or all of the steps in the method embodiments described below.
The processor 310 is a control center of the storage terminal, connects various parts of the entire electronic terminal using various interfaces and lines, and performs various functions of the electronic terminal and/or processes data by operating or executing software programs and/or modules stored in the memory 320 and calling data stored in the memory. The processor may be composed of an Integrated Circuit (IC), for example, a single packaged IC, or a plurality of packaged ICs connected with the same or different functions. For example, the processor 310 may include only a Central Processing Unit (CPU). In the embodiment of the present invention, the CPU may be a single operation core, or may include multiple operation cores.
A communication unit 330, configured to establish a communication channel so that the storage terminal can communicate with other terminals. And receiving user data sent by other terminals or sending the user data to other terminals.
The present invention also provides a computer storage medium, wherein the computer storage medium may store a program, and the program may include some or all of the steps in the embodiments provided by the present invention when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM) or a Random Access Memory (RAM).
Therefore, the method extracts the characteristics of the backup data by using an F value dimensionality reduction method and divides the backup data into irregular data blocks based on the characteristics, dynamic division of the backup data is achieved, the problems existing in fixed block division are solved, division processing is only performed on the data of a changed part, therefore, occupation of computing resources of a client side in data division processing is greatly reduced, data deduplication rate in a massive small file backup process can be effectively improved, occupation of the computing resources is reduced, meanwhile, space can be saved more during data storage, technical effects which can be achieved by the embodiment can be referred to the description above, and details are not repeated here.
Those skilled in the art will readily appreciate that the techniques of the embodiments of the present invention may be implemented as software plus a required general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied in the form of a software product, where the computer software product is stored in a storage medium, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like, and the storage medium can store program codes, and includes instructions for enabling a computer terminal (which may be a personal computer, a server, or a second terminal, a network terminal, and the like) to perform all or part of the steps of the method in the embodiments of the present invention.
The same and similar parts in the various embodiments in this specification may be referred to each other. Especially, for the terminal embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant points can be referred to the description in the method embodiment.
In the embodiments provided in the present invention, it should be understood that the disclosed system and method can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, systems or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
Although the present invention has been described in detail by referring to the drawings in connection with the preferred embodiments, the present invention is not limited thereto. Various equivalent modifications or substitutions can be made on the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and these modifications or substitutions are within the scope of the present invention/any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method for dynamically deleting data is characterized by comprising the following steps:
extracting the characteristics of the backup data by using an F value dimension reduction method, and dividing the backup data into data blocks with irregular sizes based on the characteristics;
calculating the hash value of each data block, and searching a matched data block of the data block from the stored data based on the hash value;
the data blocks for which there is a matching data block are deleted.
2. The method of claim 1, wherein extracting the characteristics of the backup data by using an F-score dimensionality reduction method and dividing the backup data into irregular-sized data blocks based on the characteristics comprises:
using the F-score function:
Figure FDA0003350360180000011
calculating F-scores of K features of the backup data, wherein
Figure FDA0003350360180000012
Is the average of the ith feature over the entire data set,
Figure FDA0003350360180000013
is the average of the ith feature over the positive class data set,
Figure FDA0003350360180000014
is the average of the ith feature over the negative class data set,
Figure FDA0003350360180000015
is the eigenvalue of the ith characteristic of the kth positive type sample point,
Figure FDA0003350360180000016
the characteristic value of the ith characteristic of the kth negative type sample point;
sorting the F scores of the characteristic values according to a rule from big to small, and selecting the F scores of the corresponding number which are ranked in the front according to the set quantity of the characteristic values;
and taking the characteristic value corresponding to the selected F score as the characteristic value of the backup data.
3. The method of claim 2, wherein extracting the characteristics of the backup data by using an F-score dimensionality reduction method and dividing the backup data into irregular-sized data blocks based on the characteristics comprises:
randomly selecting a target characteristic value from all characteristic values of the backup data, and taking the target characteristic value as a segmentation starting point;
substituting the target characteristic value into a segmentation length calculation function
Figure FDA0003350360180000021
Obtaining a division length SnWherein X isiIs a random variable sequence, aiIs a coefficient determined by the data structure of the backup data, n being the number of eigenvalues;
switching the target characteristic values until all the characteristic values are traversed to obtain the segmentation length corresponding to each characteristic value;
and dividing the backup data into a plurality of data blocks according to the characteristic values of the backup data and the division lengths corresponding to the characteristic values.
4. The method of claim 1, wherein computing a hash value for each chunk of data and finding a matching chunk of chunks of data from stored data based on the hash value comprises:
calculating hash values of the data blocks, and retrieving matched data blocks with the same hash values as the data blocks from the stored data;
and acquiring metadata of the matched data block, and taking the metadata as the metadata of the data block.
5. A system for dynamic deduplication of data, comprising:
the data segmentation unit is used for extracting the characteristics of the backup data by using an F value dimensionality reduction method and dividing the backup data into data blocks with irregular sizes based on the characteristics;
the matching search unit is used for calculating the hash value of each data block and searching the matching data block of the data block from the stored data based on the hash value;
and the repeated deleting unit is used for deleting the data blocks with the matched data blocks.
6. The system of claim 5, wherein the data partitioning unit is configured to:
using the F-score function:
Figure FDA0003350360180000022
calculating F-scores of K features of the backup data, wherein
Figure FDA0003350360180000023
Is the average of the ith feature over the entire data set,
Figure FDA0003350360180000031
is the average of the ith feature over the positive class data set,
Figure FDA0003350360180000032
is the average of the ith feature over the negative class data set,
Figure FDA0003350360180000033
is the eigenvalue of the ith characteristic of the kth positive type sample point,
Figure FDA0003350360180000034
the characteristic value of the ith characteristic of the kth negative type sample point;
sorting the F scores of the characteristic values according to a rule from big to small, and selecting the F scores of the corresponding number which are ranked in the front according to the set quantity of the characteristic values;
and taking the characteristic value corresponding to the selected F score as the characteristic value of the backup data.
7. The system of claim 6, wherein the data partitioning unit is configured to:
randomly selecting a target characteristic value from all characteristic values of the backup data, and taking the target characteristic value as a segmentation starting point;
substituting the target characteristic value into a segmentation length calculation function
Figure FDA0003350360180000035
Obtaining a division length SnWherein X isiIs a random variable sequence, aiIs a coefficient determined by the data structure of the backup data, n being the number of eigenvalues;
switching the target characteristic values until all the characteristic values are traversed to obtain the segmentation length corresponding to each characteristic value;
and dividing the backup data into a plurality of data blocks according to the characteristic values of the backup data and the division lengths corresponding to the characteristic values.
8. The system of claim 5, wherein the match lookup unit is configured to:
calculating hash values of the data blocks, and retrieving matched data blocks with the same hash values as the data blocks from the stored data;
and acquiring metadata of the matched data block, and taking the metadata as the metadata of the data block.
9. A terminal, comprising:
a processor;
a memory for storing instructions for execution by the processor;
wherein the processor is configured to perform the method of any one of claims 1-4.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-4.
CN202111335541.6A 2021-11-11 2021-11-11 Data dynamic repeating and deleting method, system, terminal and storage medium Active CN114138552B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111335541.6A CN114138552B (en) 2021-11-11 2021-11-11 Data dynamic repeating and deleting method, system, terminal and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111335541.6A CN114138552B (en) 2021-11-11 2021-11-11 Data dynamic repeating and deleting method, system, terminal and storage medium

Publications (2)

Publication Number Publication Date
CN114138552A true CN114138552A (en) 2022-03-04
CN114138552B CN114138552B (en) 2024-01-12

Family

ID=80393713

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111335541.6A Active CN114138552B (en) 2021-11-11 2021-11-11 Data dynamic repeating and deleting method, system, terminal and storage medium

Country Status (1)

Country Link
CN (1) CN114138552B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102323958A (en) * 2011-10-27 2012-01-18 上海文广互动电视有限公司 Data de-duplication method
CN109299727A (en) * 2018-08-04 2019-02-01 辽宁大学 The improvement extreme learning machine method for diagnosing faults of signal reconstruct
CN109948740A (en) * 2019-04-26 2019-06-28 中南大学湘雅医院 A kind of classification method based on tranquillization state brain image

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102323958A (en) * 2011-10-27 2012-01-18 上海文广互动电视有限公司 Data de-duplication method
CN109299727A (en) * 2018-08-04 2019-02-01 辽宁大学 The improvement extreme learning machine method for diagnosing faults of signal reconstruct
CN109948740A (en) * 2019-04-26 2019-06-28 中南大学湘雅医院 A kind of classification method based on tranquillization state brain image

Also Published As

Publication number Publication date
CN114138552B (en) 2024-01-12

Similar Documents

Publication Publication Date Title
US9727573B1 (en) Out-of core similarity matching
CN102782643B (en) Use the indexed search of Bloom filter
US9367448B1 (en) Method and system for determining data integrity for garbage collection of data storage systems
US9594674B1 (en) Method and system for garbage collection of data storage systems using live segment records
US9715505B1 (en) Method and system for maintaining persistent live segment records for garbage collection
US20150356134A1 (en) De-duplication system and method thereof
CN109299086B (en) Optimal sort key compression and index reconstruction
AU2010200866B1 (en) Data reduction indexing
CN109710455B (en) Deleted file recovery method and system based on FAT32 file system
CN111125033B (en) Space recycling method and system based on full flash memory array
CN112104725B (en) Container mirror image duplicate removal method, system, computer equipment and storage medium
CN106980680B (en) Data storage method and storage device
CN110569245A (en) Fingerprint index prefetching method based on reinforcement learning in data de-duplication system
Zhang et al. Hashfile: An efficient index structure for multimedia data
CN112416880A (en) Method and device for optimizing storage performance of mass small files based on real-time merging
CN117369731B (en) Data reduction processing method, device, equipment and medium
CN113051568A (en) Virus detection method and device, electronic equipment and storage medium
CN111124939A (en) Data compression method and system based on full flash memory array
CN111061428B (en) Data compression method and device
CN114138552B (en) Data dynamic repeating and deleting method, system, terminal and storage medium
CN113255610B (en) Feature base building method, feature retrieval method and related device
CN114443583A (en) Method, device and equipment for arranging fragment space and storage medium
CN112328587A (en) Data processing method and device for ElasticSearch
CN116048396B (en) Data storage device and storage control method based on log structured merging tree
CN114281599A (en) Recovery method of MFT fragments, terminal equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant