CN114138552A

CN114138552A - Data dynamic deduplication method, system, terminal and storage medium

Info

Publication number: CN114138552A
Application number: CN202111335541.6A
Authority: CN
Inventors: 朱箫鸣; 冀国威
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2021-11-11
Filing date: 2021-11-11
Publication date: 2022-03-04
Anticipated expiration: 2041-11-11
Also published as: CN114138552B

Abstract

The invention provides a method, a system, a terminal and a storage medium for dynamically deleting data, which comprises the following steps: extracting the characteristics of the backup data by using an F value dimension reduction method, and dividing the backup data into data blocks with irregular sizes based on the characteristics; calculating the hash value of each data block, and searching a matched data block of the data block from the stored data based on the hash value; the data blocks for which there is a matching data block are deleted. The method solves the problems existing in the division of the fixed block, greatly reduces the occupation of the computing resources of the client in the data division processing, can effectively improve the data deduplication rate in the backup process of massive small files, reduces the occupation of the computing resources, and can save more space during data storage.

Description

Data dynamic deduplication method, system, terminal and storage medium

Technical Field

The invention relates to the technical field of data storage, in particular to a method, a system, a terminal and a storage medium for dynamically deleting data.

Background

With the development of scientific technology, data starts to grow exponentially, and data security becomes a key point of government and enterprise attention, but in backup protection of data, a large amount of redundant data always fills up to occupy storage space. Therefore, in the backup disaster recovery product of data, the technology of data de-duplication also becomes one of the evaluation indexes considering whether the product is superior in the aspects of technical content, operation performance, product quality and the like.

In the implementation of data deduplication, manufacturers generally adopt a method that firstly performs blocking processing on data, that is, backup data is divided into fixed-length data blocks that do not overlap with each other, the size of the commonly used blocks is 4K/8K/16K/32K/128K, and the fixed-length data blocks selected by different manufacturers are different in size. Then, fingerprint information is established for each data block by utilizing a Hash algorithm, and the system judges whether the data block is repeated with the existing 'metadata' by calculating and checking the 'fingerprint' of the data block: if so, only the pointer to the "metadata" needs to be retained; if the fingerprint shows that the data block is brand new, the data block is reserved, and relevant information is extracted and stored as metadata for subsequent data verification and comparison.

In the whole process, it is easy to find that the size of the data block becomes a crucial problem, and the size of the data block will affect the operation performance and the deduplication rate of the data deduplication processing: the data block is large, the data deduplication processing operation performance is high, but the deduplication rate is low, and the accuracy is reduced; the data blocking is small, the data deduplication processing operation performance is low, the deduplication rate is high, and the accuracy is improved. Meanwhile, for a large amount of small file scenes, due to the complex data change situation, when data is inserted into or deleted from a source data object, due to the adoption of fixed-length data block segmentation, data block segmentation can be performed again, and the deduplication rate is lower while the calculated amount is increased.

Disclosure of Invention

In view of the above-mentioned deficiencies of the prior art, the present invention provides a method, a system, a terminal and a storage medium for dynamically deleting data, so as to solve the above-mentioned technical problems.

In a first aspect, the present invention provides a method for dynamically deleting data, including:

extracting the characteristics of the backup data by using an F value dimension reduction method, and dividing the backup data into data blocks with irregular sizes based on the characteristics;

calculating the hash value of each data block, and searching a matched data block of the data block from the stored data based on the hash value;

the data blocks for which there is a matching data block are deleted.

Further, extracting the characteristics of the backup data by using an F-score dimensionality reduction method, and dividing the backup data into data blocks with irregular sizes based on the characteristics, including:

using the F-score function:

calculating F-scores of K features of the backup data, wherein

Is the average of the ith feature over the entire data set,

is the average of the ith feature over the positive class data set,

is the average of the ith feature over the negative class data set,

is the eigenvalue of the ith characteristic of the kth positive type sample point,

the characteristic value of the ith characteristic of the kth negative type sample point;

sorting the F scores of the characteristic values according to a rule from big to small, and selecting the F scores of the corresponding number which are ranked in the front according to the set quantity of the characteristic values;

and taking the characteristic value corresponding to the selected F score as the characteristic value of the backup data.

randomly selecting a target characteristic value from all characteristic values of the backup data, and taking the target characteristic value as a segmentation starting point;

substituting the target characteristic value into a segmentation length calculation function

Obtaining a division length S_nWherein X is_iIs a random variable sequence, a_iIs a coefficient determined by the data structure of the backup data, n being the number of eigenvalues;

switching the target characteristic values until all the characteristic values are traversed to obtain the segmentation length corresponding to each characteristic value;

and dividing the backup data into a plurality of data blocks according to the characteristic values of the backup data and the division lengths corresponding to the characteristic values.

Further, calculating a hash value of each data block, and searching a matching block of the data block from the stored data based on the hash value, includes:

calculating hash values of the data blocks, and retrieving matched data blocks with the same hash values as the data blocks from the stored data;

and acquiring metadata of the matched data block, and taking the metadata as the metadata of the data block.

In a second aspect, the present invention provides a system for dynamically deleting data, including:

the data segmentation unit is used for extracting the characteristics of the backup data by using an F value dimensionality reduction method and dividing the backup data into data blocks with irregular sizes based on the characteristics;

the matching search unit is used for calculating the hash value of each data block and searching the matching data block of the data block from the stored data based on the hash value;

and the repeated deleting unit is used for deleting the data blocks with the matched data blocks.

Further, the data partitioning unit is configured to:

using the F-score function:

calculating F-scores of K features of the backup data, wherein

Is the average of the ith feature over the entire data set,

is the average of the ith feature over the positive class data set,

is the average of the ith feature over the negative class data set,

Further, the data partitioning unit is configured to:

Further, the matching search unit is configured to:

In a third aspect, a terminal is provided, including:

a processor, a memory, wherein,

the memory is used for storing a computer program which,

the processor is used for calling and running the computer program from the memory so as to make the terminal execute the method of the terminal.

In a fourth aspect, a computer storage medium is provided having stored therein instructions that, when executed on a computer, cause the computer to perform the method of the above aspects.

The method, the system, the terminal and the storage medium for dynamically deleting the data have the advantages that the characteristics of the backup data are extracted by using an F value dimensionality reduction method, the backup data are divided into data blocks with irregular sizes based on the characteristics, dynamic division of the backup data is realized, the problems existing in fixed block division are solved, division processing is only carried out on the data of a changed part, therefore, occupation of computing resources of a client side on data division processing is greatly reduced, the data deleting rate in the backup process of massive small files can be effectively improved, occupation of the computing resources is reduced, and meanwhile, space can be saved more during data storage.

In addition, the invention has reliable design principle, simple structure and very wide application prospect.

Drawings

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.

FIG. 1 is a schematic flow diagram of a method of one embodiment of the invention.

FIG. 2 is a schematic block diagram of a system of one embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the prior art, a sliding window blocking scheme also exists, a first backup based on the sliding window blocking scheme is consistent with a fixed-length deduplication method, the whole string of data is blocked by a fixed length, and a hash value of each block is calculated. The fixed length is the length of the window, and the sliding of the window is used for trying to find and match the same data during secondary backup. Taking the data modification example, the second slice has a data change deab- - > ddab. First the hash value of ddab is calculated and the data for this slice is changed so no fingerprint can match. At this time, the next data slice is not processed urgently, the window is moved forward by one unit, the hash value (f i finger int 2') of the data under the window is calculated continuously, matching is attempted, and the like is carried out until a hash value which can be matched is found. When a certain data is overwritten, the effect is the same as the fixed-length deduplication effect, and the same deduplication rate can be obtained. The method still needs to divide the data into fixed-length data blocks at the beginning, and is not suitable for the division of small files; and then, the window is slid according to the specified step distance, and data is deleted again, so that the hash value of the data block needs to be calculated for many times, the calculation amount is large, and the deduplication efficiency is low.

In actual service production, due to different types of service systems, the size of data is different, for example, structured data is generally KB level, unstructured data is MB level, the problem of low deduplication rate exists by adopting fixed-length data block segmentation, and in order to solve the problem, the invention discloses a data deduplication method based on a variable-length data block segmentation technology. The method mainly determines the data boundary through a continuously sliding window, and dynamically divides backup data into data blocks with different sizes according to a characteristic function of the backup data, so that the data deduplication rate is improved.

FIG. 1 is a schematic flow diagram of a method of one embodiment of the invention. The execution subject in fig. 1 may be a data dynamic deduplication system.

As shown in fig. 1, the method includes:

step 110, extracting the characteristics of the backup data by using an F score dimensionality reduction method, and dividing the backup data into data blocks with irregular sizes based on the characteristics;

step 120, calculating hash values of the data blocks, and searching matched data blocks of the data blocks from the stored data based on the hash values;

and step 130, deleting the data blocks with the matching data blocks.

In order to facilitate understanding of the present invention, the following describes the data dynamic deduplication method provided by the present invention in further detail by using the principle of the data dynamic deduplication method of the present invention and combining the process of performing dynamic deduplication on data in the embodiment.

Specifically, the data dynamic deduplication method includes:

s1, extracting the characteristics of the backup data by using an F score dimensionality reduction method, and dividing the backup data into data blocks with irregular sizes based on the characteristics.

For a given data set of backup data, assume that the number of its positive class points and the number of its negative class points are n + and n _, respectively:

using the F-score function:

calculating F-scores of K features of the backup data, wherein

Is the average of the ith feature over the entire data set,

is the average of the ith feature over the positive class data set,

is the average of the ith feature over the negative class data set,

the feature value of the ith feature of the kth negative type sample point.

The right numerator of the formula roughly reflects the degree of difference of the positive class point and the negative class point on the Kth feature, and the left expression and the right expression of the denominator respectively reflect the degree of dispersion of the positive class point and the negative class point on the Kth feature, so that if the value of F (K) is larger, the two classes of points can be more distinguished for the Kth feature, and the F-score value method can be used as a standard for selecting the feature.

Assuming that the number of the selected features is d, n F scores are calculated by using the formula, wherein the n F scores are respectively: f (1), F (2), …, F (n); sorting the n F scores according to a rule from big to small, and selecting the front d F scores which are ranked earlier from the n F scores; the subscript k corresponding to the selected F value is taken out_iThe subscripts have the corresponding characteristicsCharacteristic values of the data.

substituting the target characteristic value into the segmentation length calculation function

Obtaining a division length S_nWherein X is_iN is the number of eigenvalues for a random variable sequence representing an unknown number; a is_iIs a coefficient determined by the data structure of the backup data, a if the backup data is structural data_iIs constant, a if the backup data is unstructured data_iIs a row of coefficient arrays.

Switching the target characteristic values until all the characteristic values are traversed to obtain the segmentation length corresponding to each characteristic value; and dividing the backup data into a plurality of data blocks according to the characteristic values of the backup data and the division lengths corresponding to the characteristic values.

The method comprises the steps of dynamically segmenting irregular data blocks of data, wherein the data are files with different sizes of 1-4 KB generally in a scene of a large number of small files, dividing the data into data blocks with different sizes according to the size of the files in the backup process based on the principle of variable length data blocks, for example, an original file is 1KB, carrying out a hash algorithm on the 1KB as one data block based on the principle of the variable length data blocks to obtain corresponding fingerprint information, and dividing the original file into two data blocks with 2KB if the original file is 3KB, and carrying out the hash algorithm to obtain two corresponding fingerprint information.

And S2, calculating the hash value of each data block, and searching the matched data block of the data block from the stored data based on the hash value.

Calculating hash values of the data blocks, and retrieving matched data blocks with the same hash values as the data blocks from the stored data; and acquiring metadata of the matched data block, and taking the acquired metadata as the metadata of the data block.

And S3, deleting the data blocks with the matching data blocks.

In step S2, the metadata of the matching data block has been bound to the metadata of the data block, so deleting the data block saves storage resources, and the content of the data block can be read out according to the metadata when searching the data block.

The embodiment effectively solves the problems existing in the fixed block segmentation, and when data is inserted into or deleted from the data object, if the changed content is not in the boundary of the data block, the data block is not changed; the boundaries between data blocks are random, dynamic, and portions of the contents of the data blocks may be repetitive. Therefore, the insertion or deletion of the content only affects one or two adjacent data blocks, and the rest of the data blocks are not affected, so that the deduplication of the data is more accurate.

For the backup of unstructured data, especially massive small files, a dynamic segmentation technology is adopted. Because the data change situation in the massive small files is complex, the whole backup data is often re-partitioned by adopting fixed block partitioning, and the partition processing of variable-length blocks is adopted, the problems existing in the fixed block partitioning are solved, and the partition processing is only carried out on the data of the changed part, so that the occupation of the computing resources of the client side on the data partition processing is greatly reduced, and the optimal data re-deleting effect of the massive small files is ensured.

As shown in fig. 2, the system 200 includes:

a data dividing unit 210, configured to extract features of the backup data by using an F-score dimensionality reduction method, and divide the backup data into data blocks with irregular sizes based on the features;

a matching search unit 220, configured to calculate a hash value of each data block, and search a matching data block of the data block from the stored data based on the hash value;

and a deduplication unit 230 configured to delete the data block where the matching data block exists.

Optionally, as an embodiment of the present invention, the data dividing unit is configured to:

using the F-score function:

calculating F-scores of K features of the backup data, wherein

Is the average of the ith feature over the entire data set,

is the average of the ith feature over the positive class data set,

is the average of the ith feature over the negative class data set,

Optionally, as an embodiment of the present invention, the matching search unit is configured to:

Fig. 3 is a schematic structural diagram of a terminal 300 according to an embodiment of the present invention, where the terminal 300 may be used to execute the data dynamic deduplication method according to the embodiment of the present invention.

Among them, the terminal 300 may include: a processor 310, a memory 320, and a communication unit 330. The components communicate via one or more buses, and those skilled in the art will appreciate that the architecture of the servers shown in the figures is not intended to be limiting, and may be a bus architecture, a star architecture, a combination of more or less components than those shown, or a different arrangement of components.

The memory 320 may be used for storing instructions executed by the processor 310, and the memory 320 may be implemented by any type of volatile or non-volatile storage terminal or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk. The executable instructions in memory 320, when executed by processor 310, enable terminal 300 to perform some or all of the steps in the method embodiments described below.

The processor 310 is a control center of the storage terminal, connects various parts of the entire electronic terminal using various interfaces and lines, and performs various functions of the electronic terminal and/or processes data by operating or executing software programs and/or modules stored in the memory 320 and calling data stored in the memory. The processor may be composed of an Integrated Circuit (IC), for example, a single packaged IC, or a plurality of packaged ICs connected with the same or different functions. For example, the processor 310 may include only a Central Processing Unit (CPU). In the embodiment of the present invention, the CPU may be a single operation core, or may include multiple operation cores.

A communication unit 330, configured to establish a communication channel so that the storage terminal can communicate with other terminals. And receiving user data sent by other terminals or sending the user data to other terminals.

The present invention also provides a computer storage medium, wherein the computer storage medium may store a program, and the program may include some or all of the steps in the embodiments provided by the present invention when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM) or a Random Access Memory (RAM).

Therefore, the method extracts the characteristics of the backup data by using an F value dimensionality reduction method and divides the backup data into irregular data blocks based on the characteristics, dynamic division of the backup data is achieved, the problems existing in fixed block division are solved, division processing is only performed on the data of a changed part, therefore, occupation of computing resources of a client side in data division processing is greatly reduced, data deduplication rate in a massive small file backup process can be effectively improved, occupation of the computing resources is reduced, meanwhile, space can be saved more during data storage, technical effects which can be achieved by the embodiment can be referred to the description above, and details are not repeated here.

Those skilled in the art will readily appreciate that the techniques of the embodiments of the present invention may be implemented as software plus a required general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied in the form of a software product, where the computer software product is stored in a storage medium, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like, and the storage medium can store program codes, and includes instructions for enabling a computer terminal (which may be a personal computer, a server, or a second terminal, a network terminal, and the like) to perform all or part of the steps of the method in the embodiments of the present invention.

The same and similar parts in the various embodiments in this specification may be referred to each other. Especially, for the terminal embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant points can be referred to the description in the method embodiment.

In the embodiments provided in the present invention, it should be understood that the disclosed system and method can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, systems or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

Although the present invention has been described in detail by referring to the drawings in connection with the preferred embodiments, the present invention is not limited thereto. Various equivalent modifications or substitutions can be made on the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and these modifications or substitutions are within the scope of the present invention/any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for dynamically deleting data is characterized by comprising the following steps:

the data blocks for which there is a matching data block are deleted.

2. The method of claim 1, wherein extracting the characteristics of the backup data by using an F-score dimensionality reduction method and dividing the backup data into irregular-sized data blocks based on the characteristics comprises:

using the F-score function:

calculating F-scores of K features of the backup data, wherein

Is the average of the ith feature over the entire data set,

is the average of the ith feature over the positive class data set,

is the average of the ith feature over the negative class data set,

3. The method of claim 2, wherein extracting the characteristics of the backup data by using an F-score dimensionality reduction method and dividing the backup data into irregular-sized data blocks based on the characteristics comprises:

4. The method of claim 1, wherein computing a hash value for each chunk of data and finding a matching chunk of chunks of data from stored data based on the hash value comprises:

5. A system for dynamic deduplication of data, comprising:

6. The system of claim 5, wherein the data partitioning unit is configured to:

using the F-score function:

calculating F-scores of K features of the backup data, wherein

Is the average of the ith feature over the entire data set,

is the average of the ith feature over the positive class data set,

is the average of the ith feature over the negative class data set,

7. The system of claim 6, wherein the data partitioning unit is configured to:

8. The system of claim 5, wherein the match lookup unit is configured to:

9. A terminal, comprising:

a processor;

a memory for storing instructions for execution by the processor;

wherein the processor is configured to perform the method of any one of claims 1-4.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-4.