CN114138552B

CN114138552B - Data dynamic repeating and deleting method, system, terminal and storage medium

Info

Publication number: CN114138552B
Application number: CN202111335541.6A
Authority: CN
Inventors: 朱箫鸣; 冀国威
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2021-11-11
Filing date: 2021-11-11
Publication date: 2024-01-12
Anticipated expiration: 2041-11-11
Also published as: CN114138552A

Abstract

The invention provides a data dynamic deleting method, a system, a terminal and a storage medium, comprising the following steps: extracting the characteristics of the backup data by using an F-score dimension reduction method, and dividing the backup data into data blocks with irregular sizes based on the characteristics; calculating hash values of all the data blocks, and searching matched data blocks of the data blocks from stored data based on the hash values; and deleting the data blocks with the matched data blocks. The invention solves the problems existing in the fixed block segmentation, greatly reduces the occupation of the computing resources of the client in the data segmentation process, can effectively improve the data deleting rate in the process of backing up massive small files, reduces the occupation of the computing resources, and can save more space during the data storage.

Description

Data dynamic repeating and deleting method, system, terminal and storage medium

Technical Field

The invention relates to the technical field of data storage, in particular to a data dynamic deduplication method, a system, a terminal and a storage medium.

Background

With the development of scientific technology, data starts to grow exponentially, and data security is also an important point of attention of government and enterprises, but in the backup protection of data, a large amount of redundant data always occupies a storage space, so that people start to pay attention to a 'repeated data deletion' technology, and hope to save a large amount of storage space. Therefore, in the backup disaster recovery products of the data, the technology of 'repeated data deletion' also becomes one of the assessment indexes for considering whether the products are superior in technical content, running performance, product quality and the like.

In the implementation of data deduplication, manufacturers generally adopt a method of firstly performing partitioning processing on data, namely partitioning backup data into non-overlapping fixed-length data blocks, wherein the common block sizes are 4K/8K/16K/32K/128K, and the like, and the fixed-length data blocks selected by different manufacturers are different in size. Then fingerprint information is established for each data block by utilizing a hash algorithm, and the system judges whether the data block is repeated with the existing metadata by calculating and checking the fingerprint of the data block: if so, only a pointer to the "metadata" needs to be maintained; if the fingerprint shows that the data block is brand new, the data block is reserved, and relevant information is extracted and saved as metadata for subsequent data verification and comparison.

In the whole process, it is not difficult to find that the size of the segmentation data block becomes a critical problem, and the size of the data block can influence the operation performance and the erasure rate of the data deduplication process: the data block is large, the data deduplication processing operation performance is high, but the deduplication rate is low, and the accuracy is reduced; the data blocking is small, the data deduplication processing operation performance is low, but the deduplication rate is high, and the accuracy is improved. Meanwhile, for a massive small file scene, due to complex data change conditions, when data is inserted into or deleted from a source data object, the data block segmentation is carried out again due to the adoption of the data block segmentation with fixed length, so that the calculation amount is increased, and meanwhile, the erasure rate is lower.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a data dynamic deduplication method, a system, a terminal and a storage medium, so as to solve the technical problems.

In a first aspect, the present invention provides a method for dynamically deleting data, including:

extracting the characteristics of the backup data by using an F-score dimension reduction method, and dividing the backup data into data blocks with irregular sizes based on the characteristics;

calculating hash values of all the data blocks, and searching matched data blocks of the data blocks from stored data based on the hash values;

and deleting the data blocks with the matched data blocks.

Further, extracting features of the backup data by using an F-score dimension reduction method, and dividing the backup data into irregularly-sized data blocks based on the features, wherein the method comprises the following steps:

using the F score function:

calculating F scores for K features of the backup data, whereinFor the average of the ith feature over the entire dataset,for the mean value of the ith feature on the positive class dataset, +.>Is the average of the ith feature over the negative class dataset,characteristic value of the ith characteristic of the kth positive sample point, ++>The feature value of the ith feature which is the kth negative sample point;

sorting the F scores of the characteristic values according to a rule from big to small, and selecting F scores of corresponding quantity with top ranking from the set quantity of the characteristic values;

and taking the characteristic value corresponding to the selected F score as the characteristic value of the backup data.

randomly selecting a target characteristic value from all characteristic values of the backup data, and taking the target characteristic value as a segmentation starting point;

substituting the target characteristic value into a dividing length calculating functionObtaining the dividing length S _n Wherein X is _i A is a random variable sequence _i Is a coefficient determined by the data structure of the backup data, n is the number of eigenvalues;

switching the target characteristic values until all the characteristic values are traversed, and obtaining the corresponding segmentation length of each characteristic value;

and dividing the backup data into a plurality of data blocks according to the characteristic values of the backup data and the corresponding dividing lengths of the characteristic values.

Further, calculating a hash value of each data block, and searching for a matching block of the data block from the stored data based on the hash value, includes:

calculating a hash value of a data block, and retrieving a matching data block with the same hash value as the data block from stored data;

and acquiring metadata of the matched data block, and taking the metadata as metadata of the data block.

In a second aspect, the present invention provides a data dynamic deduplication system, including:

the data segmentation unit is used for extracting the characteristics of the backup data by using an F-score dimension reduction method and dividing the backup data into data blocks with irregular sizes based on the characteristics;

a matching search unit for calculating hash values of the respective data blocks and searching for matching data blocks of the data blocks from the stored data based on the hash values;

and the de-duplication unit is used for deleting the data blocks with the matched data blocks.

Further, the data dividing unit is configured to:

using the F score function:

calculating F scores for K features of the backup data, whereinFor the average of the ith feature over the entire dataset,for the mean value of the ith feature on the positive class dataset, +.>Is the average of the ith feature over the negative class dataset,characteristic value of the ith characteristic of the kth positive sample point, ++>The kth negative sample pointCharacteristic values of i characteristics;

Further, the data dividing unit is configured to:

Further, the matching search unit is configured to:

In a third aspect, a terminal is provided, including:

a processor, a memory, wherein,

the memory is used for storing a computer program,

the processor is configured to call and run the computer program from the memory, so that the terminal performs the method of the terminal as described above.

In a fourth aspect, there is provided a computer storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the method of the above aspects.

The data dynamic deleting method, the system, the terminal and the storage medium have the advantages that the F-score dimension reduction method is utilized to extract the characteristics of the backup data and divide the backup data into the data blocks with irregular sizes based on the characteristics, so that the dynamic segmentation of the backup data is realized, the problems in fixed block segmentation are solved, and the segmentation processing is only carried out on the data of a changed part, so that the occupation of computing resources of a client in the data segmentation processing is greatly reduced, the data deleting rate in the process of backing up massive small files can be effectively improved, the occupation of computing resources is reduced, and meanwhile, the space can be saved more in the process of data storage.

In addition, the invention has reliable design principle, simple structure and very wide application prospect.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a schematic flow chart of a method of one embodiment of the invention.

FIG. 2 is a schematic block diagram of a system of one embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

Detailed Description

In order to make the technical solution of the present invention better understood by those skilled in the art, the technical solution of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

In the prior art, a sliding window block scheme exists, a method based on the sliding window block scheme for first backup and fixed-length erasure is consistent, the fixed length is selected to block the whole string of data, and the hash value of each block is calculated. The fixed length selected is the length of the window, and the sliding of the window is used to try to find and match the same data during the secondary backup. Taking the data modification as an example, the second slice has the data change deab— > ddab. First the hash value of ddab is calculated, the data of the slice is changed so that no fingerprint can match. At this point the next data slice is not processed urgently, but the window is moved forward one unit, the hash value (f i-range int 2') of the data under this window is continued to be calculated and a match is attempted, and so on, until a matching hash value is found. When a certain data is overwritten, the effect is the same as the fixed-length erasure rate, and the same erasure rate can be obtained. The method still needs to divide the data into data blocks with fixed length at the beginning, and is not suitable for dividing small files; and the window is slid according to the designated step distance, and then data is deleted again, so that the hash value of the data block needs to be calculated for many times, the calculated amount is large, and the deletion efficiency is low.

In actual service production, due to different types of service systems, the size of data is different, if structured data is generally in KB level, unstructured data is in MB level, and the problem of low erasure rate can exist by adopting fixed-length data block segmentation, in order to solve the problem, the invention discloses a data erasure method based on a variable-length data block segmentation technology. The method mainly determines the data dividing limit through a continuously sliding window, and dynamically divides the backup data into data blocks with different sizes according to the characteristic function of the data dividing limit, so that the data deleting rate is improved.

FIG. 1 is a schematic flow chart of a method of one embodiment of the invention. The execution body of fig. 1 may be a data dynamic deduplication system.

As shown in fig. 1, the method includes:

step 110, extracting the characteristics of the backup data by using an F-score dimension reduction method, and dividing the backup data into data blocks with irregular sizes based on the characteristics;

step 120, calculating hash values of the data blocks, and searching matching data blocks of the data blocks from the stored data based on the hash values;

and step 130, deleting the data block with the matched data block.

In order to facilitate understanding of the present invention, the data dynamic deduplication method provided by the present invention is further described below by using the principle of the data dynamic deduplication method of the present invention, and combining the process of dynamic deduplication of data in the embodiment.

Specifically, the data dynamic deduplication method comprises the following steps:

s1, extracting characteristics of backup data by using an F-score dimension reduction method, and dividing the backup data into data blocks with irregular sizes based on the characteristics.

For a given data set of backup data, assume that the number of positive class points and the number of negative class points are n+ and n_, respectively:

using the F score function:

calculating F scores for K features of the backup data, whereinFor the average of the ith feature over the entire dataset,for the mean value of the ith feature on the positive class dataset, +.>Is the average of the ith feature over the negative class dataset,characteristic value of the ith characteristic of the kth positive sample point, ++>The feature value of the ith feature which is the kth negative sample point.

The numerator on the right side of the formula approximately reflects the magnitude of the degree of difference between the positive class point and the negative class point on the kth feature, while the formula on the left side of the denominator and the formula on the right side respectively reflect the degree of dispersion of the positive class point and the negative class point on the kth feature, so that if the value of F (K) is larger, the two classes of points can be distinguished for the kth feature, and the F-score method can be used as a standard for selecting the feature.

Assuming that the number of the features selected in advance is d, calculating n F scores by using the above formula, wherein the n F scores are respectively: f (1), F (2), …, F (n); sorting the n F scores according to a rule from big to small, and selecting the top d F scores from the n F scores; extracting the subscript k corresponding to the selected F score _i The features corresponding to the subscripts are feature values of the backup data.

substituting the target feature value into the segmentation length calculation functionObtaining the dividing length S _n Wherein X is _i N is the number of eigenvalues for a random variable sequence representing unknowns; a, a _i Is a coefficient determined by the data structure of the backup data, a if the backup data is structural data _i Being constant, a if the backup data is unstructured data _i Is a list of coefficient arrays.

Switching the target characteristic values until all the characteristic values are traversed, and obtaining the corresponding segmentation length of each characteristic value; and dividing the backup data into a plurality of data blocks according to the characteristic values of the backup data and the corresponding dividing lengths of the characteristic values.

The method comprises the steps of dynamically carrying out irregular data block segmentation on data, under a scene of massive small files, generally dividing the data into files with different sizes of 1-4 KB, dividing the files into data blocks with different sizes according to the different sizes of the files in the backup process based on the principle of variable length data blocks, namely 1KB as an original file, carrying out a hash algorithm on the 1KB as a data block based on the principle of variable length data blocks to obtain corresponding fingerprint information, and dividing the original file into two data blocks with 2KB if the original file is 3KB, and carrying out a hash algorithm to obtain two corresponding fingerprint information.

S2, calculating hash values of the data blocks, and searching matched data blocks of the data blocks from stored data based on the hash values.

Calculating a hash value of the data block, and retrieving a matching data block having the same hash value as the data block from the stored data; metadata of the matching data block is acquired, and the acquired metadata is used as metadata of the data block.

S3, deleting the data block with the matched data block.

In step S2, the metadata of the matching data block is already bound to the metadata of the data block, so that the deleting of the data block saves storage resources, and the content of the data block can be read out according to the metadata when the data block is searched.

The embodiment effectively solves the problems in fixed block segmentation, and when data is inserted into or deleted from a data object, if the changed content is not within the boundary of the data block, the data block is not changed; the boundaries between data blocks are random, dynamic, and portions of the content of the data blocks may be repetitive. Therefore, the insertion or deletion of the content only affects one or two adjacent data blocks, and the rest of the data blocks are not affected, so that the data de-duplication is more accurate.

For unstructured data, especially for backup of massive small files, a dynamic segmentation technology is adopted. Because the change condition of data in a large number of small files is complex, the adoption of fixed block segmentation often leads to re-blocking the whole backup data, and the adoption of variable-length block segmentation solves the problems existing in fixed block segmentation, and only the data of a changed part is segmented, so that the occupation of calculation resources of a client in the data segmentation process is greatly reduced, and the optimal data deleting effect of the large number of small files is ensured.

As shown in fig. 2, the system 200 includes:

a data dividing unit 210, configured to extract features of the backup data by using an F-score dimension reduction method, and divide the backup data into data blocks with irregular sizes based on the features;

a matching search unit 220 for calculating hash values of the respective data blocks and searching for matching data blocks of the data blocks from the stored data based on the hash values;

and a de-duplication unit 230 for deleting the data blocks where the matching data blocks exist.

Optionally, as an embodiment of the present invention, the data dividing unit is configured to:

using the F score function:

Optionally, as an embodiment of the present invention, the matching search unit is configured to:

Fig. 3 is a schematic structural diagram of a terminal 300 according to an embodiment of the present invention, where the terminal 300 may be used to execute the data dynamic deduplication method according to the embodiment of the present invention.

The terminal 300 may include: a processor 310, a memory 320 and a communication unit 330. The components may communicate via one or more buses, and it will be appreciated by those skilled in the art that the configuration of the server as shown in the drawings is not limiting of the invention, as it may be a bus-like structure, a star-like structure, or include more or fewer components than shown, or may be a combination of certain components or a different arrangement of components.

The memory 320 may be used to store instructions for execution by the processor 310, and the memory 320 may be implemented by any type of volatile or non-volatile memory terminal or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk, or optical disk. The execution of the instructions in memory 320, when executed by processor 310, enables terminal 300 to perform some or all of the steps in the method embodiments described below.

The processor 310 is a control center of the storage terminal, connects various parts of the entire electronic terminal using various interfaces and lines, and performs various functions of the electronic terminal and/or processes data by running or executing software programs and/or modules stored in the memory 320, and invoking data stored in the memory. The processor may be comprised of an integrated circuit (Integrated Circuit, simply referred to as an IC), for example, a single packaged IC, or may be comprised of a plurality of packaged ICs connected to the same function or different functions. For example, the processor 310 may include only a central processing unit (Central Processing Unit, simply CPU). In the embodiment of the invention, the CPU can be a single operation core or can comprise multiple operation cores.

And a communication unit 330 for establishing a communication channel so that the storage terminal can communicate with other terminals. Receiving user data sent by other terminals or sending the user data to other terminals.

The present invention also provides a computer storage medium in which a program may be stored, which program may include some or all of the steps in the embodiments provided by the present invention when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a random-access memory (random access memory, RAM), or the like.

Therefore, the invention extracts the characteristics of the backup data by using the F-score dimension reduction method and divides the backup data into data blocks with irregular sizes based on the characteristics, realizes the dynamic division of the backup data, solves the problems in the fixed block division, and only performs division processing on the data of the changed part, thereby greatly reducing the occupation of computing resources of a client in the data division processing, effectively improving the data deletion rate in the process of backing up massive small files, reducing the occupation of computing resources, and simultaneously saving more space in the process of data storage.

It will be apparent to those skilled in the art that the techniques of embodiments of the present invention may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solution in the embodiments of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium such as a U-disc, a mobile hard disc, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, etc. various media capable of storing program codes, including several instructions for causing a computer terminal (which may be a personal computer, a server, or a second terminal, a network terminal, etc.) to execute all or part of the steps of the method described in the embodiments of the present invention.

The same or similar parts between the various embodiments in this specification are referred to each other. In particular, for the terminal embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and reference should be made to the description in the method embodiment for relevant points.

In the several embodiments provided by the present invention, it should be understood that the disclosed systems and methods may be implemented in other ways. For example, the system embodiments described above are merely illustrative, e.g., the division of the elements is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, system or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

Although the present invention has been described in detail by way of preferred embodiments with reference to the accompanying drawings, the present invention is not limited thereto. Various equivalent modifications and substitutions may be made in the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and it is intended that all such modifications and substitutions be within the scope of the present invention/be within the scope of the present invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. The data dynamic deleting method is characterized by comprising the following steps:

deleting the data blocks with the matched data blocks;

extracting features of the backup data by using an F-score dimension reduction method, dividing the backup data into data blocks with irregular sizes based on the features, and comprising the following steps:

using the F score function:

calculating F scores for K features of the backup data, whereinMean value of the whole data set for the ith feature, +.>For the mean value of the ith feature on the positive class dataset, +.>For the mean value of the ith feature on the negative class dataset, +.>Characteristic value of the ith characteristic of the kth positive sample point, ++>The feature value of the ith feature which is the kth negative sample point; n is n ₊ N is the number of positive class points _- The number of the negative class points, and n is the number of the F scores;

taking the characteristic value corresponding to the selected F score as the characteristic value of the backup data;

2. The method of claim 1, wherein calculating a hash value for each data block and searching for a matching block of data blocks from stored data based on the hash value comprises:

3. A data dynamic deduplication system, comprising:

a deduplication unit configured to delete a data block in which a matching data block exists;

the data dividing unit is used for:

using the F score function:

the data dividing unit is used for:

4. A system according to claim 3, wherein the match finding unit is configured to:

5. A terminal, comprising:

a processor;

a memory for storing execution instructions of the processor;

wherein the processor is configured to perform the method of any of claims 1-2.

6. A computer readable storage medium storing a computer program, which when executed by a processor implements the method of any one of claims 1-2.