Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a data dynamic deduplication method, a system, a terminal and a storage medium, so as to solve the technical problems.
In a first aspect, the present invention provides a method for dynamically deleting data, including:
extracting the characteristics of the backup data by using an F-score dimension reduction method, and dividing the backup data into data blocks with irregular sizes based on the characteristics;
calculating hash values of all the data blocks, and searching matched data blocks of the data blocks from stored data based on the hash values;
and deleting the data blocks with the matched data blocks.
Further, extracting features of the backup data by using an F-score dimension reduction method, and dividing the backup data into irregularly-sized data blocks based on the features, wherein the method comprises the following steps:
using the F score function:
calculating F scores for K features of the backup data, whereinFor the average of the ith feature over the entire dataset,for the mean value of the ith feature on the positive class dataset, +.>Is the average of the ith feature over the negative class dataset,characteristic value of the ith characteristic of the kth positive sample point, ++>The feature value of the ith feature which is the kth negative sample point;
sorting the F scores of the characteristic values according to a rule from big to small, and selecting F scores of corresponding quantity with top ranking from the set quantity of the characteristic values;
and taking the characteristic value corresponding to the selected F score as the characteristic value of the backup data.
Further, extracting features of the backup data by using an F-score dimension reduction method, and dividing the backup data into irregularly-sized data blocks based on the features, wherein the method comprises the following steps:
randomly selecting a target characteristic value from all characteristic values of the backup data, and taking the target characteristic value as a segmentation starting point;
substituting the target characteristic value into a dividing length calculating functionObtaining the dividing length S n Wherein X is i A is a random variable sequence i Is a coefficient determined by the data structure of the backup data, n is the number of eigenvalues;
switching the target characteristic values until all the characteristic values are traversed, and obtaining the corresponding segmentation length of each characteristic value;
and dividing the backup data into a plurality of data blocks according to the characteristic values of the backup data and the corresponding dividing lengths of the characteristic values.
Further, calculating a hash value of each data block, and searching for a matching block of the data block from the stored data based on the hash value, includes:
calculating a hash value of a data block, and retrieving a matching data block with the same hash value as the data block from stored data;
and acquiring metadata of the matched data block, and taking the metadata as metadata of the data block.
In a second aspect, the present invention provides a data dynamic deduplication system, including:
the data segmentation unit is used for extracting the characteristics of the backup data by using an F-score dimension reduction method and dividing the backup data into data blocks with irregular sizes based on the characteristics;
a matching search unit for calculating hash values of the respective data blocks and searching for matching data blocks of the data blocks from the stored data based on the hash values;
and the de-duplication unit is used for deleting the data blocks with the matched data blocks.
Further, the data dividing unit is configured to:
using the F score function:
calculating F scores for K features of the backup data, whereinFor the average of the ith feature over the entire dataset,for the mean value of the ith feature on the positive class dataset, +.>Is the average of the ith feature over the negative class dataset,characteristic value of the ith characteristic of the kth positive sample point, ++>The kth negative sample pointCharacteristic values of i characteristics;
sorting the F scores of the characteristic values according to a rule from big to small, and selecting F scores of corresponding quantity with top ranking from the set quantity of the characteristic values;
and taking the characteristic value corresponding to the selected F score as the characteristic value of the backup data.
Further, the data dividing unit is configured to:
randomly selecting a target characteristic value from all characteristic values of the backup data, and taking the target characteristic value as a segmentation starting point;
substituting the target characteristic value into a dividing length calculating functionObtaining the dividing length S n Wherein X is i A is a random variable sequence i Is a coefficient determined by the data structure of the backup data, n is the number of eigenvalues;
switching the target characteristic values until all the characteristic values are traversed, and obtaining the corresponding segmentation length of each characteristic value;
and dividing the backup data into a plurality of data blocks according to the characteristic values of the backup data and the corresponding dividing lengths of the characteristic values.
Further, the matching search unit is configured to:
calculating a hash value of a data block, and retrieving a matching data block with the same hash value as the data block from stored data;
and acquiring metadata of the matched data block, and taking the metadata as metadata of the data block.
In a third aspect, a terminal is provided, including:
a processor, a memory, wherein,
the memory is used for storing a computer program,
the processor is configured to call and run the computer program from the memory, so that the terminal performs the method of the terminal as described above.
In a fourth aspect, there is provided a computer storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the method of the above aspects.
The data dynamic deleting method, the system, the terminal and the storage medium have the advantages that the F-score dimension reduction method is utilized to extract the characteristics of the backup data and divide the backup data into the data blocks with irregular sizes based on the characteristics, so that the dynamic segmentation of the backup data is realized, the problems in fixed block segmentation are solved, and the segmentation processing is only carried out on the data of a changed part, so that the occupation of computing resources of a client in the data segmentation processing is greatly reduced, the data deleting rate in the process of backing up massive small files can be effectively improved, the occupation of computing resources is reduced, and meanwhile, the space can be saved more in the process of data storage.
In addition, the invention has reliable design principle, simple structure and very wide application prospect.
Detailed Description
In order to make the technical solution of the present invention better understood by those skilled in the art, the technical solution of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
In the prior art, a sliding window block scheme exists, a method based on the sliding window block scheme for first backup and fixed-length erasure is consistent, the fixed length is selected to block the whole string of data, and the hash value of each block is calculated. The fixed length selected is the length of the window, and the sliding of the window is used to try to find and match the same data during the secondary backup. Taking the data modification as an example, the second slice has the data change deab— > ddab. First the hash value of ddab is calculated, the data of the slice is changed so that no fingerprint can match. At this point the next data slice is not processed urgently, but the window is moved forward one unit, the hash value (f i-range int 2') of the data under this window is continued to be calculated and a match is attempted, and so on, until a matching hash value is found. When a certain data is overwritten, the effect is the same as the fixed-length erasure rate, and the same erasure rate can be obtained. The method still needs to divide the data into data blocks with fixed length at the beginning, and is not suitable for dividing small files; and the window is slid according to the designated step distance, and then data is deleted again, so that the hash value of the data block needs to be calculated for many times, the calculated amount is large, and the deletion efficiency is low.
In actual service production, due to different types of service systems, the size of data is different, if structured data is generally in KB level, unstructured data is in MB level, and the problem of low erasure rate can exist by adopting fixed-length data block segmentation, in order to solve the problem, the invention discloses a data erasure method based on a variable-length data block segmentation technology. The method mainly determines the data dividing limit through a continuously sliding window, and dynamically divides the backup data into data blocks with different sizes according to the characteristic function of the data dividing limit, so that the data deleting rate is improved.
FIG. 1 is a schematic flow chart of a method of one embodiment of the invention. The execution body of fig. 1 may be a data dynamic deduplication system.
As shown in fig. 1, the method includes:
step 110, extracting the characteristics of the backup data by using an F-score dimension reduction method, and dividing the backup data into data blocks with irregular sizes based on the characteristics;
step 120, calculating hash values of the data blocks, and searching matching data blocks of the data blocks from the stored data based on the hash values;
and step 130, deleting the data block with the matched data block.
In order to facilitate understanding of the present invention, the data dynamic deduplication method provided by the present invention is further described below by using the principle of the data dynamic deduplication method of the present invention, and combining the process of dynamic deduplication of data in the embodiment.
Specifically, the data dynamic deduplication method comprises the following steps:
s1, extracting characteristics of backup data by using an F-score dimension reduction method, and dividing the backup data into data blocks with irregular sizes based on the characteristics.
For a given data set of backup data, assume that the number of positive class points and the number of negative class points are n+ and n_, respectively:
using the F score function:
calculating F scores for K features of the backup data, whereinFor the average of the ith feature over the entire dataset,for the mean value of the ith feature on the positive class dataset, +.>Is the average of the ith feature over the negative class dataset,characteristic value of the ith characteristic of the kth positive sample point, ++>The feature value of the ith feature which is the kth negative sample point.
The numerator on the right side of the formula approximately reflects the magnitude of the degree of difference between the positive class point and the negative class point on the kth feature, while the formula on the left side of the denominator and the formula on the right side respectively reflect the degree of dispersion of the positive class point and the negative class point on the kth feature, so that if the value of F (K) is larger, the two classes of points can be distinguished for the kth feature, and the F-score method can be used as a standard for selecting the feature.
Assuming that the number of the features selected in advance is d, calculating n F scores by using the above formula, wherein the n F scores are respectively: f (1), F (2), …, F (n); sorting the n F scores according to a rule from big to small, and selecting the top d F scores from the n F scores; extracting the subscript k corresponding to the selected F score i The features corresponding to the subscripts are feature values of the backup data.
Randomly selecting a target characteristic value from all characteristic values of the backup data, and taking the target characteristic value as a segmentation starting point;
substituting the target feature value into the segmentation length calculation functionObtaining the dividing length S n Wherein X is i N is the number of eigenvalues for a random variable sequence representing unknowns; a, a i Is a coefficient determined by the data structure of the backup data, a if the backup data is structural data i Being constant, a if the backup data is unstructured data i Is a list of coefficient arrays.
Switching the target characteristic values until all the characteristic values are traversed, and obtaining the corresponding segmentation length of each characteristic value; and dividing the backup data into a plurality of data blocks according to the characteristic values of the backup data and the corresponding dividing lengths of the characteristic values.
The method comprises the steps of dynamically carrying out irregular data block segmentation on data, under a scene of massive small files, generally dividing the data into files with different sizes of 1-4 KB, dividing the files into data blocks with different sizes according to the different sizes of the files in the backup process based on the principle of variable length data blocks, namely 1KB as an original file, carrying out a hash algorithm on the 1KB as a data block based on the principle of variable length data blocks to obtain corresponding fingerprint information, and dividing the original file into two data blocks with 2KB if the original file is 3KB, and carrying out a hash algorithm to obtain two corresponding fingerprint information.
S2, calculating hash values of the data blocks, and searching matched data blocks of the data blocks from stored data based on the hash values.
Calculating a hash value of the data block, and retrieving a matching data block having the same hash value as the data block from the stored data; metadata of the matching data block is acquired, and the acquired metadata is used as metadata of the data block.
S3, deleting the data block with the matched data block.
In step S2, the metadata of the matching data block is already bound to the metadata of the data block, so that the deleting of the data block saves storage resources, and the content of the data block can be read out according to the metadata when the data block is searched.
The embodiment effectively solves the problems in fixed block segmentation, and when data is inserted into or deleted from a data object, if the changed content is not within the boundary of the data block, the data block is not changed; the boundaries between data blocks are random, dynamic, and portions of the content of the data blocks may be repetitive. Therefore, the insertion or deletion of the content only affects one or two adjacent data blocks, and the rest of the data blocks are not affected, so that the data de-duplication is more accurate.
For unstructured data, especially for backup of massive small files, a dynamic segmentation technology is adopted. Because the change condition of data in a large number of small files is complex, the adoption of fixed block segmentation often leads to re-blocking the whole backup data, and the adoption of variable-length block segmentation solves the problems existing in fixed block segmentation, and only the data of a changed part is segmented, so that the occupation of calculation resources of a client in the data segmentation process is greatly reduced, and the optimal data deleting effect of the large number of small files is ensured.
As shown in fig. 2, the system 200 includes:
a data dividing unit 210, configured to extract features of the backup data by using an F-score dimension reduction method, and divide the backup data into data blocks with irregular sizes based on the features;
a matching search unit 220 for calculating hash values of the respective data blocks and searching for matching data blocks of the data blocks from the stored data based on the hash values;
and a de-duplication unit 230 for deleting the data blocks where the matching data blocks exist.
Optionally, as an embodiment of the present invention, the data dividing unit is configured to:
using the F score function:
calculating F scores for K features of the backup data, whereinFor the average of the ith feature over the entire dataset,for the mean value of the ith feature on the positive class dataset, +.>Is the average of the ith feature over the negative class dataset,characteristic value of the ith characteristic of the kth positive sample point, ++>The feature value of the ith feature which is the kth negative sample point;
sorting the F scores of the characteristic values according to a rule from big to small, and selecting F scores of corresponding quantity with top ranking from the set quantity of the characteristic values;
and taking the characteristic value corresponding to the selected F score as the characteristic value of the backup data.
Optionally, as an embodiment of the present invention, the data dividing unit is configured to:
randomly selecting a target characteristic value from all characteristic values of the backup data, and taking the target characteristic value as a segmentation starting point;
substituting the target characteristic value into a dividing length calculating functionObtaining the dividing length S n Wherein X is i A is a random variable sequence i Is a coefficient determined by the data structure of the backup data, n is the number of eigenvalues;
switching the target characteristic values until all the characteristic values are traversed, and obtaining the corresponding segmentation length of each characteristic value;
and dividing the backup data into a plurality of data blocks according to the characteristic values of the backup data and the corresponding dividing lengths of the characteristic values.
Optionally, as an embodiment of the present invention, the matching search unit is configured to:
calculating a hash value of a data block, and retrieving a matching data block with the same hash value as the data block from stored data;
and acquiring metadata of the matched data block, and taking the metadata as metadata of the data block.
Fig. 3 is a schematic structural diagram of a terminal 300 according to an embodiment of the present invention, where the terminal 300 may be used to execute the data dynamic deduplication method according to the embodiment of the present invention.
The terminal 300 may include: a processor 310, a memory 320 and a communication unit 330. The components may communicate via one or more buses, and it will be appreciated by those skilled in the art that the configuration of the server as shown in the drawings is not limiting of the invention, as it may be a bus-like structure, a star-like structure, or include more or fewer components than shown, or may be a combination of certain components or a different arrangement of components.
The memory 320 may be used to store instructions for execution by the processor 310, and the memory 320 may be implemented by any type of volatile or non-volatile memory terminal or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk, or optical disk. The execution of the instructions in memory 320, when executed by processor 310, enables terminal 300 to perform some or all of the steps in the method embodiments described below.
The processor 310 is a control center of the storage terminal, connects various parts of the entire electronic terminal using various interfaces and lines, and performs various functions of the electronic terminal and/or processes data by running or executing software programs and/or modules stored in the memory 320, and invoking data stored in the memory. The processor may be comprised of an integrated circuit (Integrated Circuit, simply referred to as an IC), for example, a single packaged IC, or may be comprised of a plurality of packaged ICs connected to the same function or different functions. For example, the processor 310 may include only a central processing unit (Central Processing Unit, simply CPU). In the embodiment of the invention, the CPU can be a single operation core or can comprise multiple operation cores.
And a communication unit 330 for establishing a communication channel so that the storage terminal can communicate with other terminals. Receiving user data sent by other terminals or sending the user data to other terminals.
The present invention also provides a computer storage medium in which a program may be stored, which program may include some or all of the steps in the embodiments provided by the present invention when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a random-access memory (random access memory, RAM), or the like.
Therefore, the invention extracts the characteristics of the backup data by using the F-score dimension reduction method and divides the backup data into data blocks with irregular sizes based on the characteristics, realizes the dynamic division of the backup data, solves the problems in the fixed block division, and only performs division processing on the data of the changed part, thereby greatly reducing the occupation of computing resources of a client in the data division processing, effectively improving the data deletion rate in the process of backing up massive small files, reducing the occupation of computing resources, and simultaneously saving more space in the process of data storage.
It will be apparent to those skilled in the art that the techniques of embodiments of the present invention may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solution in the embodiments of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium such as a U-disc, a mobile hard disc, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, etc. various media capable of storing program codes, including several instructions for causing a computer terminal (which may be a personal computer, a server, or a second terminal, a network terminal, etc.) to execute all or part of the steps of the method described in the embodiments of the present invention.
The same or similar parts between the various embodiments in this specification are referred to each other. In particular, for the terminal embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and reference should be made to the description in the method embodiment for relevant points.
In the several embodiments provided by the present invention, it should be understood that the disclosed systems and methods may be implemented in other ways. For example, the system embodiments described above are merely illustrative, e.g., the division of the elements is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, system or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
Although the present invention has been described in detail by way of preferred embodiments with reference to the accompanying drawings, the present invention is not limited thereto. Various equivalent modifications and substitutions may be made in the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and it is intended that all such modifications and substitutions be within the scope of the present invention/be within the scope of the present invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.