CN114138552A - Data dynamic deduplication method, system, terminal and storage medium - Google Patents
Data dynamic deduplication method, system, terminal and storage medium Download PDFInfo
- Publication number
- CN114138552A CN114138552A CN202111335541.6A CN202111335541A CN114138552A CN 114138552 A CN114138552 A CN 114138552A CN 202111335541 A CN202111335541 A CN 202111335541A CN 114138552 A CN114138552 A CN 114138552A
- Authority
- CN
- China
- Prior art keywords
- data
- backup
- characteristic
- value
- backup data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 58
- 230000009467 reduction Effects 0.000 claims abstract description 14
- 230000001788 irregular Effects 0.000 claims abstract description 13
- 230000011218 segmentation Effects 0.000 claims description 26
- 230000006870 function Effects 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 7
- 238000000638 solvent extraction Methods 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 abstract description 14
- 230000008569 process Effects 0.000 abstract description 7
- 238000013500 data storage Methods 0.000 abstract description 4
- 238000004891 communication Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 230000000903 blocking effect Effects 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000005192 partition Methods 0.000 description 3
- 238000006467 substitution reaction Methods 0.000 description 3
- 101100446506 Mus musculus Fgf3 gene Proteins 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013524 data verification Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1446—Point-in-time backing up or restoration of persistent data
- G06F11/1448—Management of the data involved in backup or backup restore
- G06F11/1453—Management of the data involved in backup or backup restore using de-duplication of the data
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a method, a system, a terminal and a storage medium for dynamically deleting data, which comprises the following steps: extracting the characteristics of the backup data by using an F value dimension reduction method, and dividing the backup data into data blocks with irregular sizes based on the characteristics; calculating the hash value of each data block, and searching a matched data block of the data block from the stored data based on the hash value; the data blocks for which there is a matching data block are deleted. The method solves the problems existing in the division of the fixed block, greatly reduces the occupation of the computing resources of the client in the data division processing, can effectively improve the data deduplication rate in the backup process of massive small files, reduces the occupation of the computing resources, and can save more space during data storage.
Description
Technical Field
The invention relates to the technical field of data storage, in particular to a method, a system, a terminal and a storage medium for dynamically deleting data.
Background
With the development of scientific technology, data starts to grow exponentially, and data security becomes a key point of government and enterprise attention, but in backup protection of data, a large amount of redundant data always fills up to occupy storage space. Therefore, in the backup disaster recovery product of data, the technology of data de-duplication also becomes one of the evaluation indexes considering whether the product is superior in the aspects of technical content, operation performance, product quality and the like.
In the implementation of data deduplication, manufacturers generally adopt a method that firstly performs blocking processing on data, that is, backup data is divided into fixed-length data blocks that do not overlap with each other, the size of the commonly used blocks is 4K/8K/16K/32K/128K, and the fixed-length data blocks selected by different manufacturers are different in size. Then, fingerprint information is established for each data block by utilizing a Hash algorithm, and the system judges whether the data block is repeated with the existing 'metadata' by calculating and checking the 'fingerprint' of the data block: if so, only the pointer to the "metadata" needs to be retained; if the fingerprint shows that the data block is brand new, the data block is reserved, and relevant information is extracted and stored as metadata for subsequent data verification and comparison.
In the whole process, it is easy to find that the size of the data block becomes a crucial problem, and the size of the data block will affect the operation performance and the deduplication rate of the data deduplication processing: the data block is large, the data deduplication processing operation performance is high, but the deduplication rate is low, and the accuracy is reduced; the data blocking is small, the data deduplication processing operation performance is low, the deduplication rate is high, and the accuracy is improved. Meanwhile, for a large amount of small file scenes, due to the complex data change situation, when data is inserted into or deleted from a source data object, due to the adoption of fixed-length data block segmentation, data block segmentation can be performed again, and the deduplication rate is lower while the calculated amount is increased.
Disclosure of Invention
In view of the above-mentioned deficiencies of the prior art, the present invention provides a method, a system, a terminal and a storage medium for dynamically deleting data, so as to solve the above-mentioned technical problems.
In a first aspect, the present invention provides a method for dynamically deleting data, including:
extracting the characteristics of the backup data by using an F value dimension reduction method, and dividing the backup data into data blocks with irregular sizes based on the characteristics;
calculating the hash value of each data block, and searching a matched data block of the data block from the stored data based on the hash value;
the data blocks for which there is a matching data block are deleted.
Further, extracting the characteristics of the backup data by using an F-score dimensionality reduction method, and dividing the backup data into data blocks with irregular sizes based on the characteristics, including:
using the F-score function:
calculating F-scores of K features of the backup data, whereinIs the average of the ith feature over the entire data set,is the average of the ith feature over the positive class data set,is the average of the ith feature over the negative class data set,is the eigenvalue of the ith characteristic of the kth positive type sample point,the characteristic value of the ith characteristic of the kth negative type sample point;
sorting the F scores of the characteristic values according to a rule from big to small, and selecting the F scores of the corresponding number which are ranked in the front according to the set quantity of the characteristic values;
and taking the characteristic value corresponding to the selected F score as the characteristic value of the backup data.
Further, extracting the characteristics of the backup data by using an F-score dimensionality reduction method, and dividing the backup data into data blocks with irregular sizes based on the characteristics, including:
randomly selecting a target characteristic value from all characteristic values of the backup data, and taking the target characteristic value as a segmentation starting point;
substituting the target characteristic value into a segmentation length calculation functionObtaining a division length SnWherein X isiIs a random variable sequence, aiIs a coefficient determined by the data structure of the backup data, n being the number of eigenvalues;
switching the target characteristic values until all the characteristic values are traversed to obtain the segmentation length corresponding to each characteristic value;
and dividing the backup data into a plurality of data blocks according to the characteristic values of the backup data and the division lengths corresponding to the characteristic values.
Further, calculating a hash value of each data block, and searching a matching block of the data block from the stored data based on the hash value, includes:
calculating hash values of the data blocks, and retrieving matched data blocks with the same hash values as the data blocks from the stored data;
and acquiring metadata of the matched data block, and taking the metadata as the metadata of the data block.
In a second aspect, the present invention provides a system for dynamically deleting data, including:
the data segmentation unit is used for extracting the characteristics of the backup data by using an F value dimensionality reduction method and dividing the backup data into data blocks with irregular sizes based on the characteristics;
the matching search unit is used for calculating the hash value of each data block and searching the matching data block of the data block from the stored data based on the hash value;
and the repeated deleting unit is used for deleting the data blocks with the matched data blocks.
Further, the data partitioning unit is configured to:
using the F-score function:
calculating F-scores of K features of the backup data, whereinIs the average of the ith feature over the entire data set,is the average of the ith feature over the positive class data set,is the average of the ith feature over the negative class data set,is the eigenvalue of the ith characteristic of the kth positive type sample point,the characteristic value of the ith characteristic of the kth negative type sample point;
sorting the F scores of the characteristic values according to a rule from big to small, and selecting the F scores of the corresponding number which are ranked in the front according to the set quantity of the characteristic values;
and taking the characteristic value corresponding to the selected F score as the characteristic value of the backup data.
Further, the data partitioning unit is configured to:
randomly selecting a target characteristic value from all characteristic values of the backup data, and taking the target characteristic value as a segmentation starting point;
substituting the target characteristic value into a segmentation length calculation functionObtaining a division length SnWherein X isiIs a random variable sequence, aiIs a coefficient determined by the data structure of the backup data, n being the number of eigenvalues;
switching the target characteristic values until all the characteristic values are traversed to obtain the segmentation length corresponding to each characteristic value;
and dividing the backup data into a plurality of data blocks according to the characteristic values of the backup data and the division lengths corresponding to the characteristic values.
Further, the matching search unit is configured to:
calculating hash values of the data blocks, and retrieving matched data blocks with the same hash values as the data blocks from the stored data;
and acquiring metadata of the matched data block, and taking the metadata as the metadata of the data block.
In a third aspect, a terminal is provided, including:
a processor, a memory, wherein,
the memory is used for storing a computer program which,
the processor is used for calling and running the computer program from the memory so as to make the terminal execute the method of the terminal.
In a fourth aspect, a computer storage medium is provided having stored therein instructions that, when executed on a computer, cause the computer to perform the method of the above aspects.
The method, the system, the terminal and the storage medium for dynamically deleting the data have the advantages that the characteristics of the backup data are extracted by using an F value dimensionality reduction method, the backup data are divided into data blocks with irregular sizes based on the characteristics, dynamic division of the backup data is realized, the problems existing in fixed block division are solved, division processing is only carried out on the data of a changed part, therefore, occupation of computing resources of a client side on data division processing is greatly reduced, the data deleting rate in the backup process of massive small files can be effectively improved, occupation of the computing resources is reduced, and meanwhile, space can be saved more during data storage.
In addition, the invention has reliable design principle, simple structure and very wide application prospect.
Drawings
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.
FIG. 1 is a schematic flow diagram of a method of one embodiment of the invention.
FIG. 2 is a schematic block diagram of a system of one embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a terminal according to an embodiment of the present invention.
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the prior art, a sliding window blocking scheme also exists, a first backup based on the sliding window blocking scheme is consistent with a fixed-length deduplication method, the whole string of data is blocked by a fixed length, and a hash value of each block is calculated. The fixed length is the length of the window, and the sliding of the window is used for trying to find and match the same data during secondary backup. Taking the data modification example, the second slice has a data change deab- - > ddab. First the hash value of ddab is calculated and the data for this slice is changed so no fingerprint can match. At this time, the next data slice is not processed urgently, the window is moved forward by one unit, the hash value (f i finger int 2') of the data under the window is calculated continuously, matching is attempted, and the like is carried out until a hash value which can be matched is found. When a certain data is overwritten, the effect is the same as the fixed-length deduplication effect, and the same deduplication rate can be obtained. The method still needs to divide the data into fixed-length data blocks at the beginning, and is not suitable for the division of small files; and then, the window is slid according to the specified step distance, and data is deleted again, so that the hash value of the data block needs to be calculated for many times, the calculation amount is large, and the deduplication efficiency is low.
In actual service production, due to different types of service systems, the size of data is different, for example, structured data is generally KB level, unstructured data is MB level, the problem of low deduplication rate exists by adopting fixed-length data block segmentation, and in order to solve the problem, the invention discloses a data deduplication method based on a variable-length data block segmentation technology. The method mainly determines the data boundary through a continuously sliding window, and dynamically divides backup data into data blocks with different sizes according to a characteristic function of the backup data, so that the data deduplication rate is improved.
FIG. 1 is a schematic flow diagram of a method of one embodiment of the invention. The execution subject in fig. 1 may be a data dynamic deduplication system.
As shown in fig. 1, the method includes:
and step 130, deleting the data blocks with the matching data blocks.
In order to facilitate understanding of the present invention, the following describes the data dynamic deduplication method provided by the present invention in further detail by using the principle of the data dynamic deduplication method of the present invention and combining the process of performing dynamic deduplication on data in the embodiment.
Specifically, the data dynamic deduplication method includes:
s1, extracting the characteristics of the backup data by using an F score dimensionality reduction method, and dividing the backup data into data blocks with irregular sizes based on the characteristics.
For a given data set of backup data, assume that the number of its positive class points and the number of its negative class points are n + and n _, respectively:
using the F-score function:
calculating F-scores of K features of the backup data, whereinIs the average of the ith feature over the entire data set,is the average of the ith feature over the positive class data set,is the average of the ith feature over the negative class data set,is the eigenvalue of the ith characteristic of the kth positive type sample point,the feature value of the ith feature of the kth negative type sample point.
The right numerator of the formula roughly reflects the degree of difference of the positive class point and the negative class point on the Kth feature, and the left expression and the right expression of the denominator respectively reflect the degree of dispersion of the positive class point and the negative class point on the Kth feature, so that if the value of F (K) is larger, the two classes of points can be more distinguished for the Kth feature, and the F-score value method can be used as a standard for selecting the feature.
Assuming that the number of the selected features is d, n F scores are calculated by using the formula, wherein the n F scores are respectively: f (1), F (2), …, F (n); sorting the n F scores according to a rule from big to small, and selecting the front d F scores which are ranked earlier from the n F scores; the subscript k corresponding to the selected F value is taken outiThe subscripts have the corresponding characteristicsCharacteristic values of the data.
Randomly selecting a target characteristic value from all characteristic values of the backup data, and taking the target characteristic value as a segmentation starting point;
substituting the target characteristic value into the segmentation length calculation functionObtaining a division length SnWherein X isiN is the number of eigenvalues for a random variable sequence representing an unknown number; a isiIs a coefficient determined by the data structure of the backup data, a if the backup data is structural dataiIs constant, a if the backup data is unstructured dataiIs a row of coefficient arrays.
Switching the target characteristic values until all the characteristic values are traversed to obtain the segmentation length corresponding to each characteristic value; and dividing the backup data into a plurality of data blocks according to the characteristic values of the backup data and the division lengths corresponding to the characteristic values.
The method comprises the steps of dynamically segmenting irregular data blocks of data, wherein the data are files with different sizes of 1-4 KB generally in a scene of a large number of small files, dividing the data into data blocks with different sizes according to the size of the files in the backup process based on the principle of variable length data blocks, for example, an original file is 1KB, carrying out a hash algorithm on the 1KB as one data block based on the principle of the variable length data blocks to obtain corresponding fingerprint information, and dividing the original file into two data blocks with 2KB if the original file is 3KB, and carrying out the hash algorithm to obtain two corresponding fingerprint information.
And S2, calculating the hash value of each data block, and searching the matched data block of the data block from the stored data based on the hash value.
Calculating hash values of the data blocks, and retrieving matched data blocks with the same hash values as the data blocks from the stored data; and acquiring metadata of the matched data block, and taking the acquired metadata as the metadata of the data block.
And S3, deleting the data blocks with the matching data blocks.
In step S2, the metadata of the matching data block has been bound to the metadata of the data block, so deleting the data block saves storage resources, and the content of the data block can be read out according to the metadata when searching the data block.
The embodiment effectively solves the problems existing in the fixed block segmentation, and when data is inserted into or deleted from the data object, if the changed content is not in the boundary of the data block, the data block is not changed; the boundaries between data blocks are random, dynamic, and portions of the contents of the data blocks may be repetitive. Therefore, the insertion or deletion of the content only affects one or two adjacent data blocks, and the rest of the data blocks are not affected, so that the deduplication of the data is more accurate.
For the backup of unstructured data, especially massive small files, a dynamic segmentation technology is adopted. Because the data change situation in the massive small files is complex, the whole backup data is often re-partitioned by adopting fixed block partitioning, and the partition processing of variable-length blocks is adopted, the problems existing in the fixed block partitioning are solved, and the partition processing is only carried out on the data of the changed part, so that the occupation of the computing resources of the client side on the data partition processing is greatly reduced, and the optimal data re-deleting effect of the massive small files is ensured.
As shown in fig. 2, the system 200 includes:
a data dividing unit 210, configured to extract features of the backup data by using an F-score dimensionality reduction method, and divide the backup data into data blocks with irregular sizes based on the features;
a matching search unit 220, configured to calculate a hash value of each data block, and search a matching data block of the data block from the stored data based on the hash value;
and a deduplication unit 230 configured to delete the data block where the matching data block exists.
Optionally, as an embodiment of the present invention, the data dividing unit is configured to:
using the F-score function:
calculating F-scores of K features of the backup data, whereinIs the average of the ith feature over the entire data set,is the average of the ith feature over the positive class data set,is the average of the ith feature over the negative class data set,is the eigenvalue of the ith characteristic of the kth positive type sample point,the characteristic value of the ith characteristic of the kth negative type sample point;
sorting the F scores of the characteristic values according to a rule from big to small, and selecting the F scores of the corresponding number which are ranked in the front according to the set quantity of the characteristic values;
and taking the characteristic value corresponding to the selected F score as the characteristic value of the backup data.
Optionally, as an embodiment of the present invention, the data dividing unit is configured to:
randomly selecting a target characteristic value from all characteristic values of the backup data, and taking the target characteristic value as a segmentation starting point;
substituting the target characteristic value into a segmentation length calculation functionObtaining a division length SnWherein X isiIs a random variable sequence, aiIs a coefficient determined by the data structure of the backup data, n being the number of eigenvalues;
switching the target characteristic values until all the characteristic values are traversed to obtain the segmentation length corresponding to each characteristic value;
and dividing the backup data into a plurality of data blocks according to the characteristic values of the backup data and the division lengths corresponding to the characteristic values.
Optionally, as an embodiment of the present invention, the matching search unit is configured to:
calculating hash values of the data blocks, and retrieving matched data blocks with the same hash values as the data blocks from the stored data;
and acquiring metadata of the matched data block, and taking the metadata as the metadata of the data block.
Fig. 3 is a schematic structural diagram of a terminal 300 according to an embodiment of the present invention, where the terminal 300 may be used to execute the data dynamic deduplication method according to the embodiment of the present invention.
Among them, the terminal 300 may include: a processor 310, a memory 320, and a communication unit 330. The components communicate via one or more buses, and those skilled in the art will appreciate that the architecture of the servers shown in the figures is not intended to be limiting, and may be a bus architecture, a star architecture, a combination of more or less components than those shown, or a different arrangement of components.
The memory 320 may be used for storing instructions executed by the processor 310, and the memory 320 may be implemented by any type of volatile or non-volatile storage terminal or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk. The executable instructions in memory 320, when executed by processor 310, enable terminal 300 to perform some or all of the steps in the method embodiments described below.
The processor 310 is a control center of the storage terminal, connects various parts of the entire electronic terminal using various interfaces and lines, and performs various functions of the electronic terminal and/or processes data by operating or executing software programs and/or modules stored in the memory 320 and calling data stored in the memory. The processor may be composed of an Integrated Circuit (IC), for example, a single packaged IC, or a plurality of packaged ICs connected with the same or different functions. For example, the processor 310 may include only a Central Processing Unit (CPU). In the embodiment of the present invention, the CPU may be a single operation core, or may include multiple operation cores.
A communication unit 330, configured to establish a communication channel so that the storage terminal can communicate with other terminals. And receiving user data sent by other terminals or sending the user data to other terminals.
The present invention also provides a computer storage medium, wherein the computer storage medium may store a program, and the program may include some or all of the steps in the embodiments provided by the present invention when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM) or a Random Access Memory (RAM).
Therefore, the method extracts the characteristics of the backup data by using an F value dimensionality reduction method and divides the backup data into irregular data blocks based on the characteristics, dynamic division of the backup data is achieved, the problems existing in fixed block division are solved, division processing is only performed on the data of a changed part, therefore, occupation of computing resources of a client side in data division processing is greatly reduced, data deduplication rate in a massive small file backup process can be effectively improved, occupation of the computing resources is reduced, meanwhile, space can be saved more during data storage, technical effects which can be achieved by the embodiment can be referred to the description above, and details are not repeated here.
Those skilled in the art will readily appreciate that the techniques of the embodiments of the present invention may be implemented as software plus a required general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied in the form of a software product, where the computer software product is stored in a storage medium, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like, and the storage medium can store program codes, and includes instructions for enabling a computer terminal (which may be a personal computer, a server, or a second terminal, a network terminal, and the like) to perform all or part of the steps of the method in the embodiments of the present invention.
The same and similar parts in the various embodiments in this specification may be referred to each other. Especially, for the terminal embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant points can be referred to the description in the method embodiment.
In the embodiments provided in the present invention, it should be understood that the disclosed system and method can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, systems or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
Although the present invention has been described in detail by referring to the drawings in connection with the preferred embodiments, the present invention is not limited thereto. Various equivalent modifications or substitutions can be made on the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and these modifications or substitutions are within the scope of the present invention/any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (10)
1. A method for dynamically deleting data is characterized by comprising the following steps:
extracting the characteristics of the backup data by using an F value dimension reduction method, and dividing the backup data into data blocks with irregular sizes based on the characteristics;
calculating the hash value of each data block, and searching a matched data block of the data block from the stored data based on the hash value;
the data blocks for which there is a matching data block are deleted.
2. The method of claim 1, wherein extracting the characteristics of the backup data by using an F-score dimensionality reduction method and dividing the backup data into irregular-sized data blocks based on the characteristics comprises:
using the F-score function:
calculating F-scores of K features of the backup data, whereinIs the average of the ith feature over the entire data set,is the average of the ith feature over the positive class data set,is the average of the ith feature over the negative class data set,is the eigenvalue of the ith characteristic of the kth positive type sample point,the characteristic value of the ith characteristic of the kth negative type sample point;
sorting the F scores of the characteristic values according to a rule from big to small, and selecting the F scores of the corresponding number which are ranked in the front according to the set quantity of the characteristic values;
and taking the characteristic value corresponding to the selected F score as the characteristic value of the backup data.
3. The method of claim 2, wherein extracting the characteristics of the backup data by using an F-score dimensionality reduction method and dividing the backup data into irregular-sized data blocks based on the characteristics comprises:
randomly selecting a target characteristic value from all characteristic values of the backup data, and taking the target characteristic value as a segmentation starting point;
substituting the target characteristic value into a segmentation length calculation functionObtaining a division length SnWherein X isiIs a random variable sequence, aiIs a coefficient determined by the data structure of the backup data, n being the number of eigenvalues;
switching the target characteristic values until all the characteristic values are traversed to obtain the segmentation length corresponding to each characteristic value;
and dividing the backup data into a plurality of data blocks according to the characteristic values of the backup data and the division lengths corresponding to the characteristic values.
4. The method of claim 1, wherein computing a hash value for each chunk of data and finding a matching chunk of chunks of data from stored data based on the hash value comprises:
calculating hash values of the data blocks, and retrieving matched data blocks with the same hash values as the data blocks from the stored data;
and acquiring metadata of the matched data block, and taking the metadata as the metadata of the data block.
5. A system for dynamic deduplication of data, comprising:
the data segmentation unit is used for extracting the characteristics of the backup data by using an F value dimensionality reduction method and dividing the backup data into data blocks with irregular sizes based on the characteristics;
the matching search unit is used for calculating the hash value of each data block and searching the matching data block of the data block from the stored data based on the hash value;
and the repeated deleting unit is used for deleting the data blocks with the matched data blocks.
6. The system of claim 5, wherein the data partitioning unit is configured to:
using the F-score function:
calculating F-scores of K features of the backup data, whereinIs the average of the ith feature over the entire data set,is the average of the ith feature over the positive class data set,is the average of the ith feature over the negative class data set,is the eigenvalue of the ith characteristic of the kth positive type sample point,the characteristic value of the ith characteristic of the kth negative type sample point;
sorting the F scores of the characteristic values according to a rule from big to small, and selecting the F scores of the corresponding number which are ranked in the front according to the set quantity of the characteristic values;
and taking the characteristic value corresponding to the selected F score as the characteristic value of the backup data.
7. The system of claim 6, wherein the data partitioning unit is configured to:
randomly selecting a target characteristic value from all characteristic values of the backup data, and taking the target characteristic value as a segmentation starting point;
substituting the target characteristic value into a segmentation length calculation functionObtaining a division length SnWherein X isiIs a random variable sequence, aiIs a coefficient determined by the data structure of the backup data, n being the number of eigenvalues;
switching the target characteristic values until all the characteristic values are traversed to obtain the segmentation length corresponding to each characteristic value;
and dividing the backup data into a plurality of data blocks according to the characteristic values of the backup data and the division lengths corresponding to the characteristic values.
8. The system of claim 5, wherein the match lookup unit is configured to:
calculating hash values of the data blocks, and retrieving matched data blocks with the same hash values as the data blocks from the stored data;
and acquiring metadata of the matched data block, and taking the metadata as the metadata of the data block.
9. A terminal, comprising:
a processor;
a memory for storing instructions for execution by the processor;
wherein the processor is configured to perform the method of any one of claims 1-4.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111335541.6A CN114138552B (en) | 2021-11-11 | 2021-11-11 | Data dynamic repeating and deleting method, system, terminal and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111335541.6A CN114138552B (en) | 2021-11-11 | 2021-11-11 | Data dynamic repeating and deleting method, system, terminal and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114138552A true CN114138552A (en) | 2022-03-04 |
CN114138552B CN114138552B (en) | 2024-01-12 |
Family
ID=80393713
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111335541.6A Active CN114138552B (en) | 2021-11-11 | 2021-11-11 | Data dynamic repeating and deleting method, system, terminal and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114138552B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102323958A (en) * | 2011-10-27 | 2012-01-18 | 上海文广互动电视有限公司 | Data de-duplication method |
CN109299727A (en) * | 2018-08-04 | 2019-02-01 | 辽宁大学 | The improvement extreme learning machine method for diagnosing faults of signal reconstruct |
CN109948740A (en) * | 2019-04-26 | 2019-06-28 | 中南大学湘雅医院 | A kind of classification method based on tranquillization state brain image |
-
2021
- 2021-11-11 CN CN202111335541.6A patent/CN114138552B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102323958A (en) * | 2011-10-27 | 2012-01-18 | 上海文广互动电视有限公司 | Data de-duplication method |
CN109299727A (en) * | 2018-08-04 | 2019-02-01 | 辽宁大学 | The improvement extreme learning machine method for diagnosing faults of signal reconstruct |
CN109948740A (en) * | 2019-04-26 | 2019-06-28 | 中南大学湘雅医院 | A kind of classification method based on tranquillization state brain image |
Also Published As
Publication number | Publication date |
---|---|
CN114138552B (en) | 2024-01-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9727573B1 (en) | Out-of core similarity matching | |
CN102782643B (en) | Use the indexed search of Bloom filter | |
US9367448B1 (en) | Method and system for determining data integrity for garbage collection of data storage systems | |
US9594674B1 (en) | Method and system for garbage collection of data storage systems using live segment records | |
US9715505B1 (en) | Method and system for maintaining persistent live segment records for garbage collection | |
US20150356134A1 (en) | De-duplication system and method thereof | |
CN109299086B (en) | Optimal sort key compression and index reconstruction | |
AU2010200866B1 (en) | Data reduction indexing | |
CN109710455B (en) | Deleted file recovery method and system based on FAT32 file system | |
CN111125033B (en) | Space recycling method and system based on full flash memory array | |
CN112104725B (en) | Container mirror image duplicate removal method, system, computer equipment and storage medium | |
CN106980680B (en) | Data storage method and storage device | |
CN110569245A (en) | Fingerprint index prefetching method based on reinforcement learning in data de-duplication system | |
Zhang et al. | Hashfile: An efficient index structure for multimedia data | |
CN112416880A (en) | Method and device for optimizing storage performance of mass small files based on real-time merging | |
CN117369731B (en) | Data reduction processing method, device, equipment and medium | |
CN113051568A (en) | Virus detection method and device, electronic equipment and storage medium | |
CN111124939A (en) | Data compression method and system based on full flash memory array | |
CN111061428B (en) | Data compression method and device | |
CN114138552B (en) | Data dynamic repeating and deleting method, system, terminal and storage medium | |
CN113255610B (en) | Feature base building method, feature retrieval method and related device | |
CN114443583A (en) | Method, device and equipment for arranging fragment space and storage medium | |
CN112328587A (en) | Data processing method and device for ElasticSearch | |
CN116048396B (en) | Data storage device and storage control method based on log structured merging tree | |
CN114281599A (en) | Recovery method of MFT fragments, terminal equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |