CN117707435B

CN117707435B - Solid-state disk data deduplication method

Info

Publication number: CN117707435B
Application number: CN202410161893.1A
Authority: CN
Inventors: 刘峰; 李玉雪; 孙永升; 叶昕
Original assignee: Chaoyue Technology Co Ltd
Current assignee: Chaoyue Technology Co Ltd
Priority date: 2024-02-05
Filing date: 2024-02-05
Publication date: 2024-05-03
Anticipated expiration: 2044-02-05
Also published as: CN117707435A

Abstract

The invention relates to the technical field of computer hardware, in particular to a solid-state disk data deduplication method. The method is an offline data deduplication method. When the host sends data, the solid-state disk will not immediately remove the data input by the host, but just like a common solid-state disk, the data is directly written into NAND FLASH corresponding physical pages according to the indication of the address mapping table. After the solid-state disk works normally for a period of time, the written data are divided into cold data and hot data according to the difference of erasing frequency by a wear-leveling method in the address mapping table, and the solid-state disk de-duplicates the identified cold data. Because cold data generally accounts for about 80% of data stored in the solid-state disk, duplicate data can be more easily identified by de-duplicating the data, and meanwhile, when cold and hot data exchange is performed on the solid-state disk, the data exchange amount is reduced, and the write amplification coefficient is reduced.

Description

Solid-state disk data deduplication method

Technical Field

The invention relates to the technical field of computer hardware, in particular to research on a solid-state disk offline data deduplication method, and discloses a solid-state disk data deduplication method.

Background

The reliability of a solid state disk as a data storage device with NAND FLASH as storage medium is affected by NAND FLASH erasures. As the usage time of the solid state disk increases, the number of erasures of NAND FLASH data blocks increases, which results in a decrease in the reliability of the solid state disk. Although the solid-state disk wear leveling method can effectively balance the erasing times of each block in NAND FLASH, the method cannot fundamentally reduce the erasing times of NAND FLASH. With the continuous improvement of the overall performance of the solid-state disk controller and the continuous expansion of peripheral hardware resources, many previous methods that can only be implemented at the host end are already implemented in the solid-state disk controller. The data deduplication has important significance for improving the reliability of the solid-state disk because the actual writing quantity of the data can be effectively reduced and the erasing times of NAND FLASH are reduced. Currently, data deduplication methods have been introduced inside solid state disks and are widely studied as a key technology for solid state disk data storage.

The solid-state disk data deduplication method refers to that repeated identification is carried out on data to be written into a storage medium of the solid-state disk controller, and the identified repeated data is deleted, so that the data quantity written into NAND FLASH can be effectively reduced, the abrasion of NAND FLASH is reduced, and the reliability of the solid-state disk is improved. Meanwhile, by establishing an efficient data deduplication architecture in the solid-state disk controller, for example, FPGA is used for data deduplication acceleration, and the read-write performance of the solid-state disk can be effectively improved.

One development trend of the current solid-state disk is to offload some methods which can not be realized by the host side to the device side by introducing a boundary calculation concept, so that the efficacy of the device side processor is fully exerted, and the host side processor is released to complete other tasks. Therefore, the solid-state disk data deduplication can not only lighten the calculation pressure of a host processor, but also improve the utilization rate and reliability of the storage space of the solid-state disk.

Through analysis of the large amount of data, the data written into the storage device at present can be found to have higher repeatability. For example, microsoft technicians analyze the data stored in a personal computer, and the analysis result shows that about 40% of the data stored in the personal computer is repeated data; EMC researches on backup storage systems find that 60% -90% of backup data are repeated data; the German researchers analyze the data stored in the HPC data center, and the result shows that about 30% of the data stored in the data center are repeated data. Because a large amount of repeated data is stored in the data storage device in various data storage application scenarios, it is also necessary to identify and remove the stored data repeatedly by using a solid-state disk.

The existing solid-state disk data deduplication method is generally online, namely, data deduplication is performed immediately after a host sends data, fingerprint information is generated, fingerprint information is compared, and duplicate data is deleted, so that the workload of an online deduplication mode is large, and the reading and writing efficiency of a solid-state disk is reduced.

Disclosure of Invention

The invention aims to provide an off-line solid-state disk data deduplication method, which divides data into cold data and hot data according to different erasing frequencies, and because the cold data generally accounts for about 80% of the data stored in the solid-state disk, the duplication data can be more easily identified by deduplicating the data, and meanwhile, the data exchange amount can be reduced and the write amplification coefficient can be reduced when the solid-state disk exchanges cold and hot data.

In order to solve the technical problems, the invention adopts the following technical scheme: a solid state disk data deduplication method comprising the steps of:

S01), the host sends data, and the data is directly written into a corresponding physical page NAND FLASH according to the indication of the address mapping table;

S02) after the solid-state disk works for a period of time, dividing written data into cold data and hot data according to different erasing frequencies by a wear-leveling method in an address mapping table, and then de-duplicating the cold data;

The duplication removal of the cold data is off-line duplication removal, and the specific steps are as follows:

s21), modifying the address mapping table, adding fields PPA2 and FTPA, EC, ST, P, wherein the field PPA2 is the position of a physical page of repeated data; the field ST indicates the physical page data state, including an idle state, an active state, and an inactive state, when a physical page writes data, the page is in the active state; when physical page data is updated or de-duplicated, the page is in an invalid state, and the physical page in the invalid state is finally erased and recovered, and the state of the page is restored to an idle state; field EC represents the number of block erasures, field FTPA represents the fingerprint information physical page address, and field P represents the data dependency;

S22), when the erasing times of one block in the NAND FLASH chip exceeds a preset K value, starting static wear leveling of the solid-state disk;

s23), searching cold data with time correlation according to the value of P in the address mapping table;

S24), generating fingerprint information from the cold data with time correlation and comparing the cold data with the time correlation, deleting repeated data and updating an address mapping table if the fingerprint information is the same, finishing data deduplication, returning to the step S23 if the fingerprint information is different, and continuously searching the cold data with the time correlation until the set searching times are reached.

Further, in step S24), updating the address mapping table specifically includes: the LPA values of the blocks with the same fingerprint information are pointed to the same PPA1, ST corresponding to the rest blocks is modified to be in an invalid state, a physical address is allocated to the blocks which generate the fingerprint information and have no modified PPA1 values, LPA represents a logical page address, and PPA1 represents a physical page address.

Furthermore, cold and hot data are exchanged after the data deduplication is completed, so that wear balance is realized.

Further, in step S02), according to the difference of the erasing frequency, the temperature data is added on the basis of the cold data and the hot data, and after the duplication of the cold data is completed, the duplication of the temperature data is performed.

Furthermore, two ARM processors are arranged in the solid-state disk, namely ARM-1 and ARM-2 respectively, two buffers are added in the solid-state disk, namely a buffer M and a buffer D respectively, wherein the ARM-1 is connected with the buffer M through an AXI bus, the ARM-2 is connected with the buffer D through the AXI bus, the buffer M is used for buffering an address mapping table and partial hot data, the buffer D is used for buffering partial cold data and fingerprint information generated by the partial cold data, the ARM-1 is used for managing the address mapping table in the solid-state disk, and the ARM-2 is used for generating cold data fingerprint information and judging the repeated condition of the data; when the static wear-leveling is carried out on the solid-state disk, ARM-2 starts to work, fingerprint information generation and repeated data judgment are carried out, and ARM-1 correspondingly modifies the address mapping table according to the result of ARM-2 operation.

Further, if the host side updates or deletes the hot data in the buffer M during the data deduplication process, the solid state disk controller firstly transmits all the cold data on the bus to the buffer D, and empties the data on the bus; secondly, the ARM-2 processor generates corresponding fingerprint information for the existing data in the buffer D, and meanwhile, the ARM-1 finishes updating the hot data in the buffer M; finally, after the update of the hot data is completed, the data deduplication work is continuously completed.

Further, if the situation that the host side updates or deletes the existing cold data occurs in the data deduplication process, the solid state disk controller firstly transmits all the cold data on the bus to the buffer D, and the data on the bus is emptied; secondly, judging whether the updated cold data is in the fingerprint generator at present, if not, directly updating the cold data in NAND FLASH; if the data is already in the buffer D or the fingerprint information generator, the corresponding cold data in the buffer D, the fingerprint information calculator and NAND FLASH are all updated.

Further, if the situation that the host reads data or writes new data occurs in the data deduplication process, the solid state disk firstly transmits all the cold data on the bus to the buffer D, and then directly reads the corresponding data from NAND FLASH to the host or writes the new data into the corresponding NAND FLASH.

Further, when the nth data is de-duplicated, the fingerprint information generated when the nth-1 data is de-duplicated is still valid.

The invention has the beneficial effects that: the invention designs a solid-state disk offline data deduplication method based on the existing embedded storage controller. The method comprises the steps of firstly utilizing an improved wear-leveling algorithm to identify cold and hot data and generate corresponding fingerprints, comparing all fingerprints and deleting cold data corresponding to the same fingerprints, and then exchanging the cold data after duplication removal with the hot data.

The method has the advantages that the data volume written into NAND FLASH is effectively reduced, the wear average repeated data identification rate of NAND FLASH is fundamentally reduced, and the method is close to the identification rate of an online data deduplication method with complex structure and high cost. Compared with a solid-state disk without a data deduplication function, the reading and writing speed of the solid-state disk adopting the method is improved by 15%, and the reading and writing performance of the solid-state disk adopting online data deduplication is similar to that of the solid-state disk adopting online data deduplication.

Drawings

FIG. 1 is a diagram of a solid state disk data deduplication system architecture;

FIG. 2 is a logic diagram of the internal of a data deduplication solid state disk controller;

FIG. 3 is a solid state disk offline data deduplication flowchart;

FIG. 4 is a graph of repeated data identification rate comparison;

FIG. 5 is a graph showing the average write latency performance test results.

Detailed Description

The invention will be further described with reference to the drawings and the specific examples.

Example 1

The embodiment discloses a solid-state disk data deduplication method, which is an offline data deduplication method. When the host sends data, the solid-state disk will not immediately remove the data input by the host, but just like a common solid-state disk, the data is directly written into NAND FLASH corresponding physical pages according to the indication of the address mapping table. After the solid state disk works normally for a period of time, the written data is divided into cold data and hot data according to the difference of erasing frequency by a wear leveling method in the FTL. At this point, the solid state disk then de-duplicates the identified cold data. Because cold data generally accounts for about 80% of data stored in the solid-state disk, duplicate data can be more easily identified by de-duplicating the data, and meanwhile, when cold and hot data exchange is performed on the solid-state disk, the data exchange amount is reduced, and the write amplification coefficient is reduced.

When the data is de-duplicated on the solid-state disk, the MD5 or SHA-1 method is generally adopted for data fingerprint generation, and the SHA-1 method is adopted in the invention. The data deduplication solid state disk system architecture is shown in FIG. 1.

Compared with a common solid-state disk without data deduplication, the solid-state disk with the data deduplication architecture designed by the invention has 2 groups of data buffer areas with different functions, namely a buffer M and a buffer D, which are called as Cache-M and Cache-D in the following. The Cache-M is used for caching the address mapping table and part of hot data as the Cache in the common solid-state disk, and the Cache-D is used for caching part of cold data and fingerprint information generated by the cold data.

FIG. 2 is a logic diagram within a data deduplication solid state disk controller. In the overall system architecture, dual-core ARM processors, ARM-1 and ARM-2, are employed. The ARM-1 is used for managing FTL in the solid-state disk, and comprises address mapping, wear leveling, garbage collection and the like, and the ARM-1 is connected with the Cache-M through an AXI bus. ARM-2 is mainly used for generating cold data fingerprint information and judging the repeated condition of the data, and ARM-2 is connected with Cache-D through an AXI bus. When the solid-state disk needs static wear-leveling, ARM-2 starts working, and works such as fingerprint information generation, repeated data judgment and the like are carried out; ARM-1 then modifies the address mapping table according to the result of ARM-2 operation. In ARM-2, SHA-1 calculator is used for generating fingerprint information, and Hash manager is used for managing fingerprint information.

The following describes the working principle of offline data deduplication of a solid-state disk in three ways: firstly, introducing a structure of an address mapping table in a solid-state disk data deduplication method based on wear leveling; secondly, introducing a method for removing duplication of offline data of the solid-state disk; finally, a special case processing method when offline data deduplication is performed will be described.

1) Address mapping table structure

As a solid-state disk with a data deduplication function, whether offline data deduplication or online data deduplication is adopted, a corresponding relationship between a logical address and a physical address is finally required to be established through an address mapping table. In order to meet the designed offline data deduplication function, the traditional address mapping table needs to be modified, and corresponding fields are added to meet the requirements of cold and hot data identification, repeated data relocation and fingerprint information storage.

The address mapping table adopts a page-level mapping mode. The optimized address mapping table (OLD-FTL) and the meaning of each field in the table are shown in table 1:

TABLE 1 OLD-FTL Address mapping entry

LPA indicates the logical page address to which the host side sends data. The solid state disk converts the LPA into a corresponding physical address PPA1 by a corresponding address mapping method. PPA2 is a physical page address of the duplicate data. ST represents the physical page case, including Free, valid, and Invalid. When there is no data in the physical page, ST is free state; when a physical page writes data, the page is in valid state; when physical page data is updated or de-duplicated, the ST is in an invalid state. The physical page in the invalid state is eventually erased and recovered, and the state of the page is restored to free. EC records the number of erasures per block. A parameter K is defined herein, and when EC is greater than K, the data on the physical page corresponding to EC is considered to be hot data, whereas the data is considered to be cold data.

The solid-state disk data deduplication mainly comprises three steps, namely data fingerprint information generation, fingerprint information comparison and repeated data deletion. Each data generates a hash value which is the fingerprint information of the data. Whether the hash values are the repeated data is judged by judging whether the hash values are equal. The hash value of the data is stored in a reserved space (OP) of NAND FLASH, so that in order to reduce the number of times of generating fingerprint information and improve the data deduplication efficiency of the solid-state disk, the generated fingerprint information needs to be stored in NAND FLASH for later use in data deduplication. Therefore, a mapping relation between fingerprint information and a storage physical address needs to be established, so that the generated fingerprint information can be found when the solid-state disk controller performs data deduplication. The field FTPA indicates the physical page address of the fingerprint information, that is, the corresponding mapping relationship between the fingerprint information and the storage physical address.

In solid state disks, the cold data accounts for about 80% of the total data stored by the solid state disk. It is impossible for the solid state disk to generate and compare hash values of all cold data together when generating hash values for cold data and comparing them. According to the correlation study, the repeated data are very close in time, i.e. the data with higher repetition degree have time correlation, and the data with higher repetition degree are related in time. Therefore, the invention introduces a parameter P, records the correlation of each page of data, and only performs data deduplication on the data with time correlation. Thus, not only is the data repeatability improved, but also the fingerprint information size can be reduced. Table 2 is the state of an OLD-FTL mapping table when data deduplication has not yet been performed.

TABLE 2 OLD-FTL mapping table without data deduplication

In table 2, LPA is data of 0001 and 0002, and P values thereof are all 0001, it is explained that the data on the two physical pages have time correlation. And FTPA is 0000, it indicates that the data on the corresponding physical page did not generate the hash value.

2) Offline data deduplication process

After the optimization design of the address mapping table is completed, the solid-state disk can utilize the optimized address mapping table to perform data deduplication. The embodiment designs a solid-state disk offline data deduplication method based on tristate data. The method adopts an ARM processor to generate fingerprint information.

The difference between the offline data deduplication maximum and online data deduplication is that fingerprint information is not generated immediately when data is written. Therefore, when the solid-state disk with the offline data deduplication function is in an initial state and starts to store data, the solid-state disk at the moment does not immediately perform fingerprint information generation, comparison and data deduplication. The writing and reading flow of data is the same as that of a common solid-state disk.

After the solid state disk is normally operated for a period of time, each data block in the NAND FLASH chips has different erasing times. At this time, the wear leveling method in the solid-state disk has been able to identify hot data, cold data, and warm data according to the different number of erasures per block. When the erasing times of a certain block in NAND FLASH chips exceeds a preset K value, the solid-state disk starts a corresponding static wear-leveling method. In order to effectively improve the cold and hot data identification precision, a wear-leveling method is adopted. FIG. 3 is a solid state disk offline data deduplication process based on tristate data.

After the solid-state disk is started to perform static wear leveling, the data in the solid-state disk are divided into three types of cold, warm and hot according to different erasing frequencies. After division, the data in the cold block and the hot block are not exchanged immediately, but the following processing is performed to complete the solid-state disk data deduplication:

(1) The ARM-2 processor reads the cold data which are close in time into the Cache-D from NAND FLASH in sequence by taking pages as units according to the value of P in the address mapping table;

(2) Inputting data in the Cache-D into an SHA-1 fingerprint calculator to generate corresponding fingerprint information;

(3) Fingerprint information is input into a Hash manager for comparison. If the fingerprint information is the same, modifying an address mapping table through ARM-1, pointing different LPA values to the same PPA1, and modifying ST corresponding to the rest PPA1 into an invalid state;

(4) If the same data does not appear in the fingerprint information comparison, the group of data is indicated to have no duplicate data which can be deleted. At this point, the solid state disk controller will read another set of cold data with time correlation from NAND FLASH, and proceed with fingerprint information generation and comparison. In order to avoid repeated reading of cold data for data deduplication operation and ensure the wear leveling method and the performance of the solid-state disk, the patent sets that at most 3 groups of data are read for data deduplication in one wear leveling, and if no repeated data exist in the 3 groups of data, the solid-state disk controller stops the data deduplication operation and directly performs the solid-state disk wear leveling.

In the static wear-leveling method, data stored in a solid-state disk is finally classified into three types of cold, warm, and hot by identification and merging. According to the offline data deduplication method, the coldest data with time correlation can be preferentially selected according to the cold and hot degree of the data to perform data deduplication. After multiple rounds of data deduplication, the method will repeatedly identify and remove data in the temperature block if there is no duplicate data already in the cold data block. The data deduplication process for the warm block is the same as the data deduplication process for the cold block, and will not be described again here.

(5) After fingerprint information comparison and address mapping table updating are completed, the solid-state disk exchanges cold and hot data, and at the moment, the solid-state disk controller has deleted the repeated part of the cold data, so that the exchange amount of the cold and hot data is reduced, the static loss balance efficiency of the solid-state disk is improved, and the performance of the whole solid-state disk is indirectly improved.

Table 3 shows the status of the address mapping table of a data deduplication solid state disk after it has been operating normally for a period of time. Data for LPA 0004 and 0005 correspond to EC greater than K (assuming a K value of 280). It is indicated that the data corresponding to these two logics belongs to hot data, while the data corresponding to LPA 0001, 0002 and 0003 is cold data.

Table 3 address mapping table for period of operation

At this time, according to the value corresponding to the P field in the address mapping table, the solid-state disk first reads cold data in PPA1 of 0001, 0002 and 0003 into the fingerprint generator (because the P values corresponding to the three addresses are all 0001, which indicates that the data stored in the three addresses have time correlation), and generates corresponding fingerprint information. Assuming that the data for PPA1 is 0001 and 0002 are the same, the generated fingerprint data is also the same.

After the fingerprint comparison is completed, the repeated data with the same fingerprint information can be deleted, and the invalid state mark and the address mapping table update are carried out on the physical page corresponding to the repeated data while the repeated data is deleted. After the address mapping table is updated 1 st time, the update result is shown in table 4. The bold font in the table is the modified value.

Table 4 address mapping table updated 1 st time

When the address mapping table is updated for the first time, the updated content comprises:

(1) The PPA1 value corresponding to LPA 0002 is modified to 0001, where both different logical addresses LPA (0001) and LPA (0002) correspond to physical pages of PPA1 (0001).

(2) The data of LPA (0001) and LPA (0003) generate fingerprint information, and thus the physical address corresponding to the fingerprint information is added to FTPA. Since the fingerprint information of the LPA (0002) corresponding data is identical to the fingerprint information of the LPA (0001) corresponding data, only one fingerprint information need be stored.

(3) The state of the physical page corresponding to PPA2 (0002) is modified to Invalid. After the data of the physical page is detected as the duplicate data, the data corresponding to PPA2 (0002) has been formally deleted by changing the pointing direction of the logical address. The physical page state corresponding to PPA2 (0002) is modified into Invalid, so that the solid-state disk can recover the page data when garbage collection is performed, and final repeated data deletion is realized from the physical state.

After the solid-state disk controller completes the first mapping table update to realize the duplication removal of cold data, the storage positions of the cold and hot data in the solid-state disk can be exchanged, and the wear balance of the solid-state disk is completed.

The cold and hot data storage location exchange work is divided into two parts: the first part is to update the address mapping table for the 2 nd time. The address mapping table after the 2 nd modification is shown in table 5. The second portion NAND FLASH exchanges data. The process of NAND FLASH data exchange is not discussed herein.

Table 5 address mapping table updated 2 nd time

By updating the address mapping table at the 2 nd time, the cold data corresponding to the logical address LPA (0001) and the hot data corresponding to the LPA (0004) are exchanged in physical location. After the exchange, the value of the physical address of the PPA1 corresponding to the LPA (0001) is converted from the original 0001 to 0004, and the value of the physical address of the PPA4 corresponding to the LPA (0004) is converted from the original 0004 to 0001. Similarly, the cold data corresponding to the logical address LPA (0003) and the hot data corresponding to LPA (0005) are physically interchanged.

3) Discussion of problems associated with data deduplication methods

In general, solid state disks perform static wear leveling and cold data offline deduplication only when in an idle state. However, since the read-write request of the host cannot be predicted, even if the current solid-state disk is in an idle state, it cannot be completely ensured that the host cannot send the read-write request to the solid-state disk during the static wear-leveling operation and the offline duplication removal of cold data. Thus, solid state disks need to deal with the special cases described above when static wear leveling and offline deduplication of cold data is performed. The following is a discussion of some special cases when solid state disk data is deduplicated

(1) When the host side updates or deletes the hot data in the Cache-M. The solid state disk controller first transfers all the cold data on the bus to Cache-D and empties the data on the bus. And secondly, the ARM-2 processor generates corresponding fingerprint information for the existing data in the Cache-D, and meanwhile, the ARM-1 finishes updating the hot data in the Cache-M. Finally, after the update of the hot data is completed, the data deduplication work is continuously completed.

(2) When the host updates or deletes the existing cold data. Firstly, the solid-state disk controller transmits all cold data on the bus to the Cache-D, and clears the data on the bus. Second, it is determined whether the updated cold data is currently already in the fingerprint generator. If not in the fingerprint generator, the cold data in NAND FLASH is updated directly. If the data is already in the Cache-D or the fingerprint information generator, the corresponding cold data in the Cache-D, SHA-1 calculator and NAND FLASH needs to be updated completely in order to avoid the situation that the data is inconsistent in cold and hot data exchange.

(3) When the host side reads data or writes new data. Since no update to the original cold and hot data is involved. Therefore, the solid-state disk only needs to transmit all the cold data on the bus to the Cache-D, and then directly read the corresponding data from NAND FLASH to the host side or write the new data to the corresponding NAND FLASH.

(4) When the data deduplication is completed, some mapping relations of 1 to 1 in the original address mapping table become the situation that n LPAs correspond to 1 PPA. If the host updates the ith (0<i n) LPA at this time, it is only necessary to map LPA (i) to a new PPA. If the host needs to delete the data corresponding to the LPA (i), the LPA (i) is only required to be pointed to an adjacent invalid physical page.

(5) The invention mainly aims at offline deduplication of cold data in a solid-state disk, and the data deduplication mode has two characteristics: (1) fingerprint information is generated within a specific time; (2) The cold data is stored in the solid-state disk for a long time, so that fingerprint information generated during the n-1 th data deduplication still can be effective during the n-th data deduplication. Based on the two characteristics, fingerprint information needs to be stored.

The typical 4KB data has the fingerprint information size of 20B generated by SHA-1, taking a solid state disk with the capacity of 2TB as an example, the fingerprint information of all the data is 10GB, and the actual fingerprint information of a 2TB solid state disk is about 8GB on the assumption that about 80% of the data in the solid state disk is cold data. This fingerprint information will be stored in a specific area of the solid state disk and mapped by means of the FTPA values in the address mapping table. When the data is de-duplicated next time, the data with the generated fingerprint information can be directly read from the physical address corresponding to FTPA without generating the fingerprint information again.

The method is suitable for the fields of industrial control, automation equipment, data centers, aerospace and the like.

In order to measure the effect of the solid-state disk data deduplication method, a data repetition rate identification test and a data deduplication performance test are performed.

1. Data repetition rate identification test

The data repetition rate recognition rate is an important index for measuring the recognition accuracy of the solid-state disk data deduplication method on the repeatability, and the higher the data repetition recognition rate is, the more repeated data can be recognized by the data deduplication method. In order to test the performance of different data deduplication methods in different application environments in terms of duplicate data identification, 3 relatively common trace is selected for testing. Table 6 is a specific description of the characteristics of these 3 types of trace.

Table 6 trace details of the file

Among the three types of trace test cases, the trace1 test case has a small amount of write data and a small number of data updates, and most of the trace1 data is considered to be cold data. In the trace2 test case, although the data amount of the overall test is not large, the data belongs to hot data which is updated particularly frequently, and the performance of the solid-state disk is extremely easy to be reduced and the write amplification factor is extremely high. The trace3 test case has the largest data volume, and the proportion of cold and hot data is relatively average. The three types of trace represent several application scenes common to the current solid-state disk, and the validity of the data deduplication method of various solid-state disks can be effectively reflected by using the three types of trace files for testing.

Fig. 4 is a test result of the present method in duplicate data recognition.

The duplicate data identification rate test data is shown in table 7.

Table 7 repeat data identification rate test data

The test result of the repeated data recognition rate shows that the method is lower than the solid-state disk data deduplication method proposed by Kim in the data recognition rate. This is because the method of the present patent design does not perform fingerprint information generation, comparison, and deduplication operations on all data entered, but only performs data deduplication on cold data in a solid state disk.

1) When the trace1 is used for testing, the WinXP system file is written into the solid-state disk only once, and if a large number of operations are not performed in the later period, cold data can be changed quickly, so that the identification rate of the method is relatively close to that of the method proposed by Kim in the trace1 testing.

2) When trace2 is used for testing, word document write requests are increased and data is modified more frequently, so that corresponding cold data in the Word document is less. The method designed by the patent only carries out data deduplication on the cold temperature data block, but does not carry out deduplication on the hot data, so the repeated data recognition rate of the method is reduced when the test case of trace2 is used for testing.

3) When using trace3 for testing, the cool data duty cycle in trace3 is lower than that of trace2 and higher than that of trace 1. Therefore, the offline data deduplication method of the present patent design is higher than the result of the test using trace1 and lower than the result of the test using trace2 in the duplicate data recognition rate.

2. Data deduplication performance test

The read-write performance of a solid-state disk as a data storage device will directly impact the performance of the overall computer system. Average write latency is a key indicator for measuring the read-write performance of solid state disks. Compared with a common solid-state disk without a data deduplication function, the solid-state disk adopting the data deduplication method is added with three processes of data fingerprint information generation, fingerprint information comparison and repeated data deletion in the data writing process. The method of patent design uses ARM core-A9 processor to generate fingerprint information, compare fingerprint information and delete repeated data. FIG. 5 is a test result of the present method and Kim proposed method and a normal solid state disk without data deduplication function on average write latency performance. The test cases follow the test cases in table 6.

The average write latency test data is shown in table 8.

Table 8 average write latency test data

The average write latency performance test results showed that:

1) When using trace1 for testing, the write delays for the three solid state disks are similar due to the low repetition rate of the written data.

2) When the trace2 is used for testing, the data repetition rate is increased, but because the trace2 belongs to a test case with high data repetition rate and frequent updating, the data repetition rate is lower in a frequently updated part, the offline data deduplication method is adopted, hot data with low data repetition rate is avoided, and fingerprint generation and comparison are only carried out on cold data with high data repetition rate. Meanwhile, when trace2 is used for testing, the solid-state disk reaches the static wear-leveling triggering condition of the solid-state disk for a plurality of times, and the method is used for carrying out cold data deduplication when the solid-state disk is subjected to wear leveling, so that the wear leveling efficiency of the solid-state disk is improved from the side face.

3) When trace3 is used for testing, more repeated data exist in tace, but compared with trace2, the thermal data are insufficient, and static wear leveling of the solid-state disk is not triggered for many times, so that the method designed by the patent does not show advantages.

The foregoing description is only of the basic principles and preferred embodiments of the present invention, and modifications and alternatives thereto will occur to those skilled in the art to which the present invention pertains, as defined by the appended claims.

Claims

1. A method for deduplicating solid state disk data, comprising: the method comprises the following steps:

The cold data de-duplication is off-line de-duplication, which comprises the following specific steps:

s21), modifying the address mapping table, adding fields PPA2 and FTPA, EC, ST, P, wherein the field PPA2 is the position of a physical page of repeated data; the field ST indicates the physical page data state, including an idle state, an active state, and an inactive state, and when a physical page writes data, the page is in the active state; when physical page data is updated or de-duplicated, the page is in an invalid state, and the physical page in the invalid state is finally erased and recovered, and the state of the page is restored to an idle state; field EC represents the number of block erasures, field FTPA represents the fingerprint information physical page address, and field P represents the data dependency;

2. The solid state disk data deduplication method of claim 1, wherein: the specific process of updating the address mapping table in step S24) is as follows: the LPA values of the blocks with the same fingerprint information are pointed to the same PPA1, ST corresponding to the rest blocks is modified to be in an invalid state, a physical address is allocated to the blocks which generate the fingerprint information and have no modified PPA1 values, LPA represents a logical page address, and PPA1 represents a physical page address.

3. The solid state disk data deduplication method of claim 1, wherein: and exchanging cold data and hot data after the data deduplication is completed.

4. The solid state disk data deduplication method of claim 1, wherein: in step S02), according to the difference of the erasing frequency, the temperature data is added on the basis of the cold data and the hot data, and after the duplication of the cold data is completed, the duplication of the temperature data is performed.

5. The solid state disk data deduplication method of claim 1, wherein: the solid-state disk is provided with two ARM processors, namely ARM-1 and ARM-2, two buffers are added in the solid-state disk, namely buffer M and buffer D, ARM-1 is connected with buffer M through an AXI bus, ARM-2 is connected with buffer D through the AXI bus, buffer M is used for buffering an address mapping table and partial hot data, buffer D is used for buffering partial cold data and fingerprint information generated by the partial cold data, ARM-1 is used for managing the address mapping table in the solid-state disk, ARM-2 is used for generating cold data fingerprint information and judging the data repetition condition; when the static wear-leveling is carried out on the solid-state disk, ARM-2 starts to work, fingerprint information generation and repeated data judgment are carried out, and ARM-1 modifies the address mapping table according to the result of ARM-2 operation.

6. The solid state disk data deduplication method of claim 5, wherein: if the situation that the host end updates or deletes the hot data in the buffer M occurs in the data deduplication process, the solid state disk controller firstly transmits all cold data on the bus to the buffer D, and empties the data on the bus; secondly, the ARM-2 processor generates corresponding fingerprint information for the existing data in the buffer D, and meanwhile, the ARM-1 finishes updating the hot data in the buffer M; finally, after the update of the hot data is completed, the data deduplication work is continuously completed.

7. The solid state disk data deduplication method of claim 5, wherein: if the condition that the host end updates or deletes the existing cold data occurs in the data deduplication process, the solid state disk controller firstly transmits all the cold data on the bus to the buffer D, and empties the data on the bus; secondly, judging whether the updated cold data is in the fingerprint generator at present, if not, directly updating the cold data in NAND FLASH; if the data is already in the buffer D or the fingerprint information generator, the corresponding cold data in the buffer D, the fingerprint information calculator and NAND FLASH are all updated.

8. The solid state disk data deduplication method of claim 5, wherein: if the situation that the host side reads data or writes new data occurs in the data deduplication process, the solid state disk controller firstly transmits all the cold data on the bus to the buffer D, and then directly reads the corresponding data from NAND FLASH to the host side or writes the new data into the corresponding NAND FLASH.

9. The solid state disk data deduplication method of claim 1, wherein: when the nth data is subjected to deduplication, fingerprint information generated during the nth-1 data deduplication is still valid.