CN113721836A

CN113721836A - Data deduplication method and device

Info

Publication number: CN113721836A
Application number: CN202110661794.6A
Authority: CN
Inventors: 周文
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2021-06-15
Filing date: 2021-06-15
Publication date: 2021-11-30

Abstract

The application discloses a data deduplication method and a data deduplication device, wherein the data deduplication method comprises the following steps: acquiring first data to be stored in a target storage space, wherein M data blocks have been stored in the target storage space, and M is a positive integer; calculating a sum check code of the data content of the first data; under the condition that the sum check code of the data content of the first data is determined to be different from the sum check codes of the data content of the M data blocks, distributing the first data block to the first data, and storing the first data into the first data block; in a case where it is determined that the checksum of the data content of the first data is identical to the checksum of the data content of the second data block of the M data blocks, and the data content of the second data block is identical to the data content of the first data, address information of the second data block is recorded into the index node as address information of the data block storing the first data. The utilization rate of the storage space is improved, and meanwhile, the performance loss is reduced.

Description

Data deduplication method and device

Technical Field

The present application relates to the field of computer storage, and in particular, to a data deduplication method and apparatus.

Background

With the development of science and technology, the use scenes of electronic products are more and more abundant, more and more data need to be stored, the requirement of users on storage space is more and more, and data deduplication technology is produced at the same time.

In the current data deduplication technology, a unique identifier is calculated for data to be stored and data blocks stored in a storage space through algorithms such as a hash algorithm or a fingerprint algorithm, and whether a target data block identical to the identifier of the data to be stored exists in the storage space is searched. If yes, only a pointer is stored, and the pointer points to the original position of the storage target data block. If not, writing the data to be stored into the storage space. Only one copy of the same data in the file is stored, and the same data share the same data block.

However, if the identifier of the data block other than the target data block in the calculation memory space is referred to as invalid calculation, the calculation amount and power consumption of the invalid calculation in the data deduplication method for searching whether the memory space stores the target data block by using a hash algorithm, a fingerprint algorithm, or the like are large. Especially, when most of the data in the file are different, the amount of calculation and power consumption caused by invalid calculation are large.

Therefore, how to provide a data deduplication method to reduce performance loss while improving the utilization rate of storage space becomes an important research topic in the technical field.

Disclosure of Invention

The application provides a data deduplication method and device, which are used for reducing the calculation amount and power consumption generated by comparing data content of data to be stored with data blocks stored in a target storage space, improving the utilization rate of the storage space and reducing performance loss.

In a first aspect, an embodiment of the present application provides a data deduplication method, where the method includes: acquiring first data to be stored in a target storage space, wherein M data blocks have been stored in the target storage space, and M is a positive integer; calculating a sum check code of the data content of the first data; under the condition that the sum check codes of the data contents of the first data and the sum check codes of the data contents of the M data blocks are determined to be different, distributing a first data block for the first data, and storing the first data into the first data block; and under the condition that the sum check code of the data content of the first data is determined to be the same as the sum check code of the data content of a second data block in the M data blocks, and the data content of the second data block is determined to be the same as the data content of the first data, recording the address information of the second data block into an index node as the address information of the data block storing the first data, wherein the index node is used for recording the association relationship between the data and the address information of the data block storing the data.

It is understood that the first data may be a part of text content in a document, or the first data may be a part of image frame in a video, and the like, which is not limited in this embodiment of the present application.

It is understood that the target storage space may be a storage partition of a storage medium that stores data by using data blocks, and specifically, the storage medium that stores data by using data blocks may be a magnetic disk or a solid state disk.

In the method provided by the first aspect, if the checksum (checksum) of the first data is different from the checksum of the M data blocks, it indicates that the data content of the first data is certainly different from the data content of the M data blocks, and at this time, the comparison between the first data and the data content of the M data blocks is no longer performed, so that the calculation amount and the power consumption generated by the comparison between the data content of the first data and the data content of the second data block in the case where the data content of the first data is certainly different from the data content of the second data block are reduced. And under the condition that the checksum of the second data block is the same as the checksum of the first data, that is, the data content of the first data block and the data content of the second data block may be the same, further confirming whether the data content of the first data block is the same as the data content of the second data block. If the data content of the first data is consistent with the data content of the second data block, the first data and the second data block share the address information of the same data block, and a new data block does not need to be allocated to the first data, so that the utilization rate of a storage space is improved, and meanwhile, the performance loss is reduced.

In a possible implementation, the recording, in a case where it is determined that the checksum of the data content of the first data is the same as the checksum of the data content of a second data block of the M data blocks and the data content of the second data block is the same as the data content of the first data, address information of the second data block as address information of a data block storing the first data into an index node, includes: under the condition that the sum-check code of the data content of the first data is determined to be the same as the sum-check code of the data content of a second data block in the M data blocks, the data content of the second data block is the same as the data content of the first data, and the first number of times of being referred of the second data block is smaller than a first threshold value, recording the address information of the second data block into an index node as the address information of the data block storing the first data, and adding 1 to the first number of times of being referred; the first referred times are times of referring to the address information of the second data block recorded in a repeat count table, and the repeat count table is used for recording the incidence relation between the data block and the times of repeatedly referring to the address information of the data block.

It is understood that the second data block may be any one of the M data blocks.

Optionally, it is determined whether the checksum of the data content of the first data is the same as the checksum of the data content of the second data block of the M data blocks, and it is determined whether the first number of times of reference is smaller than a first threshold, and the execution is performed simultaneously or sequentially, and the order is not limited. Illustratively, after determining that the checksum of the first data is the same as the checksum of the second data block of the M data blocks, it is determined whether the first number of times of reference is less than a first threshold. Or after determining that the first number of times of reference is less than the first threshold, determining whether the checksum of the data content of the first data is the same as the checksum of the data content of the second data block of the M data blocks.

Optionally, it is determined whether the data content of the first data is the same as the data content of the second data block, and it is determined whether the first referred frequency is smaller than a first threshold, and the first referred frequency is executed at the same time or sequentially, and the order is not limited. Illustratively, after determining that the data content of the first data is the same as the data content of the second data chunk, it is determined whether the first number of times of reference is less than a first threshold. Alternatively, it is determined whether the data content of the first data is the same as the data content of the second data block after determining that the first number of times of being referred to is less than a first threshold.

In one possible implementation, the method further includes: in the case that the sum-check code of the data content of N second data blocks in the M data blocks is determined to be the same as the sum-check code of the data content of the first data, and the first times of reference of the N second data blocks are all larger than or equal to a first threshold value, allocating a first data block to the first data, and storing the first data into the first data block; the N is an integer less than or equal to M, the first referred times are times that the address information of the second data block recorded in the repeat count table is referred to, and the repeat count table is used for recording an incidence relation between the data block and the times that the address information of the data block is referred to repeatedly.

In the embodiment of the present application, a data deduplication upper limit is set for the data deduplication method (the first threshold is the data deduplication upper limit). On the one hand, if the data deduplication upper limit is not set, the number of times of reference of the target data block corresponding to the second data is gradually increased, so that the frequency of access to the target data block is gradually increased, and repeatedly reading the data of the same target data block in the target storage space for a long time may accelerate damage of the storage medium (target storage space) storing the target data block, and shorten the service life of the storage medium. Therefore, the data deduplication upper limit is set, and the problem of accelerating the damage speed of the storage medium caused by frequent access of the target data block can be avoided. On the other hand, the more times the target data block is referred to, the higher the repetition degree, which may cause certain difficulty for the index of the target direct index table, and may result in a slow modification operation, which may reduce the efficiency of the modification operation. Therefore, the data deduplication upper limit is set, and the problems of index difficulty caused by overhigh repetition degree, reduction of the efficiency of modification operation caused by overhigh repetition degree and the like can be avoided.

In one possible implementation, before the obtaining the first data to be stored in the target storage space, the method further includes: after an instruction of modifying third data into fourth data is received, acquiring second referred times of a third data block corresponding to the third data; the second referred times are times of referring to the address information of the third data block recorded in a repeat count table, and the repeat count table is used for recording the incidence relation between the data block and the times of repeatedly referring to the address information of the data block; in a case where it is determined that the second number of times of being referred to is equal to 1, regarding the fourth data as the first data, and deleting information related to the third data; in a case where it is determined that the second number of times of being referred to is greater than 1, the fourth data is regarded as the first data, and the second number of times of being referred to in the repeat count table is decremented by 1.

Optionally, the deleting the information related to the third data may be to replace (modify) address information of the data block of the third data recorded in the index node with address information of the data block of the fourth data; deleting the data block corresponding to the third data; and deleting the record related to the third data in the repeat count table.

Optionally, the deleting the information related to the third data may be deleting data block information corresponding to the third data recorded in the index node (the data block information includes an association relationship between the third data recorded in the index node and address information of a data block corresponding to the third data); adding the incidence relation between the fourth data and the data block address information of the fourth data in the index node; deleting the data block corresponding to the third data; and deleting the record related to the third data in the second information table.

Optionally, if there is no data block in the storage space that has the same data content as the fourth data block, the deleting the information related to the third data may further be to replace (modify) the data block content of the third data with the data content of the fourth data, and replace (modify) the checksum corresponding to the third data in the second information table with the checksum corresponding to the fourth data.

In the embodiment of the application, on one hand, when data is modified, if the third data is referred to at multiple locations, in order to ensure that the use of the address information of the data block of the first data does not make an error while the third data is modified, only the number of times of the second reference is reduced by 1, and the data content of the third data is not directly operated, so that the integrity of the data under the deduplication mechanism is ensured. On the other hand, data deduplication judgment is performed on the modified fourth data serving as the first data, and whether a data block with the same data content as the fourth data is stored in the target storage space or not is checked, so that data deduplication is performed in a scene where data needs to be stored, and the utilization rate of the storage space is further improved.

In some possible embodiments, after receiving an instruction to delete fifth data, obtaining a third number of times of reference of a fifth data block corresponding to the fifth data; the third referred times are times of referring to the address information of the fifth data block recorded in a repeat count table, and the repeat count table is used for recording the incidence relation between the data block and the times of repeatedly referring to the address information of the data block; deleting information related to the fifth data in a case where it is determined that the third number of times of being referred to is equal to 1; subtracting 1 from the third referenced number in the repeat count table if it is determined that the third referenced number is greater than 1.

In the embodiment of the application, when the fifth data is referred to at multiple locations, in order to ensure that the use of the data referring to the target data block corresponding to the fifth data is not affected except for the fifth data, only the number of times of the third reference is reduced by 1, and the data content of the fifth data is not directly operated, so that the integrity of the data under the deduplication mechanism is ensured.

In some possible embodiments, the recording, in the case that it is determined that the checksum of the data content of the first data is the same as the checksum of the data content of a second data block of the M data blocks and the data content of the second data block is the same as the data content of the first data, the address information of the second data block as the address information of the data block storing the first data into an index node includes: calculating a first identifier of the data content of the first data and a second identifier of the data content of a second data block of the M data blocks in the case that it is determined that the checksum of the data content of the first data is the same as the checksum of the data content of the second data block; determining that the data content of the second data block is the same as the data content of the first data in the case that the first identifier is determined to be the same as the second identifier; and in the case that the data content of the second data block is determined to be the same as the data content of the first data, recording the address information of the second data block into an index node as the address information of the data block storing the first data.

Optionally, it is determined whether the data content of the first data is the same as the data content of the second data block, and the determination may be obtained by calculating whether the first identifier is the same as the second identifier. Whether the data content of the first data is the same as the data content of the second data or not can be judged by calling a compare function or an equals method, and applicability is strong.

In the embodiment of the application, on one hand, the identifier is adopted to compare whether the data contents are the same, and the calculated identifier can be stored, so that the identifier can be calculated for multiple times at one time, and unnecessary repeated calculation can be reduced; on the other hand, comparing identifiers results in less performance consumption than directly comparing whether the data content of two data block sizes is the same, calling the compare function or the equals method, etc. Compared with the data content when the match function or the equals method is called, the technical requirement on technical personnel is lower.

In some possible embodiments, the checksum comprises a cyclic redundancy check code.

Optionally, the checksum of the first data may be calculated by calculating a CRC code through a cyclic redundancy check (CRC 32), where the CRC code is a value of the checksum. Optionally, the method for calculating the checksum of the first data may also be that the data content of the first data is divided according to 2 bytes, each 2byte constitutes a 16-bit value, and if there is a single byte of data at last, 0 of one byte is complemented to constitute a 2 byte; accumulating all 16-bit values to a 32-bit value; and adding the 16 higher bits and the 16 lower bits of the 32bit value to a new 32bit value, if the new 32bit value is greater than 0Xffff, adding the 16 higher bits and the 16 lower bits of the new value, and finally inverting the obtained value according to bits to obtain the value of the sum check code.

In a second aspect, an embodiment of the present application provides a data deduplication apparatus, including:

the device comprises a first acquisition unit, a second acquisition unit and a control unit, wherein the first acquisition unit is used for acquiring first data to be stored in a target storage space, M data blocks have been stored in the target storage space, and M is a positive integer;

a calculation unit, configured to calculate a checksum of data content of the first data;

a first storage unit, configured to, when it is determined that the checksum of the data content of the first data is not the same as the checksum of the data content of the M data blocks, allocate a first data block to the first data, and store the first data in the first data block;

and a second storage unit, configured to, when it is determined that the checksum of the data content of the first data is the same as the checksum of the data content of a second data block of the M data blocks and the data content of the second data block is the same as the data content of the first data, record address information of the second data block as address information of the data block storing the first data in an index node, where the index node is configured to record an association relationship between data and address information of the data block storing the data.

In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors and memory; the memory coupled with the one or more processors is configured to store computer program code comprising computer instructions that are invoked by the one or more processors to cause the electronic device to perform the first aspect or the method of any possible implementation of the first aspect.

In a fourth aspect, an embodiment of the present application provides a chip system, where the chip system is applied to an electronic device, and the chip system includes one or more processors, and the processors are configured to invoke computer instructions to cause the electronic device to execute the method shown in the first aspect or any possible implementation manner of the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product containing instructions, which when run on an electronic device, cause the electronic device to perform the method of the first aspect or any possible implementation manner of the first aspect.

In a sixth aspect, an embodiment of the present application provides a computer-readable storage medium, which includes instructions, and is characterized in that when the instructions are executed on an electronic device, the electronic device is caused to execute the method shown in the first aspect or any possible implementation manner of the first aspect.

Drawings

FIG. 1A is a diagram illustrating an index structure of an address entry of an inode of a first file according to an embodiment of the present disclosure;

fig. 1B to fig. 1C are schematic diagrams illustrating that data in a file a is stored by using a data deduplication method provided in the present application according to an embodiment of the present application;

FIGS. 2A-2G are schematic diagrams of user interfaces provided by embodiments of the present application;

FIG. 3A is a block diagram of a system for data deduplication according to an embodiment of the present disclosure;

fig. 3B is a schematic diagram of a first information table according to an embodiment of the present application;

fig. 3C is a schematic diagram of a second information table according to an embodiment of the present application;

FIG. 3D is a diagram illustrating a second information table according to an embodiment of the present application;

fig. 3E is a schematic diagram of initial values of a data block table and a second information table according to an embodiment of the present application;

fig. 4A-4B are schematic flow charts of a data deduplication method according to an embodiment of the present application;

fig. 5 is a schematic flowchart of a data deduplication method for modifying third data according to an embodiment of the present application;

fig. 6 is a schematic flowchart of a data deduplication method for deleting fifth data according to an embodiment of the present application;

fig. 7 is a schematic flowchart of another data deduplication method provided in the embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device 100 according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clear, the present application will be further described with reference to the accompanying drawings.

The terms "first" and "second," and the like in the description, claims, and drawings of the present application are used solely to distinguish between different objects and not to describe a particular order. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. Such as a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those skilled in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In this application, "at least one" means one or more, "a plurality" means two or more, "at least two" means two or three and three or more, "and/or" for describing an association relationship of associated objects, which means that there may be three relationships, for example, "a and/or B" may mean: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one item(s) below" or similar expressions refer to any combination of these items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b," a and c, "" b and c, "or" a and b and c.

The terms referred to in the present application are described in detail below.

(1) Sum check code checksum:

in the embodiment of the present application, the checksum is a sum of a group of data items used for verification in the data processing and data communication fields. In the embodiment of the application, when first data needs to be stored in a target storage space, whether the checksum of the first data is the same as the checksum of a second data block in the target storage space is determined, so as to determine whether the data content of the first data and the data content of the second data block may be the same.

Optionally, the checksum of the first data may be calculated by calculating a CRC code through a cyclic redundancy check (CRC 32), where the CRC code is a value of the checksum. Specifically, if the information field corresponding to the first data (the information field refers to the binary code sequence of the first data) is a K-bit binary code sequence, the check field is an R-bit binary code sequence, and the R-th order polynomial is g (x). The CRC code is calculated by adding R0's behind the K-bit information field, dividing by g (x) the corresponding binary code sequence to obtain the remainder, i.e. the binary code sequence corresponding to the CRC code (the binary code sequence corresponding to the CRC code should be R-1 bit; if not, 0 is complemented at the high bit).

Optionally, the method for calculating the checksum of the first data may also be that the data content of the first data is divided according to 2 bytes, each 2byte constitutes a 16-bit value, and if there is a single byte of data at last, 0 of one byte is complemented to constitute a 2 byte; accumulating all 16-bit values to a 32-bit value; and adding the 16 higher bits and the 16 lower bits of the 32bit value to a new 32bit value, if the new 32bit value is greater than 0Xffff, adding the 16 higher bits and the 16 lower bits of the new value, and finally inverting the obtained value according to bits to obtain the checksum value.

As can be known from the above calculation method of checksum, for two data with different checksums, the data contents of the two data are necessarily different; for two data with the same checksum, the data contents of the two data may be the same or different. It is to be understood that the checksum may also be calculated by other calculation methods, which is not limited in the embodiment of the present application.

For example, if the checksum of the first data is different from the checksum of the second data block, it indicates that the data content of the first data is necessarily different from the data content of the second data block; if the checksum of the first data is the same as the checksum of the second data, the data content of the first data may be the same as or different from the data content of the second data block.

Understandably, the second data block is a data block in which second data is stored in the target storage space, and the data content of the second data block is the data content of the second data. It is understood that the checksum of the first data refers to the checksum of the data content of the first data (the definition of "the checksum of the first data" is the same herein), and the checksum of the second data block refers to the checksum of the data content of the second data block (the definition of "the checksum of the second data block" is the same herein).

(2) Index node inode:

understandably, one first file corresponds to one inode, and the inode comprises meta-information and address items of the first file. The meta information of the first file includes a creator of the first file, a creation date of the first file, a size of the first file, an inode number of the first file, and the like. Wherein the first file inode number can enable a user to open the first file by a file name. Specifically, the file system searches for a corresponding inode number through the file name, acquires inode information through the inode number, searches for a data block where the first file is located according to the inode information, and reads out data.

In the embodiment of the present application, the address entry of the inode is used to record address information of the data block corresponding to the logical block number in the target storage space. The data block corresponding to the logical block number is used for storing the data content corresponding to the logical block number in the first file. In an embodiment of the present application, an index structure of an address entry included in an inode of a first file may be a direct index structure, a primary indirect index structure, a secondary indirect index structure, or a tertiary indirect index structure. Understandably, a direct index structure is an index structure that includes one or more direct index chunks; the first-level indirect index structure is an index structure comprising one or more direct index blocks and one or more first-level indirect index blocks; the secondary indirect index structure is an index structure comprising one or more direct index blocks, one or more primary indirect index blocks and one or more secondary indirect index blocks; the tertiary indirect index structure is an index structure that includes one or more direct index chunks, one or more primary indirect index chunks, one or more secondary indirect index chunks, and one or more tertiary indirect index chunks. The direct index block is used for recording address information of a target data block, and the target data block is a data block storing data content; that is, one direct index block is an address entry for recording address information of the target data block. The first-level indirect index block is used for recording index block address information of a first index block, and the first index block is used for recording address information of a plurality of target data blocks (namely, one first index block comprises a plurality of address entries). The secondary indirect index block is used for recording index block address information of a second index block, and the index block address information of the plurality of first index blocks is recorded in the second index block. The three levels of indirect index blocks are used for recording index block address information of a third index block, and the index block address information of a plurality of second index blocks is stored in the third index block.

Specifically, among the address entries included in the inode of the file system, M address entries are direct index blocks, J address entries are first-level indirect index blocks, G address entries are second-level indirect index entries, and L address entries are third-level indirect index entries. One address entry in the file system is X bytes in size, the index block of the target storage space is N in size, and the data block is H in size. One index block can store N/X address entries. It is understood that the address entry size refers to a size of an address entry for storing address information of an index block or for storing address information of a data block. Wherein M is an integer greater than 0, J, G and L are integers greater than or equal to 0, N, X and H are greater than 0.

Specifically, M address entries in the direct index chunk may be used to record address information of data chunks with logical chunk numbers 0 to (M-1), and the size of the maximum data content that can be accommodated by the data chunk with logical chunk number 0- (M-1) is M × H in total.

Specifically, J address entries in the primary indirect index chunks may be used to record address information of index chunks of J first index chunks, each first index chunk may record N/X address entries, and then the J first index chunks may be used to record address information of data chunks with logical chunk numbers M to J × N/X + (M-1), and the size of the maximum data content that can be contained in total by data chunks with logical chunk numbers M to J × N/X + (M-1) is J × N/X × H.

Specifically, G address entries in the secondary indirect index blocks can be used for recording address information of index blocks of G second index blocks, and one of the second index blocks can include address information of index blocks of N/X first index blocks (each second index block can record address information of index blocks of N/X first index blocks, and each first index block can record N/X address entries, so that the second index block can be used for recording (N/N)²Address entries), i.e., the G second index blocks can be used to record the logical block numbers J × N/X + M to G × N/X²+ J × N/X + (M-1) data block address information, and logic block numbers J × N/X + M to G × N/X²The data block of + J × N/X + (M-1) can contain the maximum data content with the size G (N/X)²*H。

Specifically, L address entries in the three-level indirect index blocks may be used to record address information of index blocks of L third index blocks, and one third index block may include N/X index blocksAddress information of index blocks of the second index blocks (each third index block can record address information of index blocks of N/X second index blocks, each second index block can record address information of index blocks of N/X first index blocks, and each first index block can record N/X address items, so that the third index block can be used for recording (N/X)³Address entries), i.e. the third index block may be used to record a logical block number G X (N/X)²+ N/X + M to L (N/X)³+G*(N/X)²+ J X N/X + (M-1) data block address information, and logic block number G (N/X)²+ N/X + M to L (N/X)³+G*(N/X)²The data block of + J × N/X + (M-1) can accommodate maximum data block content in total, and the size of the maximum data block content is L × N/X³*H。

As can be seen from the above, when the size of the first file is smaller than or equal to M × H, the address entry structure of the inode selects the direct index block; when the first file size is larger than M H and smaller than or equal to M H + J N/X H, the address item structure of the inode selects a primary indirect index block; when the first file size is larger than M H + J N/X H and smaller than or equal to M H + J N/X H + G (N/X)²In the case of H, selecting a secondary indirect index block; when the first file size is larger than M H + J N/X H + G (N/X)²H and less than or equal to M H + J N/X H + G (N/X)²*H+L*(N/X)³In the case of H (for convenience of description, this value is denoted as F), three levels of indirect index blocks are selected (hereinafter, the meaning of the index structure of the address entry of the inode for selecting a file according to the file size is the same).

Understandably, the maximum value 2 that can be recorded in the size of the address item^X*8If H is greater than or equal to the value F, the calculation of the data content that can be accommodated satisfies the above formula; and the maximum value 2 that can be recorded by the size of the address item^X*8When H is smaller than the value F, the maximum data content that can be accommodated matches the maximum value that can be recorded by the address entry. Illustratively, if the size of the address entry is 4 bytes (4 bytes equals 32 bits) and the size of the data block is 4k, the minimum value and the maximum value of the address information of the data block with recordable bits of the address entry are 0 and 2, respectively³²From 0 to 2³²The size of the maximum data content that can be contained in the data block corresponding to the address information of the data block is 2³²4 k. If the maximum data content F that can be recorded by the index structure has a value less than or equal to 2³²4k, the calculation of the maximum data content which can be recorded by the index structure meets the above calculation formula; if F is greater than 2³²4k, then the maximum data content that the index structure can record with the 2³²4k are identical.

Illustratively, as shown in fig. 1A, the inode of the file system includes 8 address entries, where 4 address entries are direct index blocks, 2 address entries are first-level indirect index blocks, 1 address entry is second-level indirect index blocks, and 1 address entry is third-level indirect index blocks. Each address entry is 4 bytes in size, and the index block size and the data block size are both 1 k. One index block may store 1k/4 256 address entries.

The 4 address entries in the direct index block may be used to record address information of data blocks having logical block numbers of 0-3, and specifically, address information of data blocks having logical block numbers of block0, block1, block2, and block3, respectively. For example, if the address entry with the logical number of block0 stores the addr1, the data block with the addr1 address information is the data block pointed by the address entry. The data blocks with the logic block numbers 0-3 can accommodate the maximum data content size of 4 x 1 k-4 k in total.

The first address entry of the 2 address entries in the primary indirect index block may be used to record address information of an index block of a first index block, and the first index block may be used to record address information of 256 data blocks, that is, address information of data blocks with logical block numbers of 4 to 259 (i.e., 256+3 ═ 259) is recorded. The second address entry of the 2 address entries in the primary indirect index block may be used to record address information of an index block of another first index block, where the other first index block corresponds to address information of a data block with a logical block number of 260 to 515 (i.e., 2 × 256+3 ═ 515). That is, the address information of 2 × 256 to 512 data blocks in total, that is, the address information of the data blocks with logical block numbers 4 to 515, may be recorded in the primary indirect index block. The logical block number 4-515 data blocks can accommodate a maximum data content size of 1k 512k in total.

The 1 address entry in the secondary indirect index block is used to record address information of an index block of a second index block, the second index block records address information of an index block of 256 first index blocks, each first index block records address information of 256 data blocks, that is, the secondary indirect index block records address information of 1 × 256 × 65536 data blocks in total, that is, address information of data blocks with logical block numbers 516 to 66051 (that is, 1 × 256+515 × 66051) is recorded. The data blocks with logic block numbers 516 to 66051 can accommodate a maximum data content size of 1k × 65536k in total.

The 1 address entry in the three-level indirect index block is used to record address information of an index block of a third index block, the third index block records address information of an index block of 256 second index blocks, each of the second index blocks records address information of an index block of 256 first index blocks, and each of the first index blocks records address information of 256 data blocks, that is, the three-level indirect index block records address information of 1 × 256 × 16777216 data blocks in total, that is, address information of data blocks with logical block numbers of 66052 to 16777731 (that is, 1 × 256+ 515). The data blocks of logical block numbers 66052 to 16777731 can accommodate a maximum data content size of 1k × 16777216 in total.

From the above, when the first file size is less than or equal to 4k, the address entry structure of the inode selects the direct index block. When the first file size is larger than 4k and smaller than or equal to 512+4 ═ 516k, the address entry structure of the inode selects a level one indirect index block; when the first file size is larger than 516k and smaller than or equal to (65536+4) — 65540k, the address entry structure of the inode selects a secondary indirect index block; in the case where the first file size is greater than 65540k and less than or equal to (16777216+4) — 16777220k, a three-level indirect index chunk is selected.

It will be appreciated that the M, J, G, L, N, X values and H values are dependent on the particular file system. For example, for the EXT4 file system, M is 12, J is 1, G is 1, L is 1, N is 1k, 2k, or 4k, X is 4 bytes, and H is 1k, 2k, or 4 k. For the F2FS file system, M is 929, J is 2, G is 2, L is 1, N is 4k, X is 4 bytes, and H is 4 k.

In this embodiment, the data content of the first file corresponding to the logical block number is the data content from the target start byte to the target end byte in the first file. That is, there is an association between the logical block number and the data content from the target start byte to the target end byte in the first file. It is understood that the specific value of the target start byte or the target end byte is referred to the 0 th byte in the first file. Specifically, an association relationship exists between the logical block number with the logical block number 0 and the data content from the target start byte 0 to the target end byte H in the first file; there is an association relationship between the logical block number of logical block number 1 and the data content from the target start byte (H +1 bytes) to the target end byte 2H in the first file, and so on. And according to the incidence relation and the logic block number, the data block used for storing the target data in the first file in the target storage space can be found. Specifically, a target logical block number where the target data is located is determined according to a target start byte and a target end byte of the target data, and then address information of a data block corresponding to the logical block number is searched for from an address item of an inode of the first file according to the target logical block number. It can be understood that the address information of the target data block is an offset corresponding to the target data block in a data block table (in the data block table, an offset of the target data block in the data block table and an association relationship between corresponding data contents are recorded). Illustratively, the target storage space includes 100 data blocks, and the 100 data blocks are respectively numbered 0 to 99 in the data block table and are stored by rows, where if the offset of the data block numbered 0 (i.e. the first row record in the data block table) is 0, the address information of the data block numbered 0 is 0, and if the offset of the data block numbered 1 (i.e. the second row record in the data block table) is 1, the address information of the data block numbered 1 is 1, and so on. Therefore, the data content corresponding to the address information of the target data block in the data block table can be found by the address information of the target data block. (the meaning of searching the address information of the target data block according to the target logic block number and then searching the data content corresponding to the target data block according to the address information of the target data block is the same here).

For example, in the case that the structure of the address entry of the inode is as shown in fig. 1A, if the target start byte of the target data is 500 bytes and the target end byte is 1000 bytes, since (500 bytes >0 bytes) and (1000 bytes <1024 bytes), and there is an association relationship between the logical block number with the logical block number 0 and the data content from the target start byte of 0 bytes to the target end byte of 1024 bytes (1024 bytes ═ 1k), the target data occupies the logical block number with the target logical block number of 0. And the address item corresponding to the logical block number with the target logical block number of 0 stores the address information of the corresponding data block, the address information of the data block is addr1, and the data block corresponding to the target data can be found according to the address information of the data block.

Illustratively, if the starting node of the target data is 513025 bytes and the target ending byte is 514030 bytes, since (513025 bytes >501k) and (514030 bytes <502k), the logical block number with the logical block number of 501 is associated with the data content from the target starting byte of 501k +1 bytes to the target ending byte of 502k, the target data is associated with the logical block number of the target logical block number of 501.

It will be appreciated that the target data may correspond to two or more logical block numbers. Illustratively, the start node of the target data is 500 bytes and the end node is 3000 bytes, since (1k >500 bytes >0 bytes) and (3000 bytes <3k), there is an association relationship between the logical block number with logical block number 0 and the data content from the target start byte 0 to the target end byte 1k, there is an association relationship between the logical block number with logical block number 1 and the data content from the target start byte (1k +1 bytes) to the target end byte 2k, and there is an association relationship between the logical block number with logical block number 2 and the data content from the target start byte (2k +1 bytes) to the target end byte 3k, then the target logical block numbers occupied by the target data include logical block numbers of 0, 1 and 2.

It can be understood that, when the data content corresponding to the logical block number is modified, or the data content corresponding to the logical block number before the logical block number is modified or deleted, so that the first target start byte of the data content corresponding to the logical block number in the first file is changed into the second target start byte, and the first target end byte is changed into the second target end byte, the logical block number, the second target start byte, and the second target end byte reestablish a new association relationship.

Illustratively, when the data content corresponding to the logical block number is originally 50 bytes at the first target start byte in the first file, the first target end byte is 100 bytes, and the data content corresponding to the logical block number (i.e. the data content of 50 bytes to 100 bytes, and the data content of 50 bytes in total) is modified into the data content of 20 bytes in total, the data content of the last logical block number of the logical block number is deleted by 5 bytes, the first target start byte is changed into 45 bytes (50-5 ═ 45), and the second target end byte is changed into 65 bytes (45+20 ═ 65). The association relationship between the logical block number and the data content from the first target start byte to the second target end byte is modified to be the association relationship between the logical block number and the data content from the second target start byte of 45 bytes to the second target end byte of 65 bytes.

In some implementations of data deduplication, a unique identifier is formed for a first data block to be stored in a file and a second data block already stored in a target storage space by a hash algorithm or a fingerprint algorithm, and whether the data is duplicated is determined by the identifier. Specifically, when the identifier of the second data block is needed, calculating a second identifier of the data content of the second data block, and comparing the second identifier with the first identifier of the first data until a target second identifier equal to the first identifier is found; or, the second identifiers of all the second data blocks are calculated in advance, the association relationship between the second data blocks and the second identifiers is recorded in the target table, and the target second identifiers equal to the first identifiers are searched by traversing the target table.

If the calculation of the second identifier other than the target second identifier is referred to as invalid calculation, the calculation amount and power consumption of the invalid calculation are large in any of the above manners.

By adopting the data deduplication method provided by the application, whether the data content of the first data and the data content of the second data block are possibly the same is judged through the checksum; in the case where the data content of the first data and the data content of the second data block may be the same, it is further confirmed whether the data content of the first data and the data content of the second data block are the same. The calculation amount and the power consumption generated by comparing the data content of the first data with the data content of the second data block under the condition that the data content of the first data is different from the data content of the second data block are reduced, the utilization rate of a storage space is improved, and meanwhile, the performance loss is reduced.

Illustratively, the target storage space includes 100 second data blocks having stored data contents, and the data contents of the first data are the same as the data contents of the 80 th target data block, the method of the hash algorithm or the fingerprint algorithm is adopted to determine whether the data contents are repeated, and it is necessary to calculate at least identifiers of the 80 second data blocks before finding the target second identifier that is the same as the first identifier, where the invalid calculation includes calculating at least identifiers of the 79 second data blocks. As can be seen, the computation and power consumption of the invalid computation are large. However, by using the method provided by the present application, the checksums of 80 second data blocks need to be calculated, and if the checksums of 10 second data blocks are the same as the checksums of the first data, the identifiers of the 10 second data blocks are calculated again, and the invalid calculation includes calculating the checksums of 80 data blocks and calculating the identifiers of 9 data blocks.

Because the calculation of checksum only involves the logic operation of binary bits, the calculation amount generated by invalid calculation is very small, and the calculation method of checksum is simple and easy to understand, and has low technical requirements on developers. In contrast, the unique identifier is calculated by adopting algorithms such as a hash algorithm or a fingerprint algorithm, and the algorithms generated by invalid calculation are high in complexity, large in calculation amount and large in power consumption.

Particularly, under the condition that most data in the file are not repeated and checksums of different data contents are just different, the method provided by the application has obvious beneficial effects. Illustratively, the target storage space includes 1000 second data blocks already used for storing data content, and when the data content of the target data block is the same as the data content of the first data, the hash algorithm or the fingerprint algorithm is used to determine whether the data content repeatedly requires to calculate identifiers of the 1000 second data blocks, and all the calculation is invalid, which results in huge calculation amount and power consumption. If the checksums of the 1000 second data blocks are different from the checksums of the first data, the method provided by the application only needs to calculate the checksums of the 1000 second data blocks, and the calculation method of the checksums is simpler than the calculation method of calculating the unique identifier by a hash algorithm or a fingerprint algorithm, so that the calculation amount and the power consumption generated by invalid calculation are greatly reduced.

Illustratively, if the data contents stored in the file a and the file B are as shown in fig. 1B, wherein the size of each data content from s1 to s6 is consistent with the size of the data block capacity and is 1k (it is understood that 1k data may be 1024 all-english characters, or 1k data may also be 512 all-chinese characters, etc., which is not limited herein). And respectively reading the data of the file A and the file B into the memory in sequence and applying for data blocks for storing the data contents of the file A and the file B from the target storage space. Assuming that the target storage space does not store other data contents before storing the file a and the file B, the file a and the file B are stored by using the method of the present application, and the data contents stored in the target storage space are obtained as shown in fig. 1B, where the target storage space uses 6 data blocks to store s1, s2, s3, s4, s5, and s6, respectively, and only one copy of data is stored for the duplicate data in the same file (the duplicate data in the same file a or the duplicate data in the same file B) and the duplicate data in different files (the duplicate data between the file a and the file B) s1, s3, and s 5.

Specifically, as shown in fig. 1C, the address information of the data block of the duplicate data s3 recorded in the inode of the file a and the inode of the file B points to the same data block for the duplicate data s3 in the file a and the file B.

Understandably, the data size of the file A and the file B is 10k in total, the data actually stored in the target storage space is 6k, and 4k of storage space is saved. The calculation of the data generation for storing files a and B includes calculating the checksum of 10 strings and at least only calculating the identifier of 3 strings or at least only 4 comparisons of data contents.

The user interface provided by the embodiments of the present application is described below.

It can be understood that the method provided by the embodiment of the present application can be executed by any electronic device that stores data by using data blocks in a target storage space. Exemplary electronic devices include mobile terminals, tablet computers, desktop computers, laptop computers, handheld computers, notebook computers, ultra-mobile personal computers (UMPCs), netbooks, and cellular phones, among others. For convenience of description, a mobile terminal is taken as an example of the electronic device, and the user interface provided by the embodiment of the application is introduced.

First, a user interface involved in the data deduplication function is introduced. Referring to fig. 2A to fig. 2G, fig. 2A to fig. 2G are schematic diagrams of a user interface according to an embodiment of the present disclosure. As shown in fig. 2A, the electronic device displays a home screen interface 10. As shown in fig. 2A, home screen interface 10 includes a calendar widget (widget)101, a weather widget 102, an application icon 103, a status bar 104, and a navigation bar 105. Wherein:

the calendar gadget 101 may be used to indicate the current time, e.g., date, day of week, time division information, etc.

The weather widget 102 may be used to indicate a weather type, such as cloudy sunny, light rain, etc., may also be used to indicate information such as temperature, and may also be used to indicate a location.

The application icons 103 may include icons of Wechat (Wechat), Twitter (Twitter), Facebook (Facebook), microblog (Sinaweibo), QQ (Tencent QQ), WPS, and memo 1031, and may further include icons of other applications, which is not limited in this embodiment. The icon of any application can be used for responding to the operation of the user, such as a touch operation, so that the electronic equipment starts the application corresponding to the icon. It will be appreciated that each of the above-described applications will generate file data accordingly, which will be stored in the data block.

The name of the operator (e.g., china mobile), time, WI-FI icon, signal strength, and current remaining power may be included in the status bar 104.

Navigation bar 105 may include: a return key 1051, a home screen key 1052, an outgoing call task history key 1053, and other system navigation keys. The home screen interface 10 is an interface displayed by the electronic device 100 after any user interface detects a user operation on the main interface key 1052. When it is detected that the user clicks the return key 1051, the electronic apparatus 100 may display a user interface previous to the current user interface. When the user is detected to click on the home interface key 1052, the electronic device 100 may display the home screen interface 10. When it is detected that the user clicks the outgoing task history key 1053, the electronic device 100 may display a task that the first user has recently opened. The names of the navigation keys may also be other keys, for example, 1051 may be called a backsbutton, 1052 may be called a Home Button, and 1053 may be called a Menu Button, which is not limited in this application. Each navigation key in the navigation bar 105 is not limited to a virtual key, and may be implemented as a physical key.

As shown in fig. 2A and 2B, the electronic device may display a memo application interface 20 in response to a user operation, for example, a touch operation, acting on the memo icon 1031. The memo application interface 20 may include a memo history storage record (e.g., "meeting record 1", "meeting record 2", etc. in the figure) and an additional memo control 201. As shown in fig. 2B and 2C, in response to a user operation, such as a touch operation, applied to the new memo control 201, the electronic device may display the application interface 30. Application interface 30 may contain a save control 301 and a cancel control 302. The storage control 301 is used for storing the data content of the newly added memo in response to the user operation; the cancel control 302 is configured to discard the edited newly added memo data content in response to a user operation, that is, to return to the user interface 20 without storing the newly added memo data content.

The user can click the saving control 301, and in response to the second user operation, i.e., the click operation applied to the saving control 301, the electronic device starts data storage to store the newly added memo data content, and executes the data deduplication method provided by the present application in the data storage process. Illustratively, the electronic device boot data storage includes: reading the newly added memo data content into the memory according to the volume of the data block, and applying for the data block for the data read into the memory, for example, the data content after the newly added memo data content is read into the memory may be as shown in fig. 2B for the data content of the file a read into the memory.

As shown in fig. 2D and 2E, in response to a click operation on the target memo in the user interface 20, the user interface 40 is entered, and the user interface 40 includes the data content of the target memo (for example, the data content of the target memo is "meeting record 8"), a save control 401, and a cancel control 402. The saving control 401 is configured to respond to a user operation to save the edited and modified data content; and a cancel control 402, configured to cancel saving of the current modified data in response to a user operation. The user may edit the data content of the target memo, for example, as shown in fig. 2E, add the data content "(important meeting)" after the data content "meeting record 8", the user clicks the save control 401, in response to the third user operation, i.e., the click operation applied to the save control 401, the electronic device starts data update storage (i.e., modifies the data content "meeting record 8" into "meeting record 8 (important meeting)") and executes the data deduplication method provided by the present application during the data storage process.

As shown in fig. 2F and 2G, in response to a long press operation (which may be other operations, such as sliding left, but is not limited thereto) on the target memo in the user interface 20, the user interface 50 is entered, and the user interface 50 includes a save control 501 and a cancel control 502. The saving control 501 is configured to save and update the data content of the target memo in response to a user operation; the cancel control 502, in response to the user operation, returns to the user interface 20 without performing a save update operation on the data content deleted by the user. The user may delete some data contents of the target memo in the user interface 50, for example, the user deletes the data contents of "meeting record 3" of the target memo and clicks the saving control 501, in response to the fourth user operation, i.e., the clicking operation applied to the saving control 501, the electronic device starts a data deletion task (i.e., all data contents included in the target memo are deleted), and executes the data deduplication method provided by the present application during the data deletion task.

The present application is further described below with reference to the accompanying drawings.

In this embodiment of the application, the target storage space may be a storage partition of a storage medium that stores data by using a data block, and specifically, the storage medium that stores data by using a data block may be a magnetic disk or a solid state disk, which is not limited herein. For convenience of description, a disk is taken as an example of the storage medium, and a target disk partition is taken as an example of the target storage space. Understandably, the target disk partition is a readable-writable partition or a read-only partition in the disk.

Referring to fig. 3A, fig. 3A is a system framework diagram of data deduplication according to an embodiment of the present application. The system framework diagram includes a file layer 301 and an index node (inode) layer 302.

The file layer 301 includes files of various applications (apps), and illustratively, the file layer 301 includes a files and B files of an application a and c files of an application B. The inode layer 302 includes a Virtual File System (VFS) layer 3201, and a file system layer 3202, the file system layer 3202 including one or more of F2FS (F2 FS), EXT4 (fourteen extended file system, EXT4), EROFS (extensible read-only file system, EROFS), and other file systems. Understandably, the file system is used for clarifying a method and a data structure of a target disk partition storage file, and the file is accessed by name. The F2FS is a novel open source flash file system, and is mainly used for accessing flash memory data of a flash memory device (NAND) of a computer; the EXT4 is a log file system, and is mainly used for accessing log files; the EROFS is a super file system, and is mainly used to access system files. For example, if the a file and the b file are system files, the inode of the a file should be stored in an EROFS file system; if the c file is a log file, the inode of the c file should be stored in the EXT4 file system.

Understandably, the target disk partition in the disk is mounted as a file system, and a file system generates a data block table correspondingly. The read-write type of the file system can be read-only or readable-writable, and the read-write type of the data stored in the target disk partition is consistent with the read-write type of the file system. Specifically, if the read-write type of the file system is read-only, the data stored in the target disk partition is also read-only data; and if the read-write type of the file system is readable and writable, the data stored in the target disk partition is also readable and writable. For example, if the target disk partition is mounted as the F2FS file system and the read-write type of the F2FS file system is set to be readable and writable, the target disk partition is also a readable and writable partition.

In the embodiment of the application, data deduplication processing is performed on a data block of which a target disk partition is a readable and writable partition or a read-only partition, in other words, the data deduplication method provided by the application is applicable to both the readable and writable partition and the read-only partition.

In addition, a file system may generate a data block table (the data block table records an offset of a data block in a target disk partition corresponding to the file system in the data block table and an association relationship between data contents stored in the data block, where the data block is corresponding to the file system). The meanings of the data block table in other embodiments herein are the same.

It is understood that the first data and the second data stored in the second data block provided in other embodiments of the present application may be data of the same file in the same file system, or data of different files in the same file system. Illustratively, the first data and the second data are both data in an a file of the EROFS; or, the first data is data in an a file of the EROFS, and the second data is data in a b file of the EROFS.

It can be understood that the VFS may use standard unix system calls to read and write different file systems located on different physical media, and provide a unified operation interface and application programming interface for the above F2FS, EXT4, EROFS, and other file systems. That is, for different file systems, the interfaces for accessing the underlying storage medium are different, that is, the interfaces for accessing the data blocks pointed by the index nodes of different file systems are different, and the VFS can provide a uniform operation interface and application programming interface for the file systems, so that the system call can work without concerning the underlying storage medium and the file system type.

Understandably, the operating systems supported by the VFS include a Linux operating system and a Windows operating system. The file system layer 3102 includes different types of file systems for different operating systems, and the operating system in which the data blocks are located is not limited herein.

In some embodiments, the system framework diagram further includes a first information table as shown in fig. 3B, the first information table being used for recording the checksum of the target data block. Understandably, the target data block refers to any one data block in the target disk partition; if the target data block is already used for storing data, the checksum of the target data block refers to the checksum of the data content of the target data block; if the target data chunk is not used for storing data, the value of checksum of the target data chunk is an initial value (illustratively, null or 0) (all definitions herein regarding checksum of data chunks are the same).

In some other embodiments, the system framework diagram further includes a second information table as shown in fig. 3C, where the second information table is used to record the checksum of the target data block and the corresponding number of times that the target data block is referenced. It is understood that the number of times that the target data block is referred to refers to the number of times that the address information of the target data block is recorded in the address entry of the inode of the first file (the definition of the number of times that is referred to herein is the same).

In other embodiments, the first information table or the second information table further records the file number of the target data block. Illustratively, as shown in fig. 3D, the second information table further includes a column attribute of "belonging file number" for recording the belonging file of the target data block.

For convenience of description, hereinafter, in the description applicable to both the first information table and the second information table, the first information table and the second information table are collectively referred to as a target information table.

Optionally, the following two ways of establishing the association relationship between the target data block and the corresponding checksum in the target information table are provided:

1. the target information table is made equal to the offset corresponding to the row record of the data block table (the data block table records the offset of the target data block in the data block table and the association relationship between the corresponding data contents). The method comprises the following steps:

and creating a one-to-one corresponding relation between each column in the target information table and each column in the data block table. Specifically, the data block table is initialized and the target information table is initialized before a target data block in a target disk partition is allocated for storing data. The initialization data block table and the initialization target information table include: the offset values of the data block table and the target information table are set to be self-increasing from 0. For example, the target disk partition includes 100 data blocks, and the second information table is used to record the checksum of the 100 data blocks, so that the row records of the initialized data block table and the second information table are shown in fig. 3E. The offset corresponding to the first row of data in the data block table is 0, the offset corresponding to the second row of data is 1, the offset corresponding to the third row of data is 2, and so on; the offset corresponding to the first row of data in the target information table is 0, the offset corresponding to the second row of data is 1, the offset corresponding to the third row of data is 2, and so on).

It can be understood that, to make the target data block in the target information table generate an association relationship with the corresponding checksum, the first target row of the target information table needs to be used to record information related to the checksum of the target data block corresponding to the second target row in the data block table, and the offset of the first target row is the same as that of the second target row. Illustratively, the target disk partition includes a total of 100 data blocks, and the data block table includes 100 records of the data blocks. Correspondingly, the target information table also includes 100 rows of checksum records, and the 100 rows of checksum records correspond to the 100 data block records in a one-to-one manner. That is, the row with the offset 0 in the target information table is used to record the checksum-related information of the target data block corresponding to the row with the offset 0 in the data block table.

It can be understood that, in the first information table, the information related to checksum of the target data block includes: the checksum of the target data block; in the second information table, the information related to checksum of the target data block includes: the checksum of the target data block and the number of times the target data block is referenced; in some other embodiments, in the second information table, the information related to checksum of the target data block includes: the checksum of the target data block, the number of times the target data block is referenced, and the file number to which the target data block belongs.

2. Optionally, the target information table explicitly records an association relationship between checksum and address information of a corresponding target data block. Specifically, before the data block in the target disk partition is allocated for storing data, the corresponding row record does not need to be created when initializing the target information table, and when the target data block in the target disk partition is allocated for storing data, the row record corresponding to the target data block is newly added to the second information table.

Illustratively, when the data block with addr3 as the address information of the data block in the target disk partition is allocated to store the data content with string str1, a row record about the association relationship between the checksum of the str1 and the addr3 is newly added in the target information table. When the data block with addr25 as the address information of the data block in the target disk partition is allocated to store the data content with string str2, a row record of the association relationship between the checksum of the str2 and the addr25 is newly added to the target information table.

Referring to fig. 4A, fig. 4A is a schematic flowchart illustrating a data deduplication method according to an embodiment of the present application. As shown in fig. 4A, the data deduplication method includes the following steps:

401, upon receiving an instruction to store file a, an inode for file a is created according to the size of file a.

Specifically, after the index structure of the inode is determined according to the size of the file a, the inode of the file a is created.

In this embodiment of the present application, the file a may be a document, a picture, an audio, a video, or an audio/video, and the like, which is not limited in this embodiment of the present application.

In an embodiment of the application, the index structure includes a direct index chunk, a primary indirect index chunk, a secondary indirect index chunk, and a tertiary indirect index chunk. It is understood that the embodiment of the present application can also be applied to other types of index blocks, which is not limited by the embodiment of the present application.

Illustratively, as shown in fig. 2A and 2B, the memo record of the newly added memo data content is recorded as file a, the user clicks the saving control 301, and in response to the second user operation, i.e., the clicking operation applied to the saving control 301, the electronic device receives an instruction to store file a, and then starts data storage. Specifically, the structure of the address entry of the electronic device is shown in fig. 1A, and the size of the index block and the size of the data block are both 1 k; after receiving the instruction for storing the file a, the electronic device reads the data content included in the file a into the memory according to the size of the data block (the data content in the file a is read into the memory every 1k size). Illustratively, referring to fig. 1B, the data content of the file a includes s1, s3, s5, s1 and s4, wherein the data content sizes of s1-s5 are all 1k, and the size of the file a is 5 k. Since 4k <5k <512k, the electronic device determines that the inode structure of the file a should select a level of indirection according to the data content size of the file a. As shown in fig. 4B, the address information of s1, s3, s5 and s1 is recorded by using the direct index block, and the address information of s4 is recorded by using the level-one indirect index block.

It is understood that the 1k data s1, s3, or s5 may be 1024 all-english characters, or 512 all-chinese characters, and the like, which is not limited herein.

402, calculating the checksum of the first data in the file a.

Understandably, the checksum of the first data is the checksum of the data content of the first data.

In this embodiment of the present application, when storing the file a, the first data in the file a is sequentially read into the memory according to the capacity of a data block in the file system in which the file a is located, and then a data block is requested to be allocated to the target disk partition for the first data in the memory, where the data block is used for storing the data content of the first data.

Understandably, the first data is data that is not stored in the memory to the corresponding data block in the target disk partition. Illustratively, the file a includes 16k of data, and the capacity of the data block in the file system in which the file a is located is 4k, the data in the file a is read into the memory four times, if the first 4k of data and the second 4k of data have been stored in the corresponding data blocks of the target disk partition, and the third 4k of data and the fourth 4k of data in the memory have not been allocated with a data block. The first data is the third 4k data. Correspondingly, the checksum of the first data is the checksum of the data content of the third 4k data. After the third 4k data is stored in the data block corresponding to the target disk partition, the first data is the fourth 4k data that has not been stored in the data block in the memory.

And 403, determining whether a second data block exists according to a second information table, wherein the checksum of the second data block is the same as the checksum of the first data. The second information table is used for recording the checksum of the second data block.

It is understood that the checksum of the first data refers to the checksum of the data content of the first data.

It can be understood that the second data block is a data block storing second data, and the data content of the second data is the data content of the second data block.

It is understood that the second data stored in the second data block may be the data content in the file a, and may also be the data content in other files (for example, file B, file C, or file D). The second data block may be any one target data block stored in a target disk partition (the checksum of the target data block is the same as the checksum of the first data), and the target data block may be a data block storing data belonging to a file a, or a data block storing data belonging to another file (e.g., a file B). In this case, the data deduplication method provided in the embodiment of the present application is based on data deduplication of the same file and different files in the target disk partition. Understandably, in this case, the second information table is used to record the association relationship between the second data block and the checksum of the second data block, and the second information table is used to multiplex fig. 2C, and the second information table does not need to record the file number of the second data block.

Illustratively, the file a includes 16k of data, and the capacity of the data block in the file system in which the file a is located is 4k, the data in the file a is read into the memory four times, if the first 4k of data and the second 4k of data have been stored in the corresponding data blocks of the target disk partition, and the third 4k of data and the fourth 4k of data in the memory have not been allocated with a data block. The second data block is a data block in the target disk partition storing the first 4k data (second data), or the second data block is a data block in the target disk partition storing the second 4k data (second data), or the second data block is a data block in the target data of another file (e.g., file B) stored in the target disk partition. Correspondingly, the checksum of the second data block is the checksum of the data content of the first 4k data, or the checksum of the second data block is the checksum of the data content of the second 4k data, or the checksum of the second data block is the checksum of the data content of the target data.

Optionally, the second data block may also be constrained to be a data block in the target disk partition, where the second data belonging to the file a is stored. In this case, the data deduplication method provided by the application is data deduplication based on the same file. Understandably, in this case, the second information table is used to record the association relationship between the second data block and the checksum of the second data block, and fig. 2D is multiplexed, where the second information table needs to record the file number of the second data block.

Illustratively, the file a includes 16k of data, and the capacity of the data block in the file system in which the file a is located is 4k, the data in the file a is read into the memory four times, if the first 4k of data and the second 4k of data have been stored in the corresponding data blocks of the target disk partition, and the third 4k of data and the fourth 4k of data in the memory have not been allocated with a data block. The second data block is the data block in the target disk partition storing the first 4k data, or the second data block is the data block in the target disk partition storing the second 4k data.

In the embodiment of the present application, whether the second data block exists is found by setting a cycle length and according to the second information table (please refer to the above for the detailed description of the second information table). Illustratively, the loop length is the total number of records in the second information table, the loop start point is 0 (corresponding to 0 for the base address), and the loop increment is 1. Specifically, if there are 20 total records in the second information table, the corresponding loop statement is: for (i ═ 0; i < 20; i +1), the value of which is equal to the offset of the (i +1) th record in the second information table. The searching whether the second data block exists according to the second information table comprises the following steps: searching a target checksum corresponding to a target offset (the value of the target offset is i) in a second information table; judging whether the target checksum is the same as the checksum of the first data; and under the condition that the target checksum is determined to be the same as the checksum of the first data, determining the data block corresponding to the target checksum as the second data block.

Optionally, after the target checksum is obtained, before determining whether the target checksum is the same as the checksum of the first data, it is first determined whether the target checksum is an initial value, and if the target checksum is the initial value, it indicates that the data block corresponding to the checksum is not used for storing the data content. At this time, it is not necessary to judge whether the target checksum is the same as the checksum of the first data, but add 1 to the value of i, and continue to cyclically obtain the next target checksum.

Alternatively, the start of the loop may be other reference values. Illustratively, the loop starting point is 2 (corresponding to a base address of 2), and the corresponding loop statement is: for (i ═ 2; i < 22; i +1), the value of i minus 2 (the value of i minus 2 corresponds to the offset) is equal to the offset of the (i-1) th record in the second information table.

In the event that it is determined that the second data block exists, it is determined 404 whether the data content of the second data block is the same as the data content of the first data.

Optionally, the data content of the second data block may be obtained in a manner that, when an offset corresponding to a row record of the second information table and a row record of the data block table is equal, a data block (that is, a second data block) corresponding to the target checksum is a data block corresponding to the target offset in the data block table. The data content of the second data block is the data content of the data block corresponding to the target offset in the data block table.

Optionally, the data content of the second data block may be obtained in a manner that an association relationship between the second data block and the corresponding checksum is made in the second information table, where the association relationship between the checksum and the address information of the corresponding second data block is explicitly recorded in the second information table, and the address information of the second data block is the address information of the data block corresponding to the target offset in the second information table. And the data content of the second data block is the data content of the data block corresponding to the address information of the second data block in the data block table.

It can be understood that the embodiments of the present application may perform data deduplication for a readable and writable file system, and may also perform data deduplication for a read-only file system. Illustratively, the file system in which the file a is located is a readable and writable file system, the data block table is a data block table corresponding to the readable and writable file system, and information of data blocks included in a target disk partition corresponding to the readable and writable file system is recorded in the data block table. Or the file system in which the file a is located is a read-only file system, the data block table is a data block table corresponding to the read-only file system, and the data block table records information of data blocks included in a target disk partition corresponding to the read-only file system.

Optionally, a compare function may be used to determine whether the data content of the second data block is the same as the data content of the first data. Specifically, the data content of the first data is str1, the data content of the second data block is str2, a compare function is called through str1.compare (str2) or str2.compare (str1), a return value of the compare is obtained, and when the return value is 0, str1 and str2 are the same; if the return value is not 0, then str1 and str2 are not the same.

Optionally, an equals method may be further used to determine whether the data content of the second data block is the same as the data content of the first data. Specifically, the data content of the first data is str1, the data content of the second data block is str2, an equals method is called through str1.equals (str2) or str2.equals (str1), a return value of the equals method is obtained, and when the return value is true, the str1 and the str2 are the same; in the case where the return value is false, str1 and str2 are not the same.

Optionally, determining whether the data content of the second data block is the same as the data content of the first data block may also employ a method of calculating a unique identifier for the data content of the first data block and the data content of the second data block. It will be appreciated that a hashing algorithm or a fingerprinting algorithm may be used to compute a unique identifier for the data content. Specifically, a hash algorithm or a fingerprint algorithm is used to calculate a first identifier for the data content of the first data and a second identifier for the data content of the second data block, and in the case that the first identifier is the same as the second identifier, the data content of the first data is the same as the data content of the second data block.

It is understood that the method of determining whether the data content of the second data block is the same as the data content of the first data block may be a compare function, an equals method, or a method of calculating a unique identifier, but the determination method used for determining whether the data content of the second data block is the same as the data content of the first data block is not limited.

405, the second information table is further used for recording the number of times the second data block is referred to; in a case where it is determined that the data content of the second data block is the same as the data content of the first data, it is determined whether the number of times the second data block is referred to in the second information table is less than a first threshold.

Illustratively, the value of the first threshold value can be any one of values (greater than or equal to 500 and less than or equal to 5000). It is understood that the value of the first threshold is merely an example, and the value of the first threshold is not limited herein.

It is understood that the number of times the second data block is referred to may also be referred to as the first number of times referred to as described in other embodiments herein.

It is understood that the step 404 may be executed first, and the step 405 may be executed again when the data content of the second data block is determined to be the same as the data content of the first data; alternatively, step 405 may be executed first, and when it is determined that the number of times of reference of the second data block in the second information table is smaller than the first threshold, step 404 may be executed again. The order of execution of

steps

404 and 405 is not limited herein.

In the embodiment of the present application, the first threshold is a data deduplication upper limit. On one hand, it can be understood that, if the deduplication upper limit is not set, the number of times of reference of the target data block corresponding to the second data is gradually increased, which results in that the frequency of access of the target data block is also gradually increased, and repeatedly reading the data of the same target data block in the target disk partition for a long time will accelerate the damage of the storage medium (target disk partition) storing the target data block, and shorten the service life of the storage medium. Therefore, the problem of accelerating the damage speed of the storage medium caused by frequent access of the target data block can be avoided by setting the deduplication upper limit. On the other hand, the more times the target data block is referred to, the higher the repetition degree, which may cause certain difficulty for the index of the target direct index table, and may result in a slow modification operation, which may reduce the efficiency of the modification operation. Therefore, the duplication elimination upper limit is set, so that the problems of index difficulty caused by overhigh duplication degree, reduction of the efficiency of modification operation caused by overhigh duplication degree and the like can be avoided.

406, in a case where it is determined that the number of times the second data block is referred to is less than the first threshold, storing the second data block address information as data block address information of the first data into an address entry of an inode of the file a; and adding 1 to the referenced times of the second data block corresponding to the second information table.

It can be understood that, in the case that it is determined that there exists a second data block that is identical to the checksum of the first data, and the data content of the second data block is identical to the data content of the first data, and the number of times that the data block corresponding to the second data is referred to is less than the first threshold, the loop statement in step 404 is not executed any more. That is to say, when the second data block that is duplicated with the data content of the first data is found in the target disk partition and the second data block does not reach the deduplication upper limit, it is stopped to find whether the second information table has a next second data block (the checksum of the next second data block is the same as the checksum of the first data).

407, determining that there is no second data block according to the second information table, where the checksum of the second data block is the same as the checksum of the first data; or, determining that a second data block does not exist according to the second information table, wherein the checksum of the second data block is the same as the checksum of the first data, and the data content of the second data block is the same as the data content of the first data; or, in the case that it is determined from the second information table that there is no second data block, the checksum of the second data block is the same as the checksum of the first data, the data content of the second data block is the same as the data content of the first data, and the number of times of reference corresponding to the second data block is smaller than the first threshold, storing the first data into the reference data block, and recording the checksum corresponding to the reference data block and recording the number of times of reference of the reference data block as 1 in the second information table.

For example, in a case that it is determined that there are 0 second data blocks according to the second information table (the checksum of the second data block is the same as the checksum of the first data), the first data is stored in a reference data block (i.e., a new data block is allocated to the first data, the new data block is denoted as a reference data block, and the reference data block is used for storing the first data), and the checksum corresponding to the reference data block and the number of times that the reference data block is referred to are recorded in the second information table as 1.

Illustratively, in a case where m (m is a positive integer greater than 0) second data blocks are determined to exist in the second information table (the checksum of the m second data blocks is the same as the checksum of the first data, but the data contents of the m second data blocks are all different from the data contents of the first data), the first data is stored into a reference data block, and the checksum corresponding to the reference data block is recorded in the second information table and the number of times of reference of the reference data block is 1.

In this embodiment, when it is determined that the number of times of reference of a second data block (the checksum of the second data block is the same as the checksum of the first data, and the data content of the second data block is the same as the data content of the first data) is greater than or equal to the first threshold according to the second information table, step 404 is continuously performed, i in the loop statement is incremented by 1, and step 405, step 406, and step 407 are continuously performed, whether another second data block exists in the second information table is continuously searched, the checksum of the another second data block is the same as the checksum of the first data, and the number of times of reference of the second data block is smaller than the first threshold, until i in the loop statement obtains the maximum value and still does not find the another second data block, it is determined that no second data block exists (the checksum of the second data block is the same as the checksum of the first data, and the data content of the second data block is the same as the data content of the data, and the number of times the second data block is referenced is less than a first threshold).

Exemplarily, in a case that it is determined from the second information table that there are 3 second data blocks (the checksums of the 3 second data blocks are all the same as the checksum of the first data, the data contents of the 3 second data blocks are all the same as the data contents of the first data, but the times of reference of the second data blocks are all greater than or equal to a first threshold), the first data is stored in the reference data block, and the checksum corresponding to the reference data block is recorded in the second information table and the times of reference of the reference data block is recorded as 1.

Illustratively, if there are 2 second data blocks in the second information table (the checksum of one of the 2 second data blocks is the same as the checksum of the first data, and the data content of the second data block is the same as the data content of the first data, but the referenced times corresponding to the second data blocks are greater than or equal to the first threshold value; the checksum of the data content of another one of the 2 second data blocks is the same as the checksum of the first data, and the data content of the another one of the 2 second data blocks is the same as the data content of the first data, and the referenced times corresponding to the another one of the 2 second data blocks are less than the first threshold value), if the second data block referenced times of the 2 second data blocks are greater than or equal to the first threshold value is searched first, the

steps

405, 406 and 407 are still executed in a loop, until the other one of the 2 second data blocks is found. If the other second data block of the 2 second data blocks is found first, the

steps

405, 406 and 407 need not be executed repeatedly.

It can be understood that, in some embodiments, the file number of the second data block is recorded in the second information table, and in a case that the data deduplication method of the embodiment of the present application is constrained to data deduplication based on the same file, after the step 403 is performed to determine that the second data block exists and before the step 406, the data deduplication method further includes:

determining whether the file to which the second data block belongs is a file A or not according to the second information table; if yes, further judging whether the first data and the second data block meet a data deduplication condition; if not, determining that the first data and the second data block do not meet the data deduplication condition. Understandably, the further determination of whether the first data and the second data block satisfy the data deduplication condition is obtained by continuing the steps 404 to 407; after determining that the first data and the second data block do not satisfy the data deduplication condition, the step 407 is executed to store the first data.

In the embodiment of the application, whether the data content of the first data and the data content of the second data block may be the same is judged through a checksum; under the condition that the data content of the first data and the data content of the second data block are possibly the same, further confirming whether the data content of the first data and the data content of the second data block are the same; and finally, whether the number of times of reference of the data block corresponding to the second data is greater than the upper deduplication limit is checked, so that the data deduplication effect is achieved, the utilization rate of a storage space is improved, meanwhile, the performance consumption caused by comparison among data contents is reduced, and the problem that the damage speed of a storage medium is accelerated due to frequent access of the second data block is avoided.

The embodiment in fig. 4A-4B describes a data deduplication method for storing file a, and a detailed description is made below for a data deduplication method for modifying third data in file a.

Referring to fig. 5, fig. 5 is a schematic flow chart of a data deduplication method for modifying third data according to the present application. As shown in fig. 5, the method comprises the steps of:

501, after receiving an instruction to modify the third data in the file a into the fourth data, acquiring the number of times of being referred to of a data block corresponding to the third data in the second information table, and determining whether the number of times of being referred to is equal to 1.

It is understood that the number of times of reference of the data block corresponding to the third data may also be referred to as a second number of times of reference described in other embodiments of the present application.

In this embodiment of the application, the instruction includes a target start byte and a target end byte of the data content of the third data in the file a, and address information of a data block corresponding to the third data is searched according to the logical block numbers occupied by the target start byte and the target end byte of the third data and the address entry of the inode of the file a. For convenience of description, the data block corresponding to the third data is referred to as a third data block, and the address information of the data block corresponding to the third data is referred to as the address information of the third data block.

Please refer to other embodiments herein regarding how to find the address information of the data block corresponding to the third data according to the logical block numbers occupied by the target start byte and the target end byte of the third data and the address entry of the inode of the file a.

Illustratively, as shown in fig. 2E, the memo record corresponding to the "meeting record 8" is recorded as file a, the user clicks the save control 401, and in response to the third user operation, i.e., the click operation applied to the save control 401, the electronic device receives an instruction to modify "meeting record 8" into "meeting record 8 (important meeting)", and starts data update storage. The data content of the third data is the "meeting record 8", and the data content of the fourth data is the "meeting record 8 (important meeting)". Specifically, the address entry structure of the electronic device is shown in fig. 2A, and the size of the index block and the size of the data block are both 1 k; the electronic equipment receives an instruction for modifying the 'meeting record 8' into 'meeting record 8 (important meeting)', the instruction comprises information of a target starting byte of the 'meeting record 8' in the file A being 0 byte and a target ending byte being 9 bytes, the logic block number occupied by the 'meeting record 8' can be determined to be the first logic block number in the inode of the file A according to the target starting byte and the target ending byte, and the address information of the data block corresponding to the 'meeting record 8' can be found according to the logic block number.

Optionally, the number of times of reference to be obtained may be obtained, where a manner used for generating an association relationship between a target data block and a corresponding checksum in the second information table is that, when an offset corresponding to a row record of the second information table and a row record of the data block table is equal, an offset of the third data block in the data block table is obtained according to address information of the third data block (address information of the third data is the offset), and then the number of times of reference to be recorded in the row record corresponding to the offset in the second information table is obtained according to the offset.

Optionally, the number of times of reference in the second information table is searched, and a manner adopted by the second information table to cause the target data block and the corresponding checksum to generate an association relationship is that, under the condition that the association relationship between the checksum and the address information of the corresponding target data block is explicitly recorded in the second information table, the number of times of reference corresponding to the address information of the third data block in the second information table is obtained.

Understandably, the third data and the fourth data are the smallest units of file system data processing. Specifically, the data content of the third data is the data content of the third data block, and the data content of the fourth data is the data content modified by the third data block. It is understood that modifying the third data into the fourth data means modifying the data content of the third data block into the data content of the fourth data.

Specifically, if the data content of the third data is str1, the data content of the third data block is also str1, and the data content of the fourth data is str2, the third data is modified into the fourth data, i.e., the data content "str 1" of the third data block is modified into "str 2". Illustratively, the str1 is "adcddddabc", the str2 is "adceeeeabc", and the meaning of modifying the third data into the fourth data is to modify "adcddddabc" into "adceeeeabc". Understandably, the third data includes data content to be modified, and the data content to be modified is partial data content or all data content in the data content of the third data. For example, the data content "dddd" that needs to be modified in str5 is the partial data content that needs to be modified in the data content of the third data.

502, in the case where it is determined that the number of times of being referred to is equal to 1, the fourth data is stored in the file a, and the information related to the third data is deleted.

Understandably, the storing the fourth data into the file a includes: the fourth data is taken as the first data in the above step 402, and steps 402 to 407 are performed. (the same applies here to the meaning of "store fourth data in this document A")

It can be understood that the number of times that the data block corresponding to the third data is referred to is equal to 1, which indicates that the data block corresponding to the third data is referred to only by the third data, and the third data can be directly modified or deleted when the third data is modified, without affecting other data of the file a.

It can be understood that, by using the fourth data as the first data in step 402 and performing steps 402 to 407, it can be determined whether the file a has stored therein the second data block (the checksum of the second data block is the same as the checksum of the fourth data, the data content of the second data block is the same as the data content of the fourth data, and the number of times that the second data block is referred to is less than the first threshold)

Understandably, if the second data block exists, the address information of the second data block is used as the data block address information of the fourth data; and if the second data block does not exist, allocating a new data block to the fourth data, wherein the data block address information of the fourth data is the address information of the new data block.

Optionally, the deleting the information related to the third data may be replacing (modifying) address information of a data block of the third data corresponding to the logical block number recorded in an address entry of the inode of the file a with address information of a data block of the fourth data; deleting the data block corresponding to the third data; and deleting the record related to the third data in the second information table.

Optionally, the information related to the third data may be deleted by deleting data block information corresponding to the third data corresponding to the logical block number recorded in the address entry of the inode of the file a (the data block information includes an association relationship between a logical block number of the third data corresponding to the address entry of the inode of the file a and address information of a data block corresponding to the third data) (the following meanings of "deleting data block information corresponding to the third data corresponding to the logical block number recorded in the address entry of the inode of the file a" are the same); recording the association relationship between the logical block number and the data block address information of the fourth data in the next address item of the last address item for recording the data block address information of the data content in the address items of the inode of the file A; deleting the data block corresponding to the third data; and deleting the record related to the third data in the second information table.

Optionally, the deleting the data block corresponding to the third data may be to set the data block corresponding to the third data as unavailable; specifically, the data block of the third data is set as d i rty (the file system in which the file a is located will periodically reset the content of the data block set as d i rty to be usable after the content is cleared). Optionally, the deleting the data block corresponding to the third data may further include clearing data content of the data block corresponding to the third data (specifically, setting the data content of the data block of the third data recorded in the data block table to nul), and setting a bit position of a usage situation of the data block, in which the third data is recorded, in a bit map to 0; the above-mentioned bitmap records the use condition of the data block corresponding to the bit with the bit, specifically, if the bit is 1, it indicates that the data block is allocated for storing data, and if the bit is 0, it indicates that the data block is not allocated for storing data.

Optionally, if the second data block does not exist, the deleting the information related to the third data may further be to replace (modify) the content of the data block of the third data with the data content of the fourth data, and replace (modify) the checksum corresponding to the third data in the second information table with the checksum corresponding to the fourth data.

Illustratively, in the above modification of "meeting record 8" to "meeting record 8 (important meeting)", the third data is "meeting record 8", and the fourth data is "meeting record 8 (important meeting)". It is necessary to determine whether the number of times of reference of the data block corresponding to the "conference record 8" is equal to 1, and if so, it indicates that the data block corresponding to the "conference record 8" is only referred to by the "conference record 8", and the "conference record 8" may be directly modified and deleted. When the "meeting record 8" is modified into the "meeting record 8 (important meeting)" and the "meeting record 8 (important meeting)" is stored, it needs to be further determined whether a data block storing the "meeting record 8 (important meeting)" exists in the target disk partition and the deduplication condition is satisfied, and if so, the address information of the data block is used as the address information of the fourth data.

In the case where it is determined that the number of times of being referred is not equal to 1, it is determined whether the number of times of being referred is greater than 1.

And 504, in case that the reference frequency is determined to be greater than 1, subtracting 1 from the reference frequency of the data block corresponding to the third data in the second information table, and storing the fourth data in the file a.

Optionally, the storing the fourth data in the file a may be to replace (modify) address information of a data block of the third data corresponding to the logical block number recorded in an address entry of the inode of the file a with address information of a data block of the fourth data.

Optionally, the step of storing the fourth data in the file a may further include deleting data block information corresponding to the third data corresponding to the logical block number recorded in the address entry of the inode of the file a, and recording an association relationship between the logical block number and address information of the data block of the fourth data in a next address entry of an address entry of the last address entry used for recording address information of the data block of the data content in the address entry of the inode of the file a.

It is understood that the number of times of reference is greater than 1 indicates that the third data is referred to at a plurality of places. Illustratively, the address information of the data block of the first data is consistent with the address information of the data block of the third data, and only one copy of data is stored in the target disk partition, and the first data and the third data share the same copy of data of one data block. In order to ensure that the third data is modified while the address information of the data block of the first data is used without error, the third data cannot be directly modified, but a new data block should be newly applied for the fourth data, or whether second data exists is searched (the checksum of the second data block is the same as the checksum of the fourth data, the data content of the second data block is the same as the data content of the fourth data, and the number of times of reference corresponding to the second data block is smaller than a first threshold), and the address information of the second data block is used as the address information of the data block of the fourth data.

For example, in the modification of "meeting record 8" (third data) to "meeting record 8 (important meeting)" (fourth data), it is necessary to determine whether the number of times of reference of the data block corresponding to "meeting record 8" is greater than 1, and if so, it indicates that the data block corresponding to "meeting record 8" is referred to by another data besides the "meeting record 8" (for example, the data block is referred to by the first data besides the "meeting record 8"), and in order to ensure that the use of the address information of the data block of the first data is not erroneous while the third data is modified, the third data cannot be modified directly.

It can be understood that, in the embodiment of the present application, the data deduplication method is applied to a readable and writable file system, and for a read-only file system, since the read-only file system has read-only right control and does not allow modification of file data content, the data deduplication method shown in fig. 5 in the present application is not applicable to the read-only file system or to a read-only partition in a target disk partition.

In the embodiment of the present application, data deduplication is performed on modified fourth data, and whether the second data is already stored in the target disk partition is checked, so that data deduplication is performed on the file a in all scenes where data needs to be stored (including the new process shown in fig. 4A and the modification process shown in fig. 5), and the utilization rate of the storage space is improved.

4A-4B describe a data deduplication method for storing the file A, the embodiment in FIG. 5 describes a data deduplication method for modifying the third data in the file A, and a detailed description is given below on a data deduplication method for deleting the fifth data in the file A.

Referring to fig. 6, fig. 6 is a schematic flowchart of a data deduplication method for deleting fifth data according to the present application. As shown in fig. 6, the method comprises the steps of:

601, after receiving an instruction to delete the fifth data in the file a, obtaining the number of times of reference of the data block corresponding to the fifth data in the second information table, and determining whether the number of times of reference is equal to 1.

It is understood that the number of times of reference of the data block corresponding to the fifth data may also be referred to as a third number of times of reference described in other embodiments of the present application.

In this embodiment of the present application, the instruction includes a target start byte and a target end byte of the data content of the fifth data in the file a, and address information of a data block corresponding to the fifth data is searched according to a logical block number occupied by the target start byte and the target end byte and a target index table. For convenience of description, the data block corresponding to the fifth data is referred to as a fifth data block.

It is understood that the method for searching the number of times of reference of the fifth data block in the second information table is the same as the method for searching the number of times of reference of the third data block in the second information table in step 401, and will not be described in detail here.

For example, as shown in fig. 2F and fig. 2G, the memo record corresponding to the "meeting record 3" is recorded as file a, and after the user deletes the data content "meeting record 3" in file a and clicks the save control 501, the electronic device receives an instruction to delete "meeting record 3" (fifth data).

In a case where it is determined that the number of times of being referred to is equal to 1, information related to the fifth data is deleted 602.

The deleting information related to the fifth data includes: deleting a fifth data block corresponding to the fifth data, deleting data block information corresponding to the fifth data corresponding to the logical block number recorded in an address entry of an inode of the file a, and deleting a record corresponding to the fifth data in a second information table. It is understood that the deletion of the fifth data block corresponding to the fifth data is consistent with the deletion method of the data block corresponding to the third data, and will not be described in detail here.

It can be understood that the number of times that the data block corresponding to the fifth data is referred to is equal to 1, which indicates that the data block corresponding to the fifth data is referred to only by the fifth data, and the operations such as deleting the fifth data can be directly performed when the fifth data is deleted, and the use of other data of the file a is not affected.

For example, in the deleting of the "meeting record 3" (fifth data), it needs to be determined whether the number of times that the data block corresponding to the fifth data is referred to is equal to 1, if so, it indicates that the data block corresponding to the fifth data is referred to only by the fifth data, and the deleting operation may be directly performed on the fifth data.

603, in a case where it is determined that the number of times of being referred is not equal to 1, it is determined whether the number of times of being referred is greater than 1.

604, in case that it is determined that the number of times of reference is greater than 1, deleting the data block information corresponding to the fifth data corresponding to the logical block number recorded in the address entry of the inode of the file a, and subtracting 1 from the number of times of reference corresponding to the fifth data block.

Understandably, the data block information corresponding to the fifth data includes an association relationship between a logical block number corresponding to the fifth data in the address entry of the inode of the file a and the address information of the data block corresponding to the fifth data.

It can be understood that the data content of the fifth data includes the whole data content in the fifth data block, and for the case that the data content of the fifth data includes the partial data content in the fifth data block, it belongs to the case of modifying the third data shown in fig. 5.

For example, in the deleting "meeting record 3" (fifth data), it is required to determine whether the number of times that the data block corresponding to the fifth data is referred to is greater than 1, and if so (for example, a file in which two data contents of the memo are both "meeting record 3") is stored, it indicates that the data block corresponding to the fifth data is referred to by other data besides the fifth data (for example, the first data and the fifth data share one data block). Then, in order to ensure that the address information of the data block of the first data is not used erroneously while the fifth data is deleted, the data block corresponding to the fifth data cannot be deleted directly.

It can be understood that, in the embodiment of the present application, data deduplication is performed on a readable and writable file system, and for a read-only file system, since the read-only file system has read-only permission control and is not allowed to delete the file data content, the data deduplication method shown in fig. 6 in the present application is not applicable to the read-only file system or the read-only partition in the target disk partition.

In the embodiment of the present application, when the fifth data is referred to in multiple places, in order to ensure that the use of the data referring to the target data block corresponding to the fifth data is not affected except for the fifth data, in the case that the fifth data needs to be deleted, the data block corresponding to the fifth data is retained, the record of the address information of the data block of the logical block number corresponding to the fifth data in the target direct index table is deleted, and the data integrity of the file system under the deduplication mechanism is ensured.

It can be understood that the application can also provide data deduplication function authority control, a user can select to turn on the data deduplication function authority of the device or turn off the data deduplication function authority, and the device turns on the data deduplication function authority under the default condition. And when the data deduplication function authority of the equipment is in an open state, the equipment executes the data deduplication method provided by the application.

Referring to fig. 7, another data deduplication method is provided. As shown in fig. 7, the data deduplication method includes the following steps:

701, after receiving an instruction of storing a file A, creating an index of the file A, wherein the index is used for recording an association relationship between data and data blocks for storing the data;

optionally, the creating the index of the file a may be an index node inode of the file a created according to the size of the file a, and specifically, please refer to the description of other embodiments of the present application for how to create the inode of the file a according to the size of the file a.

Optionally, creating the index of the file a may also be creating a target index structure of the file a, where the target index structure is an index structure other than an inode. For example, the target index structure may be a hash index, a b + tree index, and the like, which is not limited in this embodiment of the present application.

And 702, calculating the characteristic identifier of the first data in the file A.

Optionally, the feature identifier may be a first feature identifier, and when the first feature identifiers of the two data are not consistent, the data contents of the two data are necessarily different; in the case where the first feature identifiers of two data are identical, the data contents of the two data may be the same. Illustratively, the first unique identifier may be a checksum. For the definition of the checksum and the calculation method thereof, please refer to the above description.

Optionally, the feature identifier may be a second feature identifier, and the second feature identifier may uniquely identify one piece of data. That is, in the case where the second characteristic identifications of two data are identical, the data contents of the two data are the same; in the case that the second characteristic identifications of the two data are not identical, the data contents of the two data are not identical. For example, the second signature may be calculated by an algorithm such as a hash algorithm or a fingerprint algorithm, or may be calculated by another algorithm with a smaller amount of calculation and a smaller performance loss than the hash algorithm or the fingerprint algorithm.

For a detailed description of the first data, please refer to other embodiments of the present application.

703, determining whether a second data block exists in the target storage space, wherein the characteristic identifier of the second data block is the same as the characteristic identifier of the first data.

Optionally, the determining whether the second data block exists in the target storage space without considering the above-mentioned data deduplication upper limit may specifically be: and determining whether the second data block exists in the target storage space according to the first information table, wherein the characteristic identifier of the second data block is the same as that of the first data. Specifically, the loop length is set and whether the second data block exists is searched according to the first information table. The specific method for setting the cycle length and looking up whether the second data block exists according to the first information table is similar to the above looking up manner in step 403 by setting the cycle length and looking up whether the second data block exists according to the second information table (the checksum of the second data block is the same as the checksum of the first data), and will not be described in detail here.

Optionally, in consideration of the data deduplication upper limit, determining whether the target storage space has the second data block may specifically be further: and determining whether the second data block exists in the target storage space according to the repeat count table. It is understood that the repetition count table can be the second information table shown in fig. 3C or the second information table shown in fig. 3D, and please refer to other embodiments of the present application for the description of the second information table.

For a description of the target storage space, please refer to other embodiments of the present application.

And 704, in case it is determined that the second data block exists, storing the address information of the second data block as data block address information of the first data into an index.

Optionally, when the feature identifier of the first data is the first feature identifier, after determining that the second data block exists (the first feature identifier of the second data block is the same as the first feature identifier of the first data), it is further required to determine whether the data content of the second data block is the same as the data content of the first data. For how to determine whether the data content of the second data block is the same as the data content of the first data, refer to other embodiments of the present application (e.g., step 404 shown in fig. 4A).

Optionally, before storing the address information of the second data block in the index as the data block address information of the first data, the method further includes: determining whether the second data block satisfies a data deduplication ceiling; specifically, it is determined whether the referenced number of times of the second data chunk is less than a first threshold according to the repetition count table. The storing the address information of the second data block as the address information of the data block of the first data into the index when it is determined that the second data block exists includes: in a case where it is determined that the second data block exists and the number of times the second data block is referred to is less than a first threshold, storing address information of the second data block as data block address information of the first data in an index. It can be understood that the step 703 of determining whether the second data block exists in the target storage space and the step of determining whether the second data block meets the data deduplication upper limit may be executed together or sequentially, and the order of the steps is not limited.

For the descriptions of the number of times of reference of the second data block, the first threshold, and the address information of the data block, please refer to other embodiments of the present application.

It can be understood that the embodiments of the present application can also incorporate the data deduplication method as shown in fig. 5; illustratively, the fourth data of step 502 in fig. 5 is used as the first data in step 702, that is, the modified fourth data obtained after the method steps of 501 to 504 shown in fig. 5 are executed is used as the first data in the method shown in fig. 7, so as to determine how to store the fourth data (determining how to store the fourth data can be specifically realized by executing steps 701 to 704).

It can be understood that the embodiments of the present application can also be combined with the data deduplication method shown in fig. 6; illustratively, the fifth data in step 601 in fig. 6 is the data content in file a in the data deduplication method shown in fig. 7, and the embodiment of the present application may also include the data deduplication method shown in fig. 6 in addition to providing the data deduplication method shown in fig. 7.

It is understood that the method provided by the above embodiments of the present application can be executed by any electronic device that stores data by using data blocks in a target disk partition. Exemplary electronic devices include mobile terminals, tablet computers, desktop computers, laptop computers, handheld computers, notebook computers, ultra-mobile personal computers (UMPCs), netbooks, and cellular phones, among others.

For example, please refer to fig. 8, fig. 8 is a schematic structural diagram of an electronic device 100 according to an embodiment of the present application, and the following describes in detail by taking a mobile terminal as an example of the electronic device.

The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a key 190, a motor 191, an indicator 192, a camera 193, a display screen 194, a Subscriber Identification Module (SIM) card interface 195, and the like. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.

It is to be understood that the illustrated structure of the embodiment of the present invention does not specifically limit the electronic device 100. In other embodiments of the present application, electronic device 100 may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Processor 110 may include one or more processing units, such as: the processor 110 may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc. The different processing units may be separate devices or may be integrated into one or more processors.

The controller may be, among other things, a neural center and a command center of the electronic device 100. The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.

A memory may also be provided in processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of the processor 110, thereby increasing the efficiency of the system.

In some embodiments, processor 110 may include one or more interfaces. The interface may include an integrated circuit (I2C) interface, an integrated circuit built-in audio (I2S) interface, a Pulse Code Modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a Mobile Industry Processor Interface (MIPI), a general-purpose input/output (GPIO) interface, a Subscriber Identity Module (SIM) interface, and/or a Universal Serial Bus (USB) interface, etc.

The I2C interface is a bi-directional synchronous serial bus that includes a serial data line (SDA) and a Serial Clock Line (SCL). In some embodiments, processor 110 may include multiple sets of I2C buses. The processor 110 may be coupled to the touch sensor 180K, the charger, the flash, the camera 193, etc. through different I2C bus interfaces, respectively. For example: the processor 110 may be coupled to the touch sensor 180K via an I2C interface, such that the processor 110 and the touch sensor 180K communicate via an I2C bus interface to implement the touch functionality of the electronic device 100.

The I2S interface may be used for audio communication. In some embodiments, processor 110 may include multiple sets of I2S buses. The processor 110 may be coupled to the audio module 170 via an I2S bus to enable communication between the processor 110 and the audio module 170. In some embodiments, the audio module 170 may communicate audio signals to the wireless communication module 160 via the I2S interface, enabling answering of calls via a bluetooth headset.

The PCM interface may also be used for audio communication, sampling, quantizing and encoding analog signals. In some embodiments, the audio module 170 and the wireless communication module 160 may be coupled by a PCM bus interface. In some embodiments, the audio module 170 may also transmit audio signals to the wireless communication module 160 through the PCM interface, so as to implement a function of answering a call through a bluetooth headset. Both the I2S interface and the PCM interface may be used for audio communication.

The UART interface is a universal serial data bus used for asynchronous communications. The bus may be a bidirectional communication bus. It converts the data to be transmitted between serial communication and parallel communication. In some embodiments, a UART interface is generally used to connect the processor 110 with the wireless communication module 160. For example: the processor 110 communicates with a bluetooth module in the wireless communication module 160 through a UART interface to implement a bluetooth function. In some embodiments, the audio module 170 may transmit the audio signal to the wireless communication module 160 through a UART interface, so as to realize the function of playing music through a bluetooth headset.

MIPI interfaces may be used to connect processor 110 with peripheral devices such as display screen 194, camera 193, and the like. The MIPI interface includes a Camera Serial Interface (CSI), a Display Serial Interface (DSI), and the like. In some embodiments, processor 110 and camera 193 communicate through a CSI interface to implement the capture functionality of electronic device 100. The processor 110 and the display screen 194 communicate through the DSI interface to implement the display function of the electronic device 100.

The GPIO interface may be configured by software. The GPIO interface may be configured as a control signal and may also be configured as a data signal. In some embodiments, a GPIO interface may be used to connect the processor 110 with the camera 193, the display 194, the wireless communication module 160, the audio module 170, the sensor module 180, and the like. The GPIO interface may also be configured as an I2C interface, an I2S interface, a UART interface, a MIPI interface, and the like.

The USB interface 130 is an interface conforming to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. The USB interface 130 may be used to connect a charger to charge the electronic device 100, and may also be used to transmit data between the electronic device 100 and a peripheral device. And the earphone can also be used for connecting an earphone and playing audio through the earphone. The interface may also be used to connect other electronic devices, such as AR devices and the like.

It should be understood that the connection relationship between the modules according to the embodiment of the present invention is only illustrative, and is not limited to the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also adopt different interface connection manners or a combination of multiple interface connection manners in the above embodiments.

The charging management module 140 is configured to receive charging input from a charger. The charger may be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 140 may receive charging input from a wired charger via the USB interface 130. In some wireless charging embodiments, the charging management module 140 may receive a wireless charging input through a wireless charging coil of the electronic device 100. The charging management module 140 may also supply power to the electronic device through the power management module 141 while charging the battery 142.

The power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140 and provides power to the processor 110, the internal memory 121, the external memory, the display 194, the camera 193, the wireless communication module 160, and the like. The power management module 141 may also be used to monitor parameters such as battery capacity, battery cycle count, battery state of health (leakage, impedance), etc. In some other embodiments, the power management module 141 may also be disposed in the processor 110. In other embodiments, the power management module 141 and the charging management module 140 may be disposed in the same device.

The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.

The

antennas

1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the electronic device 100 may be used to cover a single or multiple communication bands. Different antennas can also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed as a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 150 may provide a solution including 2G/3G/4G/5G wireless communication applied to the electronic device 100. The mobile communication module 150 may include at least one filter, a switch, a power amplifier, a Low Noise Amplifier (LNA), and the like. The mobile communication module 150 may receive the electromagnetic wave from the antenna 1, filter, amplify, etc. the received electromagnetic wave, and transmit the electromagnetic wave to the modem processor for demodulation. The mobile communication module 150 may also amplify the signal modulated by the modem processor, and convert the signal into electromagnetic wave through the antenna 1 to radiate the electromagnetic wave. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the processor 110. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the same device as at least some of the modules of the processor 110.

The modem processor may include a modulator and a demodulator. The modulator is used for modulating a low-frequency baseband signal to be transmitted into a medium-high frequency signal. The demodulator is used for demodulating the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then passes the demodulated low frequency baseband signal to a baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and then transferred to the application processor. The application processor outputs a sound signal through an audio device (not limited to the speaker 170A, the receiver 170B, etc.) or displays an image or video through the display screen 194. In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modem processor may be provided in the same device as the mobile communication module 150 or other functional modules, independent of the processor 110.

The wireless communication module 160 may provide a solution for wireless communication applied to the electronic device 100, including Wireless Local Area Networks (WLANs) (e.g., wireless fidelity (Wi-Fi) networks), bluetooth (bluetooth, BT), Global Navigation Satellite System (GNSS), Frequency Modulation (FM), Near Field Communication (NFC), Infrared (IR), and the like. The wireless communication module 160 may be one or more devices integrating at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2, performs frequency modulation and filtering processing on electromagnetic wave signals, and transmits the processed signals to the processor 110. The wireless communication module 160 may also receive a signal to be transmitted from the processor 110, perform frequency modulation and amplification on the signal, and convert the signal into electromagnetic waves through the antenna 2 to radiate the electromagnetic waves.

In some embodiments, antenna 1 of electronic device 100 is coupled to mobile communication module 150 and antenna 2 is coupled to wireless communication module 160 so that electronic device 100 can communicate with networks and other devices through wireless communication techniques. The wireless communication technology may include global system for mobile communications (GSM), General Packet Radio Service (GPRS), code division multiple access (code division multiple access, CDMA), Wideband Code Division Multiple Access (WCDMA), time-division code division multiple access (time-division code division multiple access, TD-SCDMA), Long Term Evolution (LTE), LTE, BT, GNSS, WLAN, NFC, FM, and/or IR technologies, etc. The GNSS may include a Global Positioning System (GPS), a global navigation satellite system (GLONASS), a beidou navigation satellite system (BDS), a quasi-zenith satellite system (QZSS), and/or a Satellite Based Augmentation System (SBAS).

The electronic device 100 implements display functions via the GPU, the display screen 194, and the application processor. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

The display screen 194 is used to display images, video, and the like. The display screen 194 includes a display panel. The display panel may adopt a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (active-matrix organic light-emitting diode, AMOLED), a flexible light-emitting diode (FLED), a miniature, a Micro-oeld, a quantum dot light-emitting diode (QLED), and the like. In some embodiments, the electronic device 100 may include 1 or N display screens 194, with N being a positive integer greater than 1.

The electronic device 100 may implement a shooting function through the ISP, the camera 193, the video codec, the GPU, the display 194, the application processor, and the like.

The ISP is used to process the data fed back by the camera 193. For example, when a photo is taken, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing and converting into an image visible to naked eyes. The ISP can also carry out algorithm optimization on the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in camera 193.

The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing element converts the optical signal into an electrical signal, which is then passed to the ISP where it is converted into a digital image signal. And the ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into image signal in standard RGB, YUV and other formats. In some embodiments, the electronic device 100 may include 1 or N cameras 193, N being a positive integer greater than 1.

The digital signal processor is used for processing digital signals, and can process digital image signals and other digital signals. For example, when the electronic device 100 selects a frequency bin, the digital signal processor is used to perform fourier transform or the like on the frequency bin energy.

Video codecs are used to compress or decompress digital video. The electronic device 100 may support one or more video codecs. In this way, the electronic device 100 may play or record video in a variety of encoding formats, such as: moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, and the like.

The NPU is a neural-network (NN) computing processor that processes input information quickly by using a biological neural network structure, for example, by using a transfer mode between neurons of a human brain, and can also learn by itself continuously. Applications such as intelligent recognition of the electronic device 100 can be realized through the NPU, for example: image recognition, face recognition, speech recognition, text understanding, and the like. The NPU can also realize the decision model provided by the embodiment of the application.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to extend the memory capability of the electronic device 100. The external memory card communicates with the processor 110 through the external memory interface 120 to implement a data storage function. For example, files such as music, video, etc. are saved in an external memory card.

The internal memory 121 may be used to store computer-executable program code, which includes instructions. The processor 110 executes various functional applications of the electronic device 100 and data processing by executing instructions stored in the internal memory 121. The internal memory 121 may include a program storage area and a data storage area. The storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required by at least one function, and the like. The storage data area may store data (such as audio data, phone book, etc.) created during use of the electronic device 100, and the like. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like.

The electronic device 100 may implement audio functions via the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone interface 170D, and the application processor. Such as music playing, recording, etc.

The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be disposed in the processor 110, or some functional modules of the audio module 170 may be disposed in the processor 110.

The speaker 170A, also called a "horn", is used to convert the audio electrical signal into an acoustic signal. The electronic apparatus 100 can listen to music through the speaker 170A or listen to a handsfree call.

The receiver 170B, also called "earpiece", is used to convert the electrical audio signal into an acoustic signal. When the electronic apparatus 100 receives a call or voice information, it can receive voice by placing the receiver 170B close to the ear of the person.

The microphone 170C, also referred to as a "microphone," is used to convert sound signals into electrical signals. When making a call or transmitting voice information, the user can input a voice signal to the microphone 170C by speaking the user's mouth near the microphone 170C. The electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C to achieve a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may further include three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, perform directional recording, and so on.

The headphone interface 170D is used to connect a wired headphone. The headset interface 170D may be the USB interface 130, or may be an open mobile electronic device platform (OMTP) standard interface of 3.5mm, or a Cellular Telecommunications Industry Association (CTIA) standard interface.

The pressure sensor 180A is used for sensing a pressure signal, and converting the pressure signal into an electrical signal. In some embodiments, the pressure sensor 180A may be disposed on the display screen 194. Pressure sensor 180A

Such as resistive pressure sensors, inductive pressure sensors, capacitive pressure sensors, etc. The capacitive pressure sensor may be a sensor comprising at least two parallel plates having an electrically conductive material. When a force acts on the pressure sensor 180A, the capacitance between the electrodes changes. The electronic device 100 determines the strength of the pressure from the change in capacitance. When a touch operation is applied to the display screen 194, the electronic apparatus 100 detects the intensity of the touch operation according to the pressure sensor 180A. The electronic apparatus 100 may also calculate the touched position from the detection signal of the pressure sensor 180A. In some embodiments, the touch operations that are applied to the same touch position but different touch operation intensities may correspond to different operation instructions. For example: and when the touch operation with the touch operation intensity smaller than the first pressure threshold value acts on the short message application icon, executing an instruction for viewing the short message. And when the touch operation with the touch operation intensity larger than or equal to the first pressure threshold value acts on the short message application icon, executing an instruction of newly building the short message.

The gyro sensor 180B may be used to determine the motion attitude of the electronic device 100. In some embodiments, the angular velocity of electronic device 100 about three axes (i.e., the x, y, and z axes) may be determined by gyroscope sensor 180B. The gyro sensor 180B may be used for photographing anti-shake. For example, when the shutter is pressed, the gyro sensor 180B detects a shake angle of the electronic device 100, calculates a distance to be compensated for by the lens module according to the shake angle, and allows the lens to counteract the shake of the electronic device 100 through a reverse movement, thereby achieving anti-shake. The gyroscope sensor 180B may also be used for navigation, somatosensory gaming scenes.

The air pressure sensor 180C is used to measure air pressure. In some embodiments, electronic device 100 calculates altitude, aiding in positioning and navigation, from barometric pressure values measured by barometric pressure sensor 180C.

The magnetic sensor 180D includes a hall sensor. The electronic device 100 may detect the opening and closing of the flip holster using the magnetic sensor 180D. In some embodiments, when the electronic device 100 is a flip phone, the electronic device 100 may detect the opening and closing of the flip according to the magnetic sensor 180D. And then according to the opening and closing state of the leather sheath or the opening and closing state of the flip cover, the automatic unlocking of the flip cover is set.

The acceleration sensor 180E may detect the magnitude of acceleration of the electronic device 100 in various directions (typically three axes). The magnitude and direction of gravity can be detected when the electronic device 100 is stationary. The method can also be used for recognizing the posture of the electronic equipment, and is applied to horizontal and vertical screen switching, pedometers and other applications.

A distance sensor 180F for measuring a distance. The electronic device 100 may measure the distance by infrared or laser. In some embodiments, taking a picture of a scene, electronic device 100 may utilize range sensor 180F to range for fast focus.

The proximity light sensor 180G may include, for example, a Light Emitting Diode (LED) and a light detector, such as a photodiode. The light emitting diode may be an infrared light emitting diode. The electronic device 100 emits infrared light to the outside through the light emitting diode. The electronic device 100 detects infrared reflected light from nearby objects using a photodiode. When sufficient reflected light is detected, it can be determined that there is an object near the electronic device 100. When insufficient reflected light is detected, the electronic device 100 may determine that there are no objects near the electronic device 100. The electronic device 100 can utilize the proximity light sensor 180G to detect that the user holds the electronic device 100 close to the ear for talking, so as to automatically turn off the screen to achieve the purpose of saving power. The proximity light sensor 180G may also be used in a holster mode, a pocket mode automatically unlocks and locks the screen.

The ambient light sensor 180L is used to sense the ambient light level. Electronic device 100 may adaptively adjust the brightness of display screen 194 based on the perceived ambient light level. The ambient light sensor 180L may also be used to automatically adjust the white balance when taking a picture. The ambient light sensor 180L may also cooperate with the proximity light sensor 180G to detect whether the electronic device 100 is in a pocket to prevent accidental touches.

The fingerprint sensor 180H is used to collect a fingerprint. The electronic device 100 can utilize the collected fingerprint characteristics to unlock the fingerprint, access the application lock, photograph the fingerprint, answer an incoming call with the fingerprint, and so on.

The temperature sensor 180J is used to detect temperature. In some embodiments, electronic device 100 implements a temperature processing strategy using the temperature detected by temperature sensor 180J. For example, when the temperature reported by the temperature sensor 180J exceeds a threshold, the electronic device 100 performs a reduction in performance of a processor located near the temperature sensor 180J, so as to reduce power consumption and implement thermal protection. In other embodiments, the electronic device 100 heats the battery 142 when the temperature is below another threshold to avoid the low temperature causing the electronic device 100 to shut down abnormally. In other embodiments, when the temperature is lower than a further threshold, the electronic device 100 performs boosting on the output voltage of the battery 142 to avoid abnormal shutdown due to low temperature.

The touch sensor 180K is also referred to as a "touch panel". The touch sensor 180K may be disposed on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, which is also called a "touch screen". The touch sensor 180K is used to detect a touch operation applied thereto or nearby. The touch sensor can communicate the detected touch operation to the application processor to determine the touch event type. Visual output associated with the touch operation may be provided through the display screen 194. In other embodiments, the touch sensor 180K may be disposed on a surface of the electronic device 100, different from the position of the display screen 194.

The bone conduction sensor 180M may acquire a vibration signal. In some embodiments, the bone conduction sensor 180M may acquire a vibration signal of the human vocal part vibrating the bone mass. The bone conduction sensor 180M may also contact the human pulse to receive the blood pressure pulsation signal. In some embodiments, the bone conduction sensor 180M may also be disposed in a headset, integrated into a bone conduction headset. The audio module 170 may analyze a voice signal based on the vibration signal of the bone mass vibrated by the sound part acquired by the bone conduction sensor 180M, so as to implement a voice function. The application processor can analyze heart rate information based on the blood pressure beating signal acquired by the bone conduction sensor 180M, so as to realize the heart rate detection function.

The keys 190 include a power-on key, a volume key, and the like. The keys 190 may be mechanical keys. Or may be touch keys. The electronic apparatus 100 may receive a key input, and generate a key signal input related to user setting and function control of the electronic apparatus 100.

The motor 191 may generate a vibration cue. The motor 191 may be used for incoming call vibration cues, as well as for touch vibration feedback. For example, touch operations applied to different applications (e.g., photographing, audio playing, etc.) may correspond to different vibration feedback effects. The motor 191 may also respond to different vibration feedback effects for touch operations applied to different areas of the display screen 194. Different application scenes (such as time reminding, receiving information, alarm clock, game and the like) can also correspond to different vibration feedback effects. The touch vibration feedback effect may also support customization.

Indicator 192 may be an indicator light that may be used to indicate a state of charge, a change in charge, or a message, missed call, notification, etc.

The SIM card interface 195 is used to connect a SIM card. The SIM card can be brought into and out of contact with the electronic apparatus 100 by being inserted into the SIM card interface 195 or being pulled out of the SIM card interface 195. The electronic device 100 may support 1 or N SIM card interfaces, N being a positive integer greater than 1. The SIM card interface 195 may support a Nano SIM card, a Micro SIM card, a SIM card, etc. The same SIM card interface 195 can be inserted with multiple cards at the same time. The types of the plurality of cards may be the same or different. The SIM card interface 195 may also be compatible with different types of SIM cards. The SIM card interface 195 may also be compatible with external memory cards. The electronic device 100 interacts with the network through the SIM card to implement functions such as communication and data communication. In some embodiments, the electronic device 100 employs esims, namely: an embedded SIM card. The eSIM card can be embedded in the electronic device 100 and cannot be separated from the electronic device 100.

As used in the above embodiments, the term "when …" may be interpreted to mean "if …" or "after …" or "in response to a determination of …" or "in response to a detection of …", depending on the context. Similarly, depending on the context, the phrase "at the time of determination …" or "if (a stated condition or event) is detected" may be interpreted to mean "if the determination …" or "in response to the determination …" or "upon detection (a stated condition or event)" or "in response to detection (a stated condition or event)".

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, digital subscriber line) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), among others.

One of ordinary skill in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by hardware related to instructions of a computer program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the above method embodiments. And the aforementioned storage medium includes: various media capable of storing program codes, such as ROM or RAM, magnetic or optical disks, etc.

Claims

1. A method for data deduplication, comprising:

acquiring first data to be stored in a target storage space, wherein M data blocks have been stored in the target storage space, and M is a positive integer;

calculating a sum check code of the data content of the first data;

under the condition that the sum check codes of the data contents of the first data and the sum check codes of the data contents of the M data blocks are determined to be different, distributing a first data block for the first data, and storing the first data into the first data block;

and under the condition that the sum check code of the data content of the first data is determined to be the same as the sum check code of the data content of a second data block in the M data blocks, and the data content of the second data block is determined to be the same as the data content of the first data, recording the address information of the second data block into an index node as the address information of the data block storing the first data, wherein the index node is used for recording the association relationship between the data and the address information of the data block storing the data.

2. The method of claim 1, wherein the recording address information of a second data block of the M data blocks into an index node as address information of a data block storing the first data in a case where it is determined that the checksum of the data content of the first data is identical to the checksum of the data content of the second data block and the data content of the second data block is identical to the data content of the first data comprises:

under the condition that the sum-check code of the data content of the first data is determined to be the same as the sum-check code of the data content of a second data block in the M data blocks, the data content of the second data block is the same as the data content of the first data, and the first number of times of being referred of the second data block is smaller than a first threshold value, recording the address information of the second data block into an index node as the address information of the data block storing the first data, and adding 1 to the first number of times of being referred; the first referred times are times of referring to the address information of the second data block recorded in a repeat count table, and the repeat count table is used for recording the incidence relation between the data block and the times of repeatedly referring to the address information of the data block.

3. The method of claim 1 or 2, wherein the method further comprises:

in the case that the sum-check code of the data content of N second data blocks in the M data blocks is determined to be the same as the sum-check code of the data content of the first data, and the first times of reference of the N second data blocks are all larger than or equal to a first threshold value, allocating a first data block to the first data, and storing the first data into the first data block; the N is an integer less than or equal to M, the first referred times are times that the address information of the second data block recorded in the repeat count table is referred to, and the repeat count table is used for recording an incidence relation between the data block and the times that the address information of the data block is referred to repeatedly.

4. The method of any of claims 1-3, wherein prior to said obtaining the first data to be stored in the target storage space, the method further comprises:

after an instruction of modifying third data into fourth data is received, acquiring second referred times of a third data block corresponding to the third data; the second referred times are times of referring to the address information of the third data block recorded in a repeat count table, and the repeat count table is used for recording the incidence relation between the data block and the times of repeatedly referring to the address information of the data block;

in a case where it is determined that the second number of times of being referred to is equal to 1, regarding the fourth data as the first data, and deleting information related to the third data;

in a case where it is determined that the second number of times of being referred to is greater than 1, the fourth data is regarded as the first data, and the second number of times of being referred to in the repeat count table is decremented by 1.

5. The method of any one of claims 1-4, further comprising:

after receiving an instruction for deleting fifth data, acquiring a third referred frequency of a fifth data block corresponding to the fifth data; the third referred times are times of referring to the address information of the fifth data block recorded in a repeat count table, and the repeat count table is used for recording the incidence relation between the data block and the times of repeatedly referring to the address information of the data block;

deleting information related to the fifth data in a case where it is determined that the third number of times of being referred to is equal to 1;

subtracting 1 from the third referenced number in the repeat count table if it is determined that the third referenced number is greater than 1.

6. The method of claim 1, wherein the recording address information of a second data block of the M data blocks into an index node as address information of a data block storing the first data in a case where it is determined that the checksum of the data content of the first data is identical to the checksum of the data content of the second data block and the data content of the second data block is identical to the data content of the first data comprises:

calculating a first identifier of the data content of the first data and a second identifier of the data content of a second data block of the M data blocks in the case that it is determined that the checksum of the data content of the first data is the same as the checksum of the data content of the second data block;

determining that the data content of the second data block is the same as the data content of the first data in the case that the first identifier is determined to be the same as the second identifier;

and in the case that the data content of the second data block is determined to be the same as the data content of the first data, recording the address information of the second data block into an index node as the address information of the data block storing the first data.

7. The method of any of claims 1-6, wherein the checksum comprises a cyclic redundancy check code.

8. An electronic device, characterized in that the electronic device comprises: one or more processors, memory, and a display screen;

the memory coupled with the one or more processors, the memory to store computer program code, the computer program code including computer instructions, the one or more processors to invoke the computer instructions to cause the electronic device to perform:

calculating a sum check code of the data content of the first data;

9. The electronic device of claim 8, wherein the one or more processors are further configured to invoke the computer instructions to cause the electronic device to perform:

10. The electronic device of claim 8 or 9, wherein the one or more processors are further to invoke the computer instructions to cause the electronic device to perform:

11. The electronic device of any of claims 8-10, wherein the one or more processors are further to invoke the computer instructions to cause the electronic device to perform:

before the first data to be stored in the target storage space is obtained, after an instruction of modifying third data into fourth data is received, obtaining a second referred frequency of a third data block corresponding to the third data; the second referred times are times of referring to the address information of the third data block recorded in a repeat count table, and the repeat count table is used for recording the incidence relation between the data block and the times of repeatedly referring to the address information of the data block;

12. The electronic device of any of claims 8-11, wherein the one or more processors are further to invoke the computer instructions to cause the electronic device to perform:

13. The electronic device of claim 8, wherein the one or more processors are further configured to invoke the computer instructions to cause the electronic device to perform:

14. The electronic device of any of claims 8-13, wherein the checksum comprises a cyclic redundancy check code.

15. A chip system for application to an electronic device, the chip system comprising one or more processors for invoking computer instructions to cause the electronic device to perform the method of any of claims 1-7.

16. A computer program product comprising instructions for causing an electronic device to perform the method according to any of claims 1-7 when the computer program product is run on the electronic device.

17. A computer-readable storage medium comprising instructions that, when executed on an electronic device, cause the electronic device to perform the method of any of claims 1-7.