CN112463077B

CN112463077B - Data block processing method, device, equipment and storage medium

Info

Publication number: CN112463077B
Application number: CN202011487919.XA
Authority: CN
Inventors: 高华龙
Original assignee: Beijing Yunkuanzhiye Network Technology Co ltd
Current assignee: Beijing Yunkuanzhiye Network Technology Co ltd
Priority date: 2020-12-16
Filing date: 2020-12-16
Publication date: 2021-11-12
Anticipated expiration: 2040-12-16
Also published as: CN112463077A

Abstract

The application provides a data block processing method, a device, equipment and a storage medium, wherein a target storage unit corresponding to a target logical address is determined through the target logical address of a target data block to be processed, the target storage unit comprises bitmap information, and the bitmap information is used for indicating whether a data block corresponding to each logical address in a plurality of logical addresses corresponding to the target storage unit is a repeated data block. Further, according to a flag bit corresponding to the target logical address in the bitmap information, a target physical address corresponding to the target logical address is determined, where the flag bit is used to indicate whether the target data block is a duplicate data block, and the target data block is processed according to the target physical address. Under the condition of mass data storage, the index efficiency of the target data block can be effectively improved, so that the application of the deduplication technology in a mass storage system is promoted.

Description

Data block processing method, device, equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data block method, apparatus, device, and storage medium.

Background

With the development of computer technology, people generate a large amount of data in work and life, and the large amount of data is stored in a storage system in a general situation. However, since redundant data may exist in the storage system, duplicate data needs to be deduplicated, i.e., deduplicated.

The deduplication process in the related art is generally to divide a data file into a plurality of data blocks. And further calculating fingerprint information of each data block in the plurality of data blocks, and performing hash lookup by taking the fingerprint information of each data block as a key word of each data block so as to determine whether the data block is a repeated data block.

However, in the case of mass data storage, the indexing effect for the data blocks is low, thereby limiting the application of deduplication technology in mass storage systems.

Disclosure of Invention

The embodiment of the application provides a data block processing method, a device, equipment and a storage medium, which are used for solving the problems in the related technology, and the technical scheme is as follows:

in a first aspect, an embodiment of the present application provides a data block processing method, including:

determining a target storage unit corresponding to a target logical address according to the target logical address of a target data block to be processed, wherein the target storage unit comprises bitmap information, and the bitmap information is used for indicating whether a data block corresponding to each logical address in a plurality of logical addresses corresponding to the target storage unit is a repeated data block;

determining a target physical address corresponding to the target logical address according to a flag bit corresponding to the target logical address in the bitmap information, wherein the flag bit is used for indicating whether the target data block is a repeated data block;

and processing the target data block according to the target physical address.

In one embodiment, determining a target physical address corresponding to the target logical address according to a flag bit corresponding to the target logical address in the bitmap information includes:

setting a flag bit corresponding to the target logical address in the bitmap information as a first identifier corresponding to the non-repeated data block under the condition that the target data block is the non-repeated data block;

and determining a target physical address corresponding to the target logical address according to the starting position of the physical address corresponding to the non-duplicated data block corresponding to the target storage unit and the number of the first identifiers in the bitmap information.

and under the condition that the target storage unit is in the memory, determining a target physical address corresponding to the target logical address according to the flag bit corresponding to the target logical address in the bitmap information.

In one embodiment, the method further comprises:

determining whether the number of the existing storage units in the memory is greater than or equal to a preset threshold value or not under the condition that the target storage unit is not in the memory;

and deleting the memory units which are not accessed within the preset time in the memory under the condition that the number of the existing memory units in the memory is greater than or equal to a preset threshold value.

In one embodiment, the method further comprises:

under the condition that the number of the existing storage units in the memory is smaller than a preset threshold value, acquiring the target storage unit from a first preset storage area;

and reading the target storage unit into the memory.

In one embodiment, after the target storage unit is read into the memory, the method further includes:

determining whether the target logical address is a first logical address in a plurality of logical addresses corresponding to the target storage unit;

under the condition that the target logical address is the first logical address in a plurality of logical addresses corresponding to the target storage unit, updating the used physical address to obtain an updated used physical address;

the used physical address is taken as the starting position of the first physical address included in the target storage unit.

determining whether a flag bit corresponding to the target logical address in the bitmap information is a second identifier corresponding to the repeated data block when the target data block is the repeated data block;

and under the condition that the flag bit corresponding to the target logical address in the bitmap information is the second identifier, querying the target physical address corresponding to the target logical address from a second preset storage area.

In one embodiment, the target storage unit further comprises: a first physical address starting position and a physical address number; the method further comprises the following steps:

under the condition that the flag bit corresponding to the target logical address in the bitmap information is not the second identifier, determining a plurality of triples according to the bitmap information, the starting position of the first logical address corresponding to the target storage unit, the starting position of the first physical address and the number of the physical addresses;

determining a target triple corresponding to the target logical address from the triples, wherein the target triple comprises a second logical address starting position and a second physical address starting position;

and determining the target physical address corresponding to the target logical address according to the target logical address, the starting position of the second logical address and the starting position of the second physical address.

In one embodiment, after determining the target physical address corresponding to the target logical address according to the target logical address, the starting position of the second logical address, and the starting position of the second physical address, the method further includes:

setting a flag bit corresponding to the target logical address in the bitmap information as the second identifier;

and storing the target logical address and the target physical address corresponding to the target logical address into a second preset storage area.

In a second aspect, an embodiment of the present application provides a data block processing apparatus, including:

the determining module is used for determining a target storage unit corresponding to a target logical address according to the target logical address of a target data block to be processed, wherein the target storage unit comprises bitmap information, and the bitmap information is used for indicating whether a data block corresponding to each logical address in a plurality of logical addresses corresponding to the target storage unit is a repeated data block; determining a target physical address corresponding to the target logical address according to a flag bit corresponding to the target logical address in the bitmap information, wherein the flag bit is used for indicating whether the target data block is a repeated data block;

and the processing module is used for processing the target data block according to the target physical address.

In one embodiment, the determining module is specifically configured to:

In one embodiment, the determining module is specifically configured to: and under the condition that the target storage unit is in the memory, determining a target physical address corresponding to the target logical address according to the flag bit corresponding to the target logical address in the bitmap information.

In one embodiment, the determining module is further configured to: determining whether the number of the existing storage units in the memory is greater than or equal to a preset threshold value or not under the condition that the target storage unit is not in the memory;

the device also includes: and the deleting module is used for deleting the memory units which are not accessed within the preset time in the memory under the condition that the number of the existing memory units in the memory is greater than or equal to a preset threshold value.

In one embodiment, the apparatus further comprises: the device comprises an acquisition module and a read-in module;

the acquisition module is used for acquiring the target storage unit from a first preset storage area under the condition that the number of the existing storage units in the memory is smaller than a preset threshold value;

the read-in module is used for reading the target storage unit into the memory.

In one embodiment, the determining module is further configured to: determining whether the target logical address is a first logical address in a plurality of logical addresses corresponding to the target storage unit;

the device also includes: the updating module is used for updating the used physical address under the condition that the target logical address is the first logical address in the plurality of logical addresses corresponding to the target storage unit to obtain the updated used physical address;

the processing module is further configured to: the used physical address is taken as the starting position of the first physical address included in the target storage unit.

In one embodiment, the determining module is specifically configured to:

the device also includes: and the query module is used for querying the target physical address corresponding to the target logical address from a second preset storage area under the condition that the flag bit corresponding to the target logical address in the bitmap information is the second identifier.

In one embodiment, the target storage unit further comprises: a first physical address starting position and a physical address number;

the determination module is further configured to: under the condition that the flag bit corresponding to the target logical address in the bitmap information is not the second identifier, determining a plurality of triples according to the bitmap information, the starting position of the first logical address corresponding to the target storage unit, the starting position of the first physical address and the number of the physical addresses;

In one embodiment, the apparatus further comprises: the device comprises a setting module and a storage module;

the setting module is used for setting a flag bit corresponding to the target logical address in the bitmap information as the second identifier;

the storage module is used for storing the target logical address and the target physical address corresponding to the target logical address into a second preset storage area.

In a third aspect, an embodiment of the present application provides a data block processing device, where the device includes: a memory and a processor. Wherein the memory and the processor are in communication with each other via an internal connection path, the memory is configured to store instructions, the processor is configured to execute the instructions stored by the memory, and the processor is configured to perform the method of any of the above aspects when the processor executes the instructions stored by the memory.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, which stores a computer program, and when the computer program runs on a computer, the method in any one of the above-mentioned aspects is executed.

The advantages or beneficial effects in the above technical solution at least include: determining a target storage unit corresponding to a target logical address through the target logical address of a target data block to be processed, wherein the target storage unit comprises bitmap information, and the bitmap information is used for indicating whether a data block corresponding to each logical address in a plurality of logical addresses corresponding to the target storage unit is a repeated data block. Further, according to a flag bit corresponding to the target logical address in the bitmap information, a target physical address corresponding to the target logical address is determined, where the flag bit is used to indicate whether the target data block is a duplicate data block, and the target data block is processed according to the target physical address. Because the storage space occupied by the bitmap information is small, the target physical address of the target data block can be quickly determined according to the bitmap information. Therefore, under the condition of storing a large amount of data, the index efficiency of the target data block can be effectively improved, and the application of the deduplication technology in a large-capacity storage system is promoted.

The foregoing summary is provided for the purpose of description only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present application will be readily apparent by reference to the drawings and following detailed description.

Drawings

In the drawings, like reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the disclosure and are therefore not to be considered limiting of its scope.

Fig. 1 is a schematic structural diagram of a data block retrieving apparatus according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a logical space according to an embodiment of the present application;

FIG. 3 is a flowchart of a data block processing method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an application scenario according to another embodiment of the present application;

FIG. 5 is a flow chart of a data block processing method according to another embodiment of the present application;

FIG. 6 is a flow chart of a data block processing method according to another embodiment of the present application;

FIG. 7 is a flow chart of a data block processing method according to another embodiment of the present application;

FIG. 8 is a flow chart of a data block processing method according to another embodiment of the present application;

FIG. 9 is a flow chart of a data block processing method according to another embodiment of the present application;

FIG. 10 is a flow chart of a data block processing method according to another embodiment of the present application;

FIG. 11 is a flow chart of a data block processing method according to another embodiment of the present application;

FIG. 12 is a flow chart of a data block processing method according to another embodiment of the present application;

FIG. 13 is a diagram illustrating a data block processing apparatus according to an embodiment of the present application;

fig. 14 is a schematic diagram of a data block processing device according to an embodiment of the present application.

Detailed Description

In the following, only certain exemplary embodiments are briefly described. As those skilled in the art will recognize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present application. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.

For example, the concept of "regulatory compliance" is involved in Write Once, Read Many (WORM) technology. For example, in some countries, new government regulations coupled with ever-increasing data storage requirements have made businesses more eagerly required to ensure that their business information and records are accurately and efficiently archived. The establishment of relevant regulations has also affected and controlled company behavior, and electronic documents have become the focus of regulatory restrictions. The notion of regulatory compliance has therefore arisen. WORM technology is the biggest feature of coping with regulatory compliance for data storage.

Since the data stored in the medium by the WORM technology can not be lost or modified due to various accidents, the requirement of enterprises for long-term storage of some important business data is guaranteed. In addition, because data of the WORM technology is mostly used for archiving, the application of the deduplication technology is often accompanied in the WORM storage system.

De-duplication (De-duplication) can be referred to as De-duplication for short, is a mainstream and very popular storage technology at present, and can effectively optimize storage capacity. Specifically, redundant data is eliminated by deleting duplicate data in the data set and retaining only one copy. Deduplication technology can reduce the need for physical storage space to a great extent, thereby meeting the increasing data storage needs. Thus, deduplication technology may bring many practical benefits, for example, the following may be included:

1) meets the requirements of Return On Investment (ROI) or Total Cost of Ownership (TCO).

2) The sharp increase of data can be effectively controlled.

3) The effective storage space is increased, and the storage efficiency is improved.

4) Saving the total storage cost and the management cost.

5) Network bandwidth for data transmission is saved.

6) The operation and maintenance costs such as space, power supply, cooling and the like are saved.

Deduplication (dedipe) technology is currently used in large numbers in data backup and archiving systems. Because a large amount of repeated data exists after the data is backed up for multiple times, the data can be deduplicated by using a deduplication technology. In fact, deduplication technology may be used in many situations, for example, in storage systems for online data, near-line data, offline data, and so on. In addition, the deduplication technology may also be implemented in a file system, a volume manager, a Network Attached Storage (NAS), and a Storage Area Network (SAN). In addition, the deduplication technology can also be used for data disaster recovery, data transmission and synchronization, and is used for data packaging as a data compression technology. The deduplication technology can help numerous applications to reduce data storage, save network bandwidth, improve storage efficiency, reduce backup windows, save cost, and the like.

The measure dimension of the deduplication technique may include deduplication rates (deduplication rates) and performance. The performance of deduplication technology depends on the specific implementation technology, and the deduplication rate may be determined by the characteristics and application mode of the data itself, with the influence factors being specifically shown in table 1 below. Currently, the deduplication rates published by various storage vendors vary from 20:1 to 500: 1.

TABLE 1

High data de-duplication rate	Low duplicate data erasure rate
		Data creation by a user	Data acquisition from the natural world
Low rate of change of data	High rate of change of data
		Reference data, inactive data	Activity data
Low data change rate applications	High data rate applications
		Full data backup	Incremental data backup
Long term preservation of data	Short term data preservation
		Wide-range data applications	Small-scale data applications
Persistent data traffic processing	Generic data traffic handling
		Small data chunking	Big data chunking
Variable length data partitioning	Fixed-length data chunking
		Data content perception	Data content agnostic
Temporal data deduplication	Spatial data deduplication

Specifically, the implementation points of the deduplication technology may include the following aspects:

1) and carrying out de-duplication or de-duplication on the data.

For example, temporal or spatial data is deduplicated or deduplicated, and global or local data is deduplicated or deduplicated. The realization technology of deduplication and the data deduplication rate, namely the data deduplication rate, can be directly determined according to which data is deduplicated or deduplicated. Data which changes along with time, such as periodic backup and filing data, has higher repetition rate than spatial data, and the deduplication technology is widely applied in the field of backup and filing. In addition, the global data has a higher repetition rate than the local data, so that the global data can obtain a higher data repetition rate.

2) When to perform deduplication or deduplication.

The data deduplication occasion is divided into the following situations: and online duplicate removal and offline duplicate removal. In the case of the online deduplication mode, data deduplication is performed while data is written into the storage system. Therefore, the online duplicate removal mode has less data volume actually transmitted or written, and is suitable for storage systems for data processing through Local Area Networks (LANs) or Wide Area Networks (WANs), such as Network backup archiving, remote disaster recovery systems, and the like. The online deduplication mode needs to perform file segmentation, data fingerprint calculation and Hash (Hash) lookup in real time, so that the consumption of system data is large. The offline deduplication mode may write data into the storage system first, and then perform deduplication processing with appropriate time. The offline deduplication mode is the exact opposite of the online deduplication mode, and the offline deduplication mode consumes less system data, but writes contain duplicate data, and requires more additional storage space to pre-store the data before deduplication. Therefore, the offline deduplication mode is suitable for Direct-Attached Storage (DAS) and Storage Area Network (SAN) Storage architectures, and data transmission does not occupy network bandwidth. In addition, the offline deduplication mode needs to ensure that there is a sufficient time window for data deduplication operations. Therefore, when to perform deduplication or deduplication is determined according to the actual storage application scenario.

3) Where to perform deduplication or deduplication.

Specifically, data deduplication can be performed at a Source end (Source) or a Target end (Target). Under the condition that the source end performs data deduplication or deduplication, data transmitted by the source end is data after deduplication, so that network bandwidth can be saved, but a large amount of source end system resources can be occupied. Under the condition that the target terminal performs data deduplication or deduplication, data is retransmitted after being transmitted to the target terminal, so that source terminal system resources are not occupied, but a large amount of network bandwidth is occupied. The advantages of data deduplication or deduplication at the target end are that it is transparent to the Application and has good interoperability, no special Application Program Interface (API) needs to be used, and the existing Application software can be directly applied without any modification.

4) How to perform deduplication or deduplication.

For example, deduplication technology includes many technical implementation details, including, for example, how files are sliced, how data chunk fingerprints are computed, how data chunk retrieval is performed, whether similar data detection and difference encoding techniques are used, whether data content is perceptible, whether parsing of the content is required, and so on. These implementation details are relevant to the specific implementation of deduplication or deduplication. The method and the device mainly aim at the same data detection technology, carry out duplicate elimination processing based on the binary file, and have wider applicability.

In a deduplication process of a storage system, a data file or a physical file is generally divided into a plurality of data blocks, fingerprint information of each data block in the plurality of data blocks is further calculated, and hash lookup is performed using the fingerprint information of each data block as a key of each data block. If the same fingerprint information is matched, the data block is represented as a repeated data block, and in this case, only the index number of the data block may be stored. If the same fingerprint information is not matched, it indicates that the data block is not a duplicate data block, i.e. the data block is a new and unique data block, in this case, the data block may be stored, and metadata or metadata related to the data block may be created, for example, the fingerprint information, the storage location, etc. of the data block. In this way, a data file or a physical file corresponds to a logical representation in the storage system, and the logical representation is composed of meta information or meta data corresponding to a plurality of data blocks in the data file or the physical file respectively. In the case of reading the data file or the physical file, the logical representation may be read first, and further, corresponding data blocks may be read from the storage system according to the meta information or the meta data in the logical representation, and the corresponding data blocks may constitute a copy of the data file or the physical file. It can be seen from the above process that the key techniques for deduplication mainly include file data block segmentation, data block fingerprint information calculation, and data block retrieval. The following describes the segmentation of the file data block, the calculation of the fingerprint information of the data block, and the retrieval of the data block, respectively.

1) File data block segmentation

Specifically, the file-level deduplication and the data block-level deduplication can be divided according to the granularity of deduplication. The file-level deduplication technology is also called Single Instance Store (SIS), and deduplication granularity of data block-level deduplication is smaller and can reach 4-24 KB. Obviously, data block level deduplication can provide higher data deduplication rates, and therefore, the currently predominant deduplication technique is data block level deduplication. The method of splitting the data block mainly includes, for example, fixed-size splitting (fixed-size splitting), content-based variable-length splitting (CDC) and sliding block splitting (sliding block). The fixed-length blocking Algorithm is used for segmenting a file by adopting a predefined block size, and calculating a weak check value of a data block formed after segmentation and a strong check value of a fifth version (Message Digest Algorithm, MD5) of a Message Digest Algorithm. The weak check value is mainly used for improving the performance of differential coding, the weak check value is calculated firstly and hash searching is carried out, and if the same weak check value is found, the MD5 strong check value is calculated and further hash searching is carried out. The calculation amount of the weak check value is much smaller than that of the strong check value of the MD5, so that the coding performance can be effectively improved. The fixed-length blocking algorithm has the advantages of simplicity and high performance. However, the fixed-length blocking algorithm is very sensitive to data insertion and deletion, and therefore, the calculation efficiency is low, and adjustment and optimization cannot be performed according to content changes.

The CDC algorithm is a variable-length chunking algorithm that applies data fingerprints, such as Rabin fingerprints, to segment a file into chunks of unequal length. Unlike fixed-length chunking algorithms, the CDC algorithm may perform chunk splitting based on file content, and thus, chunk sizes may vary. In the implementation of the CDC algorithm, a fixed-size, e.g., 48-byte, sliding window is used to compute the data fingerprint for the file data. If the fingerprint satisfies a certain condition, for example, if the value of the fingerprint modulo a specific integer is equal to a preset number, the window position is taken as the boundary of the block. However, the CDC algorithm may be ill-conditioned, for example, in a case that the fingerprint does not satisfy a preset condition, the boundary of the block cannot be determined, thereby causing the data block to be too large. In practice, the size of the data block may be limited, for example, upper and lower limits of the size of the data block may be set, so as to solve the problem of the data block being too large. In addition, the CDC algorithm is not sensitive to file content changes, e.g., inserting or deleting data affects only a few data blocks, leaving the remaining data blocks unaffected. In addition, the CDC algorithm is also disadvantageous, for example, the determination of the size of the data block is difficult, the overhead is too large if the granularity of the data block is too fine, and the deduplication effect is not good if the granularity of the data block is too coarse. Therefore, how to trade off the trade-off between the two is a difficulty.

The sliding block algorithm combines the advantages of fixed-length slicing and CDC slicing, e.g., fixed-size chunking. The slider algorithm may compute a weak check value for a fixed-length block of data first, and then compute a strong check value for MD5 if there is a match, and both are considered a block boundary. The data fragment preceding the data block is also a data block, which is of indefinite length. A data block boundary is also considered if the sliding window moves past a block size distance that still does not match. The sliding block algorithm is more efficient at handling insertion and deletion problems and is able to detect more redundant data than the CDC, with the disadvantage of being prone to data fragmentation.

2) Data block fingerprint computation

A data chunk fingerprint is an essential feature of a data chunk. Ideally, each data chunk has a unique data chunk fingerprint, i.e., different data chunks have different data chunk fingerprints. Data chunks themselves tend to be large, and therefore, the goal of a data chunk fingerprint is to expect a small representation of data, e.g., 16, 32, 64, 128 bytes, to distinguish between different data chunks. The data block fingerprint is usually obtained by performing a correlation mathematical operation on the content of the data block, for example, calculating the data block fingerprint by using a Hash function such as MD5, Secure Hash Algorithm (SHA) 1, SHA-256, SHA-512, One-Way Hash function (One-Way Hash), Rabin Hash (Rabin Hash), and the like. In addition, there are many string hash functions that can be used to compute a block fingerprint. However, hash functions may suffer from collision problems, i.e., different data chunks may produce the same data chunk fingerprint. Relatively speaking, the hash functions such as MD5, SHA series, etc. have lower collision probability, and therefore, can be generally used as a fingerprint calculation method. Of these, MD5 and SHA1 are 128 bits, SHA-X (X represents the number of bits) has a lower probability of collision occurrence, but the amount of calculation increases greatly. In practical applications, a trade-off between performance and data security is required. In addition, multiple hash algorithms may be used simultaneously to compute the data chunk fingerprints.

3) Data block retrieval

For a deduplication system with large storage capacity, the number of data blocks is very large, especially when the granularity of the data blocks is fine. Thus, performance can become a bottleneck in such a large fingerprint library retrieval. The specific search method may be various, for example, a dynamic array, a database, a red-black (RB) tree, a balanced binary tree (B-tree), a B + tree, a B-tree, a Hash Table (Hash Table), etc. Among them, Hash lookup or Hash Index (Hash Index) implemented based on a Hash table is widely adopted due to the lookup performance with O (1). Therefore, hash lookup or hash indexing may also be employed in the deduplication technique. Because the hash table is in memory, hash lookup or hash indexing consumes a large amount of memory resources, and the memory requirements need to be reasonably planned before deduplication is performed. For example, the memory requirement can be estimated according to the fingerprint length of the data block, the number of the data blocks, and the like. Wherein, the number of data blocks can be estimated according to the storage capacity and the average data block size.

In some cases, the hash table may also be referred to as a hash table, which is a data structure directly accessed according to a Key value (Key value). The key value can be mapped to a position in the hash table by the mapping function to access the record, so that the searching speed is accelerated. The mapping function may also be called a hash function, and the array storing the records is called a hash table. The searching process of the hash table is basically the same as the process of constructing the hash table, some keys can be directly found through addresses obtained after conversion by a hash function, and addresses obtained by other keys through the hash function may generate conflicts, and the addresses need to be searched according to a conflict processing method.

However, in the case of mass data storage, the indexing effect for the data blocks is low, thereby limiting the application of deduplication technology in mass storage systems. In order to solve the problem, an embodiment of the present application provides a data block processing method. A data block processing method provided in the embodiments of the present application is described in detail below with reference to specific embodiments. The method may be applied to the data block retrieval apparatus shown in fig. 1. As shown in fig. 1, the data block retrieval apparatus includes: the device comprises a non-repeated index module, a repeated attribute management module, a repeated index module, a memory exchange unit reader-writer and a logic space module. The logical space module is used for isolating the data block retrieval device from the physical storage medium, and can also provide a continuous readable and writable logical space for the data block retrieval device. The logical space may be a contiguous segment of storage space used to store a hash table or hash index. The storage space may be a storage space in a physical storage medium. The physical storage medium includes, but is not limited to, a Solid State Disk (SSD), a Hard Disk Drive (HDD), a magnetic Disk medium, an external storage, and the like.

The schematic diagram of the logical space is shown in fig. 2, and specifically, the logical space includes a super block, one or more logical mapping areas, and one or more repeated block index areas. The logical space shown in fig. 2 is schematically illustrated by taking a super block, a logical mapping area, and a repeated block index area as an example. Each noun in fig. 2 is explained below.

The memory swap unit may be the smallest unit of one interaction between the memory and the physical storage medium, which may be 1 sector, or may be larger. For example, the logical space described above is in a physical storage medium from which the memory can read a memory swap unit.

The super block is a memory exchange unit for describing the composition of the whole logical space, and the super block includes 4 important components, such as the size of the physical space, the size of the used physical space, the array of the logical mapping area, and the array of the repeat block index area. The physical space size is used to describe the number of physical addresses that can be used by the data block retrieval apparatus shown in fig. 1. The size of the used physical space is used to describe the number of physical addresses that the data block retrieval device has allocated for use, and the size of the used physical space is a variable. The logical mapping zone array is used to describe the starting positions and lengths of a plurality of logical mapping zones contained in the logical space. The duplicate block index area array is used to describe the starting positions and lengths of a plurality of 'duplicate block index areas' contained in the logical space.

The logical mapping area is used for storing and recording the index relation between the logical addresses and the physical addresses of all the non-duplicated data blocks and recording whether the data block corresponding to each logical address is a duplicated data block. The logical mapping zone is composed of one or more "non-duplicate index chunks," each of which includes a fixed number of logical addresses. Therefore, the non-repeated index block corresponding to the logical address can be directly calculated according to the logical address.

The logical address of the data block may be specifically a serial number of the data block used when the user of the data block search apparatus sends the data block to the data block search apparatus, and the logical address of the data block is a primary key used by the data block search apparatus for searching for providing an index function.

The physical address of the data block may specifically be a sequence number of the data block used when the data block is transmitted to the storage system in which the data block is stored by the data block retrieval apparatus, and the physical address of the data block is a content to be searched by the data block retrieval apparatus for providing the index function. Additionally, the physical address of the data block may be a storage address of the data block in the storage system.

The repeated block index area is used for storing the index relation between the logical address and the physical address of all data blocks with repeated contents, and the used organization structure is B + Tree. The constituent unit of the duplicate block index area may also be a memory exchange unit.

The non-repeated index block represents a section of logical address, is used for describing whether the data block corresponding to each logical address is a repeated data block or not, and can describe the corresponding relation between the logical address and the physical address of the non-repeated data block. The non-duplicate index block consists of three parts, e.g., physical address starting location, number of physical addresses, duplicate block bitmap.

The repeated block bitmap is used for describing whether the data blocks corresponding to all the logical addresses in one non-repeated index block are repeated data blocks. The duplicate block bitmap includes a plurality of bits (bits). One bit corresponds to one logical address and one logical address corresponds to one data block. When a bit is 0, it indicates that the data block corresponding to the bit is a duplicate data block. When the bit is 1, the data block corresponding to the bit is a non-duplicated data block.

The starting position of the physical address represents the starting position of the physical address corresponding to all the non-repeated data blocks in one non-repeated index block.

The physical address number represents the total number of physical addresses corresponding to all non-repeated data blocks corresponding to a non-repeated index block, the physical addresses and the starting position of the physical addresses describe a continuous physical space together, and the physical space corresponds to a logical address with bit 1 corresponding to a repeated block bitmap in the non-repeated index block.

Several of the modules described in fig. 1 are described below in conjunction with fig. 2.

The non-duplicate index module is used for processing the non-duplicate index blocks in fig. 2, and the main function is to modify the duplicate block bitmap.

The repeated attribute management module is used for determining whether the data block corresponding to the logical address is a repeated data block or not and realizing the repeated data block by directly operating a repeated block bitmap.

The duplicate index module is configured to index the duplicate block index area in fig. 2 in a B + Tree (B + Tree) scheduling manner.

The memory exchange unit reader-writer comprises an addressing method of each region and is responsible for providing the functions of reading and writing the memory exchange unit for the upper three modules.

The logical space module is used for isolating the data block retrieval device and the physical storage medium, and can also provide a continuously readable and writable logical space for the data block retrieval device.

Fig. 3 shows a flow chart of a data block processing method according to an embodiment of the present application. As shown in fig. 3, the method may include:

s301, according to a target logical address of a target data block to be processed, determining a target storage unit corresponding to the target logical address, where the target storage unit includes bitmap information, and the bitmap information is used to indicate whether a data block corresponding to each logical address in a plurality of logical addresses corresponding to the target storage unit is a duplicate data block.

For example, the present embodiment may be applied to the application scenario shown in fig. 4. The application scenario includes a preset device 41, a data block retrieving device 42 and a storage system 43. The preset device 41 may specifically be a user device. The structure of the data block retrieving device 42 is specifically as described above, and is not described herein again. The storage system 43 is used to store data, e.g., blocks of data.

In the case where the preset device 41 processes the target data block, a processing instruction for the target data block, which may be a write instruction or a read instruction, may be sent to the data block retrieval apparatus 42. Specifically, the preset device 41 may send the target logical address of the target data block to the data block retrieving means 42. The data block retrieving device 42 may determine the memory exchange unit corresponding to the target logical address according to the target logical address of the target data block. The memory exchange unit corresponding to the target logical address may be denoted as a target storage unit. For example, the logical mapping area shown in fig. 2 includes 10 memory switch units, and each memory switch unit corresponds to 10 logical addresses. In the case that the logical address is counted from 1, if the target logical address is 99, it may be determined that the memory switch unit corresponding to the target logical address is the 10 th memory switch unit of the 10 memory switch units, and the target logical address corresponds to the 9 th bit of the repeated block bitmap in the 10 th memory switch unit. The 10 th memory exchange unit is a target storage unit corresponding to the target logical address. The target storage unit includes bitmap information, e.g., a repeating block bitmap in the target storage unit. The duplicate block bitmap includes 10 bits, each of the 10 bits corresponds to a logical address, and each bit is used to indicate whether a data block corresponding to each logical address is a duplicate data block.

S302, according to the flag bit corresponding to the target logical address in the bitmap information, determining the target physical address corresponding to the target logical address, where the flag bit is used to indicate whether the target data block is a duplicate data block.

For example, the target logical address corresponds to the 9 th bit of the repeated block bitmap in the 10 th memory switch unit, and the 9 th bit is a flag bit corresponding to the target logical address in the bitmap information. Further, the data block retrieving device 42 determines the target physical address corresponding to the target logical address according to the flag bit corresponding to the target logical address in the bitmap information.

And S303, processing the target data block according to the target physical address.

For example, when the data block retrieval device 42 identifies the target physical address corresponding to the target logical address, the target data block may be processed according to the target physical address. For example, the target physical address may be a storage address of the target data block in the storage system. Data block retrieval means 42 may read the target data block from the target physical address of the storage system. Alternatively, the data block retrieving means 42 may write the target data block in the target physical address of the storage system. Alternatively, the data block retrieving means 42 may also send the target physical address to the preset device 41, and the preset device 41 reads the target data block from the target physical address of the storage system, or writes the target data block in the target physical address of the storage system.

According to the embodiment of the application, a target storage unit corresponding to a target logical address is determined through the target logical address of a target data block to be processed, the target storage unit comprises bitmap information, and the bitmap information is used for indicating whether a data block corresponding to each logical address in a plurality of logical addresses corresponding to the target storage unit is a repeated data block. Further, according to a flag bit corresponding to the target logical address in the bitmap information, a target physical address corresponding to the target logical address is determined, where the flag bit is used to indicate whether the target data block is a duplicate data block, and the target data block is processed according to the target physical address. Because the storage space occupied by the bitmap information is small, the target physical address of the target data block can be quickly determined according to the bitmap information. Therefore, under the condition of storing a large amount of data, the index efficiency of the target data block can be effectively improved, and the application of the deduplication technology in a large-capacity storage system is promoted.

On the basis of the foregoing embodiment, determining a target physical address corresponding to the target logical address according to a flag bit corresponding to the target logical address in the bitmap information includes the following steps as shown in fig. 5:

and S501, under the condition that the target data block is a non-duplicated data block, setting a flag bit corresponding to the target logical address in the bitmap information as a first identifier corresponding to the non-duplicated data block.

For example, the target data block corresponds to a 10 th memory switch unit in the logical mapping area, the bitmap information in the 10 th memory switch unit, i.e., the duplicate block bitmap, includes 10 bits, the target logical address of the target data block corresponds to a 9 th bit of the 10 bits, i.e., the 9 th bit of the 10 bits is a flag bit corresponding to the target logical address in the bitmap information. In the case that the target data block is a non-duplicate data block, the data block retrieving device 42 may set the flag bit corresponding to the target logical address in the bitmap information to the first identifier corresponding to the non-duplicate data block, for example, 1.

S502, determining a target physical address corresponding to the target logical address according to the initial position of the physical address corresponding to the non-duplicated data block corresponding to the target storage unit and the number of the first identifiers in the bitmap information.

Since the target storage unit, i.e. the 10 th memory exchange unit in the logical mapping region, includes the starting location of the physical address, the number of the physical addresses, and the duplicate block bitmap as shown in fig. 2, and the starting location of the physical address indicates the starting location of the physical address corresponding to the non-duplicate data block corresponding to the 10 th memory exchange unit. In addition, the number of bits, i.e. flag bits, in the repeated block bitmap in the 10 th memory exchange unit is 1 is the number of the non-repeated data blocks corresponding to the 10 th memory exchange unit, and the physical addresses of the non-repeated data blocks corresponding to the 10 th memory exchange unit are consecutive. Therefore, the target physical address corresponding to the target logical address can be determined according to the starting position of the physical address in the 10 th memory exchange unit and the number of bits of 1 in the repeated block bitmap in the 10 th memory exchange unit.

Optionally, determining a target physical address corresponding to the target logical address according to a flag bit corresponding to the target logical address in the bitmap information includes: and under the condition that the target storage unit is in the memory, determining a target physical address corresponding to the target logical address according to the flag bit corresponding to the target logical address in the bitmap information.

For example, in the case that the target data block corresponds to the 10 th memory exchange unit in the logical mapping region, it may be further determined whether the 10 th memory exchange unit is in the memory, which may be located in the data block retrieval device shown in fig. 1. In the case that the 10 th memory switch unit is in the memory, the data block retrieving device 42 may determine the target physical address corresponding to the target logical address according to the flag bit corresponding to the target logical address in the bitmap information.

Optionally, the method further includes: determining whether the number of the existing storage units in the memory is greater than or equal to a preset threshold value or not under the condition that the target storage unit is not in the memory; and deleting the memory units which are not accessed within the preset time in the memory under the condition that the number of the existing memory units in the memory is greater than or equal to a preset threshold value.

For example, in the case that the 10 th memory swap unit is not present in the memory, the data block retrieving device 42 may determine whether the number of existing memory cells in the memory is greater than or equal to a preset threshold, i.e., determine whether the number of existing memory cells in the memory has exceeded the maximum limit. In case the maximum limit is exceeded, the data block retrieving means 42 may further delete the memory locations in the memory that have not been accessed within the preset time, which may be a memory swap unit. For example, the data block retrieving device 42 may eliminate the oldest unaccessed memory exchange unit in the memory through a Least Recently Used (LRU) algorithm. Further, the data block retrieving device 42 may read the 10 th memory swap unit from the logical mapping area and read the 10 th memory swap unit into the memory.

Optionally, the method further includes: under the condition that the number of the existing storage units in the memory is smaller than a preset threshold value, acquiring the target storage unit from a first preset storage area; and reading the target storage unit into the memory.

For example, in a case that the number of the existing storage units in the memory is smaller than the preset threshold, that is, the number of the existing storage units in the memory does not exceed the maximum limit, the data block retrieving device 42 may read the 10 th memory swap unit from the logical mapping area, and read the 10 th memory swap unit into the memory. The logical mapping area shown in fig. 2 can be recorded as a first preset storage area.

Optionally, after reading the target storage unit into the memory, the method further includes: determining whether the target logical address is a first logical address in a plurality of logical addresses corresponding to the target storage unit; under the condition that the target logical address is the first logical address in a plurality of logical addresses corresponding to the target storage unit, updating the used physical address to obtain an updated used physical address; the used physical address is taken as the starting position of the first physical address included in the target storage unit.

For example, when the data block retrieving device 42 reads the 10 th memory swap unit from the logical mapping area and reads the 10 th memory swap unit into the memory, it may further determine whether the target logical address corresponding to the target data block is the first logical address of the plurality of logical addresses corresponding to the 10 th memory swap unit. If the target logical address corresponding to the target data block is the first logical address of the plurality of logical addresses, the data block search device 42 may update the used physical address, for example, add 1 to the used physical address to obtain the updated used physical address. Further, the data block retrieving means 42 may use the updated used physical address as the starting position of the physical address included in the target storage unit. The starting location of the physical address included in the target storage unit may be referred to herein as the starting location of the first physical address.

In the embodiment of the application, when the target data block is a non-duplicate data block, the flag bit corresponding to the target logical address in the bitmap information is set as the first identifier corresponding to the non-duplicate data block, and the target physical address corresponding to the target logical address is determined according to the starting position of the physical address corresponding to the non-duplicate data block corresponding to the target storage unit and the number of the first identifiers in the bitmap information. Because the storage space occupied by the bitmap information is small, the target physical address of the target data block can be quickly determined according to the bitmap information. Therefore, under the condition of storing a large amount of data, the index efficiency of the target data block can be effectively improved, and the application of the deduplication technology in a large-capacity storage system is promoted. In addition, the non-duplicate data blocks and the duplicate data blocks can be partitioned and indexed through the logic mapping area and the duplicate block index area, and the indexing efficiency of the data blocks is further improved.

The process of determining the physical address of a non-duplicate data block is described below in conjunction with a specific embodiment. As shown in fig. 6, the process includes the following steps:

s601, starting.

S602, judging whether the size of the used physical space is larger than or equal to the physical space. If so, perform S611, otherwise perform S603.

S603, positioning the target storage unit according to the target logical address.

For example, the target storage unit may be a memory interaction unit corresponding to the target logical address. The logical address contained in each memory interactive unit is fixed, and the bit number of the repeated block bitmap in each memory interactive unit is fixed. The memory interactive units are stored in the physical storage medium according to the high-low order of the contained logical addresses. Therefore, the target logical address can be reversely deduced according to the target logical address to be stored in the first memory interaction unit.

S604, judging whether the target storage unit exists in the memory. If so, go to S605, otherwise go to S606.

S605, setting the bit corresponding to the target logic address in the bitmap information as 1, adding 1 to the size of the used physical space, adding 1 to the number of the physical addresses in the target storage unit, storing the target storage unit, and returning to the original physical space size (namely the target physical address).

For example, subtracting the starting position of the logical address corresponding to the target storage unit from the target logical address can obtain the corresponding bit of the target logical address in the bitmap information, i.e., the target logical address corresponds to the several bits in the bitmap information.

S606, judging whether the number of the existing memory exchange units in the memory exceeds the maximum limit. If so, perform S607, otherwise perform S608.

S607, eliminating the earliest unaccessed memory exchange unit in the memory according to the LRU algorithm.

The Least Recently Used (LRU) algorithm may sequence existing memory interaction units in the memory according to access time, and maintain a certain queue depth, and eliminate the earliest unaccessed memory interaction unit each time.

And S608, reading the target storage unit into the memory.

And S609, judging whether the target logical address is the first logical address in the plurality of logical addresses corresponding to the target storage unit. If so, perform S610, otherwise perform S605.

S610, adding 1 to the used physical address, and using the added physical address as the starting position of the physical address included in the target storage unit.

S611, error reporting.

And S612, ending.

Specifically, the implementation process and the specific principle of S601-S612 may refer to the above embodiments, and are not described herein again.

In addition, the execution process of S601-S612 can be written as a new data block allocation flow. The allocate new data block flow is invoked when the content of the data block is determined to be written for the first time, i.e. when the data block is a non-duplicate data block, S601-S612 are invoked. The input of the flow for allocating new data blocks is the logical address of the data block to be written to the storage system. In the case where the physical space is sufficient, the output of the flow of allocating new data blocks is the physical address of the data block. And under the condition of insufficient physical space, the flow reports errors.

On the basis of the foregoing embodiment, determining a target physical address corresponding to the target logical address according to a flag bit corresponding to the target logical address in the bitmap information includes the following steps as shown in fig. 7:

s701, determining whether the flag bit corresponding to the target logical address in the bitmap information is a second identifier corresponding to the duplicate data block when the target data block is the duplicate data block.

For example, the target storage unit corresponding to the target logical address of the target data block is the 10 th memory exchange unit in the logical mapping area. The duplicate block bitmap, i.e., bitmap information, in the 10 th memory switch unit includes 10 bits. The target logical address corresponds to the 9 th bit of the 10 bits, i.e. the 9 th bit is the flag bit corresponding to the target logical address in the bitmap information. In the case that the target data block is a duplicate data block, the data block retrieving device 42 may determine whether the corresponding flag bit of the target logical address in the bitmap information is a second identifier, such as 0, corresponding to the duplicate data block.

S702, under the condition that the flag bit corresponding to the target logical address in the bitmap information is the second identifier, querying the target physical address corresponding to the target logical address from a second preset storage area.

For example, in the case that the flag bit corresponding to the target logical address in the bitmap information is 0, the data block retrieving device 42 may query the target physical address corresponding to the target logical address from the repeated block index area as shown in fig. 2. The repeated block index area shown in fig. 2 may be recorded as a second preset storage area.

S703, determining a plurality of triples according to the bitmap information, the starting position of the first logical address corresponding to the target storage unit, the starting position of the first physical address, and the number of physical addresses, if the flag bit corresponding to the target logical address in the bitmap information is not the second identifier.

For example, the 10 th memory exchange unit in the logical mapping area is the target storage unit corresponding to the target data block. The target storage unit includes, in addition to the repeated block bitmap: a first physical address starting location and a number of physical addresses. The first physical address starting location may be a starting location of a physical address corresponding to a non-duplicated data block corresponding to the 10 th memory exchange unit in the target storage unit, for example, a logical mapping region. The number of physical addresses is the total number of physical addresses corresponding to all non-duplicated data blocks corresponding to the 10 th memory exchange unit.

In a case that the flag bit corresponding to the target logical address in the bitmap information is not 0, the data block retrieving device 42 may determine a plurality of triples according to the bitmap information in the target storage unit, the start position of the first logical address corresponding to the target storage unit, the start position of the first physical address, and the number of physical addresses. The starting position of the first logical address corresponding to the target storage unit may be the starting positions of a plurality of logical addresses corresponding to the target storage unit. Each triplet includes three variables representing a mapping of a contiguous segment of logical addresses to a contiguous segment of physical addresses. Wherein, the first variable of the three variables is used for representing the starting address of the continuous logical address of the segment, the second variable is used for representing the starting address of the continuous physical address of the segment, and the third variable represents the length of the mapping interval of the segment.

S704, determining a target triple corresponding to the target logical address from the multiple triples, where the target triple includes a second logical address start position and a second physical address start position.

For example, a target triplet corresponding to the target logical address is determined from the plurality of triplets. The target triplet includes three variables, where the first variable of the target triplet may be denoted as the second logical address starting location. The second variable of the target triplet is recorded as the second physical address starting location.

S705, determining the target physical address corresponding to the target logical address according to the target logical address, the starting position of the second logical address, and the starting position of the second physical address.

For example, the target logical address may be denoted as laddress, the second logical address start position may be denoted as lstart, and the second physical address start position may be denoted as pstart. The target physical address corresponding to the target logical address is laddress-lstart + pstart.

In the embodiment of the application, when the target data block is a duplicate data block, it is determined whether a flag bit corresponding to the target logical address in the bitmap information is a second identifier corresponding to the duplicate data block. And under the condition that the flag bit corresponding to the target logical address in the bitmap information is the second identifier, querying the target physical address corresponding to the target logical address from a second preset storage area. And under the condition that the flag bit corresponding to the target logical address in the bitmap information is not the second identifier, determining a target triple corresponding to the target logical address, and determining the target physical address corresponding to the target logical address according to the target logical address, the second logical address starting position and the second physical address starting position in the target triple. Therefore, under the condition of storing a large amount of data, the index efficiency of the target data block can be effectively improved, and the application of the deduplication technology in a large-capacity storage system is promoted. In addition, the property that the mapping of logical and physical addresses of data blocks in WORM-type storage, once determined, requires no or little modification to the mapping is exploited. In addition, the index of duplicate data chunks and the index of non-duplicate data chunks in the mass storage may be stored in segments.

The following describes a process of searching a data block of a logical address in conjunction with a specific embodiment. The process includes the following steps as shown in fig. 8:

s801, start.

S802, positioning the target storage unit according to the target logical address.

S803, judging whether the target storage unit exists in the memory. If so, perform S807, otherwise perform S804.

S804, whether the number of the existing memory exchange units in the memory exceeds the maximum limit is judged. If so, perform S805, otherwise perform S806.

S805, eliminating the earliest unaccessed memory exchange unit in the memory according to the LRU algorithm.

And S806, reading the target storage unit into the memory.

S807, judging whether the corresponding flag bit of the target logical address in the bitmap information is 0. If so, perform S809, otherwise perform S808.

S808, determining a plurality of triples, determining a target triplet corresponding to the target logical address from the triples, and determining a target physical address corresponding to the target logical address according to the target triplet.

And S809, inquiring a target physical address corresponding to the target logical address from the repeated block index area.

And S810, ending.

Specifically, the implementation process and the specific principle of S801-S810 may refer to the above embodiments, and are not described herein again. Specifically, the input of the process of searching the data block of the logical address is the target logical address of the target data block. The output of the process is the target physical address of the target data block.

The following describes, with reference to a specific embodiment, the process of determining a plurality of triples in S808, and the process of determining a target physical address corresponding to the target logical address from the target triples according to the target triples. The process of determining a plurality of triples can be denoted as an internal flow 1. The target triple corresponding to the target logical address is determined from the multiple triples, and a process of determining the target physical address corresponding to the target logical address according to the target triple may be denoted as an internal process 2. The following describes the internal flow 1 and the internal flow 2, respectively. Specifically, the internal process 1 includes the following steps as shown in fig. 9:

and S901, starting.

S902, setting the traversal counter i to 0, setting the offset off of the physical address to 0, setting the starting position of the first logical address to lstart1, the starting position of the first physical address to pstart1, setting the number of the physical addresses to len, and setting the ternary array of the interval mapping relationship to res [ ] [3 ].

The first logical address start position lstart1 may be the start positions of the logical addresses corresponding to the target memory location as described above. The first physical address starting position pstart1 may be the physical address starting position included in the target storage unit as described above. The physical address number len may be the physical address number included in the target memory location as described above.

The ternary array of interval mappings may comprise a plurality of triples. Each triplet may represent a mapping relationship between a continuous segment of logical addresses and a continuous segment of physical addresses by three variables, where a first variable represents a start address of the continuous segment of logical addresses, a second variable represents a start address of the continuous segment of physical addresses, and a third variable represents a length of the continuous segment of mapping intervals. For example, a triplet [0,10,10] indicates that an interval of logical addresses 0 to 9 corresponds to an interval of physical addresses 10 to 19. The triplet [10,0,10] indicates that the interval of logical addresses 10 to 19 corresponds to the interval of physical addresses 0 to 9. Multiple triples may constitute a ternary array, for example, the triplet [0,10,10] and the triplet [10,0,10] may constitute a ternary array [ [0,10,10], [10,0,10] ].

S903, under the condition that the number of continuous 1 in the bitmap information does not exceed len, converting the bitmap information into a line segment array, and recording the line segment array as a two-dimensional array a [ ] [2 ].

For example, the bitmap information may be a repeated block bitmap in the target storage unit as described above. For example, the duplicate block bitmap includes 10 bits, the 10 bits being 0101100000. Consecutive 1 s in the repeated block bitmap may be represented herein by a plurality of ordered line segments, each line segment including a start position and a length. For example, 0101100000 has 0 at bit 0,1 at bit 1, and 2 consecutive bits at bit 3 beginning with 1. Accordingly, 0101100000 can be expressed as [ [1,1], [3,2] ], [ [1,1], [3,2] ] can be written as a two-dimensional array a [ ] [2 ].

S904, judging whether off is smaller than off. If so, perform S905, otherwise perform S906.

S905 sets res [ i ] [0] ═ a [ i ] [0] + lstart1, sets res [ i ] [1] ═ pstart1+ off, sets res [ i ] [2] ═ a [ i ] [1], sets off ═ off + a [ i ] [1], and sets i ═ i + 1.

S906, return res [ ] [3 ].

And S907 is finished.

Specifically, fig. 9 describes a process of obtaining a ternary array of interval mapping relationships according to bitmap information in a target storage unit, a first logical address starting position corresponding to the target storage unit, a first physical address starting position in the target storage unit, and a number of physical addresses in the target storage unit.

The internal flow 2 will be described with reference to specific examples. Specifically, the internal process 2 includes the following steps as shown in fig. 10:

and S1001, starting.

And S1002, determining a target triple corresponding to the target logical address laddress from the multiple triples.

Each triplet in the triplet array of the interval mapping relation is arranged according to the starting address sequence of the logical address. Therefore, according to the target logical address laddress, the triple corresponding to the target logical address laddress can be determined, and the triple corresponding to the target logical address laddress is recorded as the target triple. Specifically, in the process of searching the target triple from the ternary array of the interval mapping relationship, a binary search method may be adopted for searching.

And S1003, judging whether the target triple exists or not. If so, perform S1004, otherwise perform S1005.

And S1004, recording the starting position of a second logical address in the target triple as lstart2, and recording the starting position of a second physical address in the target triple as pstart 2.

Wherein the start position lstart2 of the second logical address in the target triple may be the first variable in the target triple. The second physical starting position pstart2 in the target triple may be the second variable in the target triple.

S1005, error reporting of the flow.

S1006, the program returns to laddress-lstart2+ pstart 2.

Specifically, laddress-lstart2+ pstart2 is the target physical address corresponding to the target logical address laddress.

And S1007, ending.

Specifically, fig. 10 describes a process of obtaining a target physical address corresponding to a target logical address according to a ternary array of the target logical address and an interval mapping relationship.

Optionally, after determining the target physical address corresponding to the target logical address according to the target logical address, the starting position of the second logical address, and the starting position of the second physical address, the method further includes the following steps as shown in fig. 11:

s1101, setting the flag bit corresponding to the target logical address in the bitmap information as the second identifier.

For example, in a case that it is determined that the target physical address laddres-lstart 2+ pstart2 corresponding to the target logical address laddres is determined according to the target logical address laddres, the second logical address start position lstart2 and the second physical address start position pstart2, further, a flag bit corresponding to the target logical address laddres in the repeated block bitmap included in the target storage unit may be set to a second flag, for example, 0. Thereby indicating that the target data block corresponding to the target logical address is a duplicate data block.

And S1102, storing the target logical address and the target physical address corresponding to the target logical address into a second preset storage area.

Further, the target logical address and the target physical address laddress-lstart2+ pstart2 corresponding to the laddress and the target logical address may be stored in a second preset storage area, for example, the repeated block index area shown in fig. 2. That is, the correspondence relationship of laddress and laddress-lstart2+ pstart2 is stored in the repeated block index area.

In the embodiment of the application, the flag bit corresponding to the target logical address in the bitmap information is set as the second identifier, and the target logical address and the target physical address corresponding to the target logical address are stored in the second preset storage area, so that the bitmap information can visually represent non-repeated data blocks and repeated data blocks. In addition, the target logical address and the target physical address of the repeated data block are stored in the second preset storage area, and the index of the repeated data block and the index of the non-repeated data block in the large-capacity storage can be stored in a segmented manner, so that the searching performance under most conditions is better, for example, the time complexity can reach O (1), and the method is not limited by the whole storage capacity. By distinguishing the indexes of the deduplication data blocks and the indexes of the non-deduplication data blocks, the characteristic of single write of WORM is utilized, and under the condition of no deduplication, the corresponding relation between the logical address and the physical address of the index is monotonously increased and is in one-to-one correspondence, so that the corresponding relation between the logical address and the physical address can be directly located and described through linear calculation according to the logical address. In addition, in the case that the repeated data blocks are much smaller than the non-repeated data blocks, in a larger-capacity storage system, even if the repeated block index area adopts a traditional mode, such as a Red Black (RB) tree, a balanced binary tree (B-tree), a B + tree and a B-tree, the overall search performance is not affected.

The process of assigning logical addresses to duplicate blocks is described below in connection with a specific embodiment. The specified logical address to repeat block flow may be invoked if the content of the data block is determined to be content-repeated data writing. The inputs to the flow may be the logical address and the physical address of the data block to be written. For any one logical address, one of the assignment of the logical address to the repeated block flow and the assignment of the new block flow as described above may be selected. As shown in fig. 12, the process of assigning a logical address to a repeat block includes the following steps:

and S1201, starting.

And S1202, positioning the target storage unit according to the target logical address.

S1203, judging whether the target storage unit exists in the memory. If so, execute S1204, otherwise execute S1205.

S1204, setting the bit corresponding to the target logic address in the bitmap information as 0, storing the target storage unit, and storing the target logic address and the target physical address in the repeated block index area.

Specifically, the target logical address and the target physical address may be stored in the duplicate block index area in a B + Tree manner.

S1205, whether the number of the existing memory exchange units in the memory exceeds the maximum limit or not is judged. If so, go to S1206, otherwise go to S1207.

S1206, eliminating the earliest unaccessed memory exchange unit in the memory according to the LRU algorithm.

S1207, reading the target storage unit into the memory.

S1208, judging whether the target logical address is the first logical address in the plurality of logical addresses corresponding to the target storage unit. If so, perform S1209, otherwise perform S1204.

S1209, add 1 to the used physical address, and use it as the starting position of the physical address included in the target storage unit.

And S1210, ending.

Specifically, the flow shown in fig. 6 may be referred to as a "new data block allocation flow", the flow shown in fig. 8 may be referred to as a "data block flow for searching a logical address", and the flow shown in fig. 12 may be referred to as a "logical address designating to repeated block flow". Under the condition of writing a data block into a logical address, if the data block is a non-duplicate data block, namely the content of the data block is a new value, calling a new data block allocation flow to acquire a physical address of the data block, and writing the data block into the physical address. If the data block is a repeated data block, calling a data block flow for searching the logical address to acquire the physical address of the data block, and further calling a specified logical address to a repeated block flow to record the logical address and the physical address of the data block in a repeated block index area. In the case of reading a data block from a logical address, a "data block flow for searching logical addresses" may be called to obtain a physical address of the data block, and read the data block from the physical address.

Fig. 13 is a block diagram illustrating a data block processing apparatus according to an embodiment of the present application. As shown in fig. 13, the apparatus 130 may include:

a determining module 131, configured to determine, according to a target logical address of a target data block to be processed, a target storage unit corresponding to the target logical address, where the target storage unit includes bitmap information, and the bitmap information is used to indicate whether a data block corresponding to each logical address in a plurality of logical addresses corresponding to the target storage unit is a duplicate data block; determining a target physical address corresponding to the target logical address according to a flag bit corresponding to the target logical address in the bitmap information, wherein the flag bit is used for indicating whether the target data block is a repeated data block;

the processing module 132 is configured to process the target data block according to the target physical address.

Optionally, the determining module 131 is specifically configured to:

Optionally, the determining module 131 is specifically configured to: and under the condition that the target storage unit is in the memory, determining a target physical address corresponding to the target logical address according to the flag bit corresponding to the target logical address in the bitmap information.

Optionally, the determining module 131 is further configured to: determining whether the number of the existing storage units in the memory is greater than or equal to a preset threshold value or not under the condition that the target storage unit is not in the memory; the apparatus 130 further comprises: the deleting module 133 is configured to delete a storage unit in the memory that has not been accessed within a preset time when the number of existing storage units in the memory is greater than or equal to a preset threshold.

Optionally, the apparatus 130 further includes: an acquisition module 134 and a read-in module 135; the obtaining module 134 is configured to obtain the target storage unit from a first preset storage area when the number of existing storage units in the memory is smaller than a preset threshold; the reading module 135 is used to read the target storage unit into the memory.

Optionally, the determining module 131 is further configured to: determining whether the target logical address is a first logical address in a plurality of logical addresses corresponding to the target storage unit; the apparatus 130 further comprises: an updating module 136, configured to update the used physical address to obtain an updated used physical address when the target logical address is a first logical address of the multiple logical addresses corresponding to the target storage unit; the processing module 132 is further configured to: the used physical address is taken as the starting position of the first physical address included in the target storage unit.

Optionally, the determining module 131 is specifically configured to: determining whether a flag bit corresponding to the target logical address in the bitmap information is a second identifier corresponding to the repeated data block when the target data block is the repeated data block; the apparatus 130 further comprises: the querying module 137 is configured to query the target physical address corresponding to the target logical address from a second preset storage area when the flag bit corresponding to the target logical address in the bitmap information is the second identifier.

Optionally, the target storage unit further includes: a first physical address starting position and a physical address number; the determining module 131 is further configured to: under the condition that the flag bit corresponding to the target logical address in the bitmap information is not the second identifier, determining a plurality of triples according to the bitmap information, the starting position of the first logical address corresponding to the target storage unit, the starting position of the first physical address and the number of the physical addresses; determining a target triple corresponding to the target logical address from the triples, wherein the target triple comprises a second logical address starting position and a second physical address starting position; and determining the target physical address corresponding to the target logical address according to the target logical address, the starting position of the second logical address and the starting position of the second physical address.

Optionally, the apparatus 130 further includes: a setting module 138 and a storage module 139, where the setting module 138 is configured to set a flag bit corresponding to the target logical address in the bitmap information as the second identifier; the storage module 139 is configured to store the target logical address and the target physical address corresponding to the target logical address in a second preset storage area.

The functions of each module in each apparatus in the embodiment of the present application may refer to corresponding descriptions in the above method, and are not described herein again.

Fig. 14 shows a block diagram of a data block processing device according to an embodiment of the present application. As shown in fig. 14, the apparatus includes: a memory 1410 and a processor 1420, the memory 1410 having stored therein computer programs that are executable on the processor 1420. The processor 1420, when executing the computer program, implements the method of data block processing in the above-described embodiments. The number of the memory 1410 and the processor 1420 may be one or more.

The application program field restoration device further includes:

and a communication interface 1430 for communicating with an external device to perform data interactive transmission.

If the memory 1410, the processor 1420, and the communication interface 1430 are implemented independently, the memory 1410, the processor 1420, and the communication interface 1430 may be connected to each other by a bus and communicate with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 14, but this is not intended to represent only one bus or type of bus.

Alternatively, in an implementation, if the memory 1410, the processor 1420 and the communication interface 1430 are integrated into a chip, the memory 1410, the processor 1420 and the communication interface 1430 may communicate with each other through an internal interface.

Embodiments of the present application provide a computer-readable storage medium, which stores a computer program, and when the program is executed by a processor, the computer program implements the method provided in the embodiments of the present application.

The embodiment of the present application further provides a chip, where the chip includes a processor, and is configured to call and execute the instruction stored in the memory from the memory, so that the communication device in which the chip is installed executes the method provided in the embodiment of the present application.

An embodiment of the present application further provides a chip, including: the system comprises an input interface, an output interface, a processor and a memory, wherein the input interface, the output interface, the processor and the memory are connected through an internal connection path, the processor is used for executing codes in the memory, and when the codes are executed, the processor is used for executing the method provided by the embodiment of the application.

It should be understood that the processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like. It is noted that the processor may be an advanced reduced instruction set machine (ARM) architecture supported processor.

Further, optionally, the memory may include a read-only memory and a random access memory, and may further include a nonvolatile random access memory. The memory may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may include a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available. For example, Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), synchlink DRAM (SLDRAM), and direct memory bus RAM (DR RAM).

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the present application are generated in whole or in part when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process. And the scope of the preferred embodiments of the present application includes other implementations in which functions may be performed out of the order shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. All or part of the steps of the method of the above embodiments may be implemented by hardware that is configured to be instructed to perform the relevant steps by a program, which may be stored in a computer-readable storage medium, and which, when executed, includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module may also be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various changes or substitutions within the technical scope of the present application, and these should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for processing a data block, the method comprising:

determining a target storage unit corresponding to a target logical address according to the target logical address of a target data block to be processed, wherein the target storage unit comprises bitmap information, a first physical address starting position and a physical address number, and the bitmap information is used for indicating whether a data block corresponding to each logical address in a plurality of logical addresses corresponding to the target storage unit is a repeated data block;

when the target data block is a non-duplicate data block, setting a flag bit corresponding to the target logical address in the bitmap information as a first identifier corresponding to the non-duplicate data block, and determining a target physical address corresponding to the target logical address according to an initial position of a physical address corresponding to the non-duplicate data block corresponding to the target storage unit and the number of the first identifiers in the bitmap information;

determining whether a flag bit corresponding to the target logical address in the bitmap information is a second identifier corresponding to a repeated data block when the target data block is the repeated data block;

under the condition that a flag bit corresponding to the target logical address in the bitmap information is the second identifier, querying the target physical address corresponding to the target logical address from a second preset storage area;

under the condition that a flag bit corresponding to the target logical address in the bitmap information is not the second identifier, determining a plurality of triples according to the bitmap information, a first logical address starting position corresponding to the target storage unit, the first physical address starting position and the physical address number; determining a target triple corresponding to the target logical address from the triples, and determining the target physical address corresponding to the target logical address according to the target logical address, a second logical address starting position and a second physical address starting position included in the target triple;

and processing the target data block according to the target physical address.

2. The method of claim 1, wherein determining the target physical address corresponding to the target logical address according to the flag bit corresponding to the target logical address in the bitmap information comprises:

and under the condition that the target storage unit is in the memory, determining a target physical address corresponding to the target logical address according to the corresponding zone bit of the target logical address in the bitmap information.

3. The method of claim 2, further comprising:

4. The method of claim 3, further comprising:

and reading the target storage unit into the memory.

5. The method of claim 4, wherein after reading the target storage unit into the memory, the method further comprises:

and taking the used physical address as the starting position of the first physical address included in the target storage unit.

6. The method according to claim 1, wherein after determining the target physical address corresponding to the target logical address according to the target logical address, the second logical address starting location, and the second physical address starting location, the method further comprises:

7. A data block processing apparatus, characterized in that the apparatus comprises:

the determining module is used for determining a target storage unit corresponding to a target logical address according to the target logical address of a target data block to be processed, wherein the target storage unit comprises bitmap information, a first physical address starting position and a physical address number, and the bitmap information is used for indicating whether a data block corresponding to each logical address in a plurality of logical addresses corresponding to the target storage unit is a repeated data block;

the processing module is used for processing the target data block according to the target physical address;

wherein the determining module is specifically configured to:

setting a flag bit corresponding to the target logical address in the bitmap information as a first identifier corresponding to the non-duplicate data block when the target data block is a non-duplicate data block; determining a target physical address corresponding to the target logical address according to the starting position of the physical address corresponding to the non-duplicated data block corresponding to the target storage unit and the number of the first identifiers in the bitmap information;

determining whether a flag bit corresponding to the target logical address in the bitmap information is a second identifier corresponding to a repeated data block when the target data block is the repeated data block; the target storage unit further comprises: a first physical address starting position and a physical address number; determining a plurality of triples according to the bitmap information, a first logical address starting position corresponding to the target storage unit, the first physical address starting position and the physical address number under the condition that a flag bit corresponding to the target logical address in the bitmap information is not the second identifier; determining a target triple corresponding to the target logical address from the multiple triples, wherein the target triple comprises a second logical address starting position and a second physical address starting position; determining the target physical address corresponding to the target logical address according to the target logical address, the second logical address starting position and the second physical address starting position;

the device further comprises:

and the query module is used for querying the target physical address corresponding to the target logical address from a second preset storage area under the condition that the flag bit corresponding to the target logical address in the bitmap information is the second identifier.

8. The apparatus of claim 7, wherein the determining module is specifically configured to: and under the condition that the target storage unit is in the memory, determining a target physical address corresponding to the target logical address according to the corresponding zone bit of the target logical address in the bitmap information.

9. The apparatus of claim 8, wherein the determining module is further configured to: determining whether the number of the existing storage units in the memory is greater than or equal to a preset threshold value or not under the condition that the target storage unit is not in the memory;

the device further comprises: and the deleting module is used for deleting the memory units which are not accessed within the preset time in the memory under the condition that the number of the existing memory units in the memory is greater than or equal to a preset threshold value.

10. The apparatus of claim 9, further comprising: the device comprises an acquisition module and a read-in module;

the read-in module is used for reading the target storage unit into the memory.

11. The apparatus of claim 10, wherein the determining module is further configured to: determining whether the target logical address is a first logical address in a plurality of logical addresses corresponding to the target storage unit;

the device further comprises: the updating module is used for updating the used physical address under the condition that the target logical address is the first logical address in the plurality of logical addresses corresponding to the target storage unit to obtain the updated used physical address;

the processing module is further configured to: and taking the used physical address as the starting position of the first physical address included in the target storage unit.

12. The apparatus of claim 7, further comprising: the device comprises a setting module and a storage module;

13. A data block processing apparatus, comprising: a processor and a memory, the memory having stored therein instructions that are loaded and executed by the processor to implement the method of any of claims 1 to 6.

14. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 6.