CN111338569A

CN111338569A - Object storage back-end optimization method based on direct mapping

Info

Publication number: CN111338569A
Application number: CN202010094568.XA
Authority: CN
Inventors: 赵靖荣; 于超; 熊江
Original assignee: Orca Data Technology Xian Co Ltd
Current assignee: Orca Data Technology Xian Co Ltd
Priority date: 2020-02-16
Filing date: 2020-02-16
Publication date: 2020-06-26

Abstract

The invention discloses an object storage back-end optimization method based on direct mapping, which is implemented according to the following steps: step 1, dividing each disk at the rear end of storage into a plurality of spaces named as object buckets according to fixed size, and numbering the object buckets; dividing the index into a plurality of spaces named as index buckets according to a fixed size; uniformly distributing the objects to be written on each disk of the nodes in proportion through a distributed hash table; step 2, realizing the small-range mapping from the object fingerprint to the physical storage position through indexing; and 3, realizing physical storage of data through the storage layer. The problem of complicated data index and data physical position mapping relation in the prior art is solved.

Description

Object storage back-end optimization method based on direct mapping

[ technical field ] A method for producing a semiconductor device

The invention belongs to the technical field of computer storage, and particularly relates to a direct mapping-based object storage back-end optimization method.

[ background of the invention ]

With the advent of the big data era, business applications have increasingly large requirements on storage space and higher performance requirements. The user's I/O requests require a fast response in a timely manner. How to simplify the data processing flow and provide a storage product which is efficient, easy to use and easy to maintain is urgently needed by storage manufacturers. This need exists for both conventional multi-control storage systems, and distributed storage, which is becoming more and more common.

The backend storage schemes employed by current industry object storage products are generally similar. When the system is initialized, all the back-end disks are numbered respectively and are marked as Disk _ Id (Disk number); each disk is then logically divided into a number of extents (ranges) uniformly sized and numbered individually, denoted as extend _ Id (range number). Different products may have different extend sizes. These extensions are managed collectively using an extension Manager. The collection of all range blocks on the disks together forms a virtualized storage pool within a node. At any time, the extend Manager maintains one or more active extensions in the system.

When new data is to be written to the object storage system, regardless of user data or metadata, they are cut into "objects" of a predetermined size, typically 4KB or 8KB in size. The system then generates a unique data fingerprint based on the content of each object according to the cryptographic digest algorithm. The data fingerprint and the data content are in one-to-one relationship. I.e. the fingerprints are the same, means that the data content is the same. Then, data is added to the end of one of the extensions currently in the active state in an ap-pend (add) manner. Writing in an additional way, so that an extend Manager can know the relative position of the current data in the extend, and the relative position is referred to as extend _ Offset (Offset within range); in particular, in order to provide higher disk space utilization, the original data is compressed by a compression algorithm and then written, and a compressed flag bit is set. Therefore, we can know the position of a data block in the back-end virtualized storage pool, which is denoted as PBN (physical block number), and it is composed of Disk _ Id, extend _ Offset (Disk number, range number, Offset within range). The physical storage location PBN of the data, the data fingerprint and some other flag bits required for implementation are recorded in a data structure called ObjectRecord. These object records, which record the mapping of data fingerprints to data physical storage locations, are stored in an Index.

An Index is an array-like structure that holds one or more objectrecords per location. By performing different hash calculations on the data fingerprint once or for many times, a position or a plurality of mutually standby positions on the Index corresponding to the data fingerprint can be obtained. The ObjectRecord associated with the data fingerprint is placed in one of the locations. When a read request exists, the same hash algorithm is applied to the fingerprint to obtain the possible position of the object record, and the data fingerprint is searched and matched in a traversing mode. Finding the object record can obtain the PBN (physical storage location) stored in the object record, and the content read from the virtual storage pool to the object by the "extend Manager" is returned to the client. FIG. 1 illustrates one possible indexing structure.

The current implementation has the following disadvantages: 1. the physical location and the index location of the data block are loosely coupled. A data block can be randomly stored in any position of the back-end global disk space, and its index needs to record the random mapping, which is irregular and can not be managed easily due to too large granularity. The recovery of the system space needs to be performed with global garbage recovery. Check to see if the data is still valid by traversing all data blocks in all extend spaces and computing to find the index location where it is located. If not, it can be deleted. This process is extremely time consuming and complex, resulting in a significant reduction in system read and write performance of around 40%.

2. ObjectRecord occupies too much memory space. We can roughly calculate the size of an ObjectRecord, where the data fingerprint occupies 20 bytes, the PBN occupies a minimum of 8 bytes, the reference count occupies 4 bytes, and plus some other flag bits, the size of an ObjectRecord is a minimum of over 32 bytes. The mass data corresponds to mass ObjectRecord (object record), and the memory occupation of the system is greatly increased. While also limiting the maximum total disk capacity supported by the system.

3. The data is not evenly distributed across the various back-end disks. The active state of Extent is randomly selected and not optimized. It may result in too much concentration of data on a certain disk. The disk is not uniformly busy, resulting in reduced disk life. Increasing the risk of system data loss due to disk damage.

[ summary of the invention ]

The invention aims to provide an object storage back-end optimization method based on direct mapping, which aims to solve the problem that the mapping relation between a data index and a data physical position is complex in the prior art.

The invention adopts the following technical scheme: a direct mapping-based object storage back-end optimization method is implemented according to the following steps:

step 1, dividing each disk at the rear end of storage into a plurality of spaces named as object buckets according to fixed size, and numbering the object buckets; dividing the index into a plurality of spaces named as index buckets according to a fixed size; uniformly distributing the objects to be written on each disk of the nodes in proportion through a distributed hash table;

step 2, realizing the small-range mapping from the object fingerprint to the physical storage position through indexing;

and 3, realizing physical storage of data through the storage layer.

Further, in step 1, a specific method for writing an object through the distributed hash table is as follows:

s1.1, setting different weights for the disk, and distributing different numbers of ring points for the disk on the Hash ring according to the weights;

s1.2, naming each ring point of each disk, and calculating the hash value of each ring point;

s1.3, combining the hash values calculated by the disk ring points to form a hash array, and arranging the hash array according to the size of the hash value, so that the hash ring points of the disks are uniformly distributed on the hash ring.

Further, the physical starting position of the object bucket in the disk is calculated by the formula:

the physical position of ObjectBucket is the size of the corresponding IndexBucket number ＊ ObjectBucket + ObjectBucket area start position.

Further, when new data is written, performing hash calculation on the data to obtain a fingerprint of the data; the data fingerprint is inquired from the initial position of the hash ring until a first disk ring point which is larger than the data fingerprint value is found; the disk to which the nexus belongs is the disk where the new data block should be saved.

Further, when the disks are increased or decreased, the hash ring points are recalculated and sorted according to the new disk topology result: the relative position of the invariant ring points does not change; the data on the changed disk will move out or in and the moved data is evenly distributed proportionally over the new disk combination.

Further, in step 3, when new data comes, the storage position of the object in the index layer is calculated according to the fingerprint of the data, and the specific steps are as follows:

s2.1, applying a group of hash functions to find a group of candidate buckets for the data fingerprint in each index bucket;

s2.2, traversing each barrel in the group of candidate barrels in sequence, and storing new object records currently if a free space is stored in the current index barrel and enough continuous space is stored in the corresponding object barrel for storing real data;

otherwise, using the cuckoo algorithm, kicking an existing object record in the candidate index bucket to other candidate buckets thereof, and then putting a new object record in the vacated position;

and S2.4, after the object records and data are stored in the found positions, updating the distribution diagram and the boundary diagram of the bucket to display changes.

The invention has the beneficial effects that: the mapping logic of data from indexes to physical addresses is simplified; the memory space occupied by the ObjectRecord (object record) is reduced, and the memory use of the whole system is greatly optimized, so that the maximum disk total capacity supported under the same memory configuration is increased; the realization of high-speed linear addressing is guaranteed, and meanwhile, data compression is considered; because the mapping is direct, the requirement that the system needs to recover the reusable space through a garbage recovery mechanism can be saved, so that the system can stably maintain high performance; the design and use complexity of the storage system is greatly reduced, and the product is easy to use and maintain and more convenient for users to use.

[ description of the drawings ]

FIG. 1 is an index manager architecture;

FIG. 2 is a general layout diagram of a direct mapping based object storage backend technology;

FIG. 3 is a DHT;

FIG. 4 is a diagram illustrating mapping between object records, index buckets, and object buckets;

fig. 5 shows a process of rearranging data blocks in an object bucket.

[ detailed description ] embodiments

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

The example shown in FIG. 1 is prior art, which divides the Index into 4 partitions, each Index 1-4. The different data types are different in priority and are mutually standby. Each partition contains a plurality of buckets, and each bucket contains a plurality of objectrecords. These ObjectRecord records a PBN (physical memory location) of a data block, a data fingerprint and other attributes, such as: a reference count of the data object. The reference count of a data object represents the number of times the object is referenced by the outside world. When the reference count is lowered to 0, it indicates that the data object is not referenced by the outside world, and it can be recycled. The garbage collection mechanism of the storage system relies on the reference count to determine whether the physical storage space corresponding to the ObjectRecord can be collected.

The present invention may be referred to visually as direct placement. In summary, each disk in the back end is divided into a number of spaces named objectbuckets (ObjectBucket) according to a fixed size, such as 64KB, and the number of the spaces is respectively numbered. Similarly, the Index is divided into a number of spaces called Indexbucket in a fixed size, e.g., 1 KB.

Unlike the prior art, each disk will have its own separate Index, which in combination provide a uniform interface to the upper layers. The data will be evenly distributed across the disks according to their characteristics. The upper layer structure does not feel the change of the rear end. In this way, a one-to-one small-range mapping between IndexBucket and ObjectBucket is realized in the disk, and real data corresponding to each ObjectRecord in the index is stored in the corresponding ObjectBucket. Such as: IndexBucket numbered #1 corresponds to the backend ObjectBucket numbered # 1. Knowing the index of the IndexBucket, knowing the number of the ObjectBucket and because the size of each ObjectBucket is fixed, the physical starting position of the ObjectBucket in the disk can be found by the following formula. The formula is specifically as follows:

the physical location of the ObjectBucket is the size of the corresponding ndexBucket number ＊ ObjectBucket + the starting location of the ObjectBucket area, so that the record of the PBN (physical storage location) in the prior art is omitted, and the ObjectRecord can be easily located to the data itself only by saving the relative offset position of the data in the ObjectBucket.

Fig. 2 shows the overall design of the invention. To implement this scheme, we need three components as follows: a Distributed Hash Table (DHT), an Index layer, and a Storage layer. And an Index layer, wherein each disk has an Index of the disk, and each disk comprises a plurality of Index buckets (Index buckets) which are respectively numbered. A Storage layer, each disk divided into a plurality of objectbuckets, each numbered.

Step 1, "DHT" -proportionally evenly distributing the objects on the respective disks of the node:

the back end of the storage node is composed of a large number of disks. The system needs to organize these dispersed disk spaces to present a uniform contiguous available space to the client. However, the capacities and performances of the disks may be different, and the system needs to reasonably allocate data objects to appropriate disks according to the characteristics of each disk, so as to avoid over-concentration. In current implementations, it is random, or unscientific, on which disk data is stored. Even in a round robin fashion, performance issues of different disks cannot be taken into account. Also, the back-end disk may fail temporarily or permanently due to a failure or lifetime problem. New disks may also be added to the expansion capacity at any time. The rebalanced data still needs to be evenly distributed over the new storage locations and minimize the amount of data movement whenever the back-end disk combinations change. The present invention uses a "DHT" (distributed hash table) to decide in which disk of the node the data should be stored. An implementation of a "DHT" (distributed hash table) is shown in FIG. 3.

The work flow of the distributed hash table is as follows:

s1.1, setting different weights for the disk by using one or more attributes of the disk, and distributing different numbers of ring points for the disk on the Hash ring according to the weights. Thus, the higher the weight, the more hash ring points are assigned. Such as: we depend solely on the capacity of the disk. One point weight per megabyte of data. The weight of a disk with the capacity of 100GB is 102400, and 102400 ring points are distributed; the weight of a disk with the capacity of 50GB is 51200, and 51200 ring points are distributed;

s1.2, naming each ring point of each disk, and calculating a hash value. For example: the disk name is sda, then its ring points are named sda _0, sda1, …, sda _102399 in order. Thereby calculating the hash value of each ring point.

And S1.3, combining the hash values calculated by the disk ring points to form a hash array, and arranging the hash array according to the size of the hash value. Since the hash itself is random and uniform. Then the hash ring points of each disk are uniformly distributed on the DHT ring after being arranged according to the size of the hash value.

And carrying out hash calculation on the data block to obtain the fingerprint of the data when new data is written. And querying the data fingerprint from the initial position of the hash ring until a first disk ring point which is larger than the data fingerprint value is found. Then the disk to which the nexus belongs is the disk where the new data block should be saved. As long as the content of the data block is not changed, its fingerprint is not changed. Thus, writing the calculated ring points results in the same ring points at the time of the read request. So that a read request will go to the disk where the data is located to find the data.

When the disks are increased or decreased, the hash ring points are recalculated and sequenced according to the new disk topology result. The relative position of the invariant ring points does not change; all that changes is to those ring points corresponding to newly added or deleted disks, and the data associated therewith. Therefore, only the data on the changed disk needs to be shifted out or in, and the shifted data is uniformly and proportionally distributed on the new disk combination. The whole system is still kept in a stable balanced state.

"Index" - -implementing a small-scale mapping of object fingerprints to physical storage locations:

each disk will build its own index service itself. When the data finds its destination disk through the DHT, the data comes to the Index layer of the destination disk itself. The index layer is responsible for finding a relatively fixed position for the data based on the data fingerprint. Because Index has extremely high access capacity, the Index is resident in the memory after the system is started and is refreshed on the disk when appropriate, and the data security of the Index is ensured.

The Index is logically divided into a number of Index buckets (Index buckets) of a fixed size, e.g., 1KB, and numbered. Each index bucket, in turn, uniquely corresponds to an identically numbered ObjectBucket on the back-end disk. Each index bucket stores a plurality of data structures named ObjectRecord, and the real data represented by the object records are stored in the object bucket corresponding to the index bucket. The size of the IndexBucket and the size of the ObjectRecord may vary from implementation to implementation, and therefore the number of records stored in the IndexBucket may also vary. FIG. 4 shows a mapping between ObjectRecord (object record) size 24B, IndexBucket (index bucket) size 1KB and object bucket size 64 KB.

In the present invention, a 64KB object bucket contains 128 sector sizes of 512B, so that the offset locations in the bucket range from 0 to 127, and a 7-bit long memory space can hold the offset locations. Compared with the existing scheme, which occupies 32 sections, namely PBN (physical storage location) record with the length of 256 bits, the total size of the ObjectRecord is reduced to 24 bytes, and the space is saved by 25%. Considering that each object corresponds to an ObjectRecord, a memory savings of 25% is a huge optimization. At the same time, the maximum total disk capacity supported by the system is increased by 25%.

Each IndexBucket also contains an allocationMap, boundaryMap. When the number of references to data is reduced to zero, indicating that the data is invalid, the corresponding flag bits are cleared in the allocationmap and boundamymap. When the space rearrangement is needed due to the shortage of the storage space in the back-end object bucket, invalid data is covered according to the graph to make room.

When new data comes, the storage position of the object in the Index layer is calculated according to the fingerprint of the data. The method comprises the following steps:

s2.1, a group of hash functions are applied, and a group of candidate buckets are found for the data fingerprint in a plurality of IndexBuckets (index buckets). The set of candidate buckets is the possible storage locations for this fingerprint. This set of candidate buckets will also be used to resolve conflicts when the amount of data in the system is drastically increased.

S2.2, sequentially traversing each bucket in the set of candidate buckets until a suitable location is found. Whether the current IndexBucket can store the new ObjectRecord is determined, whether free space is stored in the IndexBucket is checked, and whether enough continuous space is stored in the ObjectBucket corresponding to the index bucket is checked.

S2.3, if the set of candidate buckets is traversed, no proper position is found. We will kick one existing ObjectRecord in the candidate IndexBucket into its other candidate buckets using Cuckoo algorithm. The new ObjectRecord is then put into this vacated position.

S2.4, after the ObjectRecord and data are stored in the found position, the AllocationMap and BoundaryMap of the bucket are updated to display the change.

Step 3 "Storage" -implementing physical Storage of data:

as described above, each disk is divided into a number of ObjectBuckets (object buckets) of fixed size, such as 64KB, for storing data. In order to achieve better disk utilization rate by compatible data compression and reduce disk fragments, data which cannot be compressed are sequentially stored from the left side of the bucket to the right side, and data which are compressed and are smaller than the original size of the data are sequentially stored from the right side of the bucket to the left side. The in-bucket offset value for the data will be recorded into the ObjectRecord in the index. When data is to be written, according to an allocationMap (distribution map) in the index bucket, and according to the size of the data block, from left to right or from right to left, finding the first free data space meeting the size requirement to store the data. A reordering process is initiated when there is no contiguous free space in the bucket that meets the size requirement, but it is calculated that by reordering, a space can be assembled that meets the size requirement. This reordering process references allocationmaps in the corresponding index bucket. By moving the data blocks with different sizes and still valid to both sides respectively, covering the free area, and synchronously updating the offset value in the corresponding object record in the index bucket, a continuous free space is formed at the middle position of the object bucket. In this way, data can be stored in the piece of free space. Fig. 5 illustrates this rearrangement process.

The invention provides a direct mapping-based object storage back-end optimization method, which is used for optimizing the realization of an object storage back-end. By narrowing the mapping range, the mapping between the data fingerprint and the physical position is simple and direct. Meanwhile, data compression and uniform distribution of data in the global capacity space are supported. Through DHT (distributed hash table), data is stored uniformly and evenly among the disks in the global capacity space of the node based on certain attribute or attributes of the disks. And the one-to-one mapping of the index and the physical disk reduces the mapping granularity. Simplifying the computation of the data fingerprint to the physical storage location of the data. The amount of memory required is greatly reduced, so that the maximum physical capacity supported by the system is increased. The system's need for global garbage reclamation is avoided. The reduction of the mapping granularity enables the index and the storage position to realize one-to-one mapping, the recovery of the free space can be processed in the object bucket, and the global garbage recovery is not needed. So that the storage system can continue to provide high performance output.

The invention uses the distributed hash table to realize the uniform distribution of data in each disk in the storage node. The scoring condition can be comprehensively determined according to factors such as performance, capacity and the like. When the topological structure of the disk changes, only part of data needs to be repositioned. The index buckets and the object buckets are in one-to-one correspondence, so that software design and implementation are simpler. The memory space required for storing the data virtual address is greatly reduced, and the performance is obviously improved. The mapping relation between the index and the data area cell range enables the system to be free of large-scale global garbage collection, the read-write performance of the system is stable, and the garbage collection can not be remarkably reduced due to the start of the garbage collection. The rearrangement of data blocks within the object bucket has the significant advantages of high performance and small influence. The maximum compression ratio of the system reaches 2.5: 1.

Claims

1. A direct mapping-based object storage back-end optimization method is characterized by comprising the following steps:

and 3, realizing physical storage of data through the storage layer.

2. The method for optimizing the object storage back end based on direct mapping according to claim 1, wherein in the step 1, the specific method for writing the object through the distributed hash table is as follows:

3. The direct mapping-based object storage back-end optimization method of claim 2, wherein the physical starting position of the object bucket in the disk is calculated by the formula:

the physical position of the ObjectBucket is the corresponding IndexBucket number plus the size of the ObjectBucket + the starting position of the ObjectBucket area.

4. The object storage back-end optimization method based on direct mapping as claimed in any one of claims 1-3, wherein, every time there is new data to be written, the data is hashed to obtain the fingerprint of the data; the data fingerprint is inquired from the initial position of the hash ring until a first disk ring point which is larger than the data fingerprint value is found; the disk to which the nexus belongs is the disk where the new data block should be saved.

5. The object storage back-end optimization method based on direct mapping as claimed in any of claims 1-3, wherein when there is a disk increase or decrease, the hash ring will be recalculated and sorted according to the new disk topology result: the relative position of the invariant ring points does not change; the data on the changed disk will move out or in and the moved data is evenly distributed proportionally over the new disk combination.

6. The method for optimizing the object storage back end based on the direct mapping as claimed in claim 1 or 2, wherein in the step 3, when new data comes, the storage position of the object in the index layer is calculated according to the fingerprint of the data, and the specific steps are as follows: