CN115543194A

CN115543194A - Distributed object storage method, device, equipment and computer readable storage medium

Info

Publication number: CN115543194A
Application number: CN202211152796.3A
Authority: CN
Inventors: 任鹏翔; 杨利锋
Original assignee: Lenovo Netapp Technology Ltd
Current assignee: Lenovo Netapp Technology Ltd
Priority date: 2022-09-21
Filing date: 2022-09-21
Publication date: 2022-12-30

Abstract

The application provides a distributed object storage method, a device, equipment and a computer readable storage medium, wherein the method comprises the following steps: receiving first requests which are respectively sent by different terminals and carry first data to be written; merging the first requests sent by each terminal to obtain at least one merging queue; processing first data carried by all first requests in the same merge queue to obtain second data corresponding to each merge queue; writing content data included in the second data into a storage unit of the first storage medium, and writing index data included in the second data into an index unit of the first storage medium; and when the target storage unit full of the content data exists, eliminating the content data in the target storage unit to the storage unit of the second storage medium. According to the method and the device, the first storage medium with low time delay is utilized, the data reading and writing speed can be improved, the first data are merged and then written, and the space utilization rate of the second storage medium can be improved.

Description

Distributed object storage method, device, equipment and computer readable storage medium

Technical Field

The present application relates to the field of distributed file system technology, and relates to, but is not limited to, a distributed object storage method, apparatus, device, and computer-readable storage medium.

Background

Ceph is a unified, distributed file system designed for excellent performance, reliability, and scalability. Most of the existing distributed file system storage servers are mixed flash. Each server is provided with a plurality of high-speed media SSD and dozens of low-speed capacity media HDD hard disks, the SSD is used as a performance acceleration layer, the HDD is used as a main storage medium, and the purpose of improving the storage performance is achieved by the mixed use of the SSD and the HDD.

rgw is an application of the Ceph system to provide an object storage service to the outside. User objects are processed through rgw and ultimately stored in a backend Reliable, self-healing Distributed Object Store (RADOS, reliable, automated Distributed Object Store) system. 5363 the task of rgw is to convert the user object to the backend RADOS object without the cache function. The user objects are processed through rgw and stored as RADOS objects. Factors such as I/O efficiency, data migration and balanced distribution of each node are comprehensively considered, and the size of a single RADOS object is generally limited in practical use. For the rgw application scenario, the default limits the size of a single RADOS object to a maximum of 4M. In the related art, the reading and writing of rgw data are both the direct reading and writing of the HDD on the server, resulting in poor system performance.

Disclosure of Invention

In view of this, embodiments of the present application provide a distributed object storage method, apparatus, device, and computer-readable storage medium.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a distributed object storage method, which comprises the following steps:

receiving first requests respectively sent by different terminals, wherein each first request carries first data to be written;

merging the first requests sent by each terminal to obtain at least one merged queue, wherein each merged queue comprises at least one first request;

processing first data carried by all first requests in the same merge queue to obtain second data corresponding to each merge queue, wherein the second data comprises index data and content data;

writing the content data corresponding to each merging queue into a storage unit of a first storage medium, and writing the index data corresponding to each merging queue into an index unit of the first storage medium;

and when the target storage unit which is fully written with the content data exists, eliminating the content data in the target storage unit to a storage unit of a second storage medium, wherein the time delay of the first storage medium is less than that of the second storage medium.

In some embodiments, the merging the first requests sent by the terminals to obtain at least one merge queue includes:

acquiring a preset mapping function;

processing the first requests sent by the terminals by using the mapping function to obtain mapping values corresponding to the first requests;

and merging the first requests according to the mapping values corresponding to the first requests to obtain at least one merged queue, wherein the mapping values corresponding to the first requests in the same merged queue are equal.

In some embodiments, the processing first data carried by all the first requests located in the same merge queue to obtain second data corresponding to each merge queue includes:

acquiring the number of bytes of first data carried by all first requests in the same merge queue;

sequentially arranging all first data carried by the first requests in the same merge queue according to the number of bytes to obtain content data, wherein the content data is a sequence string;

acquiring the initial position and the length of each first data in the content data, and determining the initial position and the length as index data;

and determining the content data and the index data corresponding to each merging queue as second data.

In some embodiments, after the merging the first requests sent by the terminals to obtain at least one merge queue, the method further includes:

adding a locking identifier to a terminal corresponding to a first request included in each queue;

when a first request which is sent again by a terminal added with a locking identifier is received, adding the first request which is sent again into a list to be responded;

after the writing of the index data corresponding to each merge queue into the index unit of the first storage medium, the method further includes:

deleting the locking identification added to the terminal corresponding to the first request included in each queue;

and responding to each retransmitted first request in the list to be responded according to the receiving time sequence of each retransmitted first request in the list to be responded.

In some embodiments, the writing the content data corresponding to each merge queue into a storage unit of a first storage medium includes:

acquiring the weight of the content data corresponding to each merging queue, wherein the weight of the content data represents the probability of reading the content data; wherein the probability that the content data is read is positively correlated with the weight, and the time length for which the content data is eliminated to the second storage medium is positively correlated with the weight;

processing the corresponding content data according to the weight of the content data corresponding to each merging queue to obtain the content data with the weight;

and writing the content data with the weight corresponding to each merging queue into a storage unit of a first storage medium.

In some embodiments, the method further comprises:

receiving a second request which is sent by a terminal and used for reading content data, wherein the second request carries a reading identifier;

reading a target index in an index unit of the first storage medium according to the reading identifier, wherein the target index is used for indicating the storage address and the length of the content data to be read;

determining a storage unit storing content data to be read according to the target index;

reading the content data to be read from the storage unit storing the content data to be read according to the target index to obtain a reading result;

and sending a first response to the terminal, wherein the first response carries the reading result.

In some embodiments, the determining, according to the target index, a storage unit storing content data to be read includes:

determining whether the content data to be read is stored in a first storage medium or not according to the target index;

when the content data to be read is determined to be stored in a first storage medium, determining a storage unit in which the content data to be read is stored in the first storage medium;

and when the content data to be read is determined not to be stored in the first storage medium, determining a storage unit in which the content data to be read is stored in the second storage medium.

An embodiment of the present application provides a distributed object storage apparatus, where the apparatus includes:

the first receiving module is used for receiving first requests respectively sent by different terminals, and each first request carries first data to be written;

the merging processing module is used for merging the first requests sent by the terminals to obtain at least one merging queue, and each merging queue comprises at least one first request;

the data processing module is used for processing first data carried by all the first requests in the same merge queue to obtain second data corresponding to each merge queue, and the second data comprises index data and content data;

a first writing module, configured to write the content data corresponding to each merge queue into a storage unit of a first storage medium;

a second writing module, configured to write the index data corresponding to each merge queue into an index unit of the first storage medium;

and the elimination processing module is used for eliminating the content data in the target storage unit to a storage unit of a second storage medium when the target storage unit which is fully written with the content data exists, and the time delay of the first storage medium is less than that of the second storage medium.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the distributed object storage method provided by the embodiment of the application when the executable instructions stored in the memory are executed.

The embodiment of the present application provides a computer-readable storage medium, which stores executable instructions for causing a processor to implement the distributed object storage method provided by the embodiment of the present application when executed.

The distributed object storage method provided by the embodiment of the application comprises the steps of receiving first requests respectively sent by different terminals, wherein each first request carries first data to be written; merging the first requests sent by each terminal to obtain at least one merged queue, wherein each merged queue comprises at least one first request; processing first data carried by all first requests in the same merge queue to obtain second data corresponding to each merge queue, wherein the second data comprises index data and content data; writing the content data corresponding to each merging queue into a storage unit of a first storage medium, and writing the index data corresponding to each merging queue into an index unit of the first storage medium; and when the target storage unit which is fully written with the content data exists, eliminating the content data in the target storage unit to a storage unit of a second storage medium, wherein the time delay of the first storage medium is less than that of the second storage medium. The method provided by the embodiment of the application can improve the reading and writing speed of data by using the first storage medium with low delay, and can greatly improve the space utilization rate of the second storage medium by aggregating the first data occupying a smaller memory when the first data is written into the first storage medium, combining the first data into a large object and then brushing the large object to the second storage medium, so that the rgw object storage has higher performance and product competitiveness.

Drawings

In the drawings, which are not necessarily drawn to scale, like reference numerals may describe similar components in different views. The drawings illustrate generally, by way of example, but not by way of limitation, various embodiments discussed herein.

Fig. 1 is a schematic flowchart of an implementation of a distributed object storage method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of another implementation of a distributed object storage method according to an embodiment of the present application;

fig. 3 is a schematic flow chart illustrating an implementation process of rgw cache in a solution provided by the related art;

fig. 4 is a schematic flow chart illustrating an implementation of rgw hierarchical cache in the solution provided in the embodiment of the present application;

FIG. 5 is a schematic flow diagram of an in-line polymerization process in a process provided in an embodiment of the present application;

fig. 6 is a schematic flow chart illustrating cache eviction in the method according to the embodiment of the present application;

FIG. 7 is a diagram illustrating a hierarchical queue weight according to an embodiment of the present application;

fig. 8 is a schematic flowchart of a write IO request according to an embodiment of the present application;

fig. 9 is a schematic flowchart of a read IO request according to an embodiment of the present application;

FIG. 10 is a schematic diagram of a component structure of a distributed object storage apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

The following description will be added if a similar description of "first \ second \ third" appears in the application file, where the terms "first \ second \ third" merely distinguish similar objects and do not represent a specific ordering with respect to the objects, and it should be understood that "first \ second \ third" may be interchanged with a specific order or sequence as permitted, so that the embodiments of the application described herein can be implemented in an order other than that illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before the embodiments of the present application are explained in detail, terms and expressions referred to in the embodiments of the present application will be explained, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) A Solid State Disk (SSD), also called Solid State Drive, is a hard Disk made of an array of Solid State electronic memory chips.

2) A mechanical Hard Disk (HDD, hard Disk Drive), which is a conventional ordinary Hard Disk, mainly comprises: the disk, magnetic head, disk rotating shaft and control motor, magnetic head controller, data converter, interface and buffer memory.

3) The mixed flash is a performance acceleration layer using the SSD, such as a two-level cache and an automatic layering technology; the HDD is used as a main storage medium, and the purpose of improving the storage performance is achieved by the mixed use of the SSD and the HDD.

4) Erasure Codes (EC) are forward error correction techniques that are mainly used in network transmission to avoid packet loss, and are used by storage systems to improve storage reliability. Compared with multi-copy replication, erasure codes can achieve higher data reliability with smaller data redundancy, but the encoding mode is complex and requires a large amount of calculation. The erasure code can only tolerate data loss and cannot tolerate data tampering, and is named as the erasure code.

5) Metadata (Metadata): the data (data about data) describing data, also called intermediate data and relay data, is mainly information describing data attributes, and is used to support functions such as indicating storage locations, history data, resource searching, file recording, and the like.

Aiming at the defects that in the prior art, due to the distributed object storage based on rgw, SSD cannot be effectively utilized, the storage space of the HDD is wasted, and the system performance is poor, the embodiment of the application provides a method for storing the distributed object. Fig. 1 is a schematic flow chart of an implementation of a distributed object storage method according to an embodiment of the present application, and as shown in fig. 1, the method includes the following steps:

step S101, receiving first requests respectively sent by different terminals.

The method provided by the embodiment of the application can be executed by a server side of the distributed file system. The server receives first requests respectively sent by a plurality of terminals, and the first requests are used for writing data into a storage medium of the server. And after receiving the first requests, the server analyzes the first requests, and extracts that each first request carries first data to be written.

Step S102, merging the first requests sent by each terminal to obtain at least one merging queue.

Here, each merge queue includes at least one first request.

In one implementation, the merging process of the first requests sent by the terminals may be implemented according to the following steps: acquiring a preset mapping function; processing the first requests sent by each terminal by using a mapping function to obtain mapping values corresponding to each first request; and merging the first requests according to the mapping values corresponding to the first requests to obtain at least one merged queue, wherein the mapping values corresponding to the first requests in the same merged queue are equal.

Here, the preset mapping function may be a hash function. The rgw object storage has 512 front-end threads at most, each thread does not receive other write requests after receiving a write request of a terminal, and receives and processes the next write request after the data to be written carried by the current write request is completely written.

In the embodiment of the application, 32 merging threads are added on the basis of the original 512 front-end threads, each merging thread monopolizes one queue, and a first request sent by the front end is hashed on a certain queue to obtain a merging queue.

Step S103, processing the first data carried by all the first requests in the same merge queue to obtain second data corresponding to each merge queue.

Wherein the second data comprises index data and content data.

In some embodiments, the second data may be obtained by: acquiring the number of bytes of first data carried by all first requests in the same merge queue; sequentially arranging all first data carried by the first requests in the same merge queue according to the number of bytes to obtain content data, wherein the content data is a sequence string; acquiring the initial position and the length of each first data in the content data, and determining the initial position and the length as index data; and determining the content data and the index data corresponding to each merging queue as second data.

After the merging thread takes a plurality of first requests, taking out one lob which is not completely written from a lob (large object) list which is allocated in advance in a memory, and sequentially arranging the lobs into a sequence string according to the byte number of each request to obtain content data; simultaneously recording the initial position offset and the length len of each request object in the lob to form index data; thus, second data is obtained.

In some embodiments, the index data may be composed into key value pairs (kv) persisted into Object information (omap, object map) of the lob of the SSD copy pool.

Step S104, writing the content data corresponding to each merge queue into the storage unit of the first storage medium.

In the related art, after receiving the first request, the server directly writes the first data carried by the first request into the HDD, which has the defects of slow writing speed and low space utilization rate. In the embodiment of the application, first data are merged to obtain content data, the content data are written into an SSD cache pool with low latency and high number of times of read/write Operations Per Second (IOPS), the read/write speed of the data can be increased by using the first storage medium SSD with low latency, and the first data are merged and then written, so that the space utilization rate of the HDD storage pool of the Second storage medium can be increased.

In some embodiments, this step may be implemented as: acquiring the weight of the content data corresponding to each merging queue, wherein the weight of the content data represents the probability of reading the content data; wherein, the probability of the content data being read is positively correlated with the weight, and the time length of the content data being eliminated to the second storage medium is positively correlated with the weight; processing the corresponding content data according to the weight of the content data corresponding to each merging queue to obtain the content data with the weight; and writing the content data with the weight corresponding to each merging queue into a storage unit of the first storage medium.

A hierarchical elimination mechanism exists at the back end in the elimination process, the large object at the back end is not unique, but a hierarchical queue exists in the cache, 32 queues can be divided into 8 levels, and the weight of each level is different. When 32 front-end merging threads write the object into the SSD cache, different weights are given according to the data characteristics of the front-end IO request. When the probability of data being read is larger in a short time, a larger weight is set, and the time for writing the data into the HDD is later the larger the weight is, so that the data reading and writing speed can be further improved, and the reading performance can be improved. In a short time, when the probability of data being read is small, a smaller weight is set, and the smaller the weight is, the earlier the time for writing to the HDD is, so that the space utilization of the SSD can be improved.

In step S105, the index data corresponding to each merge queue is written into the index unit of the first storage medium.

In order to facilitate reading data, when the data are combined, the initial position and the length of each first data in the sequence string are recorded simultaneously, so that quick searching and reading are realized.

And step S106, when the target storage unit fully written with the content data exists, eliminating the content data in the target storage unit to the storage unit of the second storage medium.

Wherein the latency of the first storage medium is less than the latency of the second storage medium.

And the elimination thread module eliminates the lob which is completely written from the SSD copy pool to the HDD erasure pool, the elimination is a necessary process for ensuring the write balance of the front end and the back end, the elimination must continuously migrate the data in the SSD to the HDD so as to make a space for the next front end to write requests, and if the elimination is not timely, the SSD is completely blocked, so that the situation of write blocking occurs.

When the content data is eliminated, only the content data is eliminated to the HDD, and the index data is still stored in the SSD, so that the searching speed can be increased, and the reading speed can be increased.

The distributed object storage method provided by the embodiment of the application receives first requests respectively sent by different terminals, wherein each first request carries first data to be written; merging the first requests sent by each terminal to obtain at least one merged queue, wherein each merged queue comprises at least one first request; processing first data carried by all first requests in the same merge queue to obtain second data corresponding to each merge queue, wherein the second data comprises index data and content data; writing the content data corresponding to each merging queue into a storage unit of a first storage medium, and writing the index data corresponding to each merging queue into an index unit of the first storage medium; and when the target storage unit fully written with the content data exists, eliminating the content data in the target storage unit to a storage unit of a second storage medium, wherein the delay of the first storage medium is smaller than that of the second storage medium. According to the method, the first storage medium with low delay is utilized, the reading and writing speed of data can be improved, the first data which occupies a small memory is aggregated when the first storage medium is written in, and the aggregated first data is combined into a large object and then is brushed down to the second storage medium, so that the space utilization rate of the second storage medium can be greatly improved, and the rgw object storage has higher performance and product competitiveness.

In some embodiments, after the merging processing is performed on the first request sent by each terminal in step S102 to obtain at least one merged queue, the write thread is locked, and after the index data corresponding to each merged queue is written into the index unit of the first storage medium in step S105, the locked thread is unlocked, based on which, on the basis of the embodiment shown in fig. 1, the distributed object storage method provided in the embodiment of the present application may further include the following steps:

and step S11, adding locking identifiers to the terminals corresponding to the first requests included in each queue.

And S12, when receiving the first request which is sent again by the terminal added with the locking identification, adding the first request which is sent again into the list to be responded.

And step S13, deleting the locking identification added to the terminal corresponding to the first request included in each queue.

And step S14, responding the first re-sent requests in the list to be responded according to the receiving time sequence of the first re-sent requests in the list to be responded.

Here, steps S11 and S12 are thread locking steps and are executed after step S102, and steps S13 and S14 are thread unlocking steps and are executed after step S105. Through locking and unlocking operations, the consistency of the stored data can be ensured.

On the basis of the embodiment shown in fig. 1, an embodiment of the present application further provides a distributed object storage method, and fig. 2 is a schematic flow chart of another implementation of the distributed object storage method provided in the embodiment of the present application, as shown in fig. 2, the method includes the following steps:

in step S201, a second request for reading the content data sent by a terminal is received.

In the embodiment shown in fig. 1, data is written into the storage medium of the server, and the embodiment shown in fig. 2 is data read from the storage medium. And the server receives a second request which is sent by a certain terminal and used for reading data, analyzes the second request, extracts the second request and obtains a reading identifier, wherein the reading identifier is used for indicating which data needs to be read by the server. Here, the terminal that transmits the second request may be any one of the plurality of terminals that transmit the first request in the embodiment shown in fig. 1, or may be another terminal.

In step S202, a target index is read in an index unit of the first storage medium according to the read identifier.

The target index is used to indicate the storage address and length of the content data to be read.

Step S203, determining a storage unit storing the content data to be read according to the target index.

In some embodiments, determining the storage unit storing the content data to be read according to the target index may be implemented by: determining whether content data to be read is stored in a first storage medium according to the target index; when the content data to be read is determined to be stored in the first storage medium, determining a storage unit in which the content data to be read is stored in the first storage medium; when it is determined that the content data to be read is not stored in the first storage medium, a storage unit in which the content data to be read is stored is determined in the second storage medium.

Step S204, the content data to be read is read from the storage unit storing the content data to be read according to the target index, and a reading result is obtained.

Step S205, a first response is sent to a terminal.

The first response carries the read result.

According to the method provided by the embodiment of the application, when data is read, the data is read from the SSD with low time delay according to index search, and if the data is written and then read from the HDD, the reading speed of the data can be improved, and the reading time length is greatly shortened.

In the following, an exemplary application of the embodiments of the present application in a practical application scenario will be described.

At present, a mainstream distributed object storage product in the storage market is ceph rgw, most storage servers are mixed and flashed, each server is provided with a plurality of high-speed media SSD (solid state disk) matched with a plurality of low-speed capacity media HDD (hard disk drive) hard disks, rgw has no cache function, so that the SSD on each node cannot be effectively utilized, and data reading and writing are all HDDs on a machine which is directly read and written, which results in poor system performance.

In the related art, an offline aggregation scheme is provided, when a small object is written, all the small object is written into a copy storage pool of an SSD, and then a background thread reads the written objects and aggregates the objects into a large object of tens of megabytes, and then writes the large object into an erasure correction pool of the HDD. The scheme has the problem of secondary reading and writing, all the data need to be listed once and read again to be reunited, and the system performance is still poor.

In addition, the usage scenario of object storage is usually an erasure code storage pool, and erasure codes have the concept of stripe aligned write, if the EC of the storage pool is 4+2 (i.e. the data block length is 4, and the check code is 2), if the stripe length of the database is 4KB, when a user writes a 1KB object on the underlying disk, the user will be filled to 4K first and then multiplied by 6 (4+2), and 24KB of the disk will be occupied, which results in waste of storage space. In practical application, the object usage scenes are storage pools for storing the erasure codes of the mass small files, and storage space is seriously wasted.

In view of the above problems, an embodiment of the present application provides a hierarchical and aggregation scheme, based on which rgw stores and supports cache, and rgw read-write acceleration can be realized by effectively utilizing the high performance of SSD. The scheme adopts an online aggregation mode, an object storage system aggregates all small objects while writing by using a storage pool of a high-speed medium and an accelerated read-write IO speed, the small objects are uniformly combined into a large object and then are flushed from a cache, and meanwhile, the object metadata storage mode of rgw is reconstructed, so that the rgw object storage has higher performance and product competitiveness in an Erasure Code (EC) scene of reading and writing of massive small files. In the scheme, the small files are aggregated, combined into a large object and then brushed down, so that the space utilization rate of the disk can be greatly improved.

According to the scheme provided by the embodiment of the application, the direct-reading direct-writing SSD improvement performance is realized, the IO path, the data consistency guarantee and the like are also realized in the scheme of aggregating a plurality of small objects into one large object, the HDD disk obtaining rate under an erasure code storage pool can be effectively improved, the effective data of more users can be stored, the data reconstruction time is reduced, and the reconstruction recovery efficiency is greatly improved; meanwhile, the data consistency of the cache can be effectively ensured.

The aggregation scheme provided by the embodiment of the application adopts a hierarchical aggregation elimination mechanism, and the front-end access IO pattern can be effectively distinguished, so that elimination of the SSD is more consistent with the IO pattern of the user, and the reading performance is improved.

Fig. 3 is a schematic diagram illustrating an implementation flow of rgw cache in a scheme provided by the related art, fig. 4 is a schematic diagram illustrating an implementation flow of rgw hierarchical cache in a scheme provided by an embodiment of the present application, and the hierarchical cache provided by the embodiment of the present application is explained with reference to fig. 3 and fig. 4.

As shown in fig. 4, there are two mechanisms, i.e., a copy pool and an erasure correction pool, in Ceph, a copy storage pool uses copies for data protection, each data block is written into multiple copies on a disk, and the contents of each copy are consistent. The erasure correction storage pool is similar to RAID, and has M + N proportional relation, namely M data blocks and N check blocks. For the copy, the storage space utilization rate is low but the performance is good, and the space utilization rate of the copy pool is 1/N; for an erasure correction pool, the space utilization rate is high but the performance is poor, the space utilization rate of the erasure correction pool is M/(N + M), the performance of the erasure correction pool in a distributed storage system is particularly poor, and the problem of serious write amplification exists.

As shown in fig. 4, in the embodiment of the present application, a copy pool of a high-performance SSD is used as a cache of an erasure pool of a large-capacity HDD, a write request of a rgw gateway is always written into the copy pool, a background thread brushes data into the erasure pool according to a capacity water level of the copy pool, and this process is called flush. And if the object is not yet flushed during reading, directly reading the copy pool, and if the object is already flushed, reading the erasure pool.

The SSD can give full play to the SSD performance by being used as a copy pool, the SSD copy pool is usually small in capacity, and data in a write pool needs to be continuously flushed to a high-capacity HDD storage pool. Since the front-end write is returned after the SSD copy pool is written, the system performance is the performance of writing the SSD copy pool rather than the performance of writing the HDD storage pool from the client side, which is equivalent to shielding the slow back end, and the overall performance is improved by the mode.

During implementation, flush and front-end writing are a dynamic balance process, the better the process is, the smoother the system performance is, and no performance jump is generated.

In the embodiment of the application, after the write request of the front end is written into the SSD pool, the front end is not directly flushed into the HDD pool, but is flushed after aggregation, because the HDD has better sequential write performance for large files and has poorer random write performance for small files (because a magnetic disk needs to seek tracks), and therefore, the small objects are flushed after aggregation, and the flushing performance can be improved.

As the data written into the HDD pool by the object system is aggregated, when a deletion operation exists, the object only writes a deletion mark, and a background needs to perform garbage collection to reclaim the storage space of the HDD.

Fig. 5 is a schematic flow diagram of online aggregation in the method provided by the embodiment of the present application, in which a rgw process receives requests of multiple clients, and since a rgw object store only has 512 front-end threads, each thread receives one request of a client, writes data first, and then writes metadata and index information, and receives and processes the next request after processing the request.

In the embodiment of the present application, 32 merge threads are added on the basis of the original 512 front-end threads, each merge thread monopolizes one queue, and a front-end client request is hashed onto one queue. After a group of requests are taken by the merging thread, a lob which is not completely written in the memory is taken out from the lob list which is allocated in advance, the lobs are sequentially arranged into a sequence string according to the byte number of each request, the initial position offset and the length len of each request object in the lob are recorded at the same time, and the structures form kv pairs to be persisted into the omap of the lob in the copy pool.

And locking the front-end threads req1 and req2 after receiving the client requests, not receiving the next client request, locking the merging threads after the merging threads queue to the queue1, unlocking the merging threads after the merging threads write obj1 data of req1 and obj2 data of req2, and continuously writing metadata and indexes. The front-end threads req1 and req2 are then unlocked, and the enqueue process is repeated after the next request is received.

Fig. 6 is a schematic diagram of a cache elimination process in the method according to the embodiment of the present application, and as shown in fig. 6, in the newly added lob allocation thread module and elimination process, a large object lob is data actually written in the HDD pool, a small object is written in the front end, and the small objects need to be aggregated into a large object in the SSD, so that a large object unit needs to be allocated in the SSD. The large object unit is a resource which is allocated when the system is initialized, but is not used temporarily, and is directly acquired from a resource pool by an allocation module when the large object unit needs to be used.

The thread allocation module monitors the remaining available lob list in the memory, and when the remaining available lob list is smaller than a preset threshold, the thread allocation module is informed to allocate a new lob, as shown in the right side of fig. 6, the new lob does not occupy the actual space and has omap and xattr, and then oid of the lob is added into the memory list, so that the combined thread group can be directly accessed from the memory when needed.

And the elimination thread module eliminates the lob which is completely written from the SSD copy pool to the HDD erasure pool, the elimination is a necessary process for ensuring the front-end and back-end write balance, the elimination must continuously migrate the data in the SSD to the HDD so as to make room for the next front-end write request to write in, and if the elimination is not timely, the SSD is fully filled, so that the situation of write blocking occurs.

Lob _ a of the obsolete copy pool holds omap and xattr metadata, the data portion is emptied, and this pool object also exists. All data is written into the erasure pool, which creates a homonym lob _ a, and this pool object has no omap and xattr.

In the embodiment of the application, a hierarchical elimination mechanism exists at the back end in the elimination process, the large object at the back end is not unique, but a hierarchical queue exists in the cache, the hierarchical queue is divided into 8 levels in total, and the weight of each level is different. And after each queue is continuously merged into a large object in a pipeline mode, flushing the large object to the back end, and then deleting the storage space of the queue in the SSD pool.

During implementation, when the front-end 32 merging threads write the small objects into the SSD cache, a certain weight is given according to some characteristics of the front-end IO request. FIG. 7 is a graph of the rank queue weights provided by an embodiment of the present application, as shown in FIG. 7, where the queue weights represent the priority of the brush-down, weight 0 may be very slow to brush-down, while Weight 7 may be fast to brush-down. The concept of weight is introduced to better adapt to the front-end IO pattern, for example, a small IO request may be re-read in a very fast time, if the flushing elimination is slow, data is very large and may still be in the SSD cache, which may be read very fast, thereby improving the reading performance.

The method for assigning the weight comprises the following steps:

If(read repeat time＝0)

Weight＝(IO Size)mod(4096/8)+1

else

Weight-＝1；

in the embodiment of the application, data larger than 4MB is defined to be directly stored in the HDD without passing through the SSD Cache. The 4MB space is divided into 8 levels, with Weight 1 for 0 to 512k, weight 2 for 512k to 1MB, and so on, with Weight 0 remaining.

The information which is newly read again is marked in the data cached to the SSD, and if the data is read again, the weight value is increased according to the number of reading, namely the data which is recently read in the SSD cache is more slowly brushed down to the HDD storage pool.

In one implementation, the front-end IO may also customize parameters, allowing the IO to carry priority parameters when the http front-end is distributed to rgw, indicating how long the front-end wants the data to be retained in the SSD cache. This may enable finer grained IO control.

Fig. 8 is a schematic flow diagram of a write IO request according to an embodiment of the present application, and as shown in fig. 8, in a write IO stream:

1) After being transmitted from the client, a small object is received by the front-end thread, and the front-end thread hashes the small object to a queue of one of the merging threads.

2) A merged thread queue will process multiple front-end thread requests at the same time, for example, a queue receives 10 small object requests, first extracts an unfilled lob from the list of available lobs in memory, and if there are not enough lobs, informs the allocation module to allocate a new lob.

3) Serializing the 10 requests into a string, placing offset and len of each request in the omb's omap, and inserting 10 records in the omb's omap for 10 small object requests.

4) The xattr, flag, current _ size, pool _ id, etc. of this lob are modified.

5) And writing the merged data into the lob of the SSD cache pool by using a write request.

6) If the lob is not yet full over the 10 requests, the lob is placed in the available memory table.

7) Unlocking the 10 requests, which indicates that the 10 requests have been written, and writing the head object metadata and the index of the small object by the corresponding 10 front-end lines Cheng Jixu. The metadata xattr of this header object holds in which lob it is aggregated.

Fig. 9 is a schematic flowchart of a read IO request according to an embodiment of the present application, and as shown in fig. 9, in a read IO stream:

1) The client reads a small object obj1 and the gateway first reads its metadata xattr, which records in which lob it is stored.

2) Then read the omap of this lob in the cache pool, find which entry of obj1, find the offset and len length in lob.

3) Then the xattr of the lob in the cache pool is read again, and whether the lob is currently in the cache pool or is eliminated to the erasure pool is recorded in the xattr.

4) If the lob is still in the buffer pool, the offset and the length are directly used for reading data; if the flag for this lob indicates that it has been eliminated, then this lob is read according to pool-id in xattr.

If the flag of lob indicates that Garbage Collection (GC) has been passed, then the new GC object is read from the pool specified by new _ oid and pool-id of lob xattr.

In the embodiment of the application, the consistency of data is ensured by the following method:

1) Each rgw gateway manages a set of its lob list, different gateways will not use the same lob in the written IO stream, the persisted object is named rgw.alloc.rgw-id, this rgw-id is unique, and all available lobs are persisted in this object.

2) When the front-end thread transmits a request to the merging thread group to write data, the front-end thread is locked, and only after the request is successfully written in the merging thread and the data is unlocked, the front-end thread Cheng Caineng continues to write the metadata of the object, so that the condition that the metadata of the object exists but the data is lost due to the fact that successful data are written first and then the metadata are written is ensured, and the condition that the two IO power supply is interrupted and the metadata of the object exists but the data is lost is prevented.

3) The persisted Object holding the already written lob is named rgw.transfer.n, and if the multiple gateways encounter lob full, the lob is persisted into the underlying Object, so the IO operating the Object is added with a persisted RADS (persistent, automatic Distributed Object Store) lock to prevent concurrency. The persisted object that holds the lob that needs to be GC is named rgw.

4) In order to prevent the homonymous objects from enqueuing to a same merge queue at the same time when the homonymous objects are uploaded concurrently, and avoid that one of the glob saving key value pairs of lob is lost to cause inconsistency of metadata, in the embodiment of the present application, a random string epoch is generated, each object has an epoch, the homonymous objects also have the epoch, and the character strings of the glob key of lob are assembled with the epoch, so that the keys of the homonymous objects after enqueuing and the epoch are different, and the metadata are prevented from being covered.

5) Both IO and GC deletion can operate the persistent object rgw.hgc.n, content can be added when deletion, content can be deleted when GC deletes content, and some rados global locks are added when implementation is performed, so that consistency is guaranteed.

6) The elimination and GC are multi-threaded, and in the embodiment of the application, the consistency of elimination concurrency and GC concurrency is ensured through multiple consistency guarantee measures such as rereading, global locking, IO read-write sequence modification and the like.

According to the scheme provided by the embodiment of the application, an online aggregation mode is adopted, when all the small users write, the small users are directly additionally written into a dozen of megabytes of aggregated large objects in a copy storage pool of an SSD, and the large objects are printed in an erasure correction pool of the HDD after being fully written. The design and implementation scheme of the rgw object storage IO path uses rgw of the SSD acceleration read-write storage cache, and compared with the original rgw, the design and implementation scheme can obviously improve the performance, better adapt to hardware products and improve the product performance.

Based on the foregoing embodiments, embodiments of the present application provide a distributed object storage apparatus, where modules included in the apparatus and units included in the modules may be implemented by a processor in a computer device; of course, the implementation can also be realized through a specific logic circuit; in the implementation process, the processor may be a Central Processing Unit (CPU), a Microprocessor Unit (MPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), or the like.

Fig. 10 is a schematic structural diagram of a distributed object storage apparatus according to an embodiment of the present application, and as shown in fig. 10, the distributed object storage apparatus 1000 includes:

a first receiving module 1001, configured to receive first requests sent by different terminals, where each first request carries first data to be written;

a merge processing module 1002, configured to merge the first requests sent by the terminals to obtain at least one merge queue, where each merge queue includes at least one first request;

a data processing module 1003, configured to process first data carried by all first requests located in the same merge queue to obtain second data corresponding to each merge queue, where the second data includes index data and content data;

a first writing module 1004, configured to write the content data corresponding to each merge queue into a storage unit of a first storage medium;

a second writing module 1005, configured to write the index data corresponding to each merge queue into an index unit of the first storage medium;

a discarding module 1006, configured to discard, when there is a target storage unit that is full of content data, the content data in the target storage unit to a storage unit of a second storage medium, where a delay of the first storage medium is smaller than that of the second storage medium.

In some embodiments, the merge processing module 1002 is further configured to:

acquiring a preset mapping function;

and merging the first requests according to the mapping values corresponding to the first requests to obtain at least one merging queue, wherein the mapping values corresponding to the first requests in the same merging queue are equal.

In some embodiments, the data processing module 1003 is further configured to:

In some embodiments, the distributed object storage apparatus 1000 further includes:

the locking module is used for merging the first requests sent by the terminals to obtain at least one merged queue and then adding locking identifiers to the terminals corresponding to the first requests in the queues;

the adding module is used for adding the first request which is sent again to the list to be responded when the first request which is sent again by the terminal added with the locking identification is received;

the unlocking module is used for deleting the locking identification added to the terminal corresponding to the first request included in each queue after the index data corresponding to each merging queue is written into the index unit of the first storage medium;

and the response module is used for responding to the first re-sent requests in the list to be responded according to the receiving time sequence of the first re-sent requests in the list to be responded.

In some embodiments, the first writing module 1004 is further configured to:

a second receiving module, configured to receive a second request for reading content data sent by a terminal, where the second request carries a reading identifier;

a first reading module, configured to read a target index in an index unit of the first storage medium according to the read identifier, where the target index is used to indicate a storage address and a length of content data to be read;

the determining module is used for determining a storage unit which stores the content data to be read according to the target index;

the second reading module is used for reading the content data to be read from the storage unit in which the content data to be read is stored according to the target index to obtain a reading result;

and the sending module is used for sending a first response to the terminal, wherein the first response carries the reading result.

In some embodiments, the determining module is further configured to:

Here, it should be noted that: the above description of the distributed object store embodiment is similar to the above description of the method, and has the same advantages as the method embodiment. For technical details not disclosed in the embodiments of the distributed object store of the present application, a person skilled in the art should understand with reference to the description of the embodiments of the method of the present application.

It should be noted that, in the embodiment of the present application, if the distributed object storage method is implemented in the form of a software functional module and is sold or used as a standalone product, it may also be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.

Accordingly, embodiments of the present application provide a computer-readable storage medium on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the distributed object storage method provided in the above embodiments.

An embodiment of the present application further provides an electronic device, fig. 11 is a schematic structural diagram of the electronic device provided in the embodiment of the present application, and as shown in fig. 11, the electronic device 1100 includes: a processor 1101, at least one communication bus 1102, a user interface 1103, at least one external communication interface 1104 and a memory 1105. Wherein the communication bus 1102 is configured to enable connective communication between these components. The user interface 1103 may include a display screen, and the external communication interface 1104 may include standard wired and wireless interfaces, among others. Wherein the processor 1101 is configured to execute the program of the distributed object storage method stored in the memory to realize the steps in the distributed object storage method provided in the above-mentioned embodiments.

The above description of the display device, electronic device and storage medium embodiments, similar to the description of the method embodiments above, has similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the display device, the electronic device and the storage medium of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application. The above-mentioned serial numbers of the embodiments of the present application are merely for description, and do not represent the advantages and disadvantages of the embodiments.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program code, such as removable storage devices, read-only memories, magnetic or optical disks, etc.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a device to perform all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media that can store program code, such as removable storage devices, ROMs, magnetic or optical disks, etc.

The above description is only for the embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A distributed object storage method, the method comprising:

2. The method of claim 1, wherein the merging the first requests sent by the terminals to obtain at least one merge queue comprises:

acquiring a preset mapping function;

3. The method according to claim 1, wherein the processing first data carried by all the first requests in the same merge queue to obtain second data corresponding to each merge queue includes:

4. The method according to claim 1, wherein after the combining the first requests sent by the terminals to obtain at least one combining queue, the method further comprises:

when a first request which is sent again by the terminal added with the locking identification is received, adding the first request which is sent again into a list to be responded;

5. The method according to claim 1, wherein writing the content data corresponding to each merge queue into a storage unit of a first storage medium comprises:

6. The method of claim 1, further comprising:

7. The method according to claim 6, wherein the determining a storage unit storing content data to be read according to the target index comprises:

8. A distributed object storage apparatus, the apparatus comprising:

9. An electronic device, comprising:

a memory for storing executable instructions;

a processor for implementing the distributed object storage method of any one of claims 1 to 7 when executing executable instructions stored in the memory.

10. A computer-readable storage medium storing executable instructions for implementing the distributed object storage method of any one of claims 1 to 7 when executed by a processor.