WO2021077745A1

WO2021077745A1 - Data reading and writing method of distributed storage system

Info

Publication number: WO2021077745A1
Application number: PCT/CN2020/092831
Authority: WO
Inventors: 王曙光; 孟祥瑞
Original assignee: 浪潮电子信息产业股份有限公司
Priority date: 2019-10-25
Filing date: 2020-05-28
Publication date: 2021-04-29
Also published as: CN110780819A

Abstract

A data reading and writing method and apparatus, storage server and readable storage medium of a distributed storage system, the method comprising: obtaining an IO request, the IO request comprising identification information of a target object; determining a target placement group according to the identification information of the target object; determining a target OSD corresponding to the target placement group in a pre-created cache file, and sending the IO request to the target OSD to implement data reading and writing. As can be seen, the method caches the correspondence between the placement group and the OSD in the cache file, therefore, in the process of data reading and writing, there is no need to use tedious calculations to determine the target OSD corresponding to the target placement group, which reduces wastage of system CPU resources in the IO path and may greatly reduce IO delay.

Description

Data reading and writing method of distributed storage system

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on October 25, 2019, the application number is 201911025181.2, and the invention title is "a method for reading and writing data in a distributed storage system", the entire content of which is incorporated by reference In this application.

Technical field

This application relates to the field of computer technology, and in particular to a data reading and writing method, device, storage server, and readable storage medium of a distributed storage system.

Background technique

With the rapid development of cloud computing technology, the industry has higher and higher requirements for the performance and reliability of distributed storage. In the current distributed storage system, when the client reads and writes an object, it needs to first calculate the placement group based on the object name, and then use it to calculate the members of the placement group, but the calculation of the placement group membership needs to go through multiple cycles and Recursively call hash calculation, which greatly wastes system CPU resources and increases the delay of front-end IO.

It can be seen that how to provide a data reading and writing method in a distributed storage system to avoid the impact of the data reading and writing process on the front-end IO delay is an urgent problem for those skilled in the art to solve.

Summary of the invention

The purpose of this application is to provide a data reading and writing method, device, storage server, and readable storage medium for a distributed storage system to solve the problem that traditional data reading and writing solutions occupy more system CPU resources during the data reading and writing process. Causes the problem of increased front-end IO delay. The specific plan is as follows:

In the first aspect, this application provides a method for reading and writing data in a distributed storage system, which is applied to a Ceph client, including:

Acquiring an IO request, where the IO request includes identification information of the target object;

Determine the target placement group according to the identification information of the target object;

Determine the target OSD corresponding to the target placement group in the pre-created cache file, and send the IO request to the target OSD to implement data reading and writing.

Preferably, the determining the target OSD corresponding to the target placement group in the pre-created cache file and sending the IO request to the target OSD includes:

Judging whether the target placement group exists in the pre-created cache file;

If it exists, determine the target OSD corresponding to the target placement group in the cache file, and send the IO request to the target OSD;

If it does not exist, calculate the target OSD corresponding to the target placement group according to the crush algorithm, send the IO request to the target OSD, and record the target placement group and the target in the cache file Correspondence of OSD.

Preferably, the calculating the target OSD corresponding to the target placement group according to the crush algorithm includes:

Calculate the OSD group corresponding to the target placement group according to the crush algorithm, and use the main OSD in the OSD group as the target OSD.

Preferably, the determining the target placement group according to the identification information of the target object includes:

A modulo operation is performed on the hash value of the identification information of the target object to obtain the identification information of the target placement group.

Preferably, before the obtaining the IO request, the method further includes:

Obtain the file to be written, divide the file to be written into multiple target objects according to a preset object size, and determine the identification information of each target object; generate an IO request carrying the identification information of the target object.

Preferably, the determining the target OSD corresponding to the target placement group in the pre-created cache file includes:

Determine the target OSD corresponding to the target placement group in the pre-created cache list.

Preferably, before the determining the target OSD corresponding to the target placement group in the pre-created cache file, the method further includes:

If it detects that the OSD map has changed, initialize the pre-created cache file.

In the second aspect, this application provides a data reading and writing device of a distributed storage system, which is applied to a Ceph client, and includes:

Request acquisition module: used to acquire an IO request, where the IO request includes the identification information of the target object;

Placement group determination module: used to determine the target placement group according to the identification information of the target object;

Request sending module: used to determine the target OSD corresponding to the target placement group in the pre-created cache file, and send the IO request to the target OSD to realize data reading and writing.

In the third aspect, this application provides a storage server of a distributed storage system, including:

Memory: used to store computer programs;

Processor: used to execute the computer program to implement the steps of a method for reading and writing data in a distributed storage system as described above.

In a fourth aspect, the present application provides a readable storage medium on which a computer program is stored. When the computer program is executed by a processor, it is used to implement the data of a distributed storage system as described above. The steps of the read and write method.

The data reading and writing method, device, storage server, and readable storage medium of a distributed storage system provided by the present application include: obtaining an IO request, the IO request including the identification information of the target object; according to the identification information of the target object , Determine the target placement group; determine the target OSD corresponding to the target placement group in the pre-created cache file, and send the IO request to the target OSD to achieve data reading and writing. It can be seen that the solution caches the corresponding relationship between the placement group and the OSD in the cache file. Therefore, there is no need to determine the target OSD corresponding to the target placement group through tedious calculations during the data read and write process, thereby reducing the system CPU resources in the IO path. Waste, and can greatly reduce the IO delay.

Description of the drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are merely For some of the embodiments of the present application, for those of ordinary skill in the art, other drawings may be obtained based on these drawings without creative work.

FIG. 1 is an implementation flowchart of Embodiment 1 of a data reading and writing method for a distributed storage system provided by this application;

2 is an implementation flowchart of Embodiment 2 of a method for reading and writing data in a distributed storage system provided by this application;

3 is a functional block diagram of an embodiment of a data reading and writing device of a distributed storage system provided by this application;

Figure 4 is a schematic structural diagram of an embodiment of a storage server of a distributed storage system provided by this application.

Detailed ways

The core of this application is to provide a method, device, storage server and readable storage medium for reading and writing data of a distributed storage system, which reduces the waste of system CPU resources in the process of data reading and writing, and can greatly reduce IO latency .

In order to enable those skilled in the art to better understand the solution of the application, the application will be further described in detail below with reference to the accompanying drawings and specific implementations. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

The following describes the first embodiment of a method for reading and writing data of a distributed storage system provided by the present application. Referring to FIG. 1, the first embodiment includes:

S101. Obtain an IO request, where the IO request includes identification information of a target object.

S102: Determine a target placement group according to the identification information of the target object;

S103: Determine a target OSD corresponding to the target placement group in a pre-created cache file, and send the IO request to the target OSD, so as to implement data reading and writing.

This embodiment is applied to a Ceph client. The client mentioned here refers to a service process that accesses the cluster. For the nas scenario, it refers to nfsserver and samba, the corresponding block scenario refers to the tgt process, and the corresponding object scenario refers to rgw.

In Ceph, no matter which storage method (object, block, file system) is used, data will be divided into several objects (object), and the placement group (PG) is used to logically group these objects. Group, so as to achieve unified management and improve efficiency. OSD (Object Storage Device) is used to provide storage resources. Each disk, SSD, RAID group or partition can become an OSD. Specifically, when data is written, the file is divided into several objects. The objects are first mapped to the placement group, and then from the placement group to the OSD group. Among them, there is a "one-to-many" mapping relationship between objects and placement groups, and a "many-to-many" mapping relationship between placement groups and OSDs. It should be noted that the target OSD in this embodiment refers to the main OSD in the OSD group, that is, the OSD that can perform data read and write operations.

The IO request in this embodiment includes the identification information of the target object. Specifically, when the IO request is a read request, the main information included is the identification information of the target object, the offset of the object read request, and the length to be read; When the request is a write request, the main information included is the identification information of the target object, the offset of the object write request, the length of the data to be written, and the content of the data to be written. When the Ceph client receives the IO request, it first calculates the target placement group to which the target object is mapped according to the identification information of the target object carried in the IO request. As a specific implementation, calculates the identification information of the target object The hash value of, and take the modulus to get the placement group corresponding to the target object.

Then, under normal circumstances, the CRUSH algorithm (Controlled Replication Under Scalable Hashing, distributed selection algorithm for data storage) is used to calculate the target OSD to which the target placement group is mapped. However, in this embodiment, a cache file is preset, and the cache file is used to record the mapping relationship between the configuration group and the OSD. Therefore, after the target placement group is determined, there is no need to calculate the corresponding target OSD through the cumbersome CRUSH algorithm, and only need to pass The target OSD can be directly determined by querying the cache file, thereby forwarding the IO request to the target OSD, avoiding the tedious calculation process.

Specifically, after determining the target placement group, first query the cache file to determine whether the mapping relationship between the target placement group and the OSD is recorded in the cache file, and if so, determine the target OSD corresponding to the target placement group; otherwise, The CRUSH algorithm is used to calculate the target OSD corresponding to the target placement group. Among them, the CRUSH algorithm is used to calculate the location where the object should be written or read.

This embodiment provides a method for reading and writing data in a distributed storage system. The solution includes: obtaining an IO request, the IO request including the identification information of the target object; determining the target placement group according to the identification information of the target object; In the created cache file, the target OSD corresponding to the target placement group is sent, and the IO request is sent to the target OSD to realize data reading and writing. It can be seen that the solution caches the corresponding relationship between the placement group and the OSD in the cache file. Therefore, there is no need to determine the target OSD corresponding to the target placement group through tedious calculations during the data read and write process, thereby reducing the system CPU resources in the IO path. Waste, and can greatly reduce the IO delay.

The second embodiment of a data reading and writing method for a distributed storage system provided by the present application will be described in detail below. The second embodiment is implemented based on the foregoing embodiment 1, and is expanded to a certain extent on the basis of the first embodiment.

Specifically, in this embodiment, the mapping relationship between the placement group and the OSD is recorded in the form of a list, and this embodiment takes a data writing scenario as an example for description. Referring to Figure 2, the second embodiment is applied to the Ceph client and specifically includes:

S201. Obtain a file to be written, divide the file to be written into multiple target objects according to a preset object size, and determine the identification information of each target object; generate an IO carrying the identification information of the target object request;

The aforementioned preset object size is adjusted according to actual needs, usually 2M or 4M. The objects obtained by segmentation will have unique identification information, which is generally composed of the File ID of the file to be written and the number of the segment.

S202. Obtain an IO request, where the IO request includes identification information of the target object.

S203: Perform a modulo operation on the hash value of the identification information of the target object to obtain the identification information of the target placement group;

S204. Determine whether the target placement group exists in the pre-created cache list; if so, skip to S205, otherwise skip to S206;

In this embodiment, the identification information of the PG is similar to 1.0, 1.1, 1.2, and the above cache list can be implemented in the code using a map structure. The specific content is as follows:

[1.0,[1,2,3]]

[1.1,[2,5,3]]

[1.2,[3,6,8]]

.

[1.5,[6,7,9]]

The [1,2,3] following the identification information of each placement group is the identification information of the OSD corresponding to the placement group.

S205. Determine a target OSD corresponding to the target placement group in the cache list.

S206. Calculate the OSD group corresponding to the target placement group according to the crush algorithm, use the main OSD in the OSD group as the target OSD, and record the relationship between the target placement group and the target OSD in the cache list. Correspondence;

S207. Send the IO request to the target OSD to implement data reading and writing;

S208: If it is detected that the OSD map has changed, initialize the pre-created cache list.

OSD Map records how many OSDs the cluster contains, which nodes these OSDs belong to, and the respective weight information of these nodes and OSDs. These relationships are stored in a tree structure. OSD and node online and offline will bring changes in OSD Map, OSD Map changes will cause corresponding changes in the OSD corresponding to the PG, so the PG cache list should be cleared and the cache should be re-cached.

It can be seen that the data reading and writing method of a distributed storage system provided in this embodiment caches the correspondence between placement groups and OSDs in a cache file. Therefore, there is no need to determine the target placement through tedious calculations during data reading and writing. The target OSD corresponding to the group greatly reduces the CPU consumption and delay on the IO path, thereby saving the system CPU resource occupation, and greatly reducing the front-end IO delay.

The following describes a data reading and writing device of a distributed storage system provided by an embodiment of the present application. The data reading and writing device of a distributed storage system described below and the data reading and writing device of a distributed storage system described above are described below. The methods can be referred to each other.

Refer to Figure 3, the device is applied to Ceph client, including:

Request obtaining module 301: used to obtain an IO request, where the IO request includes the identification information of the target object;

The placement group determining module 302: used to determine the target placement group according to the identification information of the target object;

The request sending module 303 is used to determine the target OSD corresponding to the target placement group in the pre-created cache file, and send the IO request to the target OSD to implement data reading and writing.

The data reading and writing device of the distributed storage system of this embodiment is used to implement the aforementioned data reading and writing method of the distributed storage system. Therefore, the specific implementation of the device can be seen in the foregoing description of the data reading and writing method of the distributed storage system. The embodiment part, for example, the request obtaining module 301, the placement group determining module 302, and the request sending module 303 are respectively used to implement steps S101, S102, and S103 in the data reading and writing method of the above-mentioned distributed storage system. Therefore, the specific implementation mode can refer to the description of the respective parts of the embodiment, and the introduction is not repeated here.

In addition, since the data reading and writing device of the distributed storage system of this embodiment is used to implement the aforementioned data reading and writing method of the distributed storage system, its function corresponds to the function of the above method, and will not be repeated here.

In addition, this application also provides a storage server of a distributed storage system, as shown in Figure 4, including:

Memory 100: used to store computer programs;

The processor 200 is configured to execute the computer program to implement the steps of a method for reading and writing data in a distributed storage system as described above.

Finally, this application provides a readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, it is used to implement data reading and writing in a distributed storage system as described above. Method steps.

The various embodiments in this specification are described in a progressive manner. Each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments can be referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method part.

The steps of the method or algorithm described in combination with the embodiments disclosed in this document can be directly implemented by hardware, a software module executed by a processor, or a combination of the two. The software module can be placed in random access memory (RAM), internal memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROMs, or any other technical field Any other known storage media.

The above provides a detailed introduction to the solution provided by the application, and specific examples are used in this article to illustrate the principles and implementation of the application. The description of the above examples is only used to help understand the method and core ideas of the application; at the same time; For those of ordinary skill in the art, according to the idea of the application, there will be changes in the specific implementation and the scope of application. In summary, the content of this specification should not be construed as a limitation of the application.

Claims

A method for reading and writing data in a distributed storage system is characterized in that it is applied to a Ceph client and includes:

Acquiring an IO request, where the IO request includes identification information of the target object;

Determine the target placement group according to the identification information of the target object;

Determine the target OSD corresponding to the target placement group in the pre-created cache file, and send the IO request to the target OSD to implement data reading and writing.
The method according to claim 1, wherein the determining a target OSD corresponding to the target placement group in a pre-created cache file, and sending the IO request to the target OSD, comprises:

Judging whether the target placement group exists in the pre-created cache file;

If it exists, determine the target OSD corresponding to the target placement group in the cache file, and send the IO request to the target OSD;

If it does not exist, calculate the target OSD corresponding to the target placement group according to the crush algorithm, send the IO request to the target OSD, and record the target placement group and the target in the cache file Correspondence of OSD.
The method of claim 2, wherein the calculating the target OSD corresponding to the target placement group according to the crush algorithm comprises:

Calculate the OSD group corresponding to the target placement group according to the crush algorithm, and use the main OSD in the OSD group as the target OSD.
The method of claim 1, wherein the determining a target placement group according to the identification information of the target object comprises:

A modulo operation is performed on the hash value of the identification information of the target object to obtain the identification information of the target placement group.
The method according to claim 4, characterized in that, before the obtaining the IO request, the method further comprises:

Obtain the file to be written, divide the file to be written into multiple target objects according to a preset object size, and determine the identification information of each target object; generate an IO request carrying the identification information of the target object.
The method of claim 5, wherein the determining the target OSD corresponding to the target placement group in a pre-created cache file comprises:

Determine the target OSD corresponding to the target placement group in the pre-created cache list.
8. The method according to any one of claims 1 to 6, characterized in that, before the determining the target OSD corresponding to the target placement group in the pre-created cache file, the method further comprises:

If it detects that the OSD map has changed, initialize the pre-created cache file.
A data reading and writing device of a distributed storage system is characterized in that it is applied to a Ceph client and includes:

Request acquisition module: used to acquire an IO request, where the IO request includes the identification information of the target object;

Placement group determination module: used to determine the target placement group according to the identification information of the target object;

Request sending module: used to determine the target OSD corresponding to the target placement group in the pre-created cache file, and send the IO request to the target OSD to realize data reading and writing.
A storage server of a distributed storage system is characterized in that it comprises:

Memory: used to store computer programs;

Processor: used to execute the computer program to implement the steps of a method for reading and writing data in a distributed storage system according to any one of claims 1-7.
A readable storage medium, characterized in that a computer program is stored on the readable storage medium, and when the computer program is executed by a processor, it is used to implement a distributed system according to any one of claims 1-7. The steps of the data reading and writing method of the storage system.