CN111221773A

CN111221773A - Data storage architecture method based on RMDA high-speed network and skip list

Info

Publication number: CN111221773A
Application number: CN202010041020.9A
Authority: CN
Inventors: 胡卉芪; 林志刚
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2020-01-15
Filing date: 2020-01-15
Publication date: 2020-06-02
Anticipated expiration: 2040-01-15
Also published as: CN111221773B

Abstract

The invention discloses a data storage architecture method based on an RMDA high-speed network and a skip list, which is characterized in that a task scheduler is adopted to allocate local and remote tasks and the data storage architecture of the skip list, an access concurrency control strategy and a range scanning process are designed based on an R-skip index, and the task scheduler is used for allocating the local and remote tasks to the data stored on a server in an R-skip form, so that high-bandwidth and low-delay data access of RDMA network communication is realized. Compared with the prior art, the invention has lower delay, higher bandwidth and data access throughput, reduces the communication time in remote access, greatly saves some network resources, and effectively avoids the occupation of a CPU (central processing unit) for data transmission of the server and the absence of conflict in writing.

Description

Data storage architecture method based on RMDA high-speed network and skip list

Technical Field

The invention relates to the technical field of database systems, in particular to a data storage architecture method based on a high-bandwidth low-delay RMDA network and a skip list.

Background

Remote Direct Memory Access (RDMA) enables low latency network access by accessing memory directly from a remote computer. It bypasses the operating system and supports zero-copy networks, thereby achieving high bandwidth and low latency in network access. There are two commands available for remote memory access in RDMA, one being a single-sided operation: including Send/Recv, which, like socket programming, requires RDMA Recv to be initiated at the server side before the client sends an RDMA Send, and appending an address indicating the storage location of the upcoming message; another command is a bilateral operation: including Read/Write. Memory semantically, the memory address where the message is stored on the remote computer is allocated by the client, which eliminates the CPU consumption of the remote computer (server). Unilateral operations provide relatively higher bandwidth and lower latency than bilateral operations, however, existing concurrency control methods of traditional skiplists used in RDMA, whether with or without locks, consume significant CPU resources because they require the addition of steps to handle various conflicts in operations. In addition, the nodes are randomly stored, so only one node can be obtained in one RDMA round trip, and a large amount of network resources are consumed by remotely traversing the traditional skiplist through RDMA. In the prior art client-server design, the server is primarily responsible for processing operations, while the client only sends RPC requests and receives the returned messages from the server. When the concurrency of local access operations is high, the CPU of the local node will become a performance bottleneck for the application program, and many tasks will be blocked.

Disclosure of Invention

The invention aims to design a data storage architecture method based on an RMDA high-speed network and a skip list aiming at the defects of the prior art, which adopts the technical scheme that data is stored on a server in an R-skip form, local and remote tasks are distributed through a task scheduler, and an access concurrency control strategy and a scanning process are designed based on an R-skip index, so that high bandwidth, low delay and higher data access throughput of RDMA network communication are realized, the communication time in remote access is greatly reduced, certain network resources are saved, and the condition that the CPU occupation and writing of server data transmission have no conflict is effectively avoided.

The purpose of the invention is realized as follows: a data storage architecture method based on an RMDA high-speed network and a skip list is characterized in that a task scheduler is adopted to allocate a data storage architecture of local and remote tasks and skip lists, an access concurrency control strategy and a range scanning process are designed based on an R-skip index, and the task scheduler is used for allocating the local and remote tasks to data stored on a server in an R-skip form, so that high-bandwidth and low-delay data access of RDMA network communication is realized.

The data storage architecture comprises:

a1: in an RDMA network, when a server identifies a request sent by a client through a PRC (resource sharing component) through an access queue (CQ), firstly, the access mode is determined by detecting the current CPU resource utilization rate and other factors, and the judgment process is mainly responsible for a task scheduling part;

a2: if the judgment result is local access, requesting to be executed by one thread in the local thread pool, and then returning the result to the client;

a3: if the remote access is determined, the scheduling thread returns the request to the client, and when the returned information has the remote access mark, the client executes the remote access.

The access concurrency control strategy designed based on the R-skip index comprises the following steps:

b1: the R-skip divides the skip into a plurality of blocks, each block comprises a plurality of nodes, each block is stored in a continuous address, and one RDMA communication can obtain one block instead of one node;

b2: the R-skip supports two Access modes of Local Access (Local-Access) and Remote Access (Remote-Access), a computer where the R-skip is located in the cluster is called a Local node, other computers are called Remote nodes, and the Local-Access refers to the R-skip which is accessing the Local node; the Remote-Access is that the Remote node directly accesses the R-skip through RDMA, and can bypass the CPU of the local node;

b3: and the concurrency control on the R-skip list comprises the resolution of the top layer split conflict, the read-write conflict of the top layer and the boundary conflict.

The range scanning process designed based on the R-skiplist index comprises the following steps:

c1: firstly, a client sends a scanning request to a server;

c2: secondly, the server traverses the R-skiplist to find a first target data node;

c3: the server sends back to the client the address of the data block containing the first target data.

C4: the client accesses the data block directly using the returned address via RDMA. The process can not only save the data copy, but also avoid CPU occupation of server data transmission. In addition, since the first target data is found only on the server side, there is no conflict with the writing of other partitions.

Compared with the prior art, the invention has lower delay, higher bandwidth and data access throughput, reduces the communication time in remote access, greatly saves some network resources, and effectively avoids the occupation of a CPU (central processing unit) for data transmission of the server and the absence of conflict in writing. The design of the R-skip divides the skip into a number of blocks containing multiple nodes, then stores each block in a contiguous address, and an RDMA communication may fetch a block instead of a node. In the cluster, if a computer where the R-skip is located is called a Local node and other computers are called remote nodes, the Local-Access is accessing the R-skip on the Local node; Remote-Access is a Remote node directly accessing R-skip via RDMA, which can bypass the CPU of the local node to perform Remote Access. Meanwhile, network resources are limited, and the R-Skiplist adjusts the traditional Skiplist structure to reduce communication time in remote access, so that some network resources are saved. In addition, in practical applications, the reading operation is most frequent, when the local search result is large, copying and sending data in the traditional Skiplist mode consumes a large amount of CPU resources, and if the required data set is a plurality of partitions, the range query conflicts with the writing of other partitions. With the R-skip, the client can directly access the data block address of the first target data via RDMA. The process can not only save the data copy, but also avoid CPU occupation of server data transmission. Since we should only find the first target data at the server side, there is no conflict with the writes of other partitions.

Drawings

FIGS. 1-2 are schematic diagrams of data storage architectures according to the present invention;

FIG. 3 is a schematic design diagram of an R-skiplist jump table;

FIGS. 4 to 5 are schematic diagrams of an R-skiplist construction process;

FIG. 6 is a schematic diagram of a prior art range search scan process;

FIG. 7 is a diagram illustrating the R-skiplist range search scanning process.

Detailed Description

The present invention will be described in further detail with reference to specific examples.

Example 1

Referring to fig. 1, when the local access concurrency is high, the CPU of the local node becomes a performance bottleneck, causing many tasks to be blocked, and in order to improve performance, the blocked tasks are allocated to the remote node and the remote access is performed using the remaining bandwidth.

On the server, a background thread is used to check the CPU utilization, which can be reflected by the number of blocking tasks. If a high CPU load is detected, part of the read operation will be distributed to the client until CPU utilization drops. Here, the task scheduling process is not blind, and a scheduling strategy based on priority is designed to improve the performance to the maximum extent. Since the number of partitions is greater than the number of threads, there are some partitions without local thread service, and according to this phenomenon, read operations are classified into two categories: i.e., reading a partition that is not serviced by a local thread and reading a partition that is serviced by a local thread. For reading a partition without local thread service, the success rate of remote reading will be higher since it can be ensured that there is no conflict between the client and the server. Then the read has no local thread service assigned to the client first. In contrast, reads are serviced by a local thread at a lower priority than reads not serviced by a local thread because it may cause a conflict between the client and server, causing a remote read retry.

In the process of obtaining, inserting and deleting, the key of the required data is used to determine which range the required data belongs to, for the scanning operation, the start key is used to perform task division on the required data set, and the arriving task information is obtained by accessing the CQ queue on the server according to the self-recognition RPC method described in the inner C1. After the threads acquire the tasks, the tasks need to be partitioned according to the partition range, the partition range is determined by the partition number and a top node of the R-skip list, the partitioned tasks are pushed into task queues, one queue corresponds to one partition, and then each thread can pop up the tasks from different queues. Load skew may occur due to the division of tasks by range, for which reason it is assumed that the number of partitions is set to be much larger than the number of threads to minimize high load, and threads are not bound to queues, but rather loop access to each queue to pop up tasks, thereby avoiding execution of multiple high load partitions by only one thread. Loop accesses will result in frequent context switches to reduce performance, so threads can be made to perform a batch of tasks for one access.

Referring to FIG. 2, to balance the load, the load is periodically checked using a background thread and if load skew occurs, the partition scope is readjusted as appropriate.

Referring to FIG. 3, each level is an ordered linked list of blocks, each containing an ordered linked list of nodes.

Referring to fig. 4, there is shown the definition of the data structure in the R-skip, and compared to the conventional skip, the representation of key, value and next in the index node and data node is not changed, but down in the index node is used to record the address of the next layer containing the block of nodes with the same key, instead of the address of the next node with the same key. In the structure of the data block and the index block, next is used to record the address of the next block in addition to the node linked list, and max _ key records the largest key value. There are some flag fields in the data structure, such as is _ delete, is _ update and is _ split, which will be used for the design of concurrency control.

Referring to FIG. 5, which shows the locally inserted code, the construction process of the R-skip is introduced by the insert operation as follows:

step 1: each layer (line 1) finds a block, such as the code of a Find operation, into which new data is to be inserted. According to the key of the required data, firstly acquiring a corresponding partition header from a cache, then using a function tlayer _ get _ block () to obtain a block containing the top layer of the target node, and if the obtained block contains the partition header, traversing the link list of the block to find the target node from the partition header and acquiring the target node.

Step 2: move to the next level using down, the rest of the levels have the same access as the top level. The R-skip assigns a random height h to the data (line 3) and then inserts the data into the target blocks from layer 1 to layer h (lines 4-31). In the process, when the data needs to be inserted continuously, the block for inserting the new data in the ith layer needs to be split from the key.

And step 3: entering layer i + 1 (lines 8-9, 18-19), this splitting rule may ensure that each node of layer i + 1 corresponds to a unique block of layer i, which means that a block of layer i contains only data between two adjacent nodes of layer i + 1; the target block of the ith layer only contains data which can be accessed by the user, and the transmission of a lot of useless data in remote access is avoided. Theoretically, if the data in the R-skip is in distributed balance, there are about 1/p nodes in a block, but the data in a certain range may occupy the height h randomly, which may cause blocking and further affect the communication performance. For this case, the maximum chunk size max _ cap is set before constructing the R-skip, and when the chunk contains a number of nodes equal to max _ cap, the chunk will be split (lines 10-11, 20-29).

The structure of the R-cliplist of the invention avoids conflicts between most local operations, and saves a lot of CPU resources by reducing the steps of conflict handling in the operations.

Referring to fig. 3, R-skiplsts are partitioned according to their top level nodes, and each partition is then treated as a small R-skiplst with no effect on other partitions if data is read or written in the partition. Thus, as long as each thread is guaranteed to be responsible for the tasks of different partitions, many conflicts can be avoided, but some conflicts still exist:

(1) for a read or write operation, regardless of which partition its required data belongs to, the R-skiplist needs to be traversed from the first node on top to find the target data or location, which would likely conflict with a write operation to a location before the data block.

(2) There is a shared pointer between two adjacent partitions, and a conflict occurs when two adjacent threads modify the shared pointer at the same time.

(3) For a scan operation, a conflict may occur if its required data set spans multiple partitions.

(4) For a block at the top level, it may contain data belonging to several partitions, and if one partition partitions the block, it may conflict with other partitions.

For top-level read-write collision elimination: the first node at the top of each partition, called the partition header, is cached, and the lookup operation may begin with the corresponding partition header. If the partition is deleted, it is no longer directly accessed, but can only be found by the previous partition, so the conflict will again occur. The present invention does not delete the partition header directly from the R-skip, but simply sets the flag is _ del to true to ensure that the boundary of each partition always exists.

For eliminating boundary collisions: when data is inserted or deleted in the top boundary of a partition, or blocks are split or merged in the boundary of a partition, the shared pointers between adjacent partitions will be modified. If two adjacent partitions perform these operations simultaneously, the shared pointers may collide, excluding that only one combination (split merge) may collide, and for merge operations, when the two merge blocks belong to different partitions, other collisions may occur in addition to boundary collisions. Without a merge operation, boundary conflicts will not exist and other conflicts will be reduced. An advantage may be provided for concurrent control design between remote operations and local operations, so no merge operation is performed by the block in the R-skip.

For top-level split conflict resolution: since the block division of the top layer causes a conflict with a plurality of partitions, it is difficult to eliminate, and it can be solved in performing operations considering that the frequency of the top layer division is low. In order not to affect the concurrency of non-split operations, a flag is _ split and a reference count ref _ count are added to a block, and before a thread splits a block, it first sets the flag split to true and invalidates the associated partition header cache. If other threads find the partition header invalid or is _ split set to true, access will be stopped and retried. If not, only ref _ count needs to be incremented. After is _ split is set to true, there may be some outstanding block operations, and then the block cannot be split until all operations are completed, as can be determined by ref _ count. When the split operation is complete, is _ split is set to false and the partition header cache is updated.

For concurrent control on remote read operation, linearity between linear write and local write can be ensured by ensuring atomicity of single block read, and only a small amount of CPU resources on the server side need to be consumed, which is specifically implemented as follows:

adding a flag to each block as update, before updating the block, the thread first sets it to true and performs the update, and then sets reset to false. When a remote read gets a block from the server, it first determines whether the block is being updated by checking the value of is _ update: if false, the block can be accessed directly; if true, the read requires retries and the write-induced block split is a bottom-up approach. Thus, after the target node is found on the client, the next level of blocks corresponding to the target node may have been split in the server. However, this phenomenon does not affect the correctness of reading because all the split data can still be read. In addition, since the blocks are not merged in the R-skip, there is no fear that a block of a next layer corresponding to the target node may not exist.

Referring to FIG. 6, the prior art R-skiplist range search scan process is: the client first sends a scan request to the server, and then the server traverses the R-skiplist to find the first target data node. It then copies all the target data to the buffer and returns to the client.

The problems in the above scanning process are: (1) when the size of the search result is large, copying and sending data will consume a large amount of CPU resources; (2) if the desired data set comes from multiple partitions, the range search will conflict with the writes of other partitions. In R-skip, because the data nodes are stored in contiguous blocks of memory, they can be read directly on the client via RDMA. To this end, the present invention designs a new scanning process to solve the above two problems.

Referring to fig. 7, the R-skiplist range search scanning process is performed as follows:

step 1: the client sends a scanning request to the server;

step 2: the server traverses the R-skiplist to find a first target data node;

and step 3: the server sends the address of the data block containing the first target data back to the client;

and 4, step 4: the client accesses the data block over RDMA directly using the returned address.

The scanning process can not only save the data copy, but also avoid the CPU of the server occupying the data transmission, and because only the first target data is found at the server end, no conflict exists with the writing of other partitions. For remote scope searching, the execution flow is similar to that of remote acquisition, except that it may require access to more data blocks. After the first target data node is found by the start key, the block containing the first target node may continue to be accessed to find the remaining target data nodes. If the last node is accessed and the key is found to be less than the end _ key, then the next data block will continue to be accessed. In fact, when a data block containing the first target data node is obtained, max _ key and end _ key can be used directly to determine whether the next data block needs to be accessed. If necessary, the next data block can be preferentially used when the client accesses the current data block, so that the delay of remote range query is greatly reduced.

It is intended that all such modifications and variations be included herein within the scope of the present invention and protected by the following claims.

Claims

1. A data storage architecture method based on an RMDA high-speed network and a skip list is characterized in that a task scheduler is adopted to allocate a data storage architecture of local and remote tasks and skip lists, an access concurrency control strategy and a range scanning process are designed based on an R-skip index, and the task scheduler is used for allocating the local and remote tasks to data stored on a server in an R-skip form, so that high-bandwidth and low-delay data access of RDMA network communication is realized.

2. The RMDA high-speed network and skip list based data storage architecture method of claim 1, wherein the skip list data storage architecture comprises:

a1: in an RDMA network, when a server identifies a request sent by a client through a PRC through an access queue (CQ), firstly, the access mode of the server is judged by detecting the current CPU resource utilization rate;

a2: if the local access is judged, requesting to be executed by one thread in the local thread pool, and then returning the result to the client;

a3: if the remote access is judged, the scheduling thread returns the request to the client, and when the returned information has the remote access mark, the client executes the remote access.

3. The RMDA high-speed network and skip list based data storage architecture method of claim 1, wherein the access concurrency control policy designed by the R-skip index comprises:

b1: the R-skip divides the skip into a plurality of blocks, each block comprises a plurality of nodes, and each block is stored in a continuous address;

b2: the R-skip supports two access modes of local and remote, a computer where the R-skip is located in the cluster is a local node, other computers are called remote nodes, and the local access is to access the R-skip on the local node; the remote access is that the remote node directly accesses the R-skip through RDMA, and the R-skip can bypass the CPU of the local node; the local node is a computer where the R-skip in the cluster is located, and other computers are called remote nodes;

b3: the concurrency control on the R-skip includes resolution of the top-level split conflict, conflict of top-level read-write and conflict of boundary.

4. The RMDA high-speed network and skip list based data storage architecture method as claimed in claim 1, wherein the scanning process of the R-skip index design comprises the steps of:

c1: the client sends a scanning request to the server;

c2: the server traverses the R-skiplist to find a first target data node;

c3: the server sends the address of the data block containing the first target data back to the client;

c4: the client accesses the data block directly using the returned address via RDMA.