CN114691382A - RDMA-based communication method, node, system and medium - Google Patents

RDMA-based communication method, node, system and medium Download PDF

Info

Publication number
CN114691382A
CN114691382A CN202011630087.2A CN202011630087A CN114691382A CN 114691382 A CN114691382 A CN 114691382A CN 202011630087 A CN202011630087 A CN 202011630087A CN 114691382 A CN114691382 A CN 114691382A
Authority
CN
China
Prior art keywords
rdma
message queue
resource
queue
storage node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011630087.2A
Other languages
Chinese (zh)
Inventor
刘坤
刘永
余鹏
宋雨荷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZTE Corp
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to CN202011630087.2A priority Critical patent/CN114691382A/en
Priority to PCT/CN2021/122334 priority patent/WO2022142562A1/en
Publication of CN114691382A publication Critical patent/CN114691382A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/547Remote procedure calls [RPC]; Web services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/54Indexing scheme relating to G06F9/54
    • G06F2209/544Remote
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/54Indexing scheme relating to G06F9/54
    • G06F2209/548Queue

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The embodiment of the invention discloses a communication method, a node, a system and a medium based on Remote Direct Memory Access (RDMA). the communication method based on the RDMA is suitable for a first shared resource in a first storage node to comprise a plurality of first-level caches, the first-level caches are mutually isolated by taking a first processing core as granularity, first service thread groups corresponding to the first processing core are mutually isolated, the first service thread groups comprise at least one scene of a first service thread, and the execution steps comprise: the first RDMA message queue receives a resource access request sent by a second RDMA message queue of a second storage node; the first service thread at least obtains the required resource from the first-level cache of the corresponding first processing core; the first business thread processes the resource access request based on the required resource; the first RDMA message queue sends the processing result to the second RDMA message queue, thus avoiding the contention of shared resources in the process of RDMA data transmission and giving full play to the maximum transmission performance of the RDMA communication network as much as possible.

Description

RDMA-based communication method, node, system and medium
Technical Field
The embodiment of the invention relates to the technical field of Remote Direct Memory Access (RDMA) communication, in particular to a communication method, a node, a system and a medium based on the RDMA.
Background
The RDMA technology is widely used in the field of distributed storage, and can directly access the memory of the remote storage node by bypassing the core without involving the CPU of the remote storage node, thereby realizing zero-copy data transmission. The storage nodes can be interconnected through the RDMA network card, so that data can be transmitted between different storage nodes with low time delay and high bandwidth, and the performance of the distributed storage system is greatly improved. At present, the data transmission performance of a storage node in RDMA is improved by optimizing RDMA memory registration and exerting RDMA semantics such as send/receive, RDMA read, RDMA write and the like.
The method has the problems that in the RDMA data transmission process, a storage node serving as a data request end has threads and monitoring programs among multiple cores, so that concurrent access exists to pre-registered resources, and the maximum transmission performance of the RDMA communication network cannot be exerted to a certain extent. In addition, a reliable connection RC (full english: reliable connection) is established between the storage nodes and zero copy of data is realized through the RDMA primitive, however, the contention of resources of RDMA transceiving QUEUEs QP (full english: QUEUE PAIR)/completion QUEUEs CQ (full english: complete QUEUE) by multithreading when sending messages using the RDMA primitive in the RDMA data transmission process also has an influence on the delay in the RDMA data transmission process. Therefore, how to avoid the contention for the shared resources in the RDMA data transmission process and exert the maximum transmission performance of the RDMA communication network as much as possible becomes an urgent problem to be solved.
Disclosure of Invention
One or more embodiments of the present disclosure provide an RDMA-based communication method, node, system, and medium, which can avoid contention for shared resources during RDMA data transmission and exert the maximum transmission performance of an RDMA communication network as much as possible.
To solve the above technical problems, one or more embodiments of the present specification are implemented as follows:
in a first aspect, an RDMA communication method is provided, where a first shared resource in a first storage node includes multiple first-level caches, the multiple first-level caches are isolated from each other with a first processing core as a granularity, first service thread groups corresponding to the first processing core are isolated from each other, the first service thread group includes at least one first service thread, the first service thread corresponds to a scenario in which a first RDMA message queue is set, and steps executed on a first storage node side include: the first RDMA message queue receives a resource access request sent by a second RDMA message queue of a second storage node; the first service thread at least acquires required resources from the first-level cache of the corresponding first processing core; the first business thread processes the resource access request based on the required resource; the first RDMA message queue sends a processing result to the second RDMA message queue.
In a second aspect, an RDMA-based communication method is provided, where a second shared resource applicable to a second storage node includes multiple first-level caches, the multiple first-level caches are isolated from each other with a second processing core as a granularity, a second RDMA message queue corresponds to a scenario of the second processing core in a one-to-one manner, and the step performed on the second storage node side includes: a second RDMA message queue sends a resource access request to a first RDMA message queue of a first storage node, wherein the first RDMA message queue corresponds to the second RDMA message queue in a one-to-one mode; and after the first storage node processes the resource access request based on the required resource, receiving a processing result sent by the first RDMA message queue.
In a third aspect, a storage node is provided, including: the first processing core and the first shared resource comprise a plurality of first-level caches which are mutually isolated by taking the first processing core as granularity; the first service thread groups correspond to the first processing cores one by one, the first service thread groups are isolated from one another, and each first service thread group comprises at least one first service thread; the first RDMA message queue is arranged corresponding to the first service thread and used for receiving a resource access request sent by a second RDMA message queue of a second storage node; the first service thread is used for acquiring required resources from the first-level cache of the corresponding first processing core; and, for processing the resource access request based on the required resource; the first RDMA message queue also sends processing results to the second RDMA message queue.
In a fourth aspect, a storage node is provided, which includes: a second processing core; the second shared resource comprises a plurality of first-level caches, and the first-level caches are mutually isolated by taking the second processing core as granularity; a second RDMA message queue in one-to-one correspondence with the second processing core; the second RDMA message queue is used for sending resource access requests to a first RDMA message queue of a first storage node, and the first RDMA message queue and the second RDMA message queue are in one-to-one correspondence; and the second RDMA message queue is also used for receiving the processing result sent by the first RDMA message queue after the first storage node processes the resource access request based on the required resource.
In a fifth aspect, a distributed storage system is proposed, comprising the storage node as described above.
In a sixth aspect, a storage medium for computer readable storage is presented, the storage medium storing one or more programs which, when executed by one or more processors, implement the steps of the RDMA-based communication method as described above.
As can be seen from the technical solutions provided by one or more embodiments of the foregoing description, the RDMA-based communication method provided by the embodiments of the present invention is applicable to divide part of resources of a first shared resource in a first storage node into a plurality of primary caches, the primary caches are isolated from each other with a first processing core as a granularity, that is, the primary caches are isolated from each other according to a first processing core participating in RDMA communication, and first service thread groups corresponding to each first processing core are isolated from each other, each first service thread group includes at least one first service thread, the first service thread corresponds to a scenario in which a first RDMA message queue is disposed, so when the first RDMA message queue receives an access request, an access mode can implement isolated access, and when the first service thread performs RDMA communication, the resource can only be obtained from the primary cache on the first processing core in which the first service thread is located, the problem of communication performance reduction caused by resource contention in the data interaction process is reduced. The steps performed on the first storage node side include: the first RDMA message queue receives a resource access request sent by a second RDMA message queue of a second storage node; the first service thread at least obtains the required resource from the first-level cache of the corresponding first processing core; the first business thread processes the resource access request based on the required resource; the first RDMA message queue sends the processing result to the second RDMA message queue. Therefore, when multiple message queues need to be processed by the first storage node at the same time, the first service threads of different first service thread groups can be used for processing, the first-level caches on the first processing cores corresponding to the different service thread groups are accessed in an isolated mode, the contention of shared resources can be avoided in the RDMA data transmission process, and the RDMA communication transmission performance can be played.
Drawings
In order to more clearly illustrate one or more embodiments or prior art solutions of the present specification, reference will now be made briefly to the attached drawings, which are needed in the description of one or more embodiments or prior art, and it should be apparent that the drawings in the description below are only some of the embodiments described in the specification, and that other drawings may be obtained by those skilled in the art without inventive exercise.
Fig. 1 is a schematic structural diagram of a storage node to which an RDMA-based communication method is applied according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of steps of another RDMA-based communication method according to an embodiment of the present invention.
Fig. 3 is a schematic diagram illustrating steps of another RDMA-based communication method according to an embodiment of the present invention.
Fig. 4 is a schematic diagram illustrating steps of another RDMA-based communication method according to an embodiment of the present invention.
Fig. 5 is a schematic diagram illustrating steps of another RDMA-based communication method according to an embodiment of the present invention.
Fig. 6 is a schematic diagram illustrating steps of another RDMA-based communication method according to an embodiment of the present invention.
Fig. 7 is a schematic step diagram of another RDMA-based communication method according to an embodiment of the present invention.
Fig. 8 is a schematic step diagram of another RDMA-based communication method according to an embodiment of the present invention.
Fig. 9 is a schematic step diagram of another RDMA-based communication method according to an embodiment of the present invention.
Fig. 10 is a schematic step diagram of another RDMA-based communication method according to an embodiment of the present invention.
Fig. 11 is a schematic step diagram of another RDMA-based communication method according to an embodiment of the present invention.
Fig. 12 is a schematic structural diagram of a storage node according to an embodiment of the present invention.
Fig. 13 is a schematic structural diagram of another storage node according to an embodiment of the present invention.
Fig. 14 is a schematic structural diagram of another storage node according to an embodiment of the present invention.
Fig. 15 is a schematic structural diagram of another storage node according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions in the present specification better understood, the technical solutions in one or more embodiments of the present specification will be clearly and completely described below with reference to the accompanying drawings in one or more embodiments of the present specification, and it is obvious that the one or more embodiments described are only a part of the embodiments of the present specification, and not all embodiments. All other embodiments that can be derived by a person skilled in the art from one or more of the embodiments described herein without making any inventive step shall fall within the scope of protection of this document.
Fig. 1 shows an architecture of a storage node to which the RDMA-based communication method according to the embodiment of the present invention is applied, and fig. 1 shows an internal deployment diagram of the storage node. In a data forwarding subsystem DDF (english is called distribution data forward) working thread (see fig. 8 and fig. 9), the improvement of the resource management function and the improvement of the message communication function are mainly included. Wherein, the resource management function includes: initialization and deployment of user cache message queue resources, initialization and management of resources such as page cache and the like, initialization and management of working threads, initialization and management of mapping tables, establishment of RDMA connection requests and deployment of RDMA message queue resources. The message communication functions include: the preparation processing of resources required by the user buffer message queue and the RDMA message queue during RDMA communication.
The method for RDMA-based communication is that structural deployment is carried out on storage nodes (including the first storage node and the second storage node mentioned later are both the structural deployment) as follows:
(1) creating a DDF working thread on each processing core participating in RDMA communication, and operating a resource management function and a message communication function code;
(2) respectively applying for a primary cache with a preset size at a shared resource for each processing core (including a first processing core and a second processing core mentioned later) participating in the RDMA communication, wherein the number of pages in the primary cache is an integer multiple of 2048, and the size of the pages can be set according to the characteristics of data carried by the RDMA communication service, such as 8KB, 16KB and the like;
(3) respectively applying for a secondary cache for all processing cores participating in data interaction in shared resources, wherein the number of pages in the secondary cache is 1024 that of the processing cores participating in RDMA communication under the same first NUMA node, and the size of the pages is the same as that of the pages in the primary cache;
(4) establishing a mapping table which can be an RDMA Mkey mapping table;
(5) establishing an RDMA transceiving queue QP for data interaction to a second storage node on each processing core participating in RDMA of a first storage node, wherein the transceiving queues QP on the first storage node and the second storage node are configured one by one according to the number of the processing cores participating in RDMA communication, so that each processing core participating in RDMA communication on the first storage node has a unique transceiving queue QP corresponding to the transceiving queue QP of the processing core participating in RDMA communication on the second storage node, and vice versa;
(6) a completion queue on a processing core of a first storage node participating in RDMA and a completion queue on a processing core of a second storage node participating in RDMA are respectively configured with respective transceiving queues QP according to the relation of 1: 1;
(7) 256 DDF headers are prepared in the transceiving queue QP by an RDMA receive primitive (ib _ post _ recv), for receiving resources,
(8) and creating a user cache queue RQ on the processing cores of the first storage node and the second storage node which participate in RDMA communication, wherein the RQ and the QP are in a corresponding relation of 1:1, and the shared message queue shares isolated resources.
It should be noted that, as mentioned above, the processing cores mentioned throughout this application do not represent all processing cores, but refer to all processing cores participating in RDMA communication, refer to processing cores cpu core predicted or specified by a user and capable of performing RDMA communication services, and RDMA-related data interaction does not occur on other processing cores cpu core. And of course, dynamic configuration can be performed according to service requirements. In addition, in order to better describe the scheme of the present application, two storage nodes performing data interaction are described by distinguishing a first storage node from a second storage node, and a structural component under the corresponding first storage node is distinguished by using a first typeface and a structural component under the corresponding second storage node is distinguished by using a second typeface.
The communication method provided by the application is realized based on RDMA, and it can be understood that the first storage node and the second storage node respectively serve as a receiving end and a transmitting end, and also can serve as a transmitting end and a receiving end, the following embodiment adopts single-sided writing, and is intended to illustrate the functions that the storage nodes can respectively realize one by one, and certainly, the final storage node can be a method and a structure for integrating all the realization of the first storage node and the second storage node.
Example one
Fig. 2 is a schematic diagram illustrating steps of an RDMA-based communication method according to an embodiment of the present invention. It can be understood that the RDMA-based communication method provided by the embodiment of the present invention is applicable to a first shared resource in a first storage node, where the first shared resource includes a plurality of first-level caches, the first-level caches are isolated from each other by taking a first processing core as a granularity, first service thread groups corresponding to the first processing core are isolated from each other, each first service thread group includes at least one first service thread, and each first service thread corresponds to a scenario in which a first RDMA message queue is disposed. The RDMA-based communication method comprises the following steps executed on the side of a first storage node:
step 10: the first RDMA message queue receives a resource access request sent by a second RDMA message queue of a second storage node;
as shown in fig. 1, the multiple first-level caches are mutually isolated by taking granularity of a first processing core where the first-level cache is located, each first processing core is correspondingly provided with a first service thread group, each service thread group includes at least one first service thread, and after a first RDMA message queue correspondingly arranged to the first service thread receives a resource access request sent by a second RDMA message queue of the second storage node, the first service thread realizes lock-free access to the first-level cache resource.
Step 20: the first service thread at least obtains the required resource from the first-level cache of the corresponding first processing core;
therefore, the first service thread acquires the required resource from the first-level cache on the first processing core corresponding to the first service thread group where the first service thread is located. As can be seen from fig. 1, the first shared resource may divide a part of the resources into a plurality of first-level caches. The communication method provided by the embodiment of the invention relates to shared resources including pages, RDMA pre-registration information, message buffer queues and the like in the process of RDMA data transmission.
Step 30: the first business thread processes the resource access request based on the required resource;
after acquiring the required resource, the first service thread processes a resource access request, such as a read request or a write request, based on the required resource, thereby completing a service initiated by the second storage node.
And after the first service thread finishes processing the resource access request, sending a processing result to a first RDMA message queue, and sending the processing result to a second RDMA message queue by the first RDMA message queue so as to inform a second storage node. The first RDMA message queue may be one of a plurality of first RDMA message queues corresponding to the first business thread, may be a first RDMA message queue different from the foregoing receive resource access request, or may be only one first RDMA message queue corresponding to the first business thread group in which the first business thread is located, as described below.
Step 40: the first RDMA message queue sends the processing result to the second RDMA message queue.
The RDMA-based communication method provided by the embodiment of the invention utilizes the RDMA network card chip to perform high-bandwidth low-delay data interaction between different storage nodes. The communication method provided by the embodiment of the invention can solve the problem of communication performance reduction caused by contention of shared resources when multiple message queues need to be processed by multiple first service threads simultaneously in the RDMA data transmission process, the RDMA communication performance in a distributed storage system is improved, and in the RDMA data transmission process under a multi-core CPU architecture, the first service threads on different first processing cores can contend on the shared resources when processing message cache queues, RDMA queues and RDMA pre-registered memory information, so that the problems of delay increase and bandwidth reduction in the RDMA data transmission process are caused. And isolating part of resources in the shared resources by taking the first processing core as granularity so as to realize the shared resources and reduce the loss caused by contention in the RDMA data transmission process.
In the embodiment of the present invention, a working thread (for example, a DDF working thread shown in fig. 8) is started on each first processing core to run the above-mentioned message communication function and resource management function, and the working thread may register a callback function on an RDMA transceiving queue QP, and when there is an RDMA message queue, the RDMA message queue that is processed by the first working thread and waiting for wakeup by the callback function is executed, and the working thread performs work related to the resource management function.
It can be seen that the RDMA-based communication method provided by the embodiment of the present invention performs lock-free design on memory resources, such as pages, in the RDMA data transmission process, reduces the delay caused by invalid wait loss in the RDMA data transmission process, reduces the delay of RDMA data transmission, and improves the bandwidth of RDMA data transmission.
Referring to fig. 3, in some embodiments, in the communication method provided by the embodiments of the present invention, when the storage node adopts a NUMA architecture, the first shared resource further includes a plurality of secondary caches, and the plurality of secondary caches are isolated from each other by using the first NUMA node as a granularity.
The following description is given by taking page resources as an example, and is generally applicable to other resources. The page is used as an important resource for bearing data in the RDMA data transmission process, and the influence of the frequent application, release and data loading efficiency in the access page on the communication performance is very important. On one hand, frequent application, release and other operations of the page can cause frequent establishment and release of the mapping relationship from the virtual address to the physical address, which not only reduces the system throughput rate, but also increases the communication delay and reduces the transmission bandwidth. On the other hand, under the NUMA architecture, the access of page resources has affinity with the position of a CPU socket, and accessing shared resources across a first NUMA node in the RDMA data transmission process may increase the delay of RDMA communication. The embodiment of the invention adopts the secondary cache and the first service thread to manage the resources such as the page and the like in a mode of isolating access by taking the first processing core as granularity. As shown in FIG. 1, each first processing core pre-allocates a fixed quota of page resources for deposit into the primary cache, and each first NUMA node pre-allocates a larger quota of page resources for deposit into the secondary cache than the primary cache. When the first service thread needs the page resource to perform RDMA communication, the page resource can be only obtained from the first-level cache on the first processing core where the first service thread is located, and the page resource is released into the first-level cache after use, and the first-level cache on other first processing cores is not allowed to be accessed, so that the first-level cache is isolated access among first service thread groups, and lock-free access of the first-level cache resource can be achieved. It can be seen that the communication method provided by the embodiment of the present invention manages shared resources in a hierarchical and isolated management manner. The first-level cache and the second-level cache need to be set according to the size of a system memory and the size and length of the maximum data expected to be sent by a user.
Step 20: the first service thread obtains the required resource from at least a first-level cache of the corresponding first processing core, and specifically includes:
step 200: when the resource quota allowance of the first-level cache is lower than a first resource quota water line value, the first service thread acquires the required resource from a second-level cache corresponding to the first processing core, or;
the resource management function sets a first resource quota water line value for each primary cache aiming at the problem of insufficient resources in the use process of primary cache resources, and when the resource quota allowance is lower than the first resource quota water line value, a first service thread acquires page resources from a secondary cache corresponding to a first processing core and supplements the page resources to the primary cache.
Step 210: and when the resource quota allowance of the secondary cache is lower than the second resource quota water line value, the first service thread acquires the required resource from the operating system of the storage node.
Similarly, when the second-level cache resource and the first-level cache resource are insufficient at the same time, the first service thread may acquire a page resource in the operating system of the storage node for RDMA data transmission.
Referring to fig. 4, in some embodiments, in the communication method provided in the embodiments of the present invention, the resource access request includes a key value corresponding to the required resource, and step 20: the first service thread obtains the required resource from at least a first-level cache of the corresponding first processing core, and specifically includes:
step 220: the first service thread searches a mapping table based on the key value to obtain the address of the required resource;
in the process of RDMA data transmission, the address of the required resource is obtained after a mapping table is searched based on a key value carried by a resource access request, the mapping table is established during initialization before the RDMA data transmission, and the purpose is to reduce the requirement that the RDMA data transmission can be carried out after the required resource mapping information is registered in a local RDMA network card during each RDMA data transmission.
Step 230: the first business thread acquires the required resource based on the address.
In the process of RDMA READ or RDMA WRITE data transmission, key values of resources such as required pages are in resource access requests, after a first RDMA message queue receives the resource access requests, a first service thread finds addresses of the required resources corresponding to the key values based on a mapping table, so that the required resources are obtained in time, the required resource addresses do not need to be registered to an RDMA network card in each RDMA data transmission process, and time consumption caused by the fact that the RDMA addresses are registered and the first service threads corresponding to a plurality of processing cores contend for the preregistered resources is reduced.
Referring to FIG. 5, in some embodiments, step 10: before the first RDMA message queue receives the resource access request sent by the second RDMA message queue of the second storage node, the communication method provided by the embodiment of the present invention further includes:
step 50: performing address registration on the first shared resource to the RDMA network card;
based on the RDMA data transmission process mentioned above, the address of the required resource is obtained based on the key value carried in the resource access request, so that the address registration is performed on the first shared resource to the RDMA network card before the RDMA data transmission, and the first shared resource is not necessarily all the first shared resources, which is not limited herein.
Step 51: creating a mapping table, wherein the mapping table comprises the corresponding relation between the address of the shared resource and the key value;
for example, the addresses of the first-level cache and the second-level cache in the first shared resource may be registered to the RDMA network card, and a mapping table is created as a correspondence table between the addresses of the shared resource and the key values. The first working thread in the later period can maintain the mapping table through a resource management function, for example, key can be the first address of a page, and key value is RDMA Mkey, so that the waiting time generated by registering resource mapping information to the RDMA network card is reduced.
Step 52: the mapping table is shared with the second storage node.
And after the mapping table is established, the mapping table is shared with the second storage node, so that subsequent RDMA data transmission is facilitated.
Referring to fig. 6, in some embodiments, according to the communication method provided by the embodiments of the present invention, the first RDMA message queue corresponds to the first processing core one to one, and the first RDMA message queue corresponds to the second RDMA message queue one to one, step 20: the first service thread obtains the required resource from at least a first-level cache of the corresponding first processing core, and specifically includes:
step 240: a first service thread in a first service thread group corresponding to a first processing core at least acquires required resources from a first-level cache of the corresponding first processing core;
as can be appreciated in connection with fig. 1, the first RDMA message queue may correspond one-to-one to the first processing core, such that the first business thread in the first business thread group on the first processing core may only transceive data with the first RDMA message queue. And the first RDMA message queue corresponds one-to-one to the second RDMA message queue, such that the first RDMA message queue may only receive data sent by the second RDMA message queue and the first RDMA message queue may only send data to the corresponding second RDMA message queue. Under the condition that the first-level cache is in one-to-one correspondence with the first processing core and the first processing core is in one-to-one correspondence with the first service thread group, the first service thread in the first service thread group can only obtain required resources from the first-level cache.
Correspondingly, step 40: the sending of the processing result to the second RDMA message queue by the first RDMA message queue specifically includes:
step 400: the first RDMA message queue sends the processing result to a second RDMA message queue corresponding to the first RDM message queue one by one.
The first RDMA message queue corresponds to the first processing cores one by one, and the first RDMA message queue can only send the processing result to the second RDMA message queue, so that the contention of multiple first service threads for receiving and sending queue QP resources in the RDMA message queue in the RDMA data transmission process is avoided, and the RDMA QP resources for RDMA communication between all the second processing cores participating in RDMA communication on the second storage node are established on all the first processing cores participating in RDMA communication on the first storage node. As shown in fig. 1, the RDMA message queue resource on the current first processing core only allows the access of the first service thread in the first service thread group on the first processing core, and the RDMA communication of the first service thread running on the first processing core can only be performed on the RDMA message queue resource on the first processing core corresponding to the first service thread group where the first service thread is located, and does not allow the access of the first service thread on the other first processing core, so as to achieve lock-free access of the RDMA message queue resource.
Referring to FIG. 7, in some embodiments, step 10: before the first RDMA message queue receives the resource access request sent by the second RDMA message queue of the second storage node, the communication method provided by the embodiment of the present invention further includes:
step 60: establishing a first user message cache queue, wherein the first user message cache queue corresponds to the first RDMA message queue one by one, and the first user message cache queue comprises a first read-write dispatch queue DQ;
the RDMA-based communication method provided by the embodiment of the invention adopts the lock-free sequential access to the first shared resource in the RDMA data transmission process, and reduces the time delay caused by invalid waiting loss in the RDMA data transmission process by the lock-free design in the aspect of the first user message buffer queue, thereby achieving the purposes of reducing the time delay of the RDMA data transmission and improving the bandwidth of the RDMA data transmission. The first user message buffer queue corresponds to the first RDMA message queue one by one and serves as a first storage node of a receiving end, and in the established first user message buffer queue, the first user message buffer queue comprises a first read-write dispatch queue DQ.
Correspondingly, step 10: after the first RDMA message queue receives the resource access request sent by the second RDMA message queue of the second storage node, step 20: before the first service thread obtains the required resource from at least the first-level cache of the corresponding first processing core, the communication method provided in the embodiment of the present invention further includes:
step 70: the first service thread puts the resource access request into a first read-write dispatch queue DQ to obtain the resource access request.
In the process of RDMA data transmission, for example, in the process of RDMA READ or RDMA WRITE primitive operation, a start acknowledgement of resource access such as a page and an end acknowledgement of resource access need to be performed, so that a sending end or a receiving end can know a completion state of data processing such as remote READ or remote write, and a message communication module completes an access state acknowledgement by establishing a user message buffer queue on each first processing core participating in RDMA communication, which is specifically as follows:
read-write request queue RQ (English full: request queue): caching the sent resource access request, and waiting for a response result;
read-write dispatch queue DQ (English full: dispatch queue): caching the received resource access request;
read-write completion queue CAQ (English full: completion ack queue): caching a corresponding result of the completed resource access request;
the resource relation between the READ-write request queue RQ and the RDMA transceiving queue QP and the completion queue CQ is 1:1:1, the resource access request message of the second storage node is cached in the RQ queue of the second storage node, the acknowledgement message of completing the resource access request is replied after the first storage node completes data processing such as RDMA READ or RDMA WRITE operation, the resource access request sent by the second storage node is cached in the RQ queue and is sent to the DQ queue of the first node through RDMA data transmission in sequence, the first service thread sends the resource access request to a specified service from the DQ for data processing, the acknowledgement message of completing the resource access request is sent to the CAQ on the second node through RDMA data transmission after the resource access request is processed, and the second service thread reports the acknowledgement message to the service. The start and end of the resource access request are performed on the same second processing core to achieve lock-free access to the RQ resource.
All resource access requests on the RQ need to set timeout time, if a confirmation message responding to the resource access request is not received before timeout, the resource access request is considered to be processed unsuccessfully, and in the timeout process, because the resource grants the access authority to the accessed terminal through the RDMA network card, the accessed terminal continues to access. If the remote node releases the message during the access period and the problem of stepping on the memory exists, the invention processes the strategy of returning the overtime message: closing the RDMA communication connection, clearing all messages in the abandoning RDMA message queue, returning to the user in the form of an error state, then reestablishing the connection to the remote node to continue sending messages which are not timed out, and the user can decide whether to retransmit the error message or process the error.
The following is an exemplary illustration of data interaction of the RDMA-based communication method provided by the embodiment of the present invention:
fig. 8 shows a write data flow, where a first storage node is used as an initiator of the write flow, and it is again described that the first storage node and a second storage node mentioned in the embodiments of the present invention are only for distinguishing a sending end and a receiving end, where the first storage node and the second storage node are respectively an initiator and an accessed end, and the functions and structures implemented by the first storage node and the second storage node are the same:
(1) as shown in fig. 8, the first storage node finds the RDMA Mkey associated with the service data page in the mapping table of the RDMA Mkey and encapsulates the RDMA Mkey into the DDF write request, and sends the DDF write request to the QP on the second storage node through the RDMA send primitive (ib _ post _ send);
(2) the second storage node sends a write request from a post receive in the CQ and puts the write request into the DQ, and wakes up the DDF working thread to process the write request;
(3) the DDF working thread acquires a write request from the DQ and analyzes the RDMA Mkey information and the data size corresponding to the data page on the first storage node, a resource management module of the second storage node applies for a page with a corresponding size in a page cache (a primary cache or a secondary cache) to store data to be written, and the data corresponding to the RDMA Mkey on the first node is read into a page prepared by the second node through an RDMA read primitive;
(4) after the RDMA read primitive of the second storage node acquires data, packaging the ack information of the write request into a DDF message header, and sending the DDF message header into an RDMA QP on the first storage node through the RDMA send primitive;
(5) the first storage node sends a write request ACK from a post receive in the RDMA QP and puts the ACK into the CAQ, and wakes up the DDF working thread;
(6) and the first storage node working thread takes out the writing request ack message from the CAQ for processing, searches the writing request corresponding to the ack in the RQ and returns the writing request to the service writing state.
FIG. 9 shows a read data flow (with the first storage node acting as the initiator of the read flow):
(1) as shown in fig. 9, the first storage node finds the RDMA Mkey and the data description information corresponding to the page in the service data in the mapping table of the RDMA Mkey, packages the RDMA Mkey and the data description information into the DDF read request, and sends the read request to the QP on the second storage node through the RDMA send primitive (ib _ post _ send);
(2) the second storage node outputs a read request message from a post receive in the QP, puts the read request message into the DQ and awakens the DDF working thread;
(3) the DDF working thread takes out the read request message from the DQ, analyzes the RDMA Mkey information and the data description information corresponding to the page on the first storage node, acquires the page corresponding to the read data from the service of the second storage node, and writes the read page data of the service on the second node into the page corresponding to the RDMA Mkey on the first node through the RDMA write primitive;
(4) after finishing writing data in the RDMA write primitive of the second storage node, packaging read request ack information into a DDF message header, sending the DDF message header to the first storage node through the RDMA send primitive, and reporting the read completion data to a user by the DDF of the second storage node;
(5) the first storage node sends a read request ack out of the post receive in the RDMA QP and puts the read request ack into the CAQ, and wakes up the DDF working thread;
(6) and the DDF working thread takes out the read request ack message from the CAQ for processing, searches the read request corresponding to the ack in the RQ and returns the read request to the service writing state.
As can be seen from the above analysis, the RDMA-based communication method provided in the embodiment of the present invention is suitable for dividing part of resources of a first shared resource in a first storage node into a plurality of first-level caches, where the plurality of first-level caches are isolated from each other by using a first processing core as a granularity, that is, the plurality of first-level caches are isolated from each other according to a first processing core participating in RDMA communication, and a first service thread group corresponding to each first processing core is isolated from each other, where the first service thread group includes at least one first service thread, and the first service thread corresponds to a scenario in which a first RDMA message queue is disposed, so when the first RDMA message queue receives an access request, an access mode can implement isolated access, and when the first service thread performs RDMA communication, a resource can only be obtained from the first-level cache on the first processing core in which the first service thread is located, the problem of communication performance reduction caused by resource contention in the data interaction process is reduced. The steps performed on the first storage node side include: the first RDMA message queue receives a resource access request sent by a second RDMA message queue of a second storage node; the first service thread at least obtains the required resource from the first-level cache of the corresponding first processing core; the first business thread processes the resource access request based on the required resource; the first RDMA message queue sends the processing result to the second RDMA message queue. Therefore, when multiple message queues need to be processed by the first storage node at the same time, the first service threads of different first service thread groups can be used for processing, the first-level caches on the first processing cores corresponding to the different service thread groups are accessed in an isolated mode, the contention of shared resources can be avoided in the RDMA data transmission process, and the RDMA communication transmission performance is improved.
Example two
Referring to fig. 10, in a RDMA-based communication method provided in an embodiment of the present invention, a second shared resource applicable to a second storage node includes multiple first-level caches, the multiple first-level caches are isolated from each other by taking a second processing core as a granularity, and in a scenario where a second RDMA message queue corresponds to the second processing core one to one, steps executed on the second storage node side include:
step 10': the second RDMA message queue sends the resource access request to a first RDMA message queue of the first storage node, and the first RDMA message queue and the second RDMA message queue correspond to each other one by one;
the first RDMA message queue receives a resource access request sent by a second RDMA message queue of a second storage node;
as shown in fig. 1, the multiple first-level caches are mutually isolated by taking granularity of a first processing core where the first-level cache is located, each first processing core is correspondingly provided with a first service thread group, each service thread group includes at least one first service thread, and after a first RDMA message queue correspondingly arranged to the first service thread receives a resource access request sent by a second RDMA message queue of the second storage node, the first service thread realizes lock-free access to the first-level cache resource.
Step 20': after the first storage node processes the resource access request based on the required resource, a processing result sent by the first RDMA message queue is received.
Therefore, the first service thread acquires the required resource from the first-level cache on the first processing core corresponding to the first service thread group where the first service thread is located. As can be seen from fig. 1, the first shared resource may divide a part of the resources into a plurality of first-level caches. The communication method provided by the embodiment of the invention relates to shared resources including pages, RDMA pre-registration information, message buffer queues and the like in the process of RDMA data transmission.
After acquiring the required resource, the first service thread processes a resource access request, such as a read request or a write request, based on the required resource, thereby completing a service initiated by the second storage node.
And after the first business thread processes the resource access request, sending the processing result to a first RDMA message queue, and sending the processing result to a second RDMA message queue by the first RDMA message queue so as to inform a second storage node. The first RDMA message queue may be one of a plurality of first RDMA message queues corresponding to the first business thread, may be a first RDMA message queue different from the foregoing receive resource access request, or may be only one first RDMA message queue corresponding to the first business thread group in which the first business thread is located, as described below.
The RDMA-based communication method provided by the embodiment of the invention utilizes the RDMA network card chip to perform high-bandwidth low-delay data interaction between different storage nodes. The communication method provided by the embodiment of the invention can solve the problem of communication performance reduction caused by contention of shared resources when multiple message queues need to be processed by multiple first service threads simultaneously in the process of RDMA data transmission, and improve the performance of RDMA communication in a distributed storage system. And isolating part of resources in the shared resources by taking the first processing core as granularity so as to realize the shared resources and reduce the loss caused by contention in the RDMA data transmission process.
In the embodiment of the present invention, a working thread (for example, a DDF working thread shown in fig. 8) is started on each first processing core to run the above-mentioned message communication function and resource management function, and the working thread may register a callback function on an RDMA transceiving queue QP, and when there is an RDMA message queue, the RDMA message queue that is processed by the first working thread and waiting for wakeup by the callback function is executed, and the working thread performs work related to the resource management function.
It can be seen that the RDMA-based communication method provided by the embodiment of the present invention performs lock-free design on memory resources, such as pages, in the RDMA data transmission process, reduces the delay caused by invalid wait loss in the RDMA data transmission process, reduces the delay of RDMA data transmission, and improves the bandwidth of RDMA data transmission.
Referring to fig. 11, in some embodiments, the second service thread groups corresponding to the second processing cores are isolated from each other, and the second service thread groups include at least one second service thread, step 10': before the second RDMA message queue sends the resource access request to the first RDMA message queue of the first storage node, the communication method provided in the embodiment of the present invention further includes:
step 30': establishing a second user message buffer queue, wherein the second user message buffer queue corresponds to the RDMA message queue one by one, and comprises a second read-write request queue RQ and a second read-write completion queue CAQ;
the RDMA-based communication method provided by the embodiment of the invention adopts the lock-free sequential access to the first shared resource in the RDMA data transmission process, and reduces the time delay caused by invalid waiting loss in the RDMA data transmission process by the lock-free design in the aspect of the first user message buffer queue, thereby achieving the purposes of reducing the time delay of the RDMA data transmission and improving the bandwidth of the RDMA data transmission. The first user message buffer queue corresponds to the first RDMA message queue one by one and serves as a first storage node of a receiving end, and in the established first user message buffer queue, the first user message buffer queue comprises a first read-write dispatch queue DQ. Step 40': putting the resource access request into a first read-write request queue RQ for a second service thread to obtain;
in the process of RDMA data transmission, such as in the process of RDMA READ or RDMA WRITE primitive operation, a start acknowledgement of resource access such as page and an end acknowledgement of resource access need to be performed, so that the sending end or the receiving end can know the completion status of data processing, such as remote READ or remote write.
Step 50': the second business thread sends the resource access request to a second RDMA message queue.
Correspondingly: step 20': after receiving the processing result sent by the first RDMA message queue, the communication method provided in the embodiment of the present invention further includes:
step 60': the second service thread puts the processing result into a second read-write completion queue CAQ to obtain the processing result;
the message communication module completes the access state confirmation by establishing a user message buffer queue on each first processing core participating in the RDMA communication, which comprises the following specific steps:
read-write request queue RQ (English full: request queue): caching the sent resource access request, and waiting for a response result;
read-write dispatch queue DQ (English nomenclature: dispatch queue): caching the received resource access request;
read-write completion queue CAQ (English full: completion ack queue): caching a corresponding result of the completed resource access request;
the resource relationship between the read-write request queue RQ and the RDMA transceiving queue QP and the completion queue CQ is 1:1: 1.
Step 70': and the second service thread reports the service after checking the resource access request in the second read-write request queue RQ based on the processing result.
The resource access request message of the second storage node is buffered in an RQ queue of the second storage node, a confirmation message for completing the resource access request is replied after the first storage node completes data processing such as RDMA READ or RDMA WRITE operation, the resource access request sent by the second storage node is buffered in the RQ queue and is sent to the DQ queue of the first node sequentially through RDMA data transmission, the resource access request is sent to a designated service from DQ by a first service thread for data processing, the confirmation message for completing the resource access request is sent to a CAQ on the second node through RDMA data transmission after the resource access request processing is completed, and the second service thread reports to the service according to the confirmation message. The start and end of the resource access request are performed on the same second processing core to achieve lock-free access to the RQ resource.
As can be seen from the above analysis, the RDMA-based communication method provided in the embodiment of the present invention is suitable for dividing part of resources of a first shared resource in a first storage node into a plurality of first-level caches, where the plurality of first-level caches are isolated from each other by using a first processing core as a granularity, that is, the plurality of first-level caches are isolated from each other according to a first processing core participating in RDMA communication, and a first service thread group corresponding to each first processing core is isolated from each other, where the first service thread group includes at least one first service thread, and the first service thread corresponds to a scenario in which a first RDMA message queue is disposed, so when the first RDMA message queue receives an access request, an access mode can implement isolated access, and when the first service thread performs RDMA communication, a resource can only be obtained from the first-level cache on the first processing core in which the first service thread is located, the problem of communication performance reduction caused by resource contention in the data interaction process is reduced. The steps performed at the first storage node side include: the first RDMA message queue receives a resource access request sent by a second RDMA message queue of a second storage node; the first service thread at least obtains the required resource from the first-level cache of the corresponding first processing core; the first business thread processes the resource access request based on the required resource; the first RDMA message queue sends the processing result to the second RDMA message queue. Therefore, when multiple message queues need to be processed by the first storage node at the same time, the first service threads of different first service thread groups can be used for processing, the first-level caches on the first processing cores corresponding to the different service thread groups are accessed in an isolated mode, the contention of shared resources can be avoided in the RDMA data transmission process, and the RDMA communication transmission performance is improved.
EXAMPLE III
Referring to fig. 12, a storage node 1 according to an embodiment of the present invention includes:
a first processing core 10;
as shown in fig. 1, the multiple first-level caches are mutually isolated by taking granularity of a first processing core where the first-level cache is located, each first processing core is correspondingly provided with a first service thread group, each service thread group includes at least one first service thread, and after a first RDMA message queue correspondingly arranged to the first service thread receives a resource access request sent by a second RDMA message queue of the second storage node, the first service thread realizes lock-free access to the first-level cache resource.
The first shared resource 20 comprises a plurality of first-level caches, and the first-level caches are isolated from each other by taking the first processing core as granularity;
therefore, the first service thread acquires the required resource from the first-level cache on the first processing core corresponding to the first service thread group where the first service thread is located. As can be seen from fig. 1, the first shared resource may divide a part of the resources into a plurality of first-level caches. The communication method provided by the embodiment of the invention relates to shared resources including pages, RDMA pre-registration information, message buffer queues and the like in the process of RDMA data transmission.
The first service thread groups 30 correspond to the first processing cores one by one, the first service thread groups are isolated from one another, and each first service thread group comprises at least one first service thread;
after acquiring the required resource, the first service thread processes a resource access request, such as a read request or a write request, based on the required resource, thereby completing a service initiated by the second storage node.
And after the first business thread processes the resource access request, sending the processing result to a first RDMA message queue, and sending the processing result to a second RDMA message queue by the first RDMA message queue so as to inform a second storage node. The first RDMA message queue may be one of a plurality of first RDMA message queues corresponding to the first business thread, may be a first RDMA message queue different from the foregoing receive resource access request, or may be only one first RDMA message queue corresponding to the first business thread group in which the first business thread is located, as described below.
The first RDMA message queue 40 is arranged corresponding to the first service thread and used for receiving resource access requests sent by a second RDMA message queue of a second storage node;
the RDMA-based communication method provided by the embodiment of the invention utilizes the RDMA network card chip to perform high-bandwidth low-delay data interaction between different storage nodes. The communication method provided by the embodiment of the invention can solve the problem of communication performance reduction caused by contention of shared resources when multiple message queues need to be processed by multiple first service threads simultaneously in the RDMA data transmission process, the RDMA communication performance in a distributed storage system is improved, and in the RDMA data transmission process under a multi-core CPU architecture, the first service threads on different first processing cores can contend on the shared resources when processing message cache queues, RDMA queues and RDMA pre-registered memory information, so that the problems of delay increase and bandwidth reduction in the RDMA data transmission process are caused. And isolating part of resources in the shared resources by taking the first processing core as granularity so as to realize the shared resources and reduce the loss caused by contention in the RDMA data transmission process.
A first service thread 300, configured to obtain a required resource from at least a first-level cache of a corresponding first processing core; and, for processing resource access requests based on the required resources;
in the embodiment of the present invention, a working thread (for example, a DDF working thread shown in fig. 8) is started on each first processing core to run the above-mentioned message communication function and resource management function, and the working thread may register a callback function on an RDMA transceiving queue QP, and when there is an RDMA message queue, the RDMA message queue that is processed by the first working thread and waiting for wakeup by the callback function is executed, and the working thread performs work related to the resource management function.
The first RDMA message queue 40 also sends the processing results to the second RDMA message queue.
It can be seen that the RDMA-based communication method provided by the embodiment of the present invention performs lock-free design on memory resources, such as pages, in the RDMA data transmission process, reduces the delay caused by invalid wait loss in the RDMA data transmission process, reduces the delay of RDMA data transmission, and improves the bandwidth of RDMA data transmission.
In some embodiments, in the storage node provided in the embodiment of the present invention, under the condition that the storage node adopts a NUMA architecture, the shared resource 20 further includes a plurality of secondary caches, the plurality of secondary caches are isolated from each other with the first NUMA node as a granularity, and the first service thread 300 is further configured to:
when the resource quota allowance of the first-level cache is lower than a first resource quota water line value, acquiring the required resource from a second-level cache corresponding to the first processing core; or,
aiming at the problem of insufficient resources in the use process of primary cache resources, a resource management function sets a first resource quota water line value for each primary cache, and when the resource quota allowance is lower than the first resource quota water line value, a first service thread acquires page resources from a secondary cache corresponding to a first processing core and supplements the page resources to the primary cache. When the resource quota margin of the secondary cache is lower than the second resource quota water line value, the required resource is obtained from the operating system of the storage node, and a bit complementing mode can be adopted.
Similarly, when the second-level cache resource and the first-level cache resource are insufficient at the same time, the first service thread may acquire a page resource in the operating system of the storage node for RDMA data transmission.
In some embodiments, in the storage node provided in the embodiment of the present invention, the resource access request includes a key value corresponding to a required resource, and the first service thread 300 is further configured to:
searching a mapping table based on a key value to obtain an address where the resource is located;
in the process of RDMA data transmission, the address of the required resource is obtained after a mapping table is searched based on a key value carried by a resource access request, the mapping table is established during initialization before the RDMA data transmission, and the purpose is to reduce the requirement that the RDMA data transmission can be carried out after the required resource mapping information is registered in a local RDMA network card during each RDMA data transmission.
The required resources are acquired based on the address.
In the process of RDMA READ or RDMA WRITE data transmission, key values of resources such as required pages are in resource access requests, after a first RDMA message queue receives the resource access requests, a first service thread finds addresses of the required resources corresponding to the key values based on a mapping table, so that the required resources are acquired in time, the required resource addresses do not need to be registered to an RDMA network card in each RDMA data transmission process, and time consumption caused by the fact that the RDMA address registration and the first service thread corresponding to a plurality of processing cores contend for the preregistered resources is reduced.
Referring to fig. 15, in some embodiments, the storage node according to an embodiment of the present invention further includes a mapping creation module 50, before the first RDMA message queue 40 receives a resource access request sent by the second RDMA message queue 90 of the second storage node 2, the mapping creation module 50 is configured to:
performing address registration on the first shared resource to the RDMA network card;
based on the above-mentioned RDMA data transmission process, the address where the required resource is located is obtained based on the key value carried in the resource access request, so that the first shared resource is address-registered with the RDMA network card before RDMA data transmission, and not necessarily all of the first shared resource, which is not limited herein. Creating a mapping table, wherein the mapping table comprises the corresponding relation between the address of the shared resource and the key value;
for example, the addresses of the first-level cache and the second-level cache in the first shared resource may be registered to the RDMA network card, and a mapping table is created as a correspondence table between the addresses of the shared resource and the key values. The first working thread in the later period can maintain the mapping table through a resource management function, for example, key can be the first address of a page, and key value is RDMA Mkey, so that the waiting time generated by registering resource mapping information to the RDMA network card is reduced.
The mapping table is shared with the second storage node.
And after the mapping table is established, the mapping table is shared with the second storage node, so that subsequent RDMA data transmission is facilitated.
In some embodiments, in the storage node provided in this embodiment of the present invention, the first RDMA message queue 40 corresponds to the first processing core 10 one to one, the first RDMA message queue 40 corresponds to the second RDMA message queue 40', and the first business thread 300 in the first business thread group 30 corresponding to the first processing core 10 is further configured to:
acquiring required resources from at least a first-level cache of the corresponding first processing core 10;
as can be appreciated in connection with fig. 1, the first RDMA message queue may correspond one-to-one to the first processing core, such that the first business thread in the first business thread group on the first processing core may only transceive data with the first RDMA message queue. And the first RDMA message queue corresponds one-to-one to the second RDMA message queue, such that the first RDMA message queue may only receive data sent by the second RDMA message queue and the first RDMA message queue may only send data to the corresponding second RDMA message queue. Under the condition that the first-level cache is in one-to-one correspondence with the first processing core and the first processing core is in one-to-one correspondence with the first service thread group, the first service thread in the first service thread group can only obtain required resources from the first-level cache.
Correspondingly, the first RDMA message queue 40 correspondingly sends the processing result to the second RDMA message queue 40', which specifically includes:
the first RDMA message queue 40 sends the processing results to a second RDMA message queue 40' in one-to-one correspondence with the first RDM message queue 40.
The first RDMA message queue corresponds to the first processing cores one by one, and the first RDMA message queue can only send the processing result to the second RDMA message queue, so that the contention of multiple first service threads for receiving and sending queue QP resources in the RDMA message queue in the RDMA data transmission process is avoided, and the RDMA QP resources for RDMA communication between all the second processing cores participating in RDMA communication on the second storage node are established on all the first processing cores participating in RDMA communication on the first storage node. As shown in fig. 1, the RDMA message queue resource on the current first processing core only allows the access of the first service thread in the first service thread group on the first processing core, and the RDMA communication of the first service thread running on the first processing core can only be performed on the RDMA message queue resource on the first processing core corresponding to the first service thread group where the first service thread is located, and does not allow the access of the first service thread on the other first processing core, so as to achieve lock-free access of the RDMA message queue resource.
In some embodiments, the storage node provided by the embodiment of the present invention, before the first RDMA message queue 40 receives the resource access request sent by the second RDMA message queue 40' of the second storage node 2, the first service thread group 30 is configured to:
establishing a first user message cache queue 70, wherein the first user message cache queue corresponds to the first RDMA message queue one by one, and the first user message cache queue comprises a first read-write dispatch queue DQ;
the RDMA-based communication method provided by the embodiment of the invention adopts the lock-free sequential access to the first shared resource in the RDMA data transmission process, and reduces the time delay caused by invalid waiting loss in the RDMA data transmission process by the lock-free design in the aspect of the first user message buffer queue, thereby achieving the purposes of reducing the time delay of the RDMA data transmission and improving the bandwidth of the RDMA data transmission. The first user message buffer queue corresponds to the first RDMA message queue one by one and serves as a first storage node of a receiving end, and in the established first user message buffer queue, the first user message buffer queue comprises a first read-write dispatch queue DQ.
Correspondingly, after the first RDMA message queue receives the resource access request sent by the second RDMA message queue of the second storage node, before the first service thread 300 obtains the required resource from at least the first-level cache of the corresponding first processing core, the first service thread 300 is further configured to:
and putting the resource access request into a first read-write dispatch queue DQ to obtain the resource access request.
In the process of RDMA data transmission, for example, in the process of RDMA READ or RDMA WRITE primitive operation, a start acknowledgement of resource access such as a page and an end acknowledgement of resource access need to be performed, so that a sending end or a receiving end can know a completion state of data processing such as remote READ or remote write, and a message communication module completes an access state acknowledgement by establishing a user message buffer queue on each first processing core participating in RDMA communication, which is specifically as follows:
read-write request queue RQ (English full: request queue): caching the sent resource access request, and waiting for a response result;
read-write dispatch queue DQ (English full: dispatch queue): caching the received resource access request;
read-write completion queue CAQ (English full: completion ack queue): caching a corresponding result of the completed resource access request;
the resource relation between the READ-write request queue RQ and the RDMA transceiving queue QP and the completion queue CQ is 1:1:1, the resource access request message of the second storage node is cached in the RQ queue of the second storage node, the acknowledgement message of completing the resource access request is replied after the first storage node completes data processing such as RDMA READ or RDMA WRITE operation, the resource access request sent by the second storage node is cached in the RQ queue and is sent to the DQ queue of the first node through RDMA data transmission in sequence, the first service thread sends the resource access request to a specified service from the DQ for data processing, the acknowledgement message of completing the resource access request is sent to the CAQ on the second node through RDMA data transmission after the resource access request is processed, and the second service thread reports the acknowledgement message to the service. The start and end of the resource access request are performed on the same second processing core to achieve lock-free access to the RQ resource.
As can be seen from the above analysis, the RDMA-based communication method provided in the embodiment of the present invention is suitable for dividing part of resources of a first shared resource in a first storage node into a plurality of first-level caches, where the plurality of first-level caches are isolated from each other by using a first processing core as a granularity, that is, the plurality of first-level caches are isolated from each other according to a first processing core participating in RDMA communication, and a first service thread group corresponding to each first processing core is isolated from each other, where the first service thread group includes at least one first service thread, and the first service thread corresponds to a scenario in which a first RDMA message queue is disposed, so when the first RDMA message queue receives an access request, an access mode can implement isolated access, and when the first service thread performs RDMA communication, a resource can only be obtained from the first-level cache on the first processing core in which the first service thread is located, the problem of communication performance reduction caused by resource contention in the data interaction process is reduced. The steps performed on the first storage node side include: the first RDMA message queue receives a resource access request sent by a second RDMA message queue of a second storage node; the first service thread at least obtains the required resource from the first-level cache of the corresponding first processing core; the first business thread processes the resource access request based on the required resource; the first RDMA message queue sends the processing result to the second RDMA message queue. Therefore, when multiple message queues need to be processed by the first storage node at the same time, the first service threads of different first service thread groups can be used for processing, the first-level caches on the first processing cores corresponding to the different service thread groups are accessed in an isolated mode, the contention for shared resources can be avoided in the RDMA data transmission process, and the RDMA communication transmission performance is improved.
Example four
Referring to fig. 14, which is a schematic structural diagram of a storage node 1 'according to an embodiment of the present invention, the storage node 1' includes:
a second processing core 10';
the first RDMA message queue receives a resource access request sent by a second RDMA message queue of a second storage node;
as shown in fig. 1, the multiple first-level caches are mutually isolated by taking granularity of a first processing core where the first-level cache is located, each first processing core is correspondingly provided with a first service thread group, each service thread group includes at least one first service thread, and after a first RDMA message queue correspondingly arranged to the first service thread receives a resource access request sent by a second RDMA message queue of the second storage node, the first service thread realizes lock-free access to the first-level cache resource.
The second shared resource 20' comprises a plurality of first-level caches, and the first-level caches are mutually isolated by taking the second processing core as granularity;
therefore, the first service thread acquires the required resource from the first-level cache on the first processing core corresponding to the first service thread group where the first service thread is located. As can be seen from fig. 1, the first shared resource may divide a part of the resources into a plurality of first-level caches. The communication method provided by the embodiment of the invention relates to shared resources including pages, RDMA pre-registration information, message buffer queues and the like in the process of RDMA data transmission.
After acquiring the required resource, the first service thread processes a resource access request, such as a read request or a write request, based on the required resource, thereby completing a service initiated by the second storage node.
Second RDMA message queues 40 'corresponding one-to-one to the second processing cores, for sending resource access requests to the first RDMA message queues 40 of the first storage node, the first RDMA message queues 40 corresponding one-to-one to the second RDMA message queues 40'; and the number of the first and second groups,
and after the first business thread processes the resource access request, sending the processing result to a first RDMA message queue, and sending the processing result to a second RDMA message queue by the first RDMA message queue so as to inform a second storage node. The first RDMA message queue may be one of a plurality of first RDMA message queues corresponding to the first business thread, may be a first RDMA message queue different from the foregoing receive resource access request, or may be only one first RDMA message queue corresponding to the first business thread group in which the first business thread is located, as described below.
The RDMA-based communication method provided by the embodiment of the invention utilizes the RDMA network card chip to perform high-bandwidth low-delay data interaction between different storage nodes. The communication method provided by the embodiment of the invention can solve the problem of communication performance reduction caused by contention of shared resources when multiple message queues need to be processed by multiple first service threads simultaneously in the RDMA data transmission process, the RDMA communication performance in a distributed storage system is improved, and in the RDMA data transmission process under a multi-core CPU architecture, the first service threads on different first processing cores can contend on the shared resources when processing message cache queues, RDMA queues and RDMA pre-registered memory information, so that the problems of delay increase and bandwidth reduction in the RDMA data transmission process are caused. And isolating partial resources in the shared resources by taking the first processing core as granularity so as to realize the shared resources and reduce the loss caused by contention in the process of transmitting RDMA data.
The second RDMA message queue 40' is further configured to receive the processing result sent by the first RDMA message queue 40 after the first storage node processes the resource access request based on the required resource.
In the embodiment of the present invention, a working thread (for example, a DDF working thread shown in fig. 8) is started on each first processing core to run the above-mentioned message communication function and resource management function, and the working thread may register a callback function on an RDMA transceiving queue QP, and when there is an RDMA message queue, the RDMA message queue that is processed by the first working thread and waiting for wakeup by the callback function is executed, and the working thread performs work related to the resource management function.
It can be seen that the RDMA-based communication method provided by the embodiment of the present invention performs lock-free design on memory resources, such as pages, in the RDMA data transmission process, reduces the delay caused by invalid wait loss in the RDMA data transmission process, reduces the delay of RDMA data transmission, and improves the bandwidth of RDMA data transmission.
Referring to fig. 15, in some embodiments, a storage node 1' provided by an embodiment of the present invention further includes: a second service thread group 30 ', the second service thread group 30 ' corresponding to the second processing core 10 ' one to one, the second service thread group 30 ' being isolated from each other, the second service thread group 30 ' comprising at least one second service thread 300, the second service thread group 30 ' being configured to, before the second RDMA message queue 40 ' sends a resource access request to the first RDMA message queue 40 of the first storage node 1:
establishing a second user message buffer queue, wherein the second user message buffer queue corresponds to a second RDMA message queue one by one, and comprises a second read-write request queue RQ and a second read-write completion queue CAQ; and the number of the first and second groups,
the RDMA-based communication method provided by the embodiment of the invention adopts the lock-free sequential access to the first shared resource in the RDMA data transmission process, and reduces the time delay caused by invalid waiting loss in the RDMA data transmission process by the lock-free design in the aspect of the first user message buffer queue, thereby achieving the purposes of reducing the time delay of the RDMA data transmission and improving the bandwidth of the RDMA data transmission. The first user message buffer queue corresponds to the first RDMA message queue one by one and serves as a first storage node of a receiving end, and in the established first user message buffer queue, the first user message buffer queue comprises a first read-write dispatch queue DQ. A second business thread 300' for: putting the resource access request into a first read-write request queue RQ to obtain a resource access request;
in the process of RDMA data transmission, such as in the process of RDMA READ or RDMA WRITE primitive operation, a start confirmation of resource access such as a page and an end confirmation of resource access need to be performed, so that a sending end or a receiving end can know the completion status of data processing, such as remote READ or remote write.
Sending the resource access request to a second RDMA message queue; and the number of the first and second groups,
the message communication module completes the access state confirmation by establishing a user message buffer queue on each first processing core participating in the RDMA communication, which comprises the following specific steps:
read-write request queue RQ (English full: request queue): caching the sent resource access request, and waiting for a response result;
read-write dispatch queue DQ (English nomenclature: dispatch queue): caching the received resource access request;
read-write completion queue CAQ (English full: completion ack queue): caching a corresponding result of the completed resource access request;
the resource relationship between the read-write request queue RQ and the RDMA transceiving queue QP and the completion queue CQ is 1:1: 1. After the second RDMA message queue 40 'receives the processing results sent by the first RDMA message queue 40, the second business thread 300' is further operable to:
putting the processing result into a second read-write completion queue CAQ to obtain a processing result;
the resource access request message of the second storage node is buffered in an RQ queue of the second storage node, a confirmation message for completing the resource access request is replied after the first storage node completes data processing such as RDMA READ or RDMA WRITE operation, the resource access request sent by the second storage node is buffered in the RQ queue and is sent to the DQ queue of the first node sequentially through RDMA data transmission, the resource access request is sent to a designated service from DQ by a first service thread for data processing, the confirmation message for completing the resource access request is sent to a CAQ on the second node through RDMA data transmission after the resource access request processing is completed, and the second service thread reports to the service according to the confirmation message.
And reporting the service after checking the resource access request in the second read-write request queue RQ based on the processing result. The start and end of the resource access request are performed on the same second processing core to achieve lock-free access to the RQ resource.
As can be seen from the above analysis, the RDMA-based communication method provided in the embodiment of the present invention is suitable for dividing part of resources of a first shared resource in a first storage node into a plurality of first-level caches, where the plurality of first-level caches are isolated from each other by using a first processing core as a granularity, that is, the plurality of first-level caches are isolated from each other according to a first processing core participating in RDMA communication, and a first service thread group corresponding to each first processing core is isolated from each other, where the first service thread group includes at least one first service thread, and the first service thread corresponds to a scenario in which a first RDMA message queue is disposed, so when the first RDMA message queue receives an access request, an access mode can implement isolated access, and when the first service thread performs RDMA communication, a resource can only be obtained from the first-level cache on the first processing core in which the first service thread is located, the problem of communication performance reduction caused by resource contention in the data interaction process is reduced. The steps performed on the first storage node side include: the first RDMA message queue receives a resource access request sent by a second RDMA message queue of a second storage node; the first service thread at least obtains the required resource from the first-level cache of the corresponding first processing core; the first business thread processes the resource access request based on the required resource; the first RDMA message queue sends the processing result to the second RDMA message queue. Therefore, when multiple message queues need to be processed by the first storage node at the same time, the first service threads of different first service thread groups can be used for processing, the first-level caches on the first processing cores corresponding to the different service thread groups are accessed in an isolated mode, the contention of shared resources can be avoided in the RDMA data transmission process, and the RDMA communication transmission performance is improved.
EXAMPLE five
In addition, the embodiment of the present specification provides a distributed storage system, which includes the storage nodes shown in fig. 11 to fig. 15. As shown in fig. 12, the storage node 1 includes:
a first processing core 10;
as shown in fig. 1, the multiple first-level caches are mutually isolated by taking granularity of a first processing core where the first-level cache is located, each first processing core is correspondingly provided with a first service thread group, each service thread group includes at least one first service thread, and after a first RDMA message queue correspondingly arranged to the first service thread receives a resource access request sent by a second RDMA message queue of the second storage node, the first service thread realizes lock-free access to the first-level cache resource.
The first shared resource 20 comprises a plurality of first-level caches, and the first-level caches are isolated from each other by taking the first processing core as granularity;
therefore, the first service thread acquires the required resource from the first-level cache on the first processing core corresponding to the first service thread group where the first service thread is located. As can be seen from fig. 1, the first shared resource may divide a part of the resources into a plurality of first-level caches. The communication method provided by the embodiment of the invention relates to shared resources including pages, RDMA pre-registration information, message buffer queues and the like in the process of RDMA data transmission.
The first service thread groups 30 correspond to the first processing cores one by one, the first service thread groups are isolated from one another, and each first service thread group comprises at least one first service thread;
after acquiring the required resource, the first service thread processes a resource access request, such as a read request or a write request, based on the required resource, thereby completing a service initiated by the second storage node.
And after the first business thread processes the resource access request, sending the processing result to a first RDMA message queue, and sending the processing result to a second RDMA message queue by the first RDMA message queue so as to inform a second storage node. The first RDMA message queue may be one of a plurality of first RDMA message queues corresponding to the first business thread, may be a first RDMA message queue different from the foregoing receive resource access request, or may be only one first RDMA message queue corresponding to the first business thread group in which the first business thread is located, as described below.
The first RDMA message queue 40 is arranged corresponding to the first service thread and used for receiving resource access requests sent by a second RDMA message queue of a second storage node;
the RDMA-based communication method provided by the embodiment of the invention utilizes the RDMA network card chip to perform high-bandwidth low-delay data interaction between different storage nodes. The communication method provided by the embodiment of the invention can solve the problem of communication performance reduction caused by contention of shared resources when multiple message queues need to be processed by multiple first service threads simultaneously in the RDMA data transmission process, the RDMA communication performance in a distributed storage system is improved, and in the RDMA data transmission process under a multi-core CPU architecture, the first service threads on different first processing cores can contend on the shared resources when processing message cache queues, RDMA queues and RDMA pre-registered memory information, so that the problems of delay increase and bandwidth reduction in the RDMA data transmission process are caused. And isolating part of resources in the shared resources by taking the first processing core as granularity so as to realize the shared resources and reduce the loss caused by contention in the RDMA data transmission process.
A first service thread 300, configured to obtain a required resource from at least a first-level cache of a corresponding first processing core; and, for processing resource access requests based on the required resources;
in the embodiment of the present invention, a working thread (for example, a DDF working thread shown in fig. 8) is started on each first processing core to run the message communication function and the resource management function, the working thread may register a callback function on the RDMA transceiving queue QP, and when there is an RDMA message queue, the callback function is executed to wake up the RDMA message queue that is completed by the waiting first working thread and perform work related to the resource management function.
The first RDMA message queue 40 also sends the processing results to the second RDMA message queue.
It can be seen that the RDMA-based communication method provided by the embodiment of the present invention performs lock-free design on memory resources, such as pages, in the RDMA data transmission process, reduces the delay caused by invalid wait loss in the RDMA data transmission process, reduces the delay of RDMA data transmission, and improves the bandwidth of RDMA data transmission.
As can be seen from the above analysis, the RDMA-based communication method provided in the embodiment of the present invention is suitable for dividing part of resources of a first shared resource in a first storage node into multiple first-level caches, where the multiple first-level caches are isolated from each other with a first processing core as a granularity, that is, the multiple first-level caches are isolated from each other according to a first processing core participating in RDMA communication, and first service thread groups corresponding to each first processing core are isolated from each other, where each first service thread group includes at least one first service thread, and the first service thread is correspondingly provided with a scenario of a first RDMA message queue, so when the first RDMA message queue receives an access request, an access mode can implement isolated access, and when the first service thread performs RDMA communication, a resource can only be obtained from the first-level cache on the first processing core where the first service thread is located, the problem of communication performance reduction caused by resource contention in the data interaction process is reduced. The steps performed on the first storage node side include: the first RDMA message queue receives a resource access request sent by a second RDMA message queue of a second storage node; the first service thread at least obtains the required resource from the first-level cache of the corresponding first processing core; the first business thread processes the resource access request based on the required resource; the first RDMA message queue sends the processing result to the second RDMA message queue. Therefore, when multiple message queues need to be processed by the first storage node at the same time, the first service threads of different first service thread groups can be used for processing, the first-level caches on the first processing cores corresponding to the different service thread groups are accessed in an isolated mode, the contention of shared resources can be avoided in the RDMA data transmission process, and the RDMA communication transmission performance is improved.
EXAMPLE six
A storage medium provided in an embodiment of the present invention is a computer-readable storage medium, where one or more programs are stored, and the one or more programs may be executed by one or more processors to implement the steps of the RDMA-based communication method shown in fig. 1 to 15, and specifically may perform the following steps:
step 10: the first RDMA message queue receives a resource access request sent by a second RDMA message queue of a second storage node;
as shown in fig. 1, the multiple first-level caches are mutually isolated by taking granularity of a first processing core where the first-level cache is located, each first processing core is correspondingly provided with a first service thread group, each service thread group includes at least one first service thread, and after a first RDMA message queue correspondingly arranged to the first service thread receives a resource access request sent by a second RDMA message queue of the second storage node, the first service thread realizes lock-free access to the first-level cache resource.
Step 20: the first service thread at least obtains the required resource from the first-level cache of the corresponding first processing core;
therefore, the first service thread acquires the required resource from the first-level cache on the first processing core corresponding to the first service thread group where the first service thread is located. As can be seen from fig. 1, the first shared resource may divide a part of the resources into a plurality of first-level caches. The communication method provided by the embodiment of the invention relates to shared resources including pages, RDMA pre-registration information, message buffer queues and the like in the process of RDMA data transmission.
Step 30: the first business thread processes the resource access request based on the required resource;
after acquiring the required resource, the first service thread processes a resource access request, such as a read request or a write request, based on the required resource, thereby completing a service initiated by the second storage node.
And after the first business thread processes the resource access request, sending the processing result to a first RDMA message queue, and sending the processing result to a second RDMA message queue by the first RDMA message queue so as to inform a second storage node. The first RDMA message queue may be one of a plurality of first RDMA message queues corresponding to the first business thread, may be a first RDMA message queue different from the foregoing receive resource access request, or may be only one first RDMA message queue corresponding to the first business thread group in which the first business thread is located, as described below.
Step 40: the first RDMA message queue sends the processing result to the second RDMA message queue.
The RDMA-based communication method provided by the embodiment of the invention utilizes the RDMA network card chip to carry out high-bandwidth low-delay data interaction between different storage nodes. The communication method provided by the embodiment of the invention can solve the problem of communication performance reduction caused by contention of shared resources when multiple message queues need to be processed by multiple first service threads simultaneously in the RDMA data transmission process, the RDMA communication performance in a distributed storage system is improved, and in the RDMA data transmission process under a multi-core CPU architecture, the first service threads on different first processing cores can contend on the shared resources when processing message cache queues, RDMA queues and RDMA pre-registered memory information, so that the problems of delay increase and bandwidth reduction in the RDMA data transmission process are caused. And isolating part of resources in the shared resources by taking the first processing core as granularity so as to realize the shared resources and reduce the loss caused by contention in the RDMA data transmission process.
In the embodiment of the present invention, a working thread (for example, a DDF working thread shown in fig. 8) is started on each first processing core to run the above-mentioned message communication function and resource management function, and the working thread may register a callback function on an RDMA transceiving queue QP, and when there is an RDMA message queue, the RDMA message queue that is processed by the first working thread and waiting for wakeup by the callback function is executed, and the working thread performs work related to the resource management function.
It can be seen that the RDMA-based communication method provided by the embodiment of the present invention performs lock-free design on memory resources, such as pages, in the RDMA data transmission process, reduces the delay caused by invalid wait loss in the RDMA data transmission process, reduces the delay of RDMA data transmission, and increases the bandwidth of RDMA data transmission.
As can be seen from the above analysis, the RDMA-based communication method provided in the embodiment of the present invention is suitable for dividing part of resources of a first shared resource in a first storage node into a plurality of first-level caches, where the plurality of first-level caches are isolated from each other by using a first processing core as a granularity, that is, the plurality of first-level caches are isolated from each other according to a first processing core participating in RDMA communication, and a first service thread group corresponding to each first processing core is isolated from each other, where the first service thread group includes at least one first service thread, and the first service thread corresponds to a scenario in which a first RDMA message queue is disposed, so when the first RDMA message queue receives an access request, an access mode can implement isolated access, and when the first service thread performs RDMA communication, a resource can only be obtained from the first-level cache on the first processing core in which the first service thread is located, the problem of communication performance reduction caused by resource contention in the data interaction process is reduced. The steps performed at the first storage node side include: the first RDMA message queue receives a resource access request sent by a second RDMA message queue of a second storage node; the first service thread at least obtains the required resource from the first-level cache of the corresponding first processing core; the first business thread processes the resource access request based on the required resource; the first RDMA message queue sends the processing result to the second RDMA message queue. Therefore, when multiple message queues need to be processed by the first storage node at the same time, the first service threads of different first service thread groups can be used for processing, the first-level caches on the first processing cores corresponding to the different service thread groups are accessed in an isolated mode, the contention of shared resources can be avoided in the RDMA data transmission process, and the RDMA communication transmission performance is improved.
In short, the above description is only a preferred embodiment of the present disclosure, and is not intended to limit the scope of the present disclosure. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present specification shall be included in the protection scope of the present specification.
The system, apparatus, module or unit illustrated in one or more of the above embodiments may be implemented by a computer chip or an entity, or by an article of manufacture with a certain functionality. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
Computer-readable storage media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The foregoing description of specific embodiments has been presented for purposes of illustration and description. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Claims (18)

1. A communication method based on Remote Direct Memory Access (RDMA) is disclosed, wherein a first shared resource applicable to a first storage node comprises a plurality of first-level caches, the first-level caches are isolated from each other by taking a first processing core as granularity, first service thread groups corresponding to the first processing core are isolated from each other, the first service thread groups comprise at least one first service thread, the first service thread corresponds to a scene provided with a first RDMA message queue, and the steps executed on the first storage node side comprise:
the first RDMA message queue receives a resource access request sent by a second RDMA message queue of a second storage node;
the first service thread at least acquires required resources from the first-level cache of the corresponding first processing core;
the first business thread processes the resource access request based on the required resource;
the first RDMA message queue sends the processing result to the second RDMA message queue.
2. The communication method according to claim 1, wherein, when the storage node adopts a NUMA architecture, the first shared resource further includes a plurality of secondary caches, the plurality of secondary caches are isolated from each other with a granularity of a first NUMA node as a granularity, and the acquiring, by the first service thread, the required resource from at least the primary cache of the corresponding first processing core specifically includes:
when the resource quota margin of the first-level cache is lower than a first resource quota water line value, the first service thread acquires the required resource from the second-level cache corresponding to the first processing core, or;
and when the resource quota margin of the secondary cache is lower than a second resource quota water line value, the first service thread acquires the required resource from an operating system of the storage node.
3. The communication method according to claim 1, wherein the resource access request includes a key value corresponding to the required resource, and the acquiring, by the first service thread, the required resource from the level one cache of the corresponding first processing core specifically includes:
the first service thread searches a mapping table based on the key value to obtain the address of the required resource;
and the first business thread acquires the required resource based on the address.
4. The communication method of claim 3, prior to the first RDMA message queue receiving a resource access request sent by a second RDMA message queue of a second storage node, the method further comprising:
performing address registration on the first shared resource to an RDMA network card;
creating the mapping table, wherein the mapping table comprises the corresponding relation between the address of the first shared resource and a key value;
sharing the mapping table with the second storage node.
5. The communication method of any of claims 1 to 4, a first RDMA message queue in one-to-one correspondence with the first processing core, the first RDMA message queue in one-to-one correspondence with the second RDMA message queue, the first business thread to obtain required resources from at least the level one cache of the corresponding first processing core, comprising in particular:
the first business thread in the first business thread group corresponding to the first processing core at least obtains the required resource from the first-level cache of the corresponding first processing core;
correspondingly, the sending the processing result to the second RDMA message queue by the first RDMA message queue specifically includes:
and the first RDMA message queue sends the processing result to the second RDMA message queue corresponding to the first RDM message queue one by one.
6. The communication method of claim 5, prior to the first RDMA message queue receiving a resource access request sent by a second RDMA message queue of a second storage node, the method further comprising:
establishing a first user message cache queue, wherein the first user message cache queue corresponds to the first RDMA message queue in a one-to-one correspondence manner, and the first user message cache queue comprises a first read-write dispatch queue DQ;
correspondingly, after the first RDMA message queue receives the resource access request sent by the second RDMA message queue of the second storage node, before the first service thread obtains the required resource from at least the level-one cache of the corresponding first processing core, the method further includes:
and the first service thread puts the resource access request into the first read-write dispatching queue DQ to obtain the resource access request.
7. An RDMA-based communication method, wherein a second shared resource in a second storage node includes a plurality of level-one caches, the plurality of level-one caches are isolated from each other with a second processing core as granularity, a second RDMA message queue corresponds to a scenario of the second processing core, and the steps executed on the second storage node side include:
a second RDMA message queue sends a resource access request to a first RDMA message queue of a first storage node, wherein the first RDMA message queue corresponds to the second RDMA message queue in a one-to-one mode;
after the first storage node processes the resource access request based on the required resource, a second RDMA message queue receives a processing result sent by the first RDMA message queue.
8. The communications method of claim 7, the second business thread groups corresponding to the second processing cores being isolated from each other, the second business thread groups comprising at least one second business thread, a second RDMA message queue preceding the first RDMA message queue of the first storage node, the method further comprising:
establishing a second user message buffer queue, wherein the second user message buffer queue corresponds to the second RDMA message queue one by one, and the second user message buffer queue comprises a second read-write request queue RQ and a second read-write completion queue CAQ;
the second service thread puts the resource access request into the first read-write request queue RQ for obtaining the resource access request;
the second business thread sends the resource access request to a second RDMA message queue,
correspondingly: after the second RDMA message queue receives the processing result sent by the first RDMA message queue, the method further includes:
the second service thread puts the processing result into the second read-write completion queue CAQ to obtain the processing result;
and the second service thread reports the service after checking the resource access request in the second read-write request queue RQ based on the processing result.
9. A storage node, comprising:
a first processing core;
the first shared resource comprises a plurality of first-level caches, and the first-level caches are mutually isolated by taking the first processing core as granularity;
the first service thread groups correspond to the first processing cores one by one, the first service thread groups are isolated from one another, and each first service thread group comprises at least one first service thread;
the first RDMA message queue is arranged corresponding to the first service thread and used for receiving a resource access request sent by a second RDMA message queue of a second storage node;
the first service thread is used for acquiring required resources from the first-level cache of the corresponding first processing core; and, for processing the resource access request based on the required resource;
the first RDMA message queue is further configured to send the processing result to the second RDMA message queue.
10. The storage node of claim 9, wherein, when the storage node employs a NUMA architecture, the first shared resource further comprises a plurality of secondary caches, the plurality of secondary caches being isolated from each other at a granularity of a first NUMA node, and the first business thread is further configured to:
when the resource quota allowance of the first-level cache is lower than a first resource quota water line value, acquiring the required resource from the second-level cache corresponding to the first processing core; or,
and when the resource quota allowance of the secondary cache is lower than a second resource quota water line value, acquiring the required resource from an operating system of the storage node.
11. The storage node of claim 9, wherein the resource access request includes a key value corresponding to the required resource, and wherein the first business thread is further configured to:
searching a mapping table based on the key value to obtain the address of the required resource;
and acquiring the required resource based on the address.
12. The storage node of claim 11, further comprising a map creation module, prior to the first RDMA message queue receiving a resource access request sent by a second RDMA message queue of a second storage node, to:
performing address registration on the first shared resource to an RDMA network card;
creating the mapping table, wherein the mapping table comprises the corresponding relation between the address of the first shared resource and a key value;
sharing the mapping table with the second storage node.
13. The storage node of any of claims 9 to 12, a first RDMA message queue in one-to-one correspondence with the first processing core, the first RDMA message queue in one-to-one correspondence with the second RDMA message queue, the first business thread of the first business thread group corresponding to the first processing core, further to:
obtaining the required resource at least from the first-level cache of the corresponding first processing core;
correspondingly, the sending the processing result to the second RDMA message queue by the first RDMA message queue correspondingly includes:
and the first RDMA message queue sends the processing result to the second RDMA message queue corresponding to the first RDM message queue one by one.
14. The storage node of claim 13, prior to the first RDMA message queue receiving a resource access request sent by a second RDMA message queue of a second storage node, the first business thread further to:
establishing a first user message cache queue, wherein the first user message cache queue corresponds to the first RDMA message queue in a one-to-one correspondence manner, and the first user message cache queue comprises a first read-write dispatch queue DQ;
correspondingly, after the first RDMA message queue receives the resource access request sent by the second RDMA message queue of the second storage node, before the first service thread obtains the required resource from at least the level-one cache of the corresponding first processing core, the first service thread is further configured to:
and putting the resource access request into the first read-write dispatching queue DQ to obtain the resource access request.
15. A storage node, comprising:
a second processing core;
the second shared resource comprises a plurality of first-level caches, and the first-level caches are mutually isolated by taking the second processing core as granularity;
the second RDMA message queue corresponds to the second processing core one by one and is used for sending a resource access request to the first RDMA message queue of the first storage node, and the first RDMA message queue corresponds to the second RDMA message queue one by one; and the number of the first and second groups,
the second RDMA message queue is further configured to receive a processing result sent by the first RDMA message queue after the first storage node processes the resource access request based on a required resource.
16. The storage node of claim 15, further comprising a second business thread group in one-to-one correspondence with the second processing core, the second business thread group being isolated from each other, the second business thread group comprising at least one second business thread, the second business thread group, prior to sending a resource access request to the first RDMA message queue of the first storage node, to:
establishing a second user message buffer queue, wherein the second user message buffer queue corresponds to the second RDMA message queue one by one, and the second user message buffer queue comprises a second read-write request queue RQ and a second read-write completion queue CAQ;
the second service thread puts the resource access request into the first read-write request queue RQ to obtain the resource access request;
the second business thread is further used for sending the resource access request to a second RDMA message queue; and after receiving the processing result sent by the first RDMA message queue, the second business thread is further configured to:
putting the processing result into the second read-write completion queue CAQ to obtain the processing result;
and checking the resource access request in the second RQ based on the processing result and then reporting the service.
17. A distributed storage system comprising a storage node according to any of claims 9 to 16.
18. A storage medium for computer readable storage, the storage medium storing one or more programs which, when executed by one or more processors, perform the steps of the RDMA-based communication method of any of claims 1 to 8.
CN202011630087.2A 2020-12-31 2020-12-31 RDMA-based communication method, node, system and medium Pending CN114691382A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011630087.2A CN114691382A (en) 2020-12-31 2020-12-31 RDMA-based communication method, node, system and medium
PCT/CN2021/122334 WO2022142562A1 (en) 2020-12-31 2021-09-30 Rdma-based communication method, node, system, and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011630087.2A CN114691382A (en) 2020-12-31 2020-12-31 RDMA-based communication method, node, system and medium

Publications (1)

Publication Number Publication Date
CN114691382A true CN114691382A (en) 2022-07-01

Family

ID=82134855

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011630087.2A Pending CN114691382A (en) 2020-12-31 2020-12-31 RDMA-based communication method, node, system and medium

Country Status (2)

Country Link
CN (1) CN114691382A (en)
WO (1) WO2022142562A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115858160A (en) * 2022-12-07 2023-03-28 江苏为是科技有限公司 Remote direct memory access virtualization resource allocation method and device and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106657365B (en) * 2016-12-30 2019-12-17 清华大学 RDMA (remote direct memory Access) -based high-concurrency data transmission method
CN109491809A (en) * 2018-11-12 2019-03-19 西安微电子技术研究所 A kind of communication means reducing high-speed bus delay
CN111277616B (en) * 2018-12-04 2023-11-03 中兴通讯股份有限公司 RDMA-based data transmission method and distributed shared memory system
CN111064680B (en) * 2019-11-22 2022-05-17 华为技术有限公司 Communication device and data processing method
US20200104275A1 (en) * 2019-12-02 2020-04-02 Intel Corporation Shared memory space among devices

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115858160A (en) * 2022-12-07 2023-03-28 江苏为是科技有限公司 Remote direct memory access virtualization resource allocation method and device and storage medium
CN115858160B (en) * 2022-12-07 2023-12-05 江苏为是科技有限公司 Remote direct memory access virtualized resource allocation method and device and storage medium

Also Published As

Publication number Publication date
WO2022142562A1 (en) 2022-07-07

Similar Documents

Publication Publication Date Title
CN111277616B (en) RDMA-based data transmission method and distributed shared memory system
CN110191194B (en) RDMA (remote direct memory Access) network-based distributed file system data transmission method and system
US9405574B2 (en) System and method for transmitting complex structures based on a shared memory queue
US20190079895A1 (en) System and method for maximizing bandwidth of pci express peer-to-peer (p2p) connection
US20050038941A1 (en) Method and apparatus for accessing a memory
CN111404931B (en) Remote data transmission method based on persistent memory
CN112948149A (en) Remote memory sharing method and device, electronic equipment and storage medium
US20240039995A1 (en) Data access system and method, device, and network adapter
CN113891396B (en) Data packet processing method and device, computer equipment and storage medium
CN114756388A (en) RDMA (remote direct memory Access) -based method for sharing memory among cluster system nodes as required
JP2017537404A (en) Memory access method, switch, and multiprocessor system
CN115374046B (en) Multiprocessor data interaction method, device, equipment and storage medium
CN105045729A (en) Method and system for conducting consistency processing on caches with catalogues of far-end agent
CN113596085A (en) Data processing method, system and device
CN114691382A (en) RDMA-based communication method, node, system and medium
CN113641604A (en) Data transmission method and system
CN111949422B (en) Data multi-level cache and high-speed transmission recording method based on MQ and asynchronous IO
WO2022199357A1 (en) Data processing method and apparatus, electronic device, and computer-readable storage medium
CN114157529B (en) Distributed data transmission system and method
KR20050080704A (en) Apparatus and method of inter processor communication
CN113778937A (en) System and method for performing transaction aggregation in a network on chip (NoC)
WO2016197607A1 (en) Method and apparatus for realizing route lookup
TWI775112B (en) System and method for accessing registers
CN114615208B (en) Back pressure information transmission and request sending method and device and network chip
CN115604291A (en) Data transmission method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination