CN116303173B - Method, device and system for reducing RDMA engine on-chip cache and chip - Google Patents

Method, device and system for reducing RDMA engine on-chip cache and chip Download PDF

Info

Publication number
CN116303173B
CN116303173B CN202310564513.4A CN202310564513A CN116303173B CN 116303173 B CN116303173 B CN 116303173B CN 202310564513 A CN202310564513 A CN 202310564513A CN 116303173 B CN116303173 B CN 116303173B
Authority
CN
China
Prior art keywords
queue entry
message
rdma
fpsn
cqe
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310564513.4A
Other languages
Chinese (zh)
Other versions
CN116303173A (en
Inventor
萧启阳
黄勇平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Yunbao Intelligent Co ltd
Original Assignee
Shenzhen Yunbao Intelligent Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Yunbao Intelligent Co ltd filed Critical Shenzhen Yunbao Intelligent Co ltd
Priority to CN202310564513.4A priority Critical patent/CN116303173B/en
Publication of CN116303173A publication Critical patent/CN116303173A/en
Application granted granted Critical
Publication of CN116303173B publication Critical patent/CN116303173B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7839Architectures of general purpose stored program computers comprising a single central processing unit with memory
    • G06F15/7842Architectures of general purpose stored program computers comprising a single central processing unit with memory on one IC chip (single chip microcontrollers)
    • G06F15/7846On-chip cache and off-chip main memory
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a method for reducing RDMA engine on-chip caching, which comprises the following steps: reading a current transmission queue entry from a host memory, wherein the transmission queue entry at least comprises an fPSN; generating an RDMA request message and sending the RDMA request message to remote RDMA equipment, and deleting a sending queue entry corresponding to the RDMA request message in a sending queue entry cache; receiving and analyzing the ACK message to obtain QPN, MSN and PSN information therein; obtaining a desired consumer pointer from the QPN; and reporting the completion queue entry carrying the pointer to the local host, so that the local host acquires the corresponding transmission queue entry and determines whether the completion queue entry needs to be reported to an upper user. The invention also discloses a corresponding device, a system and a chip. By implementing the invention, the cache resource of the RDMA engine can be saved.

Description

Method, device and system for reducing RDMA engine on-chip cache and chip
Technical Field
The present invention relates to the field of remote direct memory access (Remote Direct Memory Access, RDMA) technology, and in particular, to a method, apparatus, system and chip for reducing RDMA engine on-chip caching.
Background
When receiving the remote ACK message, the RDMA engine judges whether the execution of the SQ WQE (send queue entry) is finished according to the MSN (message sequence number) of the ACK message, if the execution is finished, the CQE (Completion Queue Element, complete queue entry) is reported to the Host (Host), and when reporting the CQE, the RDMA engine needs to read the corresponding SQ WQE, wherein CQE _ind (CQE reporting indication) of the SQ WQE is used for indicating whether the SQ WQE needs to report the CQE, and if the CQE _ind is 1, the CQE is only reported, and if the CQE is 0, the CQE does not need to be reported.
When the RDMA engine reads the SQWQE from its own memory, if the corresponding SQ WQE buffer miss (error or loss), the RDMA engine needs to read the SQ WQE from the host memory again through the PCIE, which needs to occupy the bandwidth of the PCIE. If a larger cache space (cache) is designed in the RDMA engine to cache the SQ WQEs of all unreported CQEs, the power consumption and area of the chip will increase.
Therefore, how to realize that the PCIE bandwidth is not occupied as much as possible and reduce the area of the chip is a problem to be solved when the SQ WQE buffers the miss.
Disclosure of Invention
The invention aims to solve the technical problem of providing a method, a device, a system and a chip for reducing on-chip caching of an RDMA engine, which can save caching resources of the RDMA engine.
In order to solve the above technical problems, as an aspect of the present invention, a method for reducing on-chip buffering of an RDMA engine is provided, which at least includes the following steps:
when remote access operation is needed, reading the current sending queue entry SQ WQE from the host memory and storing the SQ WQE into the SQ WQE cache;
generating an RDMA request message and sending the RDMA request message to remote RDMA equipment, and deleting the SQ WQE corresponding to the RDMA request message in an SQ WQE buffer;
receiving and analyzing an ACK message from remote RDMA equipment, and at least obtaining a queue number QPN in the ACK message;
acquiring a corresponding QPC from a queue management context QPC cache according to the queue number QPN, and acquiring a desired consumer pointer ECI from the QPC;
and reporting the CQE carrying the completion queue entry of the ECI to the host, so that the host acquires the corresponding SQ WQE from the host memory according to the ECI and determines whether the CQE needs to be reported to an upper user.
The local host acquires a corresponding SQWQE from a host memory according to the ECI, and determines whether the CQE needs to be reported to an upper user, and further includes:
acquiring a corresponding SQWQE from a host memory according to the ECI index, and determining whether reporting to an upper user is required according to cqe _ind in the SQ WQE; if CQE _ind is 1, the CQE of the SQ WQE needs to be reported to an upper user, and if CQE _ind is 0, the current CQE is discarded.
Wherein, further include:
after each CQE report is finished, adding 1 to the packet sequence number PSN carried by the ACK message as the sequence number ePSN of the expected packet, and storing the sequence number ePSN in a QPC cache.
Wherein, further include:
after a CQE is reported once, the current ECI value is added with 1, and whether the subsequent CQE is continuously reported is determined according to the message sequence number MSN carried by the ACK message.
Wherein, further include: the transmission queue entry (SQ WQE) is provided with a first packet sequence number (fPSN) corresponding to the transmission queue entry;
in the process of receiving a message from remote RDMA equipment, if judging that breakpoint retransmission is required to be carried out on an RDMA request message, requesting to obtain a corresponding sending queue entry from a host memory, and determining the position of breakpoint retransmission at least according to a first packet sequence number (fPSN) stored in the sending queue entry.
Wherein, further include: the host generates a first packet sequence number fPSN of a corresponding message for each sending queue entry in the following manner, and stores the first packet sequence number fPSN into the sending queue entry in the host memory: :
if the SQ WQE is the first, the local end host negotiates to determine the fPSN when establishing a link with the remote RDMA equipment;
otherwise, calculating the fPSN corresponding to the current SQ WQE by adopting the following formula:
fpsn=fpsn+ceil of last SQ WQE (message/PMTU)
Wherein message length is the message length of the last SQ WQE, ceil is an upward rounding function, and PMTU is the length value of the path maximum transmission unit; when the message length is 0, ceil (message length/PMTU)) is assigned 1;
and storing the obtained fPSN into the SQWQE of the host memory.
Wherein the determining the breakpoint retransmission location at least according to the first packet sequence number (fPSN) stored in the transmit queue entry further comprises:
when breakpoint retransmission is required to be carried out on an RDMA request message, according to ECI stored in a QPC, requesting to obtain SQ WQE corresponding to the ECI from a host memory;
and according to the fPSN in the SQ WQE and the current ePSN acquired from the QPC, determining the first byte of the retransmission message by the following formula:
first byte + (ePSN-fPSN) PMTU corresponding to SQ WQE.
Accordingly, in still another aspect of the present invention, there is also provided an apparatus for reducing on-chip buffering of an RDMA engine, applied to an RDMA engine, including at least:
a sending queue entry reading unit, configured to read the current sending queue entry SQ WQE from the host memory and store the current sending queue entry SQ WQE in the SQ WQE cache when remote access operation is required;
the request sending processing unit is used for generating an RDMA request message and sending the RDMA request message to remote RDMA equipment, and deleting the SQ WQE corresponding to the RDMA request message in an SQ WQE cache;
a response message parsing unit, configured to receive and parse an ACK message from a remote RDMA device, and at least obtain a queue number QPN in the ACK message;
the ECI acquisition unit is used for acquiring a corresponding QPC from a queue management context QPC cache according to the queue number QPN and acquiring a desired consumer pointer ECI from the QPC;
and the CQE reporting unit is used for reporting the CQE carrying the completion queue entry of the ECI to the host end, so that the host end obtains the corresponding SQ WQE from the host memory according to the ECI and determines whether the CQE needs to be reported to an upper user.
Wherein, further include:
the ePSN processing unit is used for adding 1 to the PSN carried by the ACK message as the sequence number ePSN of the expected packet after each CQE report is completed, and storing the sequence number ePSN into a QPC cached in the QPC;
and the CQE continuous reporting judging unit is used for adding 1 to the current ECI value after reporting the CQE once and determining whether to continuously report the subsequent CQE according to the message sequence number MSN carried by the ACK message.
Wherein, further include:
the retransmission processing unit is used for requesting to acquire the SQ WQE corresponding to the ECI from the host memory according to the ECI stored in the QPC when breakpoint retransmission is required to be carried out on the RDMA request message; and according to the fPSN in the SQ WQE and the current ePSN obtained from the QPC, the first byte of the retransmission message is calculated and determined according to the following formula: first byte + (ePSN-fPSN) PMTU corresponding to SQ WQE.
Accordingly, in still another aspect of the present invention, there is provided a system for reducing on-chip buffering of an RDMA engine, at least including a local host, a host memory, and an RDMA engine, wherein:
the RDMA engine comprises a QPC buffer, a SQ WQE buffer and a device for reducing the on-chip buffer of the RDMA engine.
Wherein, the local end host computer further includes:
an fPSN calculation processing unit, configured to generate and store a first packet sequence number fPSN of a corresponding message by:
if the SQ WQE is the first, the local end host negotiates to determine the fPSN when establishing a link with the remote RDMA equipment;
otherwise, calculating the fPSN corresponding to the current SQ WQE by adopting the following formula:
fPSN+ceil (message/PMTU) of the last SQ WQE
Wherein message length is the message length of the last SQ WQE, ceil is an upward rounding function, and PMTU is the length value of the path maximum transmission unit; when the message length is 0, ceil (message length/PMTU)) is assigned 1;
and the storage processing unit is used for storing the obtained fPSN into the SQWQE of the host memory.
Wherein, the local end host computer further includes:
the reporting judgment processing unit is used for acquiring a corresponding SQ WQE from the index in the host memory according to the ECI in the completion queue entry CQE, and determining whether reporting to an upper user is required according to CQE _ind in the SQ WQE; if CQE _ind is 1, the CQE of the SQ WQE needs to be reported to an upper user, and if CQE _ind is 0, the current CQE is discarded.
In yet another aspect of the invention, there is also provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method as described above.
In yet another aspect of the invention, a chip is also provided that integrates a system for reducing RDMA engine on-chip caching as described above.
The embodiment of the invention has the following beneficial effects:
the invention provides a method, a device, a system and a chip for reducing RDMA engine on-chip caching. By deleting the sending queue entry after sending the RDMA request message to the remote RDMA device, and only carrying index corresponding to the SQ WQE when reporting the CQE, the local host obtains the opcode and CQE _ind corresponding to the SQ WQE according to the index, namely the RDMA engine does not need to read the SQ WQE when reporting the CQE, so that cache resources are not needed to be provided in the RDMA engine for caching the SQ WQE which is processed by the sending side but is not reported to the CQE, and the cache resources of the RDMA engine are saved.
In addition, by adding an fPSN field at the SQ WQE, the field is calculated when the SQ WQE is put into the SQ by the host. When the RDMA engine retransmits, the fPSN can be obtained only by reading the SQWQE of the current retransmission from the host memory, and then the fPSN can be combined with the ePSN in the QPC to obtain an accurate retransmission point, so that the breakpoint retransmission function is supported, and the occupation of PCIE bandwidth in the whole can be reduced.
Therefore, the invention can effectively reduce the use of the on-chip cache of the RDMA engine, thereby improving the link scale of the RDMA under the condition of limited on-chip cache, supporting the breakpoint retransmission function and reducing the occupation of PCIE bandwidth.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are required in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that it is within the scope of the invention to one skilled in the art to obtain other drawings from these drawings without inventive faculty.
FIG. 1 is a schematic diagram illustrating the main flow of an embodiment of a method for reducing on-chip buffering of RDMA engines according to the present invention;
FIG. 2 is a schematic view of an application environment of the method according to the present invention;
FIG. 3 is a schematic diagram illustrating one embodiment of an apparatus for reducing buffering on RDMA engines according to the present invention;
fig. 4 is a schematic structural diagram of the host in fig. 3.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present invention more apparent.
FIG. 1 is a schematic diagram illustrating the main flow of one embodiment of a method for reducing RDMA engine on-chip caching provided by the present invention; as also shown in fig. 2, in this embodiment, the method operates in the RDMA engine, and the method at least includes the following steps:
step S10, when remote access operation is needed, the RDMA engine reads the current sending queue entry SQ WQE from the host memory and stores the SQ WQE in a SQ WQE buffer, wherein the SQ WQE at least comprises a first packet sequence number fPSN of a corresponding message generated by a host at a home terminal, and one message possibly comprises more than one message, so that the first packet sequence number fPSN of the message needs to be determined for corresponding positioning;
wherein the remote access operation includes RDMA-supported communication primitives, such as RDMAWrite, RDMA Read, send, etc. operations;
step S11, an RDMA engine generates an RDMA request message and sends the RDMA request message to remote RDMA equipment, and the SQ WQE corresponding to the RDMA request message is deleted in an SQ WQE cache;
step S12, the RDMA engine receives and analyzes a response message from the remote RDMA equipment, and when the response message is an ACK message, a sequence number QPN, a message sequence number MSN and a packet sequence number PSN in the ACK message are obtained;
step S13, the RDMA engine acquires a corresponding QPC from a queue management context cache according to the queue number QPN, and acquires a desired consumer pointer ECI from the QPC;
in step S14, the RDMA engine reports the CQE carrying the completion queue entry of the ECI to the host, so that the host obtains the corresponding SQ WQE from the host memory according to the ECI, and determines whether the CQE needs to be reported to the upper user.
It will be appreciated that in a specific example of the invention, the RDMA engine would also need to implement the following steps:
after each CQE report is completed, the RDMA engine adds 1 to PSN carried by the ACK message as the sequence number ePSN of the expected packet, and stores the sequence number ePSN into the QPC of the queue management context cache.
And after reporting the CQE once, the RDMA engine adds 1 to the current ECI value, and determines whether to continue reporting the subsequent CQE according to the message sequence number MSN. It will be appreciated that ECI is a desired consumer pointer, stored within the QPC, a pointer (index) to indicate the SQ WQE of the next reported CQE. The MSN is a message sequence number indicating that the SQ WQE before MSN minus 1 has been performed. For example, in one example, when an ACK packet is received and the MSN carried by the packet is 10, the RDMA engine may report CQEs of SQ WQEs with index of 0 to 9.
It can be understood that, in the specific example of the present invention, the RDMA engine further supports a breakpoint retransmission function, and if it is determined that the breakpoint retransmission needs to be performed on the RDMA request packet (specifically, according to a pre-formulated mechanism for detecting errors) during the process of receiving the packet from the remote RDMA device, the corresponding transmit queue entry is requested to be obtained from the host memory, and the location of the breakpoint retransmission is determined at least according to the first packet sequence number (fPSN) stored in the transmit queue entry. Specifically further comprising:
when breakpoint retransmission is required to be carried out on an RDMA request message, an RDMA engine acquires SQ WQE corresponding to ECI from a host memory request according to the ECI stored in the QPC;
the RDMA engine calculates and determines the first byte of the retransmission message according to the fPSN in the SQ WQE and the current ePSN obtained from the QPC by the following formula:
first byte + (ePSN-fPSN) PMTU corresponding to SQ WQE.
It will be appreciated that, in the specific example of the present invention, the host at the home end is also required to implement the corresponding function.
Specifically, further comprising: the host generates and stores a first packet sequence number fPSN of the corresponding message by the following method:
if the SQ WQE is the first, the local end host negotiates to determine the fPSN when establishing a link with the remote RDMA equipment;
otherwise, calculate fpsn=corresponding to the current SQ WQE using the following formula
fPSN+ceil (message/PMTU) of the last SQ WQE
Wherein message length is the message length of the last SQ WQE, ceil is an upward rounding function, and PMTU is the length value of the path maximum transmission unit; when the message length is 0, ceil (message length/PMTU)) is assigned 1;
and storing the obtained fPSN into the SQWQE of the host memory.
It may be appreciated that, in a specific example of the present invention, the local host obtains, from a host memory, a corresponding SQWQE according to the ECI, and determines whether the CQE needs to be reported to an upper layer user, and further includes:
acquiring a corresponding SQWQE from a host memory according to the ECI index, and determining whether reporting to an upper user is required according to cqe _ind in the SQ WQE; if CQE _ind is 1, the CQE of the SQ WQE needs to be reported to an upper user, and if CQE _ind is 0, the current CQE is discarded.
A more detailed complete flow and principle of the present invention is described below with reference to fig. 2:
step 1, when the host computer of the local terminal puts the sending queue entry SQWQE into the sending queue SQ, the first packet sequence number fPSN of the message corresponding to the SQ WQE needs to be calculated, and the fPSN is put into the SQ WQE stored in the local memory.
Wherein, the fPSN corresponding to the first SQ WQE is negotiated during the link establishment; the fPSN of the non-first SQ WQE is calculated according to the message length (message length) and PMTU value carried by the last SQ WQE, and the fPSN of the current SQ WQE is equal to the fPSN of the last SQ WQE plus ceil (message length/PMTU), where message length is the message length of the last SQ WQE, ceil is an upward rounding function, and when message length is 0, ceil (message length/PMTU) is replaced by 1; in the SQ WQE provided in the embodiment of the present invention, it at least includes fPSN, an operation code (opcode) and a CQE instruction value (CQE _ind).
And step 2, the host computer at the home end issues a sending queue doorbell register (SQdoorbell).
Step 3, the RDMA engine reads the corresponding SQ WQE from the host memory according to the information of the sending queue doorbell register and stores the corresponding SQ WQE into the SQ WQE cache, but the RDMA engine invalidates (deletes) the SQ WQE from the SQ WQE cache immediately after processing one SQ WQE, because the RDMA engine does not need to read the corresponding SQ WQE from the cache when the CQE corresponding to the SQ WQE is reported.
Step 4, the RDMA engine generates an RDMA request message and SENDs the RDMA request message to the remote end, which may be, for example, a SEND, RDMA READ, RDMA WRITE message, or the like.
Step 5, after Round Trip Time (RTT), the RDMA engine receives a network side ACK or READ RESPONSE message; if the message is the ACK message, executing the step 6;
if not an ACK message, the RDMA engine needs to calculate the retransmission header address when a breakpoint retransmission occurs.
And 6, if the message is the ACK message, analyzing the message to obtain a QPN (queue number), an MSN (message sequence number) and a PSN (packet sequence number), wherein the MSN is used for indicating that the SQ WQE before the MSN is reduced by 1 is executed. The RDMA engine gets the QPC (context of queue management) through the QPN (queue number) first, and then gets the ECI (expected consumer pointer) from the QPC, which represents the pointer (index) corresponding to the SQ WQE of the next reported CQE.
In step 7, the RDMA engine does not need to read the SQ WQE according to the ECI, and only needs to carry the ECL during reporting, namely, the ECI is used as an index corresponding to the SQ WQE of the reported CQE.
Step 8, after the CQE is reported by the SQ WQE corresponding to the ECI, the RDMA engine adds 1 to the ECI, and then continues reporting the CQE until the CQE reporting of the SQ WQE corresponding to the MSN minus 1 is completed.
It will be appreciated that ECI indicates index of the next SQWQE to be completed, while MSN indicates that the SQWQE preceding MSN-1 can be completed. For example, when an ACK message is received and the MSN carried by the message is 10, the current ECI is 0, and then the RDMA engine may report CQEs of SQ WQEs with index of 0 to 9.
Step 9, when the host at the home end processes the CQE reported by the RDMA engine, the corresponding opcode and CQE _ind in the SQ WQE are obtained through ECI (i.e. index of the SQ WQE) index carried by the CQE, if CQE _ind is 1, the SQ WQE is indicated to need to report the CQE to an upper user; if CQE _ind is 0, then this indicates that the SQ WQE does not need to report CQEs to the upper layer user, and the CQE reported by the RDMA engine is discarded by the host.
In step 10, the RDMA engine does not need to store the fPSN of the SQ WQE into the QPC, and only needs to store the psn carried by the ACK message plus 1 as the ePSN into the QPC.
When breakpoint retransmission occurs, the RDMA engine reads SQ WQE from a host memory according to ECI in the QPC to obtain fPSN, then obtains ePSN from the QPC, and obtains the first byte of a retransmission message through calculation, wherein the specific calculation mode is as follows: the first byte + (ePSN-fPSN) corresponding to SQWQE.
It can be understood that in the method provided by the invention, after the RDMA request message is sent to the remote RDMA device, the sending queue entry is deleted, and only index corresponding to SQ WQE is carried when the CQE is reported, the local host obtains the opcode and CQE _ind corresponding to the SQ WQE according to the index de-index, that is, the RDMA engine does not need to read the SQ WQE when the CQE is reported, so that buffer resources are not needed to be provided in the RDMA engine for buffering the SQ WQE which has been processed by the sending side but has not reported the CQE, thereby saving buffer resources of the RDMA engine.
In addition, by adding an fPSN field at the SQ WQE, the field is calculated when the SQ WQE is put into the SQ by the host. When the RDMA engine retransmits, the fPSN can be obtained only by reading the SQWQE of the current retransmission from the memory of the host, and then the fPSN is combined with the ePSN in the QPC to obtain an accurate retransmission point, so that the breakpoint retransmission function is supported. Because the probability of breakpoint retransmission is small, the scheme of the embodiment of the invention can support the breakpoint retransmission function and simultaneously can reduce the occupation of PCIE bandwidth as a whole.
FIG. 3 is a schematic diagram illustrating one embodiment of an apparatus for reducing RDMA Engine on-chip buffering provided by the present invention. The apparatus 1 for reducing RDMA engine on-chip buffering is applied to an RDMA engine as shown in fig. 2, and at least includes:
a send queue entry reading unit 10, configured to read, when remote access operation is required, a current send queue entry SQ WQE from a host memory, and store the current send queue entry SQ WQE in an SQ WQE buffer, where the SQ WQE includes at least a first packet sequence number fPSN of a corresponding message generated by a host at a home terminal;
a request sending processing unit 11, configured to generate an RDMA request packet and send the RDMA request packet to a remote RDMA device, and delete an SQ WQE corresponding to the RDMA request packet in an SQ WQE buffer;
a response message parsing unit 12, configured to receive and parse an ACK message from a remote RDMA device, and obtain a queue number QPN, a message sequence number MSN, and a packet sequence number PSN in the ACK message;
an ECI obtaining unit 13, configured to obtain a corresponding QPC from a queue management context QPC cache according to the queue number QPN, and obtain a desired consumer pointer ECI from the QPC;
and the CQE reporting unit 14 is configured to report the CQE carrying the completion queue entry of the ECI to the host, so that the host obtains the corresponding SQ WQE from the host memory according to the ECI, and determines whether the CQE needs to be reported to an upper layer user.
In a specific example, the apparatus 1 further comprises:
the ePSN processing unit 15 is configured to store, after completing each CQE report, the PSN carried by the ACK packet plus 1 as a sequence number ePSN of the expected packet into the QPC of the queue management context buffer;
a CQE continuing reporting judging unit 16, configured to add 1 to the current ECI value after reporting a CQE, and determine whether to continue reporting a subsequent CQE according to the message sequence number MSN; and
the retransmission processing unit 17 is configured to, when breakpoint retransmission is required to be performed on the RDMA request packet, request, according to the ECI stored in the QPC, to obtain the SQ WQE corresponding to the ECI from the host memory; and according to the fPSN in the SQ WQE and the current ePSN obtained from the QPC, determining the first byte=the first byte+ (ePSN-fPSN) corresponding to the SQ WQE of the retransmission message by the following formula.
For more details, reference is made to and the description of fig. 1 and 2 is combined with the foregoing description, and details are not repeated here.
Accordingly, in still another aspect of the present invention, there is further provided a system for reducing RDMA engine on-chip buffering, specifically referring to fig. 2, where the system for reducing RDMA engine on-chip buffering includes at least a local host 2, a host memory, and an RDMA engine, where:
the RDMA engine includes a QPC cache, a SQ WQE cache, and an apparatus 1 for reducing RDMA engine on-chip caching as shown in FIG. 3.
Wherein, the local end host 2 further comprises:
the fPSN calculation processing unit 20 is configured to generate and store a header sequence number fPSN of a corresponding message in the following manner:
if the SQ WQE is the first, the local end host negotiates to determine the fPSN when establishing a link with the remote RDMA equipment;
otherwise, calculating the fPSN corresponding to the current SQ WQE by adopting the following formula:
fPSN+ceil (message/PMTU) of the last SQ WQE
Wherein message length is the message length of the last SQ WQE, ceil is an upward rounding function, and PMTU is the length value of the path maximum transmission unit; when the message length is 0, ceil (message length/PMTU)) is assigned 1;
a storage processing unit 21, configured to store the obtained fPSN into the SQ WQE of the host memory.
Wherein, the local end host computer further includes:
the report judging and processing unit 22 is configured to obtain a corresponding SQ WQE from the host memory according to the ECI index in the completion queue entry CQE, and determine whether to report to an upper user according to CQE _ind in the SQ WQE; if CQE _ind is 1, the CQE of the SQ WQE needs to be reported to an upper user, and if CQE _ind is 0, the current CQE is discarded.
For more details, reference is made to the description of fig. 3 and the description is omitted here.
In yet another aspect of the present invention, there is also provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method as described in the previous figures 1 to 2. For more details, reference is made to the foregoing descriptions of fig. 1 and 2, and no further description is given here.
In yet another aspect of the invention, a chip is also provided that incorporates the RDMA engine cache reduction system as described above in connection with FIGS. 3 and 4. For more details, reference may be made to the foregoing description of fig. 3 and fig. 4, and details are not repeated here.
The embodiment of the invention has the following beneficial effects:
the invention provides a method, a device, a system, a chip and a storage medium for reducing RDMA engine on-chip caching. The field segment is calculated by adding the fPSN field segment to the SQ WQE when the local host puts the SQ WQE into the SQ. When the RDMA engine reports the CQE, only index corresponding to the SQ WQE is carried, and the local host obtains opcodes and CQE _ind corresponding to the SQ WQE according to index de-indexing of the index, namely, the RDMA engine does not need to read the SQ WQE when reporting the CQE, so that buffer resources are not needed to be provided in the RDMA engine for buffering the SQ WQE which is processed by a sending side and is not reported by the CQE, and the buffer resources of the RDMA engine are saved.
Meanwhile, when the RDMA engine retransmits, the fPSN can be obtained only by reading the SQ WQE of the current retransmission from the memory of the host, and then the fPSN is combined with the ePSN in the QPC to obtain an accurate retransmission point, so that the breakpoint retransmission function is supported.
Therefore, the invention can effectively reduce the use of the RDMA engine on-chip cache, thereby improving the link scale of RDMA under the condition of limited on-chip cache.
It will be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above disclosure is only a preferred embodiment of the present invention, and it is needless to say that the scope of the invention is not limited thereto, and therefore, the equivalent changes according to the claims of the present invention still fall within the scope of the present invention.

Claims (15)

1. A method for reducing RDMA engine on-chip caching, comprising at least the steps of:
when remote access operation is needed, reading the current sending queue entry SQ WQE from the host memory and storing the current sending queue entry SQ WQE into a sending queue entry cache;
generating an RDMA request message and sending the RDMA request message to remote RDMA equipment, and deleting a sending queue entry corresponding to the RDMA request message in a sending queue entry cache;
receiving and analyzing a response message from remote RDMA equipment, and when the response message is determined to be an ACK message, at least obtaining a queue number QPN in the ACK message;
acquiring a corresponding queue management context from a queue management context QPC cache according to the queue number, and acquiring a desired consumer pointer ECI from the queue management context;
and reporting the completion queue entry CQE carrying the pointer to the local host, so that the local host acquires a corresponding transmission queue entry from a host memory according to the ECI, and determining whether the completion queue entry needs to be reported to an upper user.
2. The method of claim 1, wherein the home host obtains a corresponding transmit queue entry from a host memory according to the ECI and determines whether the completion queue entry needs to be reported to an upper user, further comprising:
acquiring a corresponding sending queue entry from a host memory according to the ECI index, and determining whether reporting to an upper user is required according to cqe _ind in the sending queue entry; if cqe _ind is 1, the completion queue entry of the send queue entry needs to be reported to the upper layer user, and if cqe _ind is 0, the current completion queue entry is discarded.
3. The method as recited in claim 2, further comprising:
after each time the report of the queue entry is completed, adding 1 to the packet sequence number PSN carried by the ACK message as the sequence number ePSN of the expected packet, and storing the sequence number ePSN in the queue management context.
4. The method as recited in claim 3, further comprising:
after the completion queue entry is reported once, the current ECI value is added with 1, and whether the subsequent completion queue entry is continuously reported is determined according to the message sequence number MSN carried by the ACK message.
5. The method of any one of claims 1 to 4, wherein:
the transmission queue entry SQ WQE has a first packet sequence number fPSN corresponding to the transmission queue entry;
after receiving a response message from remote RDMA equipment, if judging that breakpoint retransmission is required to be carried out on the RDMA request message, requesting to obtain a corresponding sending queue entry from a host memory, and determining the breakpoint retransmission position at least according to a first packet sequence number fPSN stored in the sending queue entry.
6. The method of claim 5, wherein: the host generates a first packet sequence number fPSN of a corresponding message for each sending queue entry in the following manner, and stores the first packet sequence number fPSN into the sending queue entry in the host memory:
if the transmission queue entry is the first, the local end host negotiates to determine its fPSN when a link is established with the remote RDMA device;
otherwise, the fPSN corresponding to the current transmit queue entry=fpsn+ceil (message/PMTU) of the last transmit queue entry
The message length is the message length of the last transmission queue entry, ceil is an upward rounding function, and PMTU is the length value of the path maximum transmission unit; when the message length is 0, ceil (message length/PMTU)) is assigned 1;
and storing the obtained fPSN into the transmission queue entry of the host memory.
7. The method of claim 6, wherein determining the location of the breakpoint retransmission based at least on the first packet sequence number fPSN stored in the transmit queue entry further comprises:
when breakpoint retransmission is required to be carried out on an RDMA request message, acquiring a transmission queue entry corresponding to ECI from a host memory request according to the ECI stored in a queue management context;
and determining the first byte=the first byte+ (ePSN-fPSN) PMTU corresponding to the transmission queue entry of the retransmission message according to the fPSN in the transmission queue entry and the current ePSN acquired from the queue management context.
8. An apparatus for reducing on-chip buffering of an RDMA engine, applied to the RDMA engine, comprising at least:
a sending queue entry reading unit, configured to read the sending queue entry SQ WQE from the host memory and store the sending queue entry SQ WQE in a sending queue entry cache when remote access operation is required;
the request sending processing unit is used for generating an RDMA request message and sending the RDMA request message to remote RDMA equipment, and deleting a sending queue entry corresponding to the RDMA request message in a sending queue entry cache;
a response message parsing unit, configured to receive and parse a response message from a remote RDMA device, and when determining that the response message is an ACK message, obtain at least a queue number QPN in the ACK message;
the ECI acquisition unit is used for acquiring a corresponding queue management context from the queue management context QPC cache according to the queue number, and acquiring a desired consumer pointer ECI from the queue management context;
and the CQE reporting unit is used for reporting the completion queue entry CQE carrying the pointer to the local host, so that the local host acquires the corresponding transmission queue entry from the host memory according to the ECI and determines whether the completion queue entry needs to be reported to an upper user.
9. The apparatus of claim 8, further comprising:
the ePSN processing unit is used for adding 1 to the packet sequence number PSN carried by the ACK message as the sequence number ePSN of the expected packet after each CQE report is completed, and storing the sequence number PSN into a QPC cached in the QPC;
and the CQE continuous reporting judging unit is used for adding 1 to the current ECI value after reporting the CQE once and determining whether to continuously report the subsequent CQE according to the message sequence number MSN carried by the ACK message.
10. The apparatus as recited in claim 9, further comprising:
the retransmission processing unit is used for requesting to acquire a transmission queue entry corresponding to the ECI from the host memory according to the ECI stored in the QPC when judging that breakpoint retransmission is required to be carried out on the RDMA request message; and according to the fPSN in the transmission queue entry and the current ePSN acquired from the QPC, calculating and determining that the first byte=the first byte+ (ePSN-fPSN) corresponding to the transmission queue entry of the retransmission message.
11. A system for reducing RDMA engine on-chip caching, comprising at least a local host, a host memory, and an RDMA engine, wherein:
the RDMA engine comprising a QPC cache, a transmit queue entry cache, and means for reducing RDMA engine on-chip caching as recited in any of claims 8-10.
12. The system of claim 11, wherein the home host further comprises:
an fPSN calculation processing unit, configured to generate and store a first packet sequence number fPSN of a corresponding message by:
if the transmission queue entry is the first, the local end host negotiates to determine its fPSN when a link is established with the remote RDMA device;
otherwise, determining the fPSN corresponding to the current transmit queue entry according to the following formula:
current fpsn=fpsn+ceil (message/PMTU) of last transmit queue entry
The message length is the message length of the last transmission queue entry, ceil is an upward rounding function, and PMTU is the length value of the path maximum transmission unit; when the message length is 0, ceil (message length/PMTU)) is assigned 1;
and the storage processing unit is used for storing the obtained fPSN into the transmission queue entry of the host memory.
13. The system of claim 12, the local end-host further comprising:
the report judging and processing unit is used for acquiring a corresponding transmission queue entry from the host memory according to the ECI index in the completion queue entry, and determining whether to report to an upper user according to cqe _ind in the transmission queue entry; if cqe _ind is 1, the completion queue entry of the send queue entry needs to be reported to the upper layer user, and if cqe _ind is 0, the current completion queue entry is discarded.
14. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 7.
15. A chip integrated with the RDMA engine on-chip cache reduction system of any of claims 11 to 13.
CN202310564513.4A 2023-05-19 2023-05-19 Method, device and system for reducing RDMA engine on-chip cache and chip Active CN116303173B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310564513.4A CN116303173B (en) 2023-05-19 2023-05-19 Method, device and system for reducing RDMA engine on-chip cache and chip

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310564513.4A CN116303173B (en) 2023-05-19 2023-05-19 Method, device and system for reducing RDMA engine on-chip cache and chip

Publications (2)

Publication Number Publication Date
CN116303173A CN116303173A (en) 2023-06-23
CN116303173B true CN116303173B (en) 2023-08-08

Family

ID=86827292

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310564513.4A Active CN116303173B (en) 2023-05-19 2023-05-19 Method, device and system for reducing RDMA engine on-chip cache and chip

Country Status (1)

Country Link
CN (1) CN116303173B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117687795A (en) * 2024-01-25 2024-03-12 珠海星云智联科技有限公司 Hardware offloading method, device and medium for remote direct memory access

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103647807A (en) * 2013-11-27 2014-03-19 华为技术有限公司 Information caching method, device and communication apparatus
US10860511B1 (en) * 2015-12-28 2020-12-08 Western Digital Technologies, Inc. Integrated network-attachable controller that interconnects a solid-state drive with a remote server computer
CN112463654A (en) * 2019-09-06 2021-03-09 华为技术有限公司 Cache implementation method with prediction mechanism
CN112559436A (en) * 2020-12-16 2021-03-26 中国科学院计算技术研究所 Context access method and system of RDMA communication equipment
CN113849293A (en) * 2021-11-30 2021-12-28 湖北芯擎科技有限公司 Data processing method, device, system and computer readable storage medium
CN113986791A (en) * 2021-09-13 2022-01-28 西安电子科技大学 Intelligent network card rapid DMA design method, system, equipment and terminal
CN114090274A (en) * 2020-07-31 2022-02-25 华为技术有限公司 Network interface card, storage device, message receiving method and message sending method
CN115002047A (en) * 2022-05-20 2022-09-02 北京百度网讯科技有限公司 Remote direct data access method, device, equipment and storage medium
CN115470156A (en) * 2022-09-13 2022-12-13 深圳云豹智能有限公司 RDMA-based memory use method, system, electronic device and storage medium
CN115629840A (en) * 2022-10-19 2023-01-20 深圳云豹智能有限公司 Method and system for hot migration of RDMA virtual machine and corresponding physical machine
CN115633104A (en) * 2022-09-13 2023-01-20 江苏为是科技有限公司 Data sending method, data receiving method, device and data receiving and sending system
CN115982091A (en) * 2023-03-21 2023-04-18 深圳云豹智能有限公司 Data processing method, system, medium and equipment based on RDMA engine
CN116016570A (en) * 2022-12-29 2023-04-25 深圳云豹智能有限公司 Message processing method, device and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060075067A1 (en) * 2004-08-30 2006-04-06 International Business Machines Corporation Remote direct memory access with striping over an unreliable datagram transport

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103647807A (en) * 2013-11-27 2014-03-19 华为技术有限公司 Information caching method, device and communication apparatus
US10860511B1 (en) * 2015-12-28 2020-12-08 Western Digital Technologies, Inc. Integrated network-attachable controller that interconnects a solid-state drive with a remote server computer
CN112463654A (en) * 2019-09-06 2021-03-09 华为技术有限公司 Cache implementation method with prediction mechanism
CN114090274A (en) * 2020-07-31 2022-02-25 华为技术有限公司 Network interface card, storage device, message receiving method and message sending method
CN112559436A (en) * 2020-12-16 2021-03-26 中国科学院计算技术研究所 Context access method and system of RDMA communication equipment
CN113986791A (en) * 2021-09-13 2022-01-28 西安电子科技大学 Intelligent network card rapid DMA design method, system, equipment and terminal
CN113849293A (en) * 2021-11-30 2021-12-28 湖北芯擎科技有限公司 Data processing method, device, system and computer readable storage medium
CN115002047A (en) * 2022-05-20 2022-09-02 北京百度网讯科技有限公司 Remote direct data access method, device, equipment and storage medium
CN115470156A (en) * 2022-09-13 2022-12-13 深圳云豹智能有限公司 RDMA-based memory use method, system, electronic device and storage medium
CN115633104A (en) * 2022-09-13 2023-01-20 江苏为是科技有限公司 Data sending method, data receiving method, device and data receiving and sending system
CN115629840A (en) * 2022-10-19 2023-01-20 深圳云豹智能有限公司 Method and system for hot migration of RDMA virtual machine and corresponding physical machine
CN116016570A (en) * 2022-12-29 2023-04-25 深圳云豹智能有限公司 Message processing method, device and system
CN115982091A (en) * 2023-03-21 2023-04-18 深圳云豹智能有限公司 Data processing method, system, medium and equipment based on RDMA engine

Also Published As

Publication number Publication date
CN116303173A (en) 2023-06-23

Similar Documents

Publication Publication Date Title
CN111327603B (en) Data transmission method, device and system
TWI416334B (en) Method, bus interface device and processor for transmitting data transfer requests from a plurality of clients as packets on a bus
US5903724A (en) Method of transferring packet data in a network by transmitting divided data packets
CN116303173B (en) Method, device and system for reducing RDMA engine on-chip cache and chip
CN113157625B (en) Data transmission method, device, terminal equipment and computer readable storage medium
CN113422793A (en) Data transmission method and device, electronic equipment and computer storage medium
CN115827506A (en) Data writing method, data reading method, device, processing core and processor
US10198378B2 (en) Faster data transfer with simultaneous alternative remote direct memory access communications
CN110557341A (en) Method and device for limiting data current
CN111404842B (en) Data transmission method, device and computer storage medium
CN116016570A (en) Message processing method, device and system
CN112230880B (en) Data transmission control method and device, FPGA and medium
CN112511522B (en) Method, device and equipment for reducing memory occupation in detection scanning
WO2021073413A1 (en) Method and apparatus for sending system performance parameters, management device, and storage medium
CN111669431B (en) Message transmission method and device, computer equipment and storage medium
CN114490459A (en) Data transmission method, device, equipment, receiver and storage medium
CN111913815A (en) Call request processing method and device, electronic equipment and readable storage medium
CN113157628A (en) Storage system, data processing method and device, storage system and electronic equipment
US20210377197A1 (en) A mail transmission method, server and system
CN110908886A (en) Data sending method and device, electronic equipment and storage medium
CN111447650B (en) Data forwarding method, equipment and storage medium
CN115442320B (en) Method, device and storage medium for realizing fragment reassembly of RDMA (remote direct memory Access) multi-queue messages
CN116909978B (en) Data framing method and device, electronic equipment and storage medium
CN115878351B (en) Message transmission method and device, storage medium and electronic device
CN116431558B (en) AXI protocol-based request response method, device, system and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant