WO2022259452A1 - 中間装置、通信方法、およびプログラム - Google Patents
中間装置、通信方法、およびプログラム Download PDFInfo
- Publication number
- WO2022259452A1 WO2022259452A1 PCT/JP2021/022074 JP2021022074W WO2022259452A1 WO 2022259452 A1 WO2022259452 A1 WO 2022259452A1 JP 2021022074 W JP2021022074 W JP 2021022074W WO 2022259452 A1 WO2022259452 A1 WO 2022259452A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- request
- remote
- local
- data
- response
- Prior art date
Links
- 238000004891 communication Methods 0.000 title claims description 44
- 238000000034 method Methods 0.000 title claims description 21
- 230000004044 response Effects 0.000 claims abstract description 84
- 238000012546 transfer Methods 0.000 claims abstract description 56
- 230000005540 biological transmission Effects 0.000 claims description 19
- 102100036409 Activated CDC42 kinase 1 Human genes 0.000 description 28
- 238000010586 diagram Methods 0.000 description 16
- 102100038192 Serine/threonine-protein kinase TBK1 Human genes 0.000 description 11
- 238000012545 processing Methods 0.000 description 9
- 230000003287 optical effect Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 3
- 230000006378 damage Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000003780 insertion Methods 0.000 description 2
- 230000037431 insertion Effects 0.000 description 2
- 230000005856 abnormality Effects 0.000 description 1
- 239000000969 carrier Substances 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 230000000593 degrading effect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000007257 malfunction Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1097—Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/16—Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
- G06F15/163—Interprocessor communication
- G06F15/173—Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
Definitions
- the present invention relates to an intermediate device, a communication method, and a program.
- RDMA Remote Direct Memory Access
- Non-Patent Document 1 is a communication protocol that performs high-speed and highly reliable data transfer between communication terminals at a distance. Since RDMA directly accesses the memory area of the receiving terminal from the memory area of the transmitting terminal, high-speed communication is possible. In addition to having a credit-based flow control function, RDMA performs Completion control to confirm the completion of data transfer and proceed with processing, so highly reliable communication is possible. RDMA is also used as a transport method for Host-to-Device and Device-to-Device data communication between SSD (Solid State Drive) and GPU (Graphics Processing Unit).
- SSD Solid State Drive
- GPU Graphics Processing Unit
- RDMA is a communication model that configures a QP (Queue Pair) between a local server and a remote server and transfers data using the QP.
- QP is a set of SQ (Send Queue) and RQ (Receive Queue).
- the communication unit of RDMA is a communication request called WR (Work Request), which is loaded on SQ/RQ in units of WQE (Work Queue Element).
- WR includes Send WR, which is a request to send, and Receive WR, which is a request to receive.
- Send WR the memory area of the data to be sent is specified in WQE and loaded in SQ.
- Receive WR specify the memory area where you want to receive data in WQE and load it in RQ.
- WQEs corresponding to the queue size of the SQ/RQ can be stacked in the SQ/RQ by FIFO (First-In-First-Out).
- FIFO First-In-First-Out
- CQE Completion Queue Entry
- CQ Completion Queue
- WR processing ends with an error during QP CQ is loaded with CQE indicating the error.
- the SQ/RQ WQE is deleted and the next WR can be accepted.
- Advances in communication technology may provide networks that connect high bandwidth and long distances.
- a transponder installed in a current optical transmission system is installed in a client-server system, and the signal reaches the server of the communication partner without undergoing electrical/optical conversion during transmission.
- high-speed transmission lines can be established with few network resources (frequency, etc.) by selecting the optimum transmission mode (modulation method, baud rate, number of carriers, etc.) based on network conditions (distance, signal quality, etc.).
- a technique to do so has been proposed. With such technology, long-distance, high-speed communication between communication terminals may be realized using fewer network resources and terminal resources.
- the present invention has been made in view of the above, and aims to realize high-bandwidth data transfer even on network services with a large RTT (Round Trip Time).
- An intermediate device of one aspect of the present invention is an intermediate device disposed between a first device and a second device that transfer data using remote direct memory access, wherein a transfer unit that transfers a request including data to be transmitted to a second device; a generation unit that generates a pseudo-response to the request and returns it to the first device; and a response to the request from the second device. and a discarding unit that discards the
- An intermediate device of one aspect of the present invention is an intermediate device disposed between a first device and a second device that transfer data using remote direct memory access, wherein a generator for generating a pseudo request for requesting data transmission from the first device to the second device based on the first request for requesting data transmission to the second device and transmitting the pseudo request to the second device; , a transfer unit that transfers a response including data to be transmitted from the second device to the first device, and a discarding unit that discards subsequent requests from the first device.
- a communication method is a communication method by an intermediate device disposed between a first device and a second device that transfer data using remote direct memory access, wherein the first device to the second device, generates a pseudo-response to the request and returns it to the first device, and discards the response to the request from the second device .
- a communication method is a communication method by an intermediate device disposed between a first device and a second device that transfer data using remote direct memory access, wherein the first device generating a pseudo request requesting transmission of data from said first device to said second device based on the first request requesting transmission of data from said second device, and transmitting said request to said second device forward a response containing the data to be sent from the second device to the first device, and discard subsequent requests from the first device.
- high-bandwidth data transfer can be realized even on network services with a large RTT.
- FIG. 1 is a diagram for explaining an RDMA communication model.
- FIG. 2 is a diagram for explaining RDMA SEND.
- FIG. 3 is a diagram for explaining RDMA WRITE of RDMA.
- FIG. 4 is a diagram for explaining RDMA WRITE with Immediate of RDMA.
- FIG. 5 is a diagram for explaining RDMA READ of RDMA.
- FIG. 6 is a diagram for explaining the ATOMIC Operations of RDMA.
- FIG. 7 is a diagram showing an example of the configuration of a communication system including the intermediate device of this embodiment. 8 is a sequence diagram showing an example of the flow of processing in the communication system of FIG. 7.
- FIG. 9 is a diagram showing an example of the configuration of a communication system having another intermediate device according to this embodiment.
- FIG. 10 is a sequence diagram showing an example of the flow of processing in the communication system of FIG. 9.
- FIG. 11 is a diagram for explaining an example of a method of creating a table and resolving the destination QPN of the response.
- FIG. 12 is a diagram for explaining an example of a method of notifying the source QPN and resolving the destination QPN of the response.
- FIG. 13 is a diagram showing an example in which the intermediate device is configured on the NIC.
- FIG. 14 is a diagram illustrating an example of a hardware configuration of an intermediate device;
- RDMA service types are roughly divided into four types: RC (Reliable Connection), RD (Reliable Datagram), UC (Unreliable Connection), and UD (Unreliable Datagram), according to Reliable/Unreliable and Connection/Datagram.
- RC and UD are commonly used.
- RC guarantees the order and reachability of messages by means of acknowledgment of communication success/abnormality and retransmission by ACK/NAK.
- RC is also connection-oriented, providing one-to-one communication between local-remote QPs.
- UD does not have a mechanism for acknowledgment and retransmission
- many-to-many communication such as transmission to multiple QPs and reception from multiple QPs is possible by specifying a destination for each communication.
- RDMA WRITE with Immediate
- RDMA READ with Immediate
- ATOMIC Operations All of these are available in RC. Only SEND can be used in UD.
- Retransmission control in RDMA is classified into three patterns: when ACK/NAK is not returned, when RNR (Receiver-Not-Ready) NAK is returned, and when Out-Of-sequence NAK is returned .
- ACK or NAK is not returned from the remote side within a certain period of time, the local side will time out and retransmit.
- the remote side returns RNR NAK when WQE cannot be prepared by RQ. If RNR NAK is returned from the remote side, the local side will resend after a certain period of time.
- the remote side returns Out-of-sequence NAK when the PSN (Packet Sequence Number) of the received packet is out of order. If an Out-Of-sequence NAK is returned from the remote side, the local side will resend without waiting.
- PSN Packet Sequence Number
- SEND is the basic send/receive model of RDMA, sending data from local to remote.
- the local When communication is ready, the local sends data with SEND. When the remote successfully receives the data, it loads CQE in CQ, releases WQE in RQ, and returns ACK to the local. When the local receives the ACK, it loads the CQ with the CQE and releases the WQE in the SQ.
- SEND has a special operation, SEND w/Imm (SEND with Immediate).
- SEND w/Imm a special field (imm_data) can be set in the WQE of the local SQ, and imm_data can be sent simultaneously when data is sent from the local to the remote.
- imm_data can be sent simultaneously when data is sent from the local to the remote.
- the remote When the remote successfully receives the data, it loads the CQ with a CQE containing imm_data.
- the contents of imm_data can be known remotely by referring to the CQE.
- RDMA WRITE is a method of transmitting data from local to remote, but differs in that data is directly transferred to a remote memory area.
- WQE Local prepares SQ and loads WQE.
- WQE a memory area for data to be transmitted and a remote memory area to be written are set.
- the remote reserves a memory area for RDMA, there is no need to stack WQE in RQ.
- the local When communication is ready, the local transmits data with RDMA WRITE. Data is written directly to the remote memory area. The remote returns an ACK to the local upon successfully receiving the data. When the local receives the ACK, it loads the CQ with the CQE and releases the WQE in the SQ.
- RDMA WRITE has the disadvantage that the remote cannot detect changes in the memory area when receiving data.
- the remote prepares RQ, stores WQE in RQ, and loads CQE in CQ when data reception is successful, thereby detecting a change in the memory area.
- the WQE of the local SQ is set with the memory area of the data to be sent, the remote memory area to be written, and a special field (imm_data).
- the remote fills the CQ with a CQE including imm_data, releases the WQE of the RQ, and returns ACK to the local.
- the remote can use this CQE to detect changes in any memory area.
- RDMA READ is a method of pulling data from remote to local.
- Local prepares SQ and loads WQE.
- a local memory area from which data is to be received and a remote memory area from which data is to be read are set in WQE.
- the remote reserves a memory area for RDMA, there is no need to stack WQE in RQ.
- the local When communication is ready, the local requests data reading with RDMA Read Request.
- the remote When the remote receives the request, it directly sends the data in the remote memory area to the set local memory area as an RDMA Read Response.
- the RDMA Read Response contains an ACK extension header.
- the local When the local receives this ACK, it loads the CQ with the CQE and releases the WQE of the SQ.
- ATOMIC Operations is a method of performing memory operations such as FetchAdd (Fetch and Add) or CmpSwap (Compare and Swap) on a remote memory area and reading the memory contents before the operation locally.
- FetchAdd is an operation that adds the value of an arbitrary 64-bit field to the contents of an arbitrary remote memory address.
- CmpSwap is an operation to change to a new 64-bit field value when the content of any remote memory address is the same as the specified 64-bit field value.
- the WQE is set with a local memory area to receive data, a remote memory area to be manipulated, and operation details (FetchAdd or CmpSwap and their arguments). Although the remote reserves a memory area for RDMA, there is no need to stack WQE in RQ.
- the local When communication is ready, the local sends an ATOMIC Command (FetchAdd or CmpSwap).
- ATOMIC Command FetchAdd or CmpSwap
- the remote When the remote receives the command, it performs an ATOMIC operation on the set local memory area and returns the pre-operation data with an ATOMIC ACK.
- the local When the local receives this ACK, it loads the CQ with the CQE and releases the WQE of the SQ.
- Intermediate devices 10A, 10B are placed between local 30 and remote 50 that transfer data using RDMA. More specifically, the intermediate device 10A is placed in front of the long distance network on the local 30 side, and the intermediate device 10B is placed in front of the long distance network on the remote 50 side. The intermediate device 10A returns the pseudo-response to the local 30, and the intermediate device 10B discards the response from the remote 50.
- FIG. 7 An example of the configuration of a communication system including intermediate devices 10A and 10B of this embodiment will be described.
- Intermediate devices 10A, 10B are placed between local 30 and remote 50 that transfer data using RDMA. More specifically, the intermediate device 10A is placed in front of the long distance network on the local 30 side, and the intermediate device 10B is placed in front of the long distance network on the remote 50 side.
- the intermediate device 10A returns the pseudo-response to the local 30, and the intermediate device 10B discards the response from the remote 50.
- the intermediate device 10A includes a transfer unit 11 and a generation unit 12.
- the transfer unit 11 receives requests from the local 30 and transfers them to the remote 50 .
- This request is, for example, the aforementioned SEND, SEND w/Imm, RDMA WRITE, RDMA WRITE w/Imm, or ATOMIC Command.
- a request includes data or an operation on data to be sent from the local 30 to the remote 50 .
- the generation unit 12 picks up a request sent from the local 30 and flagged as Only or Last, and generates a pseudo-response using the PSN included in the request.
- the generator 12 returns the generated pseudo-response to the local 30 . Note that the same PSN value as the Only or Last request is used for the ACK for the request.
- the local 30 When the local 30 receives the pseudo-response, it recognizes it as a response from the remote 50, adds CQE to the CQ, and completes normally. This allows the WQE of the local 30 SQ to be forcibly released.
- the intermediate device 10B includes a transfer unit 11 and a discarding unit 13.
- the transfer unit 11 transfers the request sent by the local 30 to the remote 50 in the same way as the transfer unit 11 of the intermediate device 10A.
- the discarding unit 13 discards the true response from the remote 50 to the request. This prevents duplicate reception of responses at the local 30 . Furthermore, if an RNR or an out-of-sequence NAK transmitted from the remote 50 arrives at the local 30, it may cause a malfunction, so the discarding unit 13 also discards these NAKs.
- the intermediate device 10A may be provided with the discarding unit 13 and the intermediate device 10B may not be arranged.
- the local 30 stores WQE in the SQ and transmits a request to the remote 50.
- the request is forwarded to the remote 50 via the intermediate devices 10A, 10B.
- the intermediate device 10A generates a pseudo-response using the PSN included in the request.
- the intermediate device 10A returns the pseudo-response to the local 30.
- the local 30 receives the pseudo-response, it loads the CQE with the CQE and releases the WQE of the SQ.
- the intermediate device 10A forwards the request, generates a pseudo-response, and returns it to the local device 30 (step S17).
- the local 30 receives the pseudo-response, it loads the CQE into the CQ, releases the WQE of the SQ, loads the WQE into the SQ, and transmits the request to the remote 50 (step S18).
- the remote 50 when it successfully receives the request (data), it transmits a response to the local 30 in step S14.
- the intermediate device 10B discards the received response.
- the remote 50 when it receives a request, it returns a response, and the intermediate device 10B discards the response.
- Intermediate devices 20A, 20B are placed between local 30 and remote 50 that transfer data using RDMA. More specifically, the intermediate device 20A is placed in front of the long distance network on the local 30 side, and the intermediate device 20B is placed in front of the long distance network on the remote 50 side. The intermediate device 20A discards the request from the local 30 and the intermediate device 20B sends a pseudo request to the remote 50.
- FIG. 9 An example of the configuration of a communication system including other intermediate devices 20A and 20B of this embodiment will be described.
- Intermediate devices 20A, 20B are placed between local 30 and remote 50 that transfer data using RDMA. More specifically, the intermediate device 20A is placed in front of the long distance network on the local 30 side, and the intermediate device 20B is placed in front of the long distance network on the remote 50 side. The intermediate device 20A discards the request from the local 30 and the intermediate device 20B sends a pseudo request to the remote 50.
- FIG. 9 An example of the configuration of a communication system including other intermediate devices 20A and 20B of this embodiment
- the intermediate device 20A includes a discarding unit 21 and a transfer unit 24.
- the discarding unit 21 transfers the first request (Only or First request) from the local 30 to the remote 50 and discards subsequent requests from the local 30 . This prevents duplicate reception of requests at the remote 50 .
- the transfer unit 24 transfers the response returned from the remote 50 to the local 30.
- the response contains data to be sent from the remote 50 to the local 30 .
- the intermediate device 20B includes a generation unit 22, a control unit 23, and a transfer unit 24.
- the generation unit 22 picks up the first request from the local 30 and generates a pseudo-request using the destination QPN (QPNumber) included in RETH (RDMA Extended Transport Header) and BTH (Base Transport Header). Generate.
- QPNumber the destination QPN included in RETH (RDMA Extended Transport Header) and BTH (Base Transport Header).
- the generation unit 22 generates pseudo requests so that the number obtained by subtracting the number of responses returned from the remote 50 from the number of requests sent to the remote 50 does not exceed the SQ queue size of the local 30 .
- the PSN of a pseudo request is obtained by calculating the number of requests for 1 WQE from the DMA length of RETH and PSH of BTH of the request, and incrementing the PSN by that number.
- the remote 50 Upon receiving the pseudo request, the remote 50 recognizes it as a request from the local 30, extracts data from the memory area, and transmits the response to the local 30. This allows data to be sent from the remote 50 without waiting for a request from the local 30 .
- the control unit 23 checks the pseudo-request sent to the remote 50 and the response returned from the remote 50, and checks whether or not the expected length and number of responses are returned. Controls when pseudo requests are generated.
- the transfer unit 24 transfers the response returned from the remote 50 to the local 30 in the same manner as the transfer unit 24 of the intermediate device 20A.
- the intermediate device 20B may include the discarding unit 21 and the intermediate device 20A may not be arranged.
- the local 30 loads the SQ with WQE and transmits the first request to the remote 50.
- the intermediate device 20A transfers the first request to the remote 50 side without discarding it.
- Intermediate device 20B acquires the QPN included in the first request.
- the remote 50 Upon receiving the request, the remote 50 returns a response to the local 30 in step S22.
- the local 30 receives the response, it loads the CQ with the CQE and releases the WQE of the SQ.
- the local 30 loads the SQ with WQE and transmits the subsequent request to the remote 50.
- the intermediate device 20A discards subsequent requests from the local 30.
- the intermediate device 20B generates a pseudo request in step S25 and transmits the pseudo request to the remote 50 in step S26.
- the intermediate device 20B controls the generation timing of the pseudo-request so that the local 30 can correctly receive the response returned from the remote 50.
- the remote 50 Upon receiving the pseudo request, the remote 50 returns a response including data to the local 30 in step S27.
- the local 30 loads the CQE into the CQ, releases the WQE of the SQ, loads the WQE into the SQ, and transmits the subsequent request to the remote 50 (step S28). .
- the intermediate device 20A discards subsequent requests from the local 30 (step S29).
- the intermediate device 20B generates a pseudo request at a predetermined timing (step S30) and transmits the pseudo request to the remote 50 (step S31).
- QP has different QPN for each endpoint.
- the SQ/RQ recognizes the QPN of the opposite side, and includes the destination QPN in the header when generating the RDMA packet.
- the QPN of the source is not included in the header.
- the intermediate device 10A generates a pseudo-response
- the destination of the pseudo-response is unknown because the received request does not contain information indicating the QPN of the transmission source. Therefore, in this embodiment, the destination of the pseudo-response is specified by the following two methods.
- the first method is to inspect the exchange of the original RDMA request and response and store the QPN combination in a table.
- the same PSN is used for RDMA packet Only or Last requests and ACKs. Therefore, the intermediate device 10A inspects the passing requests and responses, and adds the destination QPN of each header of the Only or Last request and ACK having the same PSN to the table as a combination.
- the destination QPNs of the request and response headers with the same PSN are 0x000020 and 0x000010, respectively, so add the combination of 0x000010 and 0x000020 to the table.
- the local 30 constitutes a QP between each of the remote 50A and the remote 50B.
- the intermediate device 10A When the intermediate device 10A generates a pseudo-response, it acquires a combination of QPNs including the destination QPN of the request from the table, and sets the other QPN of the combination as the destination QPN of the pseudo-response. For example, when receiving a request with a destination QPN of 0x000020, the intermediate device 10A acquires a combination of 0x000010 and 0x000020 including 0x000020 from the table, and sets the destination QPN of the pseudo-response to 0x000010.
- the second method is to put the Source QPN on the RDMA packet.
- WQE has a 32-bit immDt (immediate Date) field, and any 32-bit information can be written in the immDt field only for SEND with immediate or RDMA WRITE with immediate.
- the local 30 has an insertion unit 31, and the insertion unit 31 writes the QPN of the local 30 side SQ into the immDt field of the WQE of the local 30 side SQ.
- the intermediate device 10A When the intermediate device 10A generates a pseudo-response, it sets the QPN written in the immDt field of the received request as the destination QPN of the pseudo-response.
- the intermediate devices 10A and 10B of FIG. 7 can be used for SEND.
- the intermediate device 10A transfers SEND Only to the remote 50 side, creates a pseudo-response (ACK) from the SEND Only header, and returns it to the local 30.
- ACK pseudo-response
- the local 30 receives the pseudo-response, it loads the CQE with the CQE and releases the WQE of the SQ.
- the remote 50 When the remote 50 successfully receives the data, it returns an ACK to the local 30 side. Intermediate device 20B discards the ACK from remote 50 .
- the intermediate devices 10A and 10B in FIG. 7 can be used for RDMA WRITE.
- the intermediate device 10A transfers the data to the remote 50 side, creates a pseudo-response (ACK) from the RDMA WRITE header, and returns it to the local 30.
- ACK pseudo-response
- the local 30 receives the pseudo-response, it loads the CQE with the CQE and releases the WQE of the SQ.
- the remote 50 When the remote 50 successfully receives the data, it returns an ACK to the local 30. Intermediate device 10B discards the ACK from remote 50 .
- the RDMA WRITE w/Imm shown in FIG. 4 can also be applied in the same manner as the RDMA WRITE.
- the intermediate device 20A transfers the first request to the remote 50 without abandoning it.
- the remote 50 Upon receiving the request, the remote 50 returns a response to the local 30.
- the intermediate devices 20A, 20B forward the responses to the local 30. FIG.
- the intermediate device 20B classifies the request as completed or incomplete based on the status of the response, and estimates the free SQ of the local 30.
- the intermediate device 20B creates new pseudo requests (pseudo RDMA Read Requests) for the amount of free SQ and transmits them to the remote 50 .
- Intermediate device 20B repeats creation and transmission of the pseudo request until the request is classified as completed based on the status of the response.
- the local 30 When the local 30 successfully receives the data, it loads the CQE into the CQ, releases the WQE of the SQ, loads a new WQE into the SQ, and transmits a new request to the remote 50 .
- the intermediate device 20A discards the request from the local 30.
- the operation differs depending on whether the remote 50 pre-operation data may be discarded or the local 30 receives the pre-operation data.
- pre-operation data can be discarded. If pre-operation data can be discarded in ATOMIC Operations, intermediate devices 10A and 10B in FIG. 7 can be used.
- the intermediate device 10A transfers the ATOMIC Command to the remote 50 side, creates a pseudo-response (ATOMIC ACK) from the header of the ATOMIC Command, and returns it to the local 30.
- ATOMIC ACK pseudo-response
- the local 30 receives the pseudo-response, it loads the CQE with the CQE and releases the WQE of the SQ.
- the remote 50 Upon receiving the ATOMIC Command, the remote 50 performs the ATOMIC operation and returns the pre-operation data with ATOMIC ACK.
- the intermediate device 10B discards the ATOMIC ACK from the remote 50.
- the intermediate devices 20A and 20B transfer the ATOMIC Command to the remote 50.
- the remote 50 Upon receiving the ATOMIC Command, the remote 50 performs the ATOMIC operation and returns the pre-operation data with ATOMIC ACK. Intermediate devices 20A and 20B transfer ATOMIC ACK to local 30.
- the intermediate device 20B classifies the request as completed or incomplete based on the status of the response, and estimates the free SQ of the local 30.
- the intermediate device 20B creates a new pseudo request (pseudo ATOMIC Command) for the amount of space in the SQ and transmits it to the remote 50 .
- Intermediate device 20B repeats creation and transmission of the pseudo request until the request is classified as completed based on the status of the response.
- the intermediate device 10A of this embodiment includes the transfer unit 11 that transfers a request including data to be transmitted from the local device 30 to the remote device 50, and a pseudo-response to the request that is generated and returned to the local device 30.
- a part 12 is provided.
- the intermediate device 10B has a discarding unit 13 that discards the response to the request from the remote 50.
- the intermediate device 20B of this embodiment includes a generator 22 that generates a pseudo request based on a first request requesting transmission of data from the local 30 to the remote 50 and transmits it to the remote 50, and a generator 22 that transmits the pseudo request from the remote 50 to the local 30.
- a transfer unit 24 is provided for transferring a response including data to be processed.
- the intermediate device 20A includes a discarding unit 21 that discards subsequent requests from the local 30. FIG. Since the remote 50 transmits data in response to a pseudo request from the intermediate device 20B, even if the RTT between the local 30 and the remote 50 is large, high-bandwidth data transfer is realized without waiting for a request from the local 30. can.
- the intermediate devices 10A, 10B, 20A, and 20B are installed between the local 30 and the remote 50.
- Devices 10A, 20A may be configured and intermediate devices 10B, 20B may be configured on the NIC of the remote 50 device.
- intermediate devices 10A, 10B, 20A, and 20B may be composed of physical servers or may be composed of virtual servers.
- Network devices such as switches or routers may provide the functionality of intermediate devices 10A, 10B, 20A, 20B.
- An intermediate device having the functions of the intermediate device 10A and the intermediate device 10B may be arranged on the local 30 side, and an intermediate device having the functions of the intermediate device 20A and the intermediate device 20B may be arranged on the remote 50 side. may be placed.
- the intermediate device 10A having the discarding unit 13 may be arranged on the local 30 side, and the intermediate device 20B having the discarding unit 21 may be arranged on the remote 50 side.
- the intermediate devices 10A, 10B, 20A, and 20B described above include, for example, a central processing unit (CPU) 901, a memory 902, a storage 903, a communication device 904, and an input device 905 as shown in FIG. , and an output device 906 can be used.
- CPU central processing unit
- memory 902 a storage 903, a communication device 904, and an input device 905 as shown in FIG.
- an output device 906 can be used.
- intermediate apparatuses 10A, 10B, 20A, and 20B are realized by CPU 901 executing a predetermined program loaded on memory 902.
- FIG. This program can be recorded on a computer-readable recording medium such as a magnetic disk, optical disk, or semiconductor memory, or distributed via a network.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Communication Control (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
Description
RDMAのサービスタイプは、Reliable/Unreliable、Connection/Datagramの区分により、RC(Reliable Connection)、RD(Reliable Datagram)、UC(Unreliable Connection)、およびUD(Unreliable Datagram)の4つに大別される。RCとUDが一般に使用される。
次に、サービスタイプがRCのいくつかのオペレーションについて説明する。
次に、図7を参照し、本実施形態の中間装置10A,10Bを備えた通信システムの構成の一例について説明する。中間装置10A,10Bは、RDMAを用いてデータを転送するローカル30とリモート50の間に配置される。より具体的には、中間装置10Aは、ローカル30側の長距離ネットワークの前段に配置され、中間装置10Bは、リモート50側の長距離ネットワークの前段に配置される。中間装置10Aは、疑似レスポンスをローカル30へ返却し、中間装置10Bは、リモート50からのレスポンスを破棄する。
次に、図8のシーケンス図を参照し、中間装置10A,10Bを備える通信システムの処理の流れの一例について説明する。
次に、図9を参照し、本実施形態の別の中間装置20A,20Bを備えた通信システムの構成の一例について説明する。中間装置20A,20Bは、RDMAを用いてデータを転送するローカル30とリモート50の間に配置される。より具体的には、中間装置20Aは、ローカル30側の長距離ネットワークの前段に配置され、中間装置20Bは、リモート50側の長距離ネットワークの前段に配置される。中間装置20Aは、ローカル30からのリクエストを破棄し、中間装置20Bは、疑似リクエストをリモート50へ送信する。
次に、図10のシーケンス図を参照し、中間装置20A,20Bを備える通信システムの処理の流れの一例について説明する。
RDMAのインタフェースでは、QPは、エンドポイントごとに異なるQPNを持つ。SQ/RQは対向のQPNを認識しており、RDMAパケットを生成する際はdestination QPNをヘッダに含める。しかしながら、送信元のQPNはヘッダに含まれない。中間装置10Aが疑似レスポンスを生成する際、受信したリクエストに送信元のQPNを示す情報がないため、疑似レスポンスの送り先が不明である。そこで、本実施形態では、以下に示す2通りの方法で、疑似レスポンスの送り先を特定する。
次に、RDMAの各オペレーションに本実施形態の中間装置を適用した例について説明する。
11…転送部
12…生成部
13…破棄部
21…破棄部
22…生成部
23…制御部
24…転送部
30…ローカル
50…リモート
Claims (7)
- リモートダイレクトメモリアクセスを用いてデータを転送する第1の装置と第2の装置の間に配置される中間装置であって、
前記第1の装置から前記第2の装置へ送信されるデータを含むリクエストを転送する転送部と、
前記リクエストに対する疑似レスポンスを生成して前記第1の装置へ返却する生成部と、
前記第2の装置からの前記リクエストに対するレスポンスを破棄する破棄部を備える
中間装置。 - 請求項1に記載の中間装置であって、
前記第1の装置と前記第2の装置との間で送受信されるリクエストとレスポンスから同じ識別子を持つリクエストとレスポンスのそれぞれの宛先を組み合わせとしたテーブルを作成しておき、
前記生成部は、前記テーブルから前記リクエストの宛先を含む組み合わせを取得し、前記組み合わせにおいて前記リクエストの宛先に対応する宛先を前記疑似レスポンスの宛先とする
中間装置。 - 請求項1に記載の中間装置であって、
前記リクエストは当該リクエストの送信元を含み、
前記生成部は、前記リクエストに含まれる前記リクエストの送信元を前記疑似レスポンスの宛先とする
中間装置。 - リモートダイレクトメモリアクセスを用いてデータを転送する第1の装置と第2の装置の間に配置される中間装置であって、
前記第1の装置から前記第2の装置へデータの送信を要求する最初のリクエストに基づいて前記第1の装置から前記第2の装置へデータの送信を要求する疑似リクエストを生成して前記第2の装置へ送信する生成部と、
前記第2の装置から前記第1の装置へ送信されるデータを含むレスポンスを転送する転送部と、
前記第1の装置からの後続のリクエストを破棄する破棄部を備える
中間装置。 - リモートダイレクトメモリアクセスを用いてデータを転送する第1の装置と第2の装置の間に配置される中間装置による通信方法であって、
前記第1の装置から前記第2の装置へ送信されるデータを含むリクエストを転送し、
前記リクエストに対する疑似レスポンスを生成して前記第1の装置へ返却し、
前記第2の装置からの前記リクエストに対するレスポンスを破棄する
通信方法。 - リモートダイレクトメモリアクセスを用いてデータを転送する第1の装置と第2の装置の間に配置される中間装置による通信方法であって、
前記第1の装置から前記第2の装置へデータの送信を要求する最初のリクエストに基づいて前記第1の装置から前記第2の装置へデータの送信を要求する疑似リクエストを生成して前記第2の装置へ送信し、
前記第2の装置から前記第1の装置へ送信されるデータを含むレスポンスを転送し、
前記第1の装置からの後続のリクエストを破棄する
通信方法。 - 請求項1ないし4のいずれかに記載の中間装置の各部としてコンピュータを動作させるプログラム。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2023526747A JPWO2022259452A1 (ja) | 2021-06-10 | 2021-06-10 | |
US18/568,618 US20240146806A1 (en) | 2021-06-10 | 2021-06-10 | Intermediate apparatus, communication method, and program |
PCT/JP2021/022074 WO2022259452A1 (ja) | 2021-06-10 | 2021-06-10 | 中間装置、通信方法、およびプログラム |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2021/022074 WO2022259452A1 (ja) | 2021-06-10 | 2021-06-10 | 中間装置、通信方法、およびプログラム |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022259452A1 true WO2022259452A1 (ja) | 2022-12-15 |
Family
ID=84426018
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2021/022074 WO2022259452A1 (ja) | 2021-06-10 | 2021-06-10 | 中間装置、通信方法、およびプログラム |
Country Status (3)
Country | Link |
---|---|
US (1) | US20240146806A1 (ja) |
JP (1) | JPWO2022259452A1 (ja) |
WO (1) | WO2022259452A1 (ja) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116566921A (zh) * | 2023-07-04 | 2023-08-08 | 珠海星云智联科技有限公司 | 远程直接内存访问读取的拥塞控制方法、系统及存储介质 |
WO2024201804A1 (ja) * | 2023-03-29 | 2024-10-03 | 日本電信電話株式会社 | 中継装置 |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2011058640A1 (ja) * | 2009-11-12 | 2011-05-19 | 富士通株式会社 | 並列計算用の通信方法、情報処理装置およびプログラム |
JP2013255185A (ja) * | 2012-06-08 | 2013-12-19 | Of Networks:Kk | オープンフロースイッチ、オープンフローコントローラ及びオープンフローネットワークシステム |
-
2021
- 2021-06-10 JP JP2023526747A patent/JPWO2022259452A1/ja active Pending
- 2021-06-10 US US18/568,618 patent/US20240146806A1/en active Pending
- 2021-06-10 WO PCT/JP2021/022074 patent/WO2022259452A1/ja active Application Filing
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2011058640A1 (ja) * | 2009-11-12 | 2011-05-19 | 富士通株式会社 | 並列計算用の通信方法、情報処理装置およびプログラム |
JP2013255185A (ja) * | 2012-06-08 | 2013-12-19 | Of Networks:Kk | オープンフロースイッチ、オープンフローコントローラ及びオープンフローネットワークシステム |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2024201804A1 (ja) * | 2023-03-29 | 2024-10-03 | 日本電信電話株式会社 | 中継装置 |
CN116566921A (zh) * | 2023-07-04 | 2023-08-08 | 珠海星云智联科技有限公司 | 远程直接内存访问读取的拥塞控制方法、系统及存储介质 |
CN116566921B (zh) * | 2023-07-04 | 2024-03-22 | 珠海星云智联科技有限公司 | 远程直接内存访问读取的拥塞控制方法、系统及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
US20240146806A1 (en) | 2024-05-02 |
JPWO2022259452A1 (ja) | 2022-12-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10430374B2 (en) | Selective acknowledgement of RDMA packets | |
US11934340B2 (en) | Multi-path RDMA transmission | |
US10148581B2 (en) | End-to-end enhanced reliable datagram transport | |
US8023520B2 (en) | Signaling packet | |
US7966380B2 (en) | Method, system, and program for forwarding messages between nodes | |
US8176187B2 (en) | Method, system, and program for enabling communication between nodes | |
US7817634B2 (en) | Network with a constrained usage model supporting remote direct memory access | |
US7876751B2 (en) | Reliable link layer packet retry | |
JP6236933B2 (ja) | 中継装置 | |
US7185114B1 (en) | Virtual memory systems and methods | |
US20030053462A1 (en) | System and method for implementing multi-pathing data transfers in a system area network | |
US7733875B2 (en) | Transmit flow for network acceleration architecture | |
US20070208820A1 (en) | Apparatus and method for out-of-order placement and in-order completion reporting of remote direct memory access operations | |
WO2022259452A1 (ja) | 中間装置、通信方法、およびプログラム | |
JP2013511884A (ja) | 動的接続された移送サービス | |
JP2006178961A (ja) | 要求−応答トランスポートプロトコルによる高信頼一方向メッセージング | |
MXPA04010437A (es) | Sistema, metodo y producto para administrar transferencias de datos en una red. | |
TW200537877A (en) | Retransmission system and method for a transport offload engine | |
US11870590B2 (en) | Selective retransmission of packets | |
US8150996B2 (en) | Method and apparatus for handling flow control for a data transfer | |
US20120072520A1 (en) | System and Method for Establishing Reliable Communication in a Connection-Less Environment | |
JP3148733B2 (ja) | 信号処理装置及び信号処理システム | |
TWI839155B (zh) | 電腦裝置以及應用於電腦裝置的傳輸控制協定封包處理方法 | |
TW202431824A (zh) | 電腦裝置以及應用於電腦裝置的傳輸控制協定封包處理方法 | |
TW202431825A (zh) | 電腦裝置以及應用於電腦裝置的傳輸控制協定封包處理方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21945128 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2023526747 Country of ref document: JP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 18568618 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21945128 Country of ref document: EP Kind code of ref document: A1 |