WO2023040683A1 - 传输数据的方法和输入输出设备 - Google Patents

传输数据的方法和输入输出设备 Download PDF

Info

Publication number
WO2023040683A1
WO2023040683A1 PCT/CN2022/116822 CN2022116822W WO2023040683A1 WO 2023040683 A1 WO2023040683 A1 WO 2023040683A1 CN 2022116822 W CN2022116822 W CN 2022116822W WO 2023040683 A1 WO2023040683 A1 WO 2023040683A1
Authority
WO
WIPO (PCT)
Prior art keywords
ssq
computing device
computing
request
pending
Prior art date
Application number
PCT/CN2022/116822
Other languages
English (en)
French (fr)
Inventor
曲会春
李君瑛
吉辛维克多
古列维奇·埃琳娜
陆钢
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023040683A1 publication Critical patent/WO2023040683A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer

Definitions

  • the present application relates to the technical field of communication, and more specifically, to a data transmission method and an input and output device.
  • RDMA Remote Direct Memory Access
  • a computing cluster usually includes more than two computing devices, and each computing device runs multiple processes. Assume that the computing cluster includes N node computing devices, and each computing device has P processes (N node is a positive integer greater than or equal to 2, and P is a positive integer greater than or equal to 1). If full interconnection is to be realized in the computing cluster, (N node -1) ⁇ P ⁇ P queue (queue pairs, QP) pairs need to be established in each computing device and each queue pair has a sending queue ( send queue, SQ). This takes a lot of memory. For example, assuming that the cluster includes 10 computing devices in total, the memory occupation of input and output (input output, IO) devices in each computing device will reach a terabyte (TB) level.
  • the application provides a data transmission method and an input and output device, which can support larger-scale computing clusters.
  • the present application provides a data transmission method, the method is applied to a computing cluster, the computing cluster includes multiple computing devices, the first computing device and the second computing device are any two of the multiple computing devices A computing device, the first computing device and the second computing device communicate through a first channel.
  • a first IO device is deployed in the first computing device, and a second IO device is deployed in the second computing device.
  • the first IO device acquires the pending request, stores the pending request according to the storage policy, and then sends the pending request to the second IO device through the first channel.
  • the storage policy is used to indicate the manner of storing the request to be processed in the first IO device.
  • the first IO device may determine how to store the pending request according to the storage policy, instead of directly storing the pending request in its own storage space. In this way, the storage space of the first IO device can be used more reasonably, so that less storage space can be used to store pending requests that need to be sent to other computing devices of the computing cluster. In this way, under the condition that the storage space of the IO device remains unchanged, the IO device using the above technical solution can communicate with more IO devices, thereby supporting a larger-scale computing cluster and improving the scalability of RDMA.
  • the first IO device storing the pending request according to a storage policy includes: determining, by the first IO device, an identifier of a target computing device of the pending request; The identifier of the computing device determines a first shared send queue SSQ for storing the pending request, and the first SSQ is used for storing the pending request associated with the destination computing device.
  • the IO device can allocate the storage space for storing the SSQ of the request to be processed according to the number of computing devices in the computing cluster.
  • the first IO device can allocate N-1 storage spaces, the N-1 storage spaces correspond to N-1 channels one by one, and the N-1 channels They are the communication channels between the first computing device and N ⁇ 1 computing devices in the N computing devices except the first computing device.
  • Each of the N-1 storage spaces can be used to store N cos SSQs.
  • N cos is the number of service levels supported by the first IO, and N cos is a positive integer greater than or equal to 1.
  • the value of N cos will not be greater than 8. Therefore, the first IO device needs to establish N cos ⁇ (N-1) SSQs at most. Normally, the value of N cos will be smaller than the number of processes running on the computing device. Therefore, compared with the existing technology, when the storage space of the IO device remains unchanged, the IO device using the above technical solution can communicate with more IO devices, thereby supporting a larger-scale computing cluster and improving the reliability of RDMA. scalability.
  • the SSQ and the channel in the first IO device may be dynamically associated.
  • the method further includes: the first IO device creates a first SSQ set, and the first SSQ set includes At least one SSQ, the at least one SSQ includes the first SSQ; the first IO device binds the first SSQ set to the first channel, and the first SSQ set is used to store the pending processing sent through the first channel ask.
  • the first IO device may create an SSQ resource pool, and the storage space occupied by the SSQ resource pool may be smaller than the size of N-1 storage spaces.
  • the first IO device may determine the first SSQ set from the SSQ resource pool and bind the first SSQ set to the first channel.
  • the first IO device may unbind the first SSQ set from the first channel. In this way, the SSQs in the first SSQ set will be recycled to the SSQ resource pool, so as to be used by pending requests sent through other channels. In this way, the storage space occupied by SSQ can be further saved, thereby further improving the scale of supported computing clusters and the scalability of RDMA.
  • the plurality of SSQs correspond to a plurality of CoSs of service levels, wherein the CoS corresponding to the first SSQ is the same as the to-be-processed
  • the requested CoS is the same.
  • the above technical solution supports CoS, so pending requests of different CoSs can be stored in corresponding SSQs.
  • the processing time of the pending requests in the first SSQ exceeds a preset threshold, the pending requests in the first SSQ are stored in the waiting shared sending queue PSSQ.
  • the above technical solution can move the pending requests blocked before the SSQ to the PSSQ. This allows subsequent SSQs to be processed first. This can effectively reduce the probability of head-of-line blocking
  • the acquisition of the pending request by the first input/output IO device includes: acquiring the pending request by the first IO device from a first submission queue, and the first submission queue is used to store the pending request associated with it.
  • the pending requests of the process, the first submission queue is stored in the memory of the first computing device.
  • the first IO device includes a network interface controller, an intelligent network interface controller, a host bus adapter, a host channel adapter, an accelerator, a data processor, an image processor, an artificial intelligence device, a software Define at least one of the infrastructures.
  • the present application further provides an IO device, where the IO device includes a unit for implementing the first aspect or any possible implementation manner of the first aspect.
  • the present application also provides a computer device, the computer device includes a processor, the processor is used to be coupled with a memory, and read and execute instructions and/or program codes in the memory, so as to implement the first aspect or the second aspect. Any possible implementation of one aspect.
  • the memory can also be used to store SSQ.
  • the present application also provides a chip system, the chip system includes a logic circuit, the logic circuit is used to couple with the input/output interface, and transmit data through the input/output interface, so as to implement the first aspect or the first aspect any possible implementation.
  • the present application also provides a computer-readable storage medium, where the computer-readable storage medium stores program codes, and when the computer storage medium is run on a computer, the computer executes the program code according to the first aspect or the first aspect. any possible implementation.
  • the present application also provides a computer program product, the computer program product including: computer program code, when the computer program code runs on the computer, the computer executes any one of the first aspect or the first aspect. a possible implementation.
  • Figure 1 is a schematic diagram of a computing cluster.
  • Fig. 2 is a schematic flowchart of a computing device provided by an embodiment of the present application.
  • Fig. 3 is a schematic flowchart of a data transmission method provided according to an embodiment of the present application.
  • Fig. 4 is a schematic flowchart of a data transmission method provided according to an embodiment of the present application.
  • Fig. 5 is a schematic structural block diagram of an IO device provided according to an embodiment of the present application.
  • Fig. 6 is a structural block diagram of an IO device provided according to an embodiment of the present application.
  • the chip referred to in the embodiment of the present application may be a system chip (system on chip, SoC), may also be a central processing unit (central processor unit, CPU), may also be a network processor (network processor, NP), may also be It is a digital signal processing circuit (digital signal processor, DSP), and it can also be an application processor (application processor, AP), or other integrated chips.
  • SoC system on chip
  • CPU central processing unit
  • NP network processor
  • DSP digital signal processing circuit
  • AP application processor
  • a computing cluster (may be referred to simply as a cluster) is a computing system.
  • Computing clusters connect a group of computing devices to work closely together to complete computing tasks.
  • Individual computing devices in a computing cluster may be referred to as nodes.
  • Figure 1 is a schematic diagram of a computing cluster. As shown in FIG. 1 , the computing cluster 100 includes six computing devices, namely computing device 111 , computing device 112 , computing device 113 , computing device 114 , computing device 115 and computing device 116 .
  • Computing device 111 to computing device 116 are connected by network 120 .
  • the computing device shown in FIG. 1 will be introduced below with reference to FIG. 2 .
  • the computing device 200 shown in FIG. 2 may be any one of the computing devices 111 to 116 shown in FIG. 1 .
  • the computing device 200 shown in FIG. 2 includes a host 210 , an IO interconnect channel 220 and an IO device 230 .
  • the host 210 can be connected to the IO device 230 through the IO interconnection channel 220 .
  • the host 210 may be a computing core and a control core, and is the final execution unit for information processing and program execution.
  • the host 210 includes a processor 211 and a first memory 222, the processor may be a central processing unit (central processing unit, CPU), and the processor may also be other general purpose processors, digital signal processors (digital signal processor, DSP) , application specific integrated circuit (ASIC), field-programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general purpose processor may be a microprocessor or any conventional processor or the like.
  • the processor 211 may also be an on-chip (system of chip, SoC) or embedded processor.
  • the processor 211 has functions such as processing instructions, executing operations, and processing data.
  • the first processor 211 can allocate independent memory resources for multiple processes, so as to run multiple processes.
  • the first memory 222 may be implemented by a random access device (random access memory, RAM) or other storage media.
  • the first memory 222 can be used to store program codes of a plurality of processes.
  • the IO interconnection channel 220 is an interconnection mechanism between the host computer 210 and the IO device 230, for example, a high-speed serial computer expansion bus standard (peripheral component interconnect express, PCIe), a computer express link (compute express link, CXL), and a cache coherent interconnection Protocol (cache coherent interconnect for accelerators, CCIX), unified bus (unified bus, UB or Ubus) and so on.
  • PCIe peripheral component interconnect express
  • CXL compute express link
  • CCIX cache coherent interconnection Protocol
  • unified bus unified bus, UB or Ubus
  • the IO device 230 refers to hardware that can perform data transmission with the host 210 , and is used to receive and execute pending requests sent by the host 210 .
  • the IO device 230 may be a network interface controller (network interface controller, NIC), an intelligent NIC (smart-NIC), a host bus adapter (host bus adapter, HBA), a host channel adapter (host channel adapter, HCA), an accelerator (accelerator ), data processor (data processing unit, DPU), image processor (graphics processing unit, GPU), artificial intelligence (artificial intelligence, AI) equipment, software defined infrastructure (software defined infrastructure, SDI), etc. at least A sort of.
  • the IO device 230 may include a second processor 231 and a second memory 232 .
  • the second storage 231 may be implemented by a random access device (random access memory, RAM) or other storage media.
  • the computing device 200 shown in FIG. 2 is the computing device 116 in the computing system 100 shown in FIG. 1 .
  • the host 210 there are P processes in the host 210 below, which are called process 1 to process P respectively.
  • the host 210 creates P submission queue (submission queue, SuQ or SQ) sets, which are called SuQ set 1 to SuQ set P respectively.
  • the P SuQ sets are respectively associated with P processes.
  • SuQ set 1 is associated with process 1
  • SuQ set 2 is associated with process 2
  • SuQ set 3 is associated with process 3, and so on.
  • Each SuQ set in the P SuQ sets is used to store the pending requests of the associated process. Pending requests stored in SuQ may be referred to as submission queue elements (SuQE).
  • process 1 creates work request (work request, WR) 1.
  • WR 1 is converted to SuQE and kept in SuQ Set 1.
  • SuQE is assigned a CoS when it is created.
  • the SuQE can be saved to the corresponding SuQ according to the CoS of the SuQE.
  • SuQE 1 to SuQE 3 can be saved to SuQE SuQ 1 in Set 1.
  • process 2 creates work request (work request, WR) 2.
  • WR 2 is converted to SuQE 4 to SuQE 6, and the CoS levels of SuQE 4 to SuQE 6 are all 4, then SuQE 4 to SuQE 6 are saved in SuQ 4 of SuQ Set 2.
  • the SuQ set on the host side is created based on the process, and supports QoS capability at the same time, and the buffering from the WR on the host side to the IO device is realized through SuQ
  • the host 210 After the host 210 stores the SuQE in the SuQ, it can notify the IO device 230 to process the SuQE in the SuQ by means of a doorbell.
  • the IO device 230 determines the destination computing device of the SuQE, and then stores the SuQE in a corresponding shared send queue (shared send queue, SSQ) set.
  • a shared send queue can also be called an active queue (AQ).
  • the SSQ set created by the IO device 230 is statically associated with the channel. In other words, the number of SSQ sets created by the IO device 230 is the same as the number of channels. Each SSQ set is associated with a channel.
  • the computing cluster shown in FIG. 1 includes 6 computing devices. Therefore, the computing device 200 (that is, the computing device 116 ) needs to have five channels to communicate with the computing device 111 to the computing device 115 respectively. These five channels may be referred to as channels 1 to 5, respectively. In other words, computing device 116 communicates with computing device 111 via channel 1, computing device 116 communicates with computing device 112 via channel 2, computing device 116 communicates with computing device 113 via channel 3, and so on. In the static association scenario, the IO device 230 can create five SSQ sets. These five SSQ sets are respectively associated with five channels.
  • IO device 230 may store the SuQE in the SSQ set associated with lane 1 .
  • the IO device 230 may send the pending requests in the SSQ set associated with the channel 1 to the computing device 111 through the channel 1 .
  • Pending requests in SSQ can be called shared send queue elements (shared send queue element, SSQE) to send pending requests to
  • the SSQ set created by the IO device 230 is dynamically associated with the channel.
  • the IO device 230 can create an SSQ resource pool, which includes at least N CoS SSQs, where N CoS is the number of SuQs included in one SuQ set, and N CoS is a positive integer greater than or equal to 1 .
  • the IO device 230 can determine SSQ resources from the SSQ resource pool, create an SSQ set, and bind the SSQ set to the channel corresponding to the destination computing device to be processed.
  • the embodiment shown in FIG. 3 is a scenario in which an SSQ set is dynamically associated with a channel.
  • IO device 230 determines that there is a pending request to be sent to computing device 111 , a pending request to be sent to computing device 112 , and a pending request to be sent to computing device 115 .
  • IO device 230 creates SSQ set 1 and binds SSQ set 1 to channel 1, creates SSQ set 2 and binds SSQ set 2 to channel 2, creates SSQ set 3 and binds SSQ set 3 to channel 5 bound.
  • pending requests sent to computing device 111 can be saved in SSQ set 1
  • pending requests sent to computing device 112 can be saved in SSQ set 2
  • pending requests sent to computing device 115 can be saved in SSQ set 3 in.
  • the SSQ in the SSQ collection can be implemented in the following two ways:
  • SSQ is implemented through a ring buffer.
  • SSQE is executed sequentially. For example, assume that the storage space allocated to SSQ can store a total of 8 SSQEs, namely SSQE 1 to SSQE 8, and SSQE 1 is the first SSQE, SSQE 2 is the second SSQE, and so on. Then in the case of circular buffer implementation, SSQE 1 to SSQE8 will be sent in order. After SSQE 8 is sent, continue to send the SSQE stored in the buffer space of the original SSQE 1.
  • the first SSQE storage space after the first SSQE storage space is determined, the subsequent SSQE storage space that needs to be sent can be determined, and after the last SSQE is processed, the first SSQE can be processed again New SSQE for storage space preservation.
  • each SSQE records the information of the next SSQE to be processed (which may be called link information). Assume that the storage space allocated to SSQ can store a total of 8 SSQEs, namely SSQE 1 to SSQE 8.
  • the first one to send may be SSQE 2; if the next SSQE to be sent is determined to be SSQE 6 according to the link information in SSQE 2, then SSQE 6 can be sent after sending SSQE 2; if According to the link information of SSQE 6, it is determined that the next sender is SSQE 1, then SSQE 1 can be sent after sending SSQE 6.
  • SuQ mentioned above and the PSSQ mentioned later can also be implemented through a ring buffer or a linked list.
  • the specific implementation method is the same as that of SSQ. For the sake of brevity, details will not be repeated here.
  • the total memory size occupied by the SSQ in the IO device 230 is determined by the number of CoSs, the queue depth of the SSQ, the number of computing devices in the computing cluster, and the size of the SSQE.
  • the total memory size occupied by SSQ can be determined according to the following formula:
  • N SSQ N CoS ⁇ (N node -1) ⁇ depth queue ⁇ SSQE size , (Formula 1)
  • N SSQ represents the total memory size occupied by SSQ in the IO device 230
  • N CoS represents the number of CoS (also equivalent to the number of SSQs included in an SSQ set)
  • N node is the number of computing devices in the computing cluster
  • depth queue represents The queue depth of the SSQ (that is, the number of SSQEs included in one SSQ)
  • the SSQE size indicates the size of one SSQE.
  • the total memory size occupied by SSQs in the IO device 230 may be smaller than N SSQ .
  • the total memory size occupied by the send queue (send queue, SQ) in the IO device of each computing device is:
  • N SQ (N node -1) ⁇ P ⁇ P ⁇ depth queue_SQ ⁇ WQE size
  • N SQ represents the total memory size occupied by SQ in the IO device
  • N node represents the number of computing devices in the computing cluster
  • P represents the number of processes running on each computing device in the cluster (assuming that any two computing devices run the same number of processes and each process can communicate with each other)
  • depth queue_SQ indicates the queue depth of SQ (that is, the number of WQEs included in one SQ)
  • WQE size indicates the size of one WQE.
  • the size of the WQE is the same as the size of the SSQE or the difference is smaller than a preset value.
  • the preset value can be determined according to the size of the storage space where the WQE is located, or determined according to the statistical value of the size of the data to be stored in the WQE requested by the same type of business.
  • N SSQ is smaller than N SQ .
  • the technical solution of the embodiment of the present application can support a larger-scale computing cluster. In this way, the scalability (Scalability) capability of RDMA can be improved.
  • each SSQ set in FIG. 3 also includes four SSQs, which are in one-to-one correspondence with the four CoSs.
  • the IO device 230 may determine to store the SuQE in the corresponding SSQ in the SSQ set according to the CoS in the SuQE.
  • the processing time of the SSQE in an SSQ exceeds a preset time threshold (for example, no confirmation message from the destination computing device is received within the preset time threshold)
  • the SSQE in the SSQ can be Save to the pending shared send queue (pending shared send queue, PSSQ), and the waiting shared send queue can also be called the pending queue (pending queue, PQ).
  • the preset time threshold may be the same as the time of waiting for the kth retransmission, where k is a positive integer greater than or equal to 1 and less than the maximum number of retransmissions. For example, suppose the time to wait for a retransmission is determined according to the following formula:
  • T RTNS T 1 ⁇ 2 Service Timeout (Formula 3),
  • T RTNS is the waiting time for retransmission
  • T 1 is a preset value
  • Service Timeout is a positive integer greater than or equal to 1
  • Service Timeout is the preset maximum number of retransmissions. For example, if the value of Service Timeout is 10, the time for waiting for retransmission for the first time is T 1 ⁇ 2; the time for waiting for retransmission for the second time is T 1 ⁇ 2 2 , and so on.
  • the value range of T 1 is usually at the level of microseconds, for example, a typical value may be 4.096 microseconds. Certainly, T 1 may also take other values, such as 5 microseconds, 10 microseconds, and so on.
  • the preset time threshold can also be set according to experience. For example, it can be equal to 10 microseconds, 15 microseconds, 20 microseconds, etc.
  • the pending request in PSSQ (that is, the SSQE saved in PSSQ) can also be called waiting shared send queue element (pending shared send queue element, PSSQE).
  • the IO device 230 may bind the PSSQ to a channel corresponding to the destination computing device, and send the PSSQE in the PSSQ through the channel. For example, suppose SSQ 1 in SSQ Set 3 has two SSQEs named SSQE 1 and SSQE 2. If the processing time of SSQE 1 and SSQE 2 exceeds the preset time threshold, then SSQE 1 and SSQE 2 can be saved to PSSQ 1 (SSQE 1 and SSQE 2 saved to PSSQ 1 can be called PSSQE 1 and PSSQE 2 respectively) . The IO device 230 can bind PSSQ 1 to channel 1, and send PSSQE 1 and PSSQE 2 through channel 1.
  • the IO device 230 may re-save the pending requests in the PSSQ to the SSQ if the preset condition is met. For example, suppose SSQ 1 in SSQ Set 3 has two SSQEs named SSQE 1 and SSQE 2. If the processing time of SSQE 1 and SSQE 2 exceeds the preset time threshold, then SSQE 1 and SSQE 2 can be saved into PSSQ 1 (SSQE 1 and SSQE 2 saved into PSSQ 1 can be called PSSQE1 and PSSQE 2, respectively). The IO device 230 can bind PSSQ 1 to channel 1, and send PSSQE 1 and PSSQE 2 through channel 1. If PSSQE 1 is successfully sent to the destination computing device, IO device 230 can move PSSQE 2 back into SSQ 1 again.
  • SSQ is a first-in-first-out sending mechanism
  • PSSQ a first-in-first-out sending mechanism
  • some or all SSQEs stuck in SSQ can be relocated to PSSQ. In this way, the remaining SSQEs in the SSQ can be sent on.
  • SSQ can bind other channels as well. This can effectively reduce the probability of head-of-line blocking.
  • Each pending request (ie, SuQE, SSQE, WQE, PSSQE) in each queue (ie, SuQ, SSQ, PSSQ) in the embodiment of the present application may use the same data structure, or may use a different data structure. The embodiment of the present application does not limit this. If SuQE, SSQE, WQE, and PSSQE use the same data structure, the sizes of pending requests carrying the same data in different queues can be the same; if SuQE, SSQE, WQE, and PSSQE use different data structures, then in Pending requests carrying the same data in different queues may not be exactly the same size.
  • each SuQ set and each SSQ set includes 4 queues. If there is no need to implement CoS (the value of CoS can also be considered as 1), then each SuQ set and each SSQ set can have only one queue.
  • both SSQ and PSSQ in the above embodiments are stored in the memory of the IO device.
  • the SSQ and/or PSSQ may also be stored in the memory of the host.
  • Fig. 4 is a schematic flowchart of a data transmission method provided according to an embodiment of the present application. The method shown in FIG. 4 can be applied to a computing cluster, where the computing cluster includes multiple computing devices.
  • a first IO device acquires a pending request, where the first IO device is an IO device deployed in a first computing device, and the first computing device is any computing device among the plurality of computing devices;
  • the first IO device stores the pending request according to a storage policy, where the storage policy is used to indicate a manner of storing the pending request in the first IO device.
  • the first IO device sends the pending request to a second IO device through a first channel, where the first channel is a transmission channel between the first computing device and a second computing device, and the second IO device is the The IO device deployed in the second computing device, where the second computing device is any computing device in the plurality of computing devices except the first computing device, and the first computing device and the second computing device use remote direct memory Access technology to communicate.
  • the first IO device may determine how to store the pending request according to the storage policy, instead of directly storing the pending request in its own storage space. In this way, the storage space of the first IO device can be used more reasonably, so that less storage space can be used to store pending requests that need to be sent to other computing devices of the computing cluster. In this way, under the condition that the storage space of the IO device remains unchanged, the IO device using the above technical solution can communicate with more IO devices, thereby supporting a larger-scale computing cluster and improving the scalability of RDMA.
  • the data transmission method provided according to the embodiment of the present application is described in detail above with reference to FIG. 1 to FIG. 4 .
  • the IO device provided according to the embodiment of the present application will be described below in conjunction with FIG. 5 to FIG. 6 .
  • Fig. 5 is a schematic structural block diagram of an IO device provided according to an embodiment of the present application.
  • the IO device 500 shown in FIG. 5 includes an acquisition unit 501 , a processing unit 502 , a storage unit 503 and a sending unit 504 .
  • the obtaining unit 501 is configured to obtain pending requests.
  • the processing unit 502 is configured to store the request to be processed in the storage unit 503 according to a storage policy, where the storage policy is used to indicate a manner of storing the request to be processed in the storage unit 503 .
  • the sending unit 504 is configured to send the pending request to another IO device through a first channel, wherein the first channel is a transmission channel between the first computing device and the second computing device, and the IO device is the first computing device.
  • the IO device deployed in the second computing device, the other IO device is the IO device deployed in the second computing device, the first computing device and the second computing device are any two computing devices among the computing devices included in the computing cluster , the first computing device and the second computing device communicate through remote direct memory access technology.
  • the IO device in the embodiment of the present invention can be implemented by a central processing unit (central processing unit, CPU), or by an application-specific integrated circuit (ASIC), or a programmable logic device (programmable logic device, PLD) implementation
  • the above-mentioned PLD can be a complex program logic device (complex programmable logical device, CPLD), field-programmable gate array (field-programmable gate array, FPGA), general array logic (generic array logic, GAL ) or any combination thereof.
  • the IO device and its modules can also be software modules.
  • Fig. 6 is a structural block diagram of an IO device provided according to an embodiment of the present application.
  • the IO device 600 shown in FIG. 6 includes: a processor 601 , a memory 602 and a communication interface 603 , and the processor 601 , the memory 602 and the communication interface 603 communicate through a bus 604 .
  • the receiver 605 is used to receive pending requests from the host, and the sender 606 is used to send the pending requests stored in the memory 602 to another computing device in the computing cluster.
  • Processor 601 can be a central processing unit (central processing unit, CPU), and can also be other general-purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuits, ASICs), on-site Programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general purpose processor may be a microprocessor or any conventional processor or the like. In the implementation process, each step of the above method may be completed by an integrated logic circuit of hardware in the processor 601 or instructions in the form of software.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the steps of the methods disclosed in the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
  • Software modules may be located in memory 602, which may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory.
  • the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • Volatile memory can be random access memory (RAM), which acts as external cache memory.
  • RAM random access memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • Double data rate synchronous dynamic random access memory double data date SDRAM, DDR SDRAM
  • enhanced SDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous connection dynamic random access memory
  • direct rambus RAM, DR RAM direct rambus RAM
  • the memory 602 may store instructions for executing the method performed by the IO device in the foregoing embodiments.
  • the processor 601 can execute the instructions stored in the memory 602 in conjunction with other hardware (such as the receiver 605 and the transmitter 606) to complete the steps of the IO device in the above embodiment, and the specific working process and beneficial effects can be described in the above embodiment.
  • Memory can be volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory.
  • the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electronically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • Volatile memory can be random access memory (RAM), which acts as external cache memory.
  • RAM random access memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • DRAM synchronous dynamic random access memory
  • SDRAM double data rate synchronous dynamic random access memory
  • ESDRAM enhanced synchronous dynamic random access memory
  • SLDRAM direct memory bus random access memory
  • direct rambus RAM direct rambus RAM
  • bus 604 may also include a power bus, a control bus, a status signal bus, and the like. However, for clarity of illustration, the various buses are labeled as bus 604 in the figure.
  • the embodiment of the present application also provides a chip system, the chip system includes a logic circuit, the logic circuit is used to couple with the input/output interface, and transmit data through the input/output interface, so as to execute the IO device in the above embodiment various steps.
  • each step of the above-mentioned method can be completed by an integrated logic circuit of hardware in a processor or an instruction or program code in the form of software.
  • the steps of the methods disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
  • the storage medium is located in the memory, and the processor reads the information in the memory, and completes the steps of the above method in combination with its hardware. To avoid repetition, no detailed description is given here.
  • the processor in the embodiment of the present application may be an integrated circuit chip, which has a signal processing capability.
  • each step of the above-mentioned method embodiments may be implemented by an integrated logic circuit of hardware in a processor or instructions or program codes in the form of software.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
  • the storage medium is located in the memory, and the processor reads the information in the memory, and completes the steps of the above method in combination with its hardware.
  • the present application further provides a computing device, which includes the aforementioned host and the IO device.
  • the present application also provides a computing cluster, including a plurality of computing devices described above, and each computing device in the computing devices includes the aforementioned IO device and the aforementioned host.
  • the above-mentioned embodiments may be implemented in whole or in part by software, hardware, firmware or other arbitrary combinations.
  • the above-described embodiments may be implemented in whole or in part in the form of computer program products.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded or executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server, or data center by wired (eg, coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center that includes one or more sets of available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media.
  • the semiconductor medium may be a solid state drive (SSD).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种传输数据的方法和输入输出设备。该方法应用于计算集群,该计算集群包括多个计算设备,第一计算设备和第二计算设备是该多个计算设备中的任意两个计算设备,该第一计算设备和该第二计算设备通过第一通道通信。该第一计算设备中部署有第一IO设备,该第二计算设备中部署有第二IO设备。该第一IO设备获取待处理请求,根据存储策略存储该待处理请求,然后通过该第一通道将该待处理请求发送至第二IO设备。该存储策略用于指示在该第一IO设备中存储该待处理请求的方式。上述技术方案中,第一IO设备可以根据存储策略来确定如何存储待处理请求,从而可以更加合理地使用第一IO设备的存储空间。

Description

传输数据的方法和输入输出设备
本申请要求于2021年09月17日提交俄罗斯联邦专利局、申请号为RU2021127325申请名称为“传输数据的方法和输入输出设备”的俄罗斯联邦专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及通信技术领域,更具体地,涉及传输数据的方法和输入输出设备。
背景技术
远程直接内存存取(Remote Direct Memory Access,RDMA)是为了解决网络传输中计算设备数据处理的延迟而产生的一种技术。利用RDMA技术可以将数据直接从一台计算设备的内存传输到另一台计算设备,无需对方操作系统的接入。这样可以允许高吞吐、低时延的网络通信,尤其适合在计算集群中有广泛应用。
计算集群中通常包括两个以上的计算设备,并且每个计算设备中会运行多个进程。假设计算集群中包括N node个计算设备,每个计算设备中有P个进程(N node是大于或等于2的正整数,P为大于或等于1的正整数)。如果要在该计算集群中实现全互联,那么每个计算设备中需要建立(N node-1)×P×P条队列(queue pairs,QP)对且每条队列对中都有一个发送队列(send queue,SQ)。这需要占用大量内存。例如,假设集群中共包括10个计算设备,那么每个计算设备中输入输出(input output,IO)设备的内存占用将达到太字节(terabyte,TB)级别。
发明内容
本申请提供一种传输数据的方法和输入输出设备,可以支持更大规模的计算集群。
第一方面,本申请提供一种数据传输的方法,该方法应用于计算集群,该计算集群包括多个计算设备,第一计算设备和第二计算设备是该多个计算设备中的任意两个计算设备,该第一计算设备和该第二计算设备通过第一通道通信。该第一计算设备中部署有第一IO设备,该第二计算设备中部署有第二IO设备。该第一IO设备获取待处理请求,根据存储策略存储该待处理请求,然后通过该第一通道将该待处理请求发送至第二IO设备。该存储策略用于指示在该第一IO设备中存储该待处理请求的方式。
上述技术方案中,第一IO设备可以根据存储策略来确定如何存储待处理请求,而不是直接将该待处理请求保存在自己的存储空间内。这样,可以更加合理地使用第一IO设备的存储空间,从而可以使用较少的存储空间保存需要发送到计算集群的其他计算设备中的待处理请求。这样,在IO设备的存储空间不变的情况下,利用上述技术方案的IO设备可以与更多的IO设备通信,从而可以支持更大规模的计算集群,提升RDMA的可扩展性。
在一种可能的实现方式中,该第一IO设备根据存储策略存储该待处理请求,包括: 该第一IO设备确定该待处理请求的目的计算设备的标识;该第一IO设备根据该目的计算设备的标识确定存储该待处理请求的第一共享发送队列SSQ,该第一SSQ用于存储与该目的计算设备关联的待处理请求。
利用上述技术方案,IO设备可以按照计算集群中的计算设备的数目分配用于存储待处理请求的SSQ的存储空间。
例如,如果计算集群中包括N个计算设备,那么该第一IO设备可以分配N-1个存储空间,该N-1个存储空间与N-1个通道一一对应,该N-1个通道分别是该第一计算设备与该N个计算设备中除该第一计算设备以外的N-1个计算设备之间的通信通道。该N-1个存储空间中的每个存储空间可以用于存储N cos个SSQ。N cos是第一IO支持的服务等级数目,N cos是大于或等于1的正整数,一般情况下,N cos的取值不会大于8。因此,第一IO设备最多需要建立N cos×(N-1)个SSQ。通常情况下N cos的取值会小于计算设备运行的进程数目。因此与现有技术相比,在IO设备的存储空间不变的情况下,利用上述技术方案的IO设备可以与更多的IO设备通信,从而可以支持更大规模的计算集群,提升RDMA的可扩展性。
上述这种N-1个存储空间与N-1个通道一一对应的情况可以称为静态关联。在第一方面的一种可能的实现方式中,第一IO设备中的SSQ与通道可以是动态关联的。在该第一IO设备根据该目的计算设备的标识确定存储该待处理请求的第一共享发送队列SSQ之前,该方法还包括:该第一IO设备创建第一SSQ集合,该第一SSQ集合包括至少一个SSQ,该至少一个SSQ包括该第一SSQ;该第一IO设备将该第一SSQ集合与第一通道绑定,该第一SSQ集合用于存储通过该第一通道发送的待处理器请求。
在动态管理的实现方式中,第一IO设备可以创建一个SSQ资源池,该SSQ资源池占用的存储空间大小可以小于N-1个存储空间的大小。在需要保存通过第一通道发送的待处理请求之前时,该第一IO设备可以从该SSQ资源池中确定该第一SSQ集合并将该第一SSQ集合与第一通道绑定。在没有需要通过第一通道发送的待处理请求的情况下,该第一IO设备可以将该第一SSQ集合与该第一通道解绑。这样,该第一SSQ集合中的SSQ会被回收到SSQ资源池中,以便于供通过其他通道发送的待处理请求使用。这样,可以更进一步节省SSQ占用的存储空间大小,从而更进一步提升支持的计算集群的规模和RDMA的可扩展性。
在另一种可能的实现方式中,在该第一SSQ集合包括多个SSQ的情况下,该多个SSQ与多个服务等级CoS一一对应,其中该第一SSQ对应的CoS与该待处理请求的CoS相同。
上述技术方案支持CoS,因此不同CoS的待处理请求可以被存储到对应的SSQ中。
在另一种可能的实现方式中,在该第一SSQ中的待处理请求的处理时间超过预设阈值的情况下,将该第一SSQ中的待处理请求存储到等待共享发送队列PSSQ中。
上述技术方案可以将堵塞在SSQ前的待处理请求搬移到PSSQ中。这样可以让后续的SSQ先被处理。这样可以有效降低线头阻塞(head-of-line blocking)发生的概率
在另一种可能的实现方式中,该第一输入输出IO设备获取待处理请求,包括:该第一IO设备从第一提交队列获取该待处理请求,该第一提交队列用于存储与其关联的进程的待处理请求,该第一提交队列存储于该第一计算设备的内存中。
在另一种可能的实现方式中,该第一IO设备包括网络接口控制器、智能网络接口控 制器、主机总线适配器、主机通道适配器、加速器、数据处理器、图像处理器、人工智能设备、软件定义基础设施中的至少一种。
第二方面,本申请还提供了一种IO设备,该IO设备包括用于实现第一方面或第一方面任一种可能的实现方式的单元。
第三方面,本申请还提供一种计算机设备,该计算机设备包括处理器,该处理器用于与存储器耦合,读取并执行该存储器中的指令和/或程序代码,以执行第一方面或第一方面的任一种可能的实现方式。
可选的,该存储器还可以用于存储SSQ。
第四方面,本申请还提供一种芯片系统,该芯片系统包括逻辑电路,该逻辑电路用于与输入/输出接口耦合,通过该输入/输出接口传输数据,以执行第一方面或第一方面任一种可能的实现方式。
第五方面,本申请还提供一种计算机可读存储介质,该计算机可读存储介质存储有程序代码,当该计算机存储介质在计算机上运行时,使得计算机执行如第一方面或第一方面的任一种可能的实现方式。
第六方面,本申请还提供一种计算机程序产品,该计算机程序产品包括:计算机程序代码,当该计算机程序代码在计算机上运行时,使得该计算机执行如第一方面或第一方面的任一种可能的实现方式。
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
附图说明
图1是一个计算集群的示意图。
图2是本申请实施例提供的计算设备的示意性流程图。
图3是根据本申请实施例提供的数据传输方法的示意性流程图。
图4是根据本申请实施例提供的一种数据传输的方法的示意性流程图。
图5是根据本申请实施例提供的IO设备的示意性结构框图。
图6是根据本申请实施例提供的IO设备的结构框图。
具体实施方式
下面将结合附图,对本申请中的技术方案进行描述。
本申请实施例中所称的芯片可以是系统芯片(system on chip,SoC),还可以是中央处理器(central processor unit,CPU),还可以是网络处理器(network processor,NP),还可以是数字信号处理电路(digital signal processor,DSP),还可以是应用处理器(application processor,AP),或其他集成芯片。
计算集群(可以简称为集群)是一种计算系统。计算集群通过将一组计算设备连接起来高度紧密地协作完成计算工作。计算集群中的单个计算设备可以称为节点。图1是一个计算集群的示意图。如图1所示,计算集群100包括六个计算设备,分别为计算设备111、计算设备112、计算设备113、计算设备114、计算设备115和计算设备116。计算设备111至计算设备116通过网络120连接。
下面结合图2对图1所示中的计算设备进行介绍。图2所示的计算设备200可以是图1所示的计算设备111至计算设备116中的任一个计算设备。
如图2所示的计算设备200包括主机210、IO互联通道220以及IO设备230。其中,主机210可以通过IO互联通道220连接IO设备230。
主机210可以是运算核心和控制核心,是信息处理、程序运行的最终执行单元。主机210包括处理器211和第一存储器222,该处理器可以为中央处理单元(central processing unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者是任何常规的处理器等。该处理器211还可以为一种片上芯片(system of chip,SoC)或者嵌入式处理器。处理器211具有处理指令、执行操作、处理数据等功能。第一处理器211可以为多个进程分配独立的存储器资源,从而运行多个进程。第一存储器222可以由随机存取器(random access memory,RAM)或其他存储介质实现。第一存储器222可用于存储多个进程的程序代码。
IO互联通道220是主机210与IO设备230之间的互连机制,例如,高速串行计算机扩展总线标准(peripheral component interconnect express,PCIe)、计算机快速链接(compute express link,CXL)、缓存一致互联协议(cache coherent interconnect for accelerators,CCIX)、统一总线(unified bus,UB或Ubus)等等。
IO设备230是指可以与主机210进行数据传输的硬件,用于接收和执行主机210发送的待处理请求。IO设备230可以为网络接口控制器(network interface controller,NIC)、智能NIC(smart-NIC)、主机总线适配器(host bus adapter,HBA)、主机通道适配器(host channel adaptor,HCA)、加速器(accelerator)、数据处理器(data processing unit,DPU)、图像处理器(graphics processing unit,GPU)、人工智能(artificial intelligence,AI)设备、软件定义基础设施(software defined infrastructure,SDI)等等中的至少一种。IO设备230可以包括第二处理器231和第二存储器232。第二存储器231可以由随机存取器(random access memory,RAM)或其他存储介质实现。
下面结合图3,对本申请实施例进行介绍。为了便于描述,在对如图3所示的实施例进行描述时,假设如图2所示的计算设备200为图1所示的计算系统100中的计算设备116。此外,为了便于描述,以下假设主机210中存在P个进程,分别称为进程1至进程P。如图3所示,在主机210中存在P个进程的情况下,主机210创建P个提交队列(submission queue,SuQ或SQ)集合,分别称为SuQ集合1至SuQ集合P。该P个SuQ集合分别与P个进程相关联。换句话说,SuQ集合1与进程1关联,SuQ集合2与进程2关联,SuQ集合3与进程3关联,以此类推。P个SuQ集合中的每个SuQ集合用于存储关联的进程的待处理请求。存储在SuQ中的待处理请求可以称为提交队列元素(submission queue element,SuQE)。
例如,进程1创建了工作请求(work request,WR)1。WR 1被转换为SuQE保存在SuQ集合1中。
此外,如图3所示,每个SuQ集合中包括四个SuQ。四个SuQ与四个服务等级(Class of Service,CoS)一一对应。SuQE在创建的时候就分配了CoS。在将SuQE保存在关联 的SuQ集合时,可以根据SuQE的CoS,将SuQE保存至对应的SuQ中。
还以WR 1为例,假设WR 1被转化为三个SuQE,分别为SuQE 1,SuQE 2和SuQE 3,SuQE 1至SuQE 3的CoS等级均为1,那么SuQE 1至SuQE 3可以保存至SuQ集合1中的SuQ 1。
假设进程2创建了工作请求(work request,WR)2。WR 2被转换为SuQE 4至SuQE 6,且SuQE 4至SuQE 6的CoS等级均为4,那么SuQE 4至SuQE 6保存在SuQ集合2的SuQ 4中。
上述技术方案中,主机侧的SuQ集合是基于进程进行创建,同时支持QoS能力,通过SuQ实现主机侧WR到IO设备的缓冲
主机210在将SuQE存储到SuQ之后,可以通过门铃的方式通知IO设备230来处理SuQ中的SuQE。
IO设备230确定SuQE的目的计算设备,然后将SuQE存储到对应的共享发送队列(shared send queue,SSQ)集合中。共享发送队列也可以称为激活队列(active queue,AQ)。
在一些实施例中,IO设备230创建的SSQ集合与通道是静态关联的。换句话说,IO设备230创建的SSQ集合数目与通道数目是相同的。每个SSQ集合与一个通道关联。
例如,在图1所示的计算集群中共包括6个计算设备。因此,计算设备200(即计算设备116)需要有五个通道分别与计算设备111至计算设备115通信。这五个通道可以分别称为通道1至通道5。换句话说,计算设备116通过通道1与计算设备111通信,计算设备116通过通道2与计算设备112通信,计算设备116通过通道3与计算设备113通信,以此类推。在静态关联的场景下,IO设备230可以创建五个SSQ集合。这五个SSQ集合分别与五个通道相关联。如果IO设备230确定SuQE的目的计算设备是计算设备111,那么IO设备230可以将该SuQE存储到与通道1相关联的SSQ集合中。IO设备230可以将与通道1相关联的SSQ集合中的待处理请求通过通道1发送至计算设备111。SSQ中的待处理请求可以称为共享发送队列元素(shared send queue element,SSQE)将待处理请求发送至
在另一些实施例中,IO设备230创建的SSQ集合与通道是动态关联的。在此情况下,IO设备230可以创建一个SSQ资源池,该SSQ资源池中至少包括N CoS个SSQ,N CoS为一个SuQ集合中包括的SuQ的数量,N CoS为大于或等于1的正整数。在接收到主机210的门铃通知后,IO设备230可以从SSQ资源池中确定SSQ资源,创建SSQ集合并将SSQ集合与待处理请求的目的计算设备对应的通道绑定。
例如,图3所示的实施例就是SSQ集合与通道动态关联的场景。如图3所示,IO设备230确定有需要发送给计算设备111的待处理请求,有需要发送给计算设备112的待处理请求以及有需要发送给计算设备115的待处理请求。在此情况下,IO设备230创建SSQ集合1并将SSQ集合1与通道1绑定,创建SSQ集合2并将SSQ集合2与通道2绑定,创建SSQ集合3并将SSQ集合3与通道5绑定。这样,发往计算设备111的待处理请求可以保存在SSQ集合1中,发往计算设备112的待处理请求可以保存在SSQ集合2中,发往计算设备115的待处理请求可以保存在SSQ集合3中。
SSQ集合中的SSQ可以通过以下两种方式实现:
方式1,SSQ通过循环缓存(ring buffer)实现。
如果SSQ是通过循环缓存实现,那么SSQE是按照顺序被执行的。例如,假设分配给SSQ的存储空间总共可以存储8个SSQE,分别为SSQE 1至SSQE 8,且SSQE 1是第一个SSQE,SSQE 2是第二个SSQE,以此类推。那么在循环缓存实现的情况下,SSQE 1至SSQE8会被依次被发送。在SSQE 8完成发送,继续发送存储在将原SSQE 1的缓存空间的SSQE。换句话说,在循环缓存实现中,在确定了第一个SSQE存储空间后,就可以确定后续需要发送的SSQE的存储空间,并且再处理完最后一个SSQE后,可以再处理第一个SSQE的存储空间保存的新的SSQE。
方式2,SSQ通过链表(linked list)实现。
如果SSQ是通过链表实现,那么每个SSQE中都记录有下一个需要处理的SSQE的信息(可以称为链接信息)。假设分配给SSQ的存储空间总共可以存储8个SSQE,分别为SSQE 1至SSQE 8。那么在通过链表实现的情况下,第一个发送的可能是SSQE 2;如果根据SSQE 2中的链接信息确定下一个发送的SSQE是SSQE 6,那么在发送完SSQE 2后可以发送SSQE 6;如果根据SSQE 6的链接信息确定下一个发送的是SSQE 1,那么在发送完SSQE 6后可以发送SSQE 1。
前文提到的SuQ与后文提到的PSSQ也可以通过循环缓存(ring buffer)实现或者链表(linked list)实现,具体实现方式与SSQ的实现方式相同,为了简洁,在此就不再赘述。
在SSQ集合与通道静态关联的场景中,IO设备230中SSQ占用的内存总大小由CoS的数目、SSQ的队列深度、计算集群中的计算设备的数目和SSQE的大小决定。SSQ占用的内存总大小可以根据以下公式确定:
N SSQ=N CoS×(N node-1)×depth queue×SSQE size,   (公式1)
其中N SSQ表示IO设备230中SSQ占用的内存总大小,N CoS表示CoS的数目(也相当于一个SSQ集合中包括的SSQ数目),N node为计算集群中的计算设备的数目,depth queue表示SSQ的队列深度(即一个SSQ中包括的SSQE的数目),SSQE size表示一个SSQE的大小。而在SSQ集合与通道动态管理的场景中,IO设备230中SSQ占用的内存总大小可以小于N SSQ
而如果是全互联结构的计算集群,那么每个计算设备的IO设备中发送队列(send queue,SQ)占用的内存总大小为:
N SQ=(N node-1)×P×P×depth queue_SQ×WQE size
其中N SQ表示IO设备中SQ占用的内存总大小,N node为计算集群中的计算设备的数目,P表示集群中每个计算设备运行的进程数目(假设任意两个计算设备运行的进程数目相同且各个进程间均能互相通信),depth queue_SQ表示SQ的队列深度(即一个SQ中包括的WQE的数目),WQE size表示一个WQE的大小。WQE的大小与SSQE的大小相同或差值小于预设值,该预设值可以根据WQE所在存储空间的大小确定,或者,根据同类型业务请求需在WQE中存储数据的大小的统计值确定。通常情况下一个计算设备运行的进程数目要远大于CoS数目。因此,N SSQ要小于N SQ。这样,在IO设备内存大小相同的情况下,与现有方案相比,采用本申请实施例的技术方案可以支持更大规模的计算集群。这样,可以提升RDMA的可扩展性(Scalability)能力。
与SuQ集合类似,图3中的每个SSQ集合也包括四个SSQ,该四个SSQ与四个CoS一一对应。在此情况下,IO设备230可以根据SuQE中的CoS确定将该SuQE存储在SSQ集合中的对应的SSQ中。
在一些实施例中,如果一个SSQ中的SSQE的处理时间超过预设时间阈值(例如,在预设时间阈值内没有收的目的计算设备发来的确认消息),那么可以将该SSQ中的SSQE保存到等待共享发送队列(pending shared send queue,PSSQ),等待共享发送队列也可以称为等待队列(pending queue,PQ)。
在一些实施例中,该预设时间阈值可以与第k次等待重传的时间相同,k是大于或等于1且小于最大重传次数的正整数。例如,假设等待重传的时间根据以下公式确定:
T RTNS=T 1×2 Service Timeout   (公式3),
其中T RTNS为等待重传的时间,T 1是一个预设值,Service Timeout是大于或等于1的正整数,Service Timeout是预设的最大重传次数。例如,如果Service Timeout的值为10,那么第一次等待重传的时间为T 1×2;第二次等待重传的时间为T 1×2 2,以此类推。T 1的取值范围通常在微秒级,例如,典型的取值可以是4.096微秒。当然,T 1也可以是取其他值,例如5微秒、10微秒等。
如果该预设时间阈值与第一次等待重传的时间相同且T 1的取值为4.096微秒,那么该预设时间阈值为4.096×2=8.192微秒。
在另一些实施例中,该预设时间阈值也可以根据经验自行设定。例如,可以等于10微秒、15微秒、20微秒等。
PSSQ中的待处理请求(即保存到PSSQ中的SSQE)也可以称等待共享发送队列元素(pending shared send queue element,PSSQE)。
在一些实施例中,IO设备230可以将PSSQ与目的计算设备对应的通道绑定,并通过该通道发送PSSQ中的PSSQE。例如,假设SSQ集合3中的SSQ 1有两个SSQE分别为SSQE 1和SSQE 2。如果SSQE 1和SSQE 2的处理时间超过预设时间阈值,那么可以将SSQE 1和SSQE 2保存至PSSQ 1中(保存到PSSQ 1中的SSQE 1和SSQE 2可以分别称为PSSQE 1和PSSQE 2)。IO设备230可以将PSSQ 1与通道1绑定,并通过通道1发送PSSQE 1和PSSQE 2。
在另一些实施例中,IO设备230可以在满足预设条件的情况下,将PSSQ中的待处理请求重新保存到SSQ中。例如,假设SSQ集合3中的SSQ 1有两个SSQE分别为SSQE 1和SSQE 2。如果SSQE 1和SSQE 2的处理时间超过预设时间阈值,那么可以将SSQE 1和SSQE 2保存至PSSQ 1中(保存到PSSQ 1中的SSQE 1和SSQE 2可以分别称为PSSQE1和PSSQE 2)。IO设备230可以将PSSQ 1与通道1绑定,并通过通道1发送PSSQE 1和PSSQE 2。如果PSSQE 1成功被发送到目的计算设备,那么IO设备230可以将PSSQE 2重新搬回SSQ 1中。
由于SSQ是先入先出的发送机制,因此如果SSQ中存在长时间没有完成的SSQE,那么后面的SSQE也不能得到及时发送。利用PSSQ,可以将堵在SSQ中的部分或全部SSQE搬迁到PSSQ中。这样,SSQ中的其余SSQE可以被继续发送。或者,SSQ也可以绑定其他通道。这样可以有效降低线头阻塞(head-of-line blocking)发生的概率。
本申请实施例中各个队列(即SuQ,SSQ,PSSQ)中的各个待处理请求(即,SuQE, SSQE、WQE、PSSQE)可以采用相同的数据结构,也可以采用不同的数据结构。本申请实施例对此并不限定。如果SuQE、SSQE、WQE、PSSQE采用相同的数据结构,那么在不同队列中的携带相同数据的待处理请求的大小可以是相同的;如果SuQE、SSQE、WQE、PSSQE采用不同的数据结构,那么在不同队列中的携带相同数据的待处理请求的大小可能不会完全相同。
可以理解,在图3所示的实施例中,为了实现CoS,每个SuQ集合和每个SSQ集合都包括4个队列。如果不需要实现CoS(也可以认为CoS的值为1),那么每个SuQ集合和每个SSQ集合可以只有1个队列。
此外,上述实施例中的SSQ和PSSQ都保存在IO设备的存储器中。在另一些实施例中,SSQ和/或PSSQ也可以保存在主机的存储器中。
图4是根据本申请实施例提供的一种数据传输的方法的示意性流程图。如图4所示的方法可以应用于计算集群,该计算集群包括多个计算设备。
401,第一IO设备获取待处理请求,该第一IO设备为第一计算设备中部署的IO设备,该第一计算设备为该多个计算设备中的任一个计算设备;
402,该第一IO设备根据存储策略存储该待处理请求,该存储策略用于指示在该第一IO设备中存储该待处理请求的方式。
403,该第一IO设备通过第一通道向第二IO设备发送该待处理请求,其中该第一通道为该第一计算设备和第二计算设备间的传输通道,该第二IO设备为该第二计算设备中部署的IO设备,该第二计算设备为该多个计算设备中除该第一计算设备以外的任一计算设备,该第一计算设备和该第二计算设备通过远程直接内存访问技术进行通信。
上述技术方案中,第一IO设备可以根据存储策略来确定如何存储待处理请求,而不是直接将该待处理请求保存在自己的存储空间内。这样,可以更加合理地使用第一IO设备的存储空间,从而可以使用较少的存储空间保存需要发送到计算集群的其他计算设备中的待处理请求。这样,在IO设备的存储空间不变的情况下,利用上述技术方案的IO设备可以与更多的IO设备通信,从而可以支持更大规模的计算集群,提升RDMA的可扩展性。
上文中结合图1至图4,详细描述了根据本申请实施例所提供的数据传输的方法,下面将结合图5至图6,描述根据本申请实施例所提供的IO设备。
图5是根据本申请实施例提供的IO设备的示意性结构框图。如图5所示的IO设备500包括获取单元501、处理单元502、存储单元503和发送单元504。
获取单元501,用于获取待处理请求。
处理单元502,用于根据存储策略,将该待处理请求存储到存储单元503,该存储策略用于指示在存储单元503中存储该待处理请求的方式。
发送单元504,用于通过第一通道向另一IO设备发送该待处理请求,其中该第一通道为第一计算设备和第二计算设备间的传输通道,该IO设备为该第一计算设备中部署的IO设备,该另一IO设备为该第二计算设备中部署的IO设备,该第一计算设备和该第二计算设备为计算集群包括的多个计算设备中的任意两个计算设备,该第一计算设备和该第二计算设备通过远程直接内存访问技术进行通信。
应理解的是,本发明本申请实施例的IO设备可以通过中央处理单元(central processing unit,CPU)实现,也可以通过专用集成电路(application-specific integrated circuit,ASIC) 实现,或可编程逻辑器件(programmable logic device,PLD)实现,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD),现场可编程门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。也可以通过软件实现图3至图4所示的数据传输方法时,IO设备及其各个模块也可以为软件模块。
获取单元501、处理单元502、存储单元503和发送单元504的具体功能和有益效果可以参见如图3至图5所述的方法中的描述,为了简洁,在此就不再赘述。图6是根据本申请实施例提供的IO设备的结构框图。图6所示的IO设备600包括:处理器601、存储器602和通信接口603,处理器601、存储器602和通信接口603通过总线604相通信。接收器605用于接收来自于主机的待处理请求,发送器606用于将存储器602中存储的待处理请求发送至计算集群中的另一计算设备。
上述本发明实施例揭示的方法可以应用于处理器601中,或者由处理器601实现。处理器601可以是中央处理器(central processing unit,CPU),还可以是其他通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者是任何常规的处理器等。在实现过程中,上述方法的各步骤可以通过处理器601中的硬件的集成逻辑电路或者软件形式的指令完成。可以实现或者执行本发明实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本发明实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于存储器602中,该存储器602可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data date SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)处理器601读取存储器602中的指令,结合其硬件完成上述方法的步骤。
存储器602可以存储用于执行上述实施例中IO设备执行的方法的指令。处理器601可以执行存储器602中存储的指令结合其他硬件(例如接收器605和发送器606)完成上述实施例中IO设备的步骤,具体工作过程和有益效果可以上述实施例中的描述。
存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM, EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(dynamic RAM,DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。应注意,本文描述的系统和方法的存储器旨在包括但不限于这些和任意其它适合类型的存储器。
总线604除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都标为总线604。
本申请实施例还提供了一种芯片系统,该芯片系统包括逻辑电路,该逻辑电路用于与输入/输出接口耦合,通过该输入/输出接口传输数据,以执行上述实施例中IO设备执行的各个步骤。
在实现过程中,上述方法的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令或程序代码完成。结合本申请实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。为避免重复,这里不再详细描述。
应注意,本申请实施例中的处理器可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法实施例的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令或程序代码完成。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。
根据本申请实施例提供的方法,本申请还提供一种计算设备,其包括前述的主机和IO设备。
本申请还提供一种计算集群,包括多个前述计算设备,该多个计算设备中的每个计算设备包括前述IO设备和前述的主机。
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载或执行所述计算机程序指令时,全部或部分地产生按照本发明本申请实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同 轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘(solid state drive,SSD)。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (16)

  1. 一种数据传输的方法,其特征在于,所述方法应用于计算集群,所述计算集群包括多个计算设备,所述方法包括:
    第一输入输出IO设备获取待处理请求,所述第一IO设备为第一计算设备中部署的IO设备,所述第一计算设备为所述多个计算设备中的任一个计算设备;
    所述第一IO设备根据存储策略存储所述待处理请求,所述存储策略用于指示在所述第一IO设备中存储所述待处理请求的方式;
    所述第一IO设备通过第一通道向第二IO设备发送所述待处理请求,其中所述第一通道为所述第一计算设备和第二计算设备间的传输通道,所述第二IO设备为所述第二计算设备中部署的IO设备,所述第二计算设备为所述多个计算设备中除所述第一计算设备以外的任一计算设备,所述第一计算设备和所述第二计算设备通过远程直接内存访问技术进行通信。
  2. 根据权利要求1所述的方法,其特征在于,
    所述第一IO设备根据存储策略存储所述待处理请求,包括:
    所述第一IO设备确定所述待处理请求的目的计算设备的标识;
    所述第一IO设备根据所述目的计算设备的标识确定存储所述待处理请求的第一共享发送队列SSQ,所述第一SSQ用于存储与所述目的计算设备关联的待处理请求。
  3. 根据权利要求2所述的方法,其特征在于,在所述第一IO设备根据所述目的计算设备的标识确定存储所述待处理请求的第一共享发送队列SSQ之前,所述方法还包括:
    所述第一IO设备创建第一SSQ集合,所述第一SSQ集合包括至少一个SSQ,所述至少一个SSQ包括所述第一SSQ;
    所述第一IO设备将所述第一SSQ集合与第一通道绑定,所述第一SSQ集合用于存储通过所述第一通道发送的待处理器请求。
  4. 如权利要求3所述的方法,其特征在于,在所述第一SSQ集合包括多个SSQ的情况下,所述多个SSQ与多个服务等级CoS一一对应,其中所述第一SSQ对应的CoS与所述待处理请求的CoS相同。
  5. 如权利要求2至4中任一项所述的方法,其特征在于,在所述第一SSQ中的待处理请求的处理时间超过预设阈值的情况下,将所述第一SSQ中的待处理请求存储到等待共享发送队列PSSQ中。
  6. 根据权利要求1至5中任一所述的方法,其特征在于,所述第一输入输出IO设备获取待处理请求,包括:
    所述第一IO设备从第一提交队列获取所述待处理请求,所述第一提交队列用于存储与其关联的进程的待处理请求,所述第一提交队列存储于所述第一计算设备的内存中。
  7. 根据权利要求1至6任一所述的方法,其特征在于,所述IO设备包括网络接口控制器、智能网络接口控制器、主机总线适配器、主机通道适配器、加速器、数据处理器、图像处理器、人工智能设备、软件定义基础设施中的至少一种。
  8. 一种输入输出IO设备,其特征在于,所述IO设备包括获取单元、处理单元、存 储单元和发送单元:
    所述获取单元,用于获取待处理请求;
    所述处理单元,用于根据存储策略,将所述待处理请求存储到所述存储单元,所述存储策略用于指示在所述存储单元中存储所述待处理请求的方式;
    所述发送单元,用于通过第一通道向另一IO设备发送所述待处理请求,其中所述第一通道为第一计算设备和第二计算设备间的传输通道,所述IO设备为所述第一计算设备中部署的IO设备,所述另一IO设备为所述第二计算设备中部署的IO设备,所述第一计算设备和所述第二计算设备为计算集群包括的多个计算设备中的任意两个计算设备,所述第一计算设备和所述第二计算设备通过远程直接内存访问技术进行通信。
  9. 根据权利要求8所述的IO设备,其特征在于,所述处理单元,具体用于确定所述待处理请求的目的计算设备的标识;根据所述目的计算设备的标识确定存储所述待处理请求的第一共享发送队列SSQ,所述第一SSQ用于存储与所述目的计算设备关联的待处理请求。
  10. 根据权利要求9所述的IO设备,其特征在于,所述处理单元,还用于在根据所述目的计算设备的标识确定所述第一SSQ之前,创建第一SSQ集合,所述第一SSQ集合包括至少一个SSQ,所述至少一个SSQ包括所述第一SSQ,将所述第一SSQ集合与第一通道绑定,所述第一SSQ集合用于存储通过所述第一通道发送的待处理器请求。
  11. 如权利要求10所述的IO设备,其特征在于,在所述第一SSQ集合包括多个SSQ的情况下,所述多个SSQ与多个服务等级CoS一一对应,其中所述第一SSQ对应的CoS与所述待处理请求的CoS相同。
  12. 如权利要求9至11中任一项所述的IO设备,其特征在于,所述处理单元,还用于在所述第一SSQ中的待处理请求的处理时间超过预设阈值的情况下,将所述第一SSQ中的待处理请求存储到所述存储单元的等待共享发送队列PSSQ中。
  13. 根据权利要求8至12中任一所述的IO设备,其特征在于,所述获取单元,具体用于从第一提交队列获取所述待处理请求,所述第一提交队列用于存储与其关联的进程的待处理请求,所述第一提交队列存储于所述第一计算设备的内存中。
  14. 根据权利要求8至13任一所述的IO设备,其特征在于,所述IO设备包括网络接口控制器、智能网络接口控制器、主机总线适配器、主机通道适配器、加速器、数据处理器、图像处理器、人工智能设备、软件定义基础设施中的至少一种。
  15. 一种输入输出IO设备,其特征在于,所述IO设备包括处理器,所述处理器用于与存储器耦合,读取并执行所述存储器中的指令和/或程序代码,执行如权利要求1-7中任一项所述的方法。
  16. 一种计算机可读介质,其特征在于,所述计算机可读介质存储有程序代码,当所述计算机程序代码在计算机上运行时,使得计算机执行如权利要求1-7中任一项所述的方法。
PCT/CN2022/116822 2021-09-17 2022-09-02 传输数据的方法和输入输出设备 WO2023040683A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
RU2021127325 2021-09-17
RU2021127325 2021-09-17

Publications (1)

Publication Number Publication Date
WO2023040683A1 true WO2023040683A1 (zh) 2023-03-23

Family

ID=85602413

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/116822 WO2023040683A1 (zh) 2021-09-17 2022-09-02 传输数据的方法和输入输出设备

Country Status (1)

Country Link
WO (1) WO2023040683A1 (zh)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040049580A1 (en) * 2002-09-05 2004-03-11 International Business Machines Corporation Receive queue device with efficient queue flow control, segment placement and virtualization mechanisms
CN102831018A (zh) * 2011-06-15 2012-12-19 塔塔咨询服务有限公司 低延迟先进先出消息交换系统
CN103999068A (zh) * 2011-12-23 2014-08-20 英特尔公司 共享的发送队列
CN111277616A (zh) * 2018-12-04 2020-06-12 中兴通讯股份有限公司 一种基于rdma的数据传输方法和分布式共享内存系统
CN111865831A (zh) * 2019-04-30 2020-10-30 华为技术有限公司 数据处理的方法、网络设备、计算节点和系统
CN112764893A (zh) * 2019-11-04 2021-05-07 华为技术有限公司 数据处理方法和数据处理系统
US20210271536A1 (en) * 2020-12-23 2021-09-02 Intel Corporation Algorithms for optimizing small message collectives with hardware supported triggered operations
CN113360077A (zh) * 2020-03-04 2021-09-07 华为技术有限公司 数据存储方法及计算节点

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040049580A1 (en) * 2002-09-05 2004-03-11 International Business Machines Corporation Receive queue device with efficient queue flow control, segment placement and virtualization mechanisms
CN102831018A (zh) * 2011-06-15 2012-12-19 塔塔咨询服务有限公司 低延迟先进先出消息交换系统
CN103999068A (zh) * 2011-12-23 2014-08-20 英特尔公司 共享的发送队列
CN111277616A (zh) * 2018-12-04 2020-06-12 中兴通讯股份有限公司 一种基于rdma的数据传输方法和分布式共享内存系统
CN111865831A (zh) * 2019-04-30 2020-10-30 华为技术有限公司 数据处理的方法、网络设备、计算节点和系统
CN112764893A (zh) * 2019-11-04 2021-05-07 华为技术有限公司 数据处理方法和数据处理系统
CN113360077A (zh) * 2020-03-04 2021-09-07 华为技术有限公司 数据存储方法及计算节点
US20210271536A1 (en) * 2020-12-23 2021-09-02 Intel Corporation Algorithms for optimizing small message collectives with hardware supported triggered operations

Similar Documents

Publication Publication Date Title
US10868767B2 (en) Data transmission method and apparatus in optoelectronic hybrid network
KR102429904B1 (ko) PCIe P2P 접속의 밴드위스를 최대화할 수 있는 방법 및 시스템
US10331595B2 (en) Collaborative hardware interaction by multiple entities using a shared queue
US7263103B2 (en) Receive queue descriptor pool
JP2018185814A (ja) NVMe−oF SSDにおける低レイテンシ直接データアクセス方法、及びそのためのシステム
KR101951072B1 (ko) 코어 간 통신 장치 및 방법
US10609125B2 (en) Method and system for transmitting communication data
WO2001018988A1 (en) Bridge between parallel buses over a packet-switched network
CN112543925A (zh) 用于使用专用低延迟链路的多个硬件加速器的统一地址空间
CN111966446B (zh) 一种容器环境下rdma虚拟化方法
WO2015035955A1 (zh) 内存模组访问方法及装置
CN113490927A (zh) 具有硬件集成和乱序放置的rdma输送
EP4357901A1 (en) Data writing method and apparatus, data reading method and apparatus, and device, system and medium
CN1295633C (zh) 一种多cpu通信的方法
US10534737B2 (en) Accelerating distributed stream processing
US11231964B2 (en) Computing device shared resource lock allocation
EP1569111A2 (en) Method and object request broker for accelerating object-oriented communication
US20240061802A1 (en) Data Transmission Method, Data Processing Method, and Related Product
CN116204487A (zh) 远程数据访问方法及装置
WO2023040683A1 (zh) 传输数据的方法和输入输出设备
CN116471242A (zh) 基于rdma的发送端、接收端、数据传输系统及方法
JP2009116561A (ja) データ転送システム
JP3379377B2 (ja) データ処理システム
EP1548591A2 (en) Accelerator for object-oriented communications and method
TWI411922B (zh) 通用串列匯流排主機控制器和通用串列匯流排主機控制方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22869054

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022869054

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022869054

Country of ref document: EP

Effective date: 20240326

NENP Non-entry into the national phase

Ref country code: DE