WO2022110805A1 - 实现集合通信的方法、计算机设备和通信系统 - Google Patents

实现集合通信的方法、计算机设备和通信系统 Download PDF

Info

Publication number
WO2022110805A1
WO2022110805A1 PCT/CN2021/103616 CN2021103616W WO2022110805A1 WO 2022110805 A1 WO2022110805 A1 WO 2022110805A1 CN 2021103616 W CN2021103616 W CN 2021103616W WO 2022110805 A1 WO2022110805 A1 WO 2022110805A1
Authority
WO
WIPO (PCT)
Prior art keywords
communication
work request
computer device
work
request
Prior art date
Application number
PCT/CN2021/103616
Other languages
English (en)
French (fr)
Inventor
陈强
李思聪
潘孝刚
陈一都
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP21896313.0A priority Critical patent/EP4236125A4/en
Publication of WO2022110805A1 publication Critical patent/WO2022110805A1/zh
Priority to US18/324,742 priority patent/US20230300080A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/50Queue scheduling
    • H04L47/62Queue scheduling characterised by scheduling criteria
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/28Data switching networks characterised by path configuration, e.g. LAN [Local Area Networks] or WAN [Wide Area Networks]
    • H04L12/40Bus networks
    • H04L12/40006Architecture of a communication node
    • H04L12/40032Details regarding a bus interface enhancer
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/90Buffering arrangements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Definitions

  • the present application relates to the field of information technology, and in particular, to a method, computer device and communication system for realizing collective communication.
  • each process may receive data from several other different processes and need to process it accordingly, and send the processed data to other processes. .
  • Embodiments of the present application provide a method, a computer device, and a communication system for implementing collective communication, so as to solve the problems of high communication delay and resource consumption in the prior art.
  • an embodiment of the present application provides a computer device, including a processor, a memory, and a host channel adapter;
  • a computer-executable program is stored in the memory
  • the processor is configured to execute the computer-executable program to achieve the following operations:
  • the host channel adapter is used for judging whether the received work request is a work request without communication dependence; for the work request marked as no communication dependence, it is directly forwarded, and for the work request not marked as Forward.
  • the above-mentioned computer equipment directly forwards the work request without communication dependence by identifying the work request without communication dependence, which avoids the communication delay caused when the work request without communication dependence is controlled by the queue, and can reduce the delay caused by the execution of related management and control.
  • the resulting resource consumption can improve the communication performance of collective communication as a whole.
  • the operation request of the collective communication is a cross-node operation request.
  • Communication between different nodes is achieved through the network.
  • the nodes include, but are not limited to, a computer-capable device, such as a computing-capable computer device or a storage-capable computer device.
  • the operation request of the cross-node collective communication may be an operation request that needs to be quickly forwarded among multiple nodes and/or needs to be precisely synchronized among multiple nodes.
  • the operation request of the collective communication is an operation request within a node, that is, a collective operation implemented between different processes or different threads in the same node.
  • the processor converts the operation request of the collective communication into a work request, it is also used to perform grid switching according to the number of communication sub-modules of the collective communication and the tasks that each communication sub-module needs to perform. point.
  • the task to be performed by the operation request for realizing the collective communication is allocated to run on the resource in the idle state, which can improve the utilization rate of the resource.
  • the processor sending the work request to the host channel adapter includes:
  • the processor converts the work request into a format that the host channel adapter can recognize, and sends the work request to the host channel adapter through an interface between the processor and the host channel adapter .
  • the identification of the work request without communication dependency includes:
  • a second identifier is added to the communication-dependent work request, where the second identifier is used to indicate that the work request is a communication-dependent work request.
  • the host channel adapter determines whether the received work request includes the first identifier, and when the received work request includes the first identifier, determines the work request The request is a work request without communication dependence; when the received work request does not include the first identifier or includes the second identifier, it is determined that the work request is a work request with communication dependence.
  • the host channel adapter determines whether the received work request includes the second identifier, and when the received work request includes the second identifier, determine the work request is a work request with communication dependence; when the received work request does not include the second identifier, it is determined that the work request is a work request without communication dependence.
  • the collective communication is any of the following communications: communication between a first communication sub-module and a plurality of second communication sub-modules, a plurality of first communication sub-modules and a second communication sub-module Communication between communication sub-modules, communication between a plurality of first communication sub-modules and a plurality of second communication sub-modules.
  • the work request is a communication request between one of the first communication sub-modules and one of the second communication sub-modules.
  • the non-communication dependency is that the communication between one of the first communication sub-modules and one of the second communication sub-modules does not need to depend on other communication sub-modules.
  • the processor is further configured to execute the computer-executable program to achieve the following operations:
  • the communication method includes a method for transmitting data between two communication submodules in a work request, and whether data transmission between the two communication submodules depends on data sent by other communication submodules .
  • the mode of data transmission between two communication sub-modules in a work request is a mode of data transmission between the communication sub-module as the data sender and the communication sub-module as the data receiver in a work request.
  • whether the data transmission between the two communication submodules depends on the data sent by other communication submodules includes: whether the communication submodule as the data sender sends data to the communication submodule as the data receiver depends on other communication submodules. data sent.
  • the communication manner is determined according to the communication interface between the different communication sub-modules.
  • the first communication sub-module runs through the computer device
  • the second communication sub-module runs through another computer device
  • the computer device communicates with the other computer device through a network
  • the host channel adapter is further configured to directly forward the work requests identified as having no communication dependence through the network, and forward the work requests that are not identified as having no communication dependence through the network based on queue management and control.
  • the network is an Infiniband-based network.
  • the host channel adapter for the work request that is not identified as having no communication dependency, forwards the work request based on queue management and control, including:
  • the host channel adapter loads the work requests that are not identified as having no communication dependencies into the queue, and determines whether the conditions recorded in the queue that trigger the work requests that are not identified as having no communication dependencies are satisfied;
  • the condition includes whether data sent to the communication sub-module receiving data in the work request has been received.
  • the condition for triggering the work request that does not carry the first identifier is satisfied.
  • the condition includes whether other work requests that triggered the work request are already in the queue. The condition is satisfied when other work requests that trigger the work request are already in the queue. When other work requests that trigger the work request are not in the queue, the condition is not satisfied, and the processor waits for the condition to be satisfied before triggering the work request.
  • the work request includes multiple work requests
  • the plurality of work requests include one or more work requests that are identified as having no communication dependencies, and one or more work requests that are not identified as having no communication dependencies.
  • the processor is further configured to execute the computer-executable program to achieve the following operations:
  • the number of communication sub-modules of the collective communication, the tasks to be performed by each communication sub-module, and the information of the data and transmission modes transmitted between different communication sub-modules are acquired.
  • the application may be a high performance computing (high performance computing, HPC) industry application, an HPC-artificial intelligence (artificial intelligence, AI) industry application, and a big data industry application.
  • HPC high performance computing
  • AI artificial intelligence
  • the application may initiate an operation request for the collective communication through a command for initiating a collective operation.
  • the processor converting the collective communication operation request into a work request includes:
  • the operation request of collective communication is converted into a work request and a control command for executing the work request; wherein, the control command is used to control the work request to realize the collective operation.
  • the collective communication is a collective communication based on a message-passing interface (MPI).
  • MPI message-passing interface
  • the processor converts the operation request of the collective communication into a work request, which is implemented according to the MPI library stored in the memory and in combination with the information obtained from the operation request of the collective communication.
  • the processor selects an MPI collective communication interface for communication between the communication sub-modules from the MPI library, and determines different communication sub-modules in the work request according to the selected MPI collective communication interface the means of communication between them. Different MPI collective communication interfaces will select different algorithms based on factors such as network topology, the number of communication sub-modules, and the size of data transmitted.
  • the processor determines a communication mode between different communication sub-modules according to algorithms corresponding to different MPI set communication interfaces.
  • the host channel adapter is implemented by a network interface card (NIC), a separate chip or a chipset.
  • the communication submodule is a process or a thread.
  • an embodiment of the present application provides a communication system, where the communication system includes at least one second computer device, and the at least one second computer device communicates with the computer device in any one of the first aspects through a network.
  • an embodiment of the present application provides a method for implementing collective communication, the method comprising:
  • the above method avoids the communication delay caused by queue management and control of work requests without communication dependencies, and can reduce the delay caused by the execution of related management and control. It can improve the communication performance of collective communication as a whole.
  • the operation request of the collective communication is a cross-node operation request.
  • Communication between different nodes is achieved through the network.
  • the nodes include, but are not limited to, a computer-capable device, such as a computing-capable computer device or a storage-capable computer device.
  • the operation request of the cross-node collective communication may be an operation request that needs to be quickly forwarded among multiple nodes and/or needs to be precisely synchronized among multiple nodes.
  • the operation request of the collective communication is an operation request within a node, that is, a collective operation implemented between different processes or different threads in the same node.
  • the method before converting the operation request of the collective communication into a work request, the method further includes:
  • Grid segmentation is performed according to the number of communication sub-modules of the collective communication and the tasks that each communication sub-module needs to perform. Through grid segmentation, the task to be performed by the operation request for realizing the collective communication is allocated to run on the resource in the idle state, which can improve the utilization rate of the resource.
  • the identification of the work request without communication dependency includes:
  • a second identifier is added to the communication-dependent work request, where the second identifier is used to indicate that the work request is a communication-dependent work request.
  • the collective communication is any of the following communications: communication between a first communication sub-module and a plurality of second communication sub-modules, a plurality of first communication sub-modules and a second communication sub-module Communication between communication sub-modules, communication between a plurality of first communication sub-modules and a plurality of second communication sub-modules.
  • the work request is a communication request between one of the first communication sub-modules and one of the second communication sub-modules.
  • the non-communication dependency is that the communication between one of the first communication sub-modules and one of the second communication sub-modules does not need to depend on other communication sub-modules.
  • the method further includes:
  • the communication method includes a method for transmitting data between two communication submodules in a work request, and whether data transmission between the two communication submodules depends on data sent by other communication submodules .
  • the mode of data transmission between two communication sub-modules in a work request is a mode of data transmission between the communication sub-module as the data sender and the communication sub-module as the data receiver in a work request.
  • whether the data transmission between the two communication submodules depends on the data sent by other communication submodules includes: whether the communication submodule as the data sender sends data to the communication submodule as the data receiver depends on other communication submodules. data sent.
  • the communication manner is determined according to the communication interface between the different communication sub-modules.
  • the first communication sub-module runs through a first computer device
  • the second communication sub-module runs through a second computer device
  • the first computer device and the second computer device run through Telecommunication
  • the method further includes: directly forwarding the work request identified as having no communication dependence through the network, and forwarding the work request not identified as having no communication dependence through the network after queue management and control.
  • the network is an Infiniband-based network.
  • the forwarding based on queue management and control for the work request that is not identified as having no communication dependency includes:
  • the condition includes whether data sent to the communication sub-module receiving data in the work request has been received.
  • the condition for triggering the work request that does not carry the first identifier is satisfied.
  • the condition includes whether other work requests that triggered the work request are already in the queue. The condition is satisfied when other work requests that trigger the work request are already in the queue. When other work requests that trigger the work request are not in the queue, the condition is not satisfied, and the processor waits for the condition to be satisfied before triggering the work request.
  • the work request includes multiple work requests
  • the plurality of work requests include one or more work requests that are identified as having no communication dependencies, and one or more work requests that are not identified as having no communication dependencies.
  • the method further includes:
  • the number of communication sub-modules of the collective communication, the tasks to be performed by each communication sub-module, and the information of the data and transmission modes transmitted between different communication sub-modules are acquired.
  • the converting the collectively communicated operation request into a work request includes:
  • the collective operation is converted into a work request and a control command for executing the work request; wherein the control command is used to control the work request to implement the collective operation.
  • the collective communication is MPI-based collective communication.
  • the host channel adapter is implemented by a NIC, a separate chip or chipset.
  • the communication submodule is a process or a thread.
  • an embodiment of the present application provides a computer program product containing instructions, which, when the computer program product runs on a computer device, causes the computer device to execute the method described in any one of the third aspects above.
  • an embodiment of the present application provides a computer-readable storage medium, wherein an instruction is stored in the computer-readable storage medium, and the instruction instructs a computer device to execute the method described in any one of the third aspect above. method.
  • FIG. 1 is a schematic structural diagram of implementing collective communication according to an embodiment of the present application
  • FIG. 2 is a logical schematic diagram of a data flow during 8 inter-process communications provided by an embodiment of the present application
  • FIG. 3A is a schematic structural diagram of a computer device 300 according to an embodiment of the present application.
  • 3B is a schematic structural diagram of a system provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a logical structure of a program or an instruction to be executed by the processor 301 in an embodiment of the present application;
  • 5A is a schematic flowchart of a method for identifying a work request without communication dependency provided by an embodiment of the present application
  • FIG. 5B is a schematic flowchart of a process of identifying a work request without communication dependency by the control module 3021 according to an embodiment of the present application;
  • FIG. 6 is a schematic diagram of a specific structure of a host channel adapter 303 provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a computer device 700 according to an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a communication system provided by an embodiment of the present application.
  • FIG. 9 is a schematic flowchart of a method for implementing collective communication according to an embodiment of the present application.
  • first, second and the like in the description and claims of the present application and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It is to be understood that data so used may be interchanged under appropriate circumstances so that the embodiments described herein can be practiced in sequences other than those illustrated or described herein.
  • first and second are only used for descriptive purposes, and should not be construed as indicating or implying relative importance or implying the number of indicated technical features. Thus, a feature defined as “first” or “second” may expressly or implicitly include one or more of that feature.
  • the terms "comprising” and “having” and any variations thereof, are intended to cover non-exclusive inclusion, for example, a process, method, system, product or device comprising a series of steps or modules It is not necessary to be limited to those steps or modules expressly listed, but may include other steps or modules not expressly listed or inherent to the process, method, product or apparatus.
  • the naming or numbering of the steps in this application does not mean that the steps in the method flow must be executed in the time/logical sequence indicated by the naming or numbering, and the named or numbered process steps can be implemented according to the The technical purpose is to change the execution order, as long as the same or similar technical effects can be achieved.
  • the division of units in this application is a logical division.
  • determining B according to A does not mean that B is only determined according to A, and B may also be determined according to A and/or other information.
  • the term “if” may be interpreted to mean “when” or “upon” or “in response to determining” or “in response to detecting.”
  • the phrases “if it is determined" or “if a [statement or event] is detected” can be interpreted to mean “when determining" or “in response to determining... ” or “on detection of [recited condition or event]” or “in response to detection of [recited condition or event]”.
  • references throughout the specification to "one embodiment,” “an embodiment,” and “one possible implementation” mean that a particular feature, structure, or characteristic related to the embodiment or implementation is included in the present application at least one embodiment of .
  • appearances of "in one embodiment” or “in an embodiment” or “one possible implementation” in various places throughout this specification are not necessarily necessarily referring to the same embodiment.
  • the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments.
  • Parallel computing is based on the idea that large problems can be divided into smaller problems, and these smaller problems can be solved simultaneously (parallel) using existing resource capabilities. The solution eventually leads to the solution of the big problem.
  • Parallel computing is relative to serial computing.
  • the characteristic of serial computing is that the processor runs the computing algorithm in sequence according to the instruction sequence.
  • Parallel computing can be divided into two types: parallel in time and parallel in space.
  • Temporal parallelism refers to the pipeline technology used in the central processing unit of the computer, which divides each instruction into multiple steps to complete, and these steps can be overlapped in time.
  • Spatial parallelism refers to the concurrent execution of computer instructions by multiple processors to speed up problem solving.
  • the advantage of parallel computing is that it can break through the limitation of the computing power of serial computers, improve the computing speed, complete computing tasks in a shorter time, better utilize the computing power of hardware, and save computing costs.
  • HPC refers to a complete set of computer systems with a certain level of computing power. Because it is difficult for a single processor to achieve such powerful computing power, HPC requires multiple central processing units (CPUs) or multiple hosts (such as multiple computer devices) to work together to achieve. The main purpose of building a high-performance computing system is to increase the computing speed. To achieve a computing speed of teraflops per second, the system's processor, memory bandwidth, computing method, and system input/output (I/O) , storage and other aspects are very high requirements, each of which will directly affect the operating speed of the system. HPC is mainly used to quickly complete data-intensive, computing-intensive and I/O-intensive calculations in the fields of scientific research, engineering design, finance, industry, and social management.
  • Typical applications include: bioengineering, new drug development, petroleum geophysical exploration, vehicle design (aerospace, ships, automobiles), material engineering, nuclear explosion simulation, cutting-edge weapon manufacturing, cryptographic research, and various large-scale information processing.
  • the goals of high-performance computing are: to minimize the computational time to complete a particular computational problem, to maximize the size of the problem that can be completed in a specifiable time, to deal with a large number of complex problems that were previously unachievable, to improve cost-effectiveness, and to scale to solve medium-sized problems. problems and budgets, etc.
  • MPI is a message passing interface standard for developing parallel programs based on message passing, and its purpose is to provide users with a practical, portable, efficient and flexible message passing interface.
  • MPI can be applied to a variety of system architectures, such as distributed/shared memory multi-core processors, high-performance networking, and combinations of these architectures.
  • MPI is also a parallel programming function library, and its compilation and operation need to be combined with a specific programming language.
  • MPI is implemented on all major operating systems, including Windows and Linux.
  • MPI can be a process-level parallel software middleware.
  • the MPI framework manages all computing processes to form a system, and then provides rich inter-process communication functions. A process is a running instance of a program.
  • MPI can support a variety of different communication protocols, such as Infiniband or transmission control protocol (transmission control protocol, TCP) and so on. MPI encapsulates these protocols, provides a set of unified communication interfaces, and shields the underlying communication details.
  • the MPI management framework will assign a process identification number (rank number) to each process, and the rank will be arranged from 0 to the back. Which part of the work each process of the MPI program completes is determined by its process identification number.
  • the MPI process needs to communicate in the communication domain.
  • the communication domain is the communication environment between processes, including process group, context, virtual topology, etc. When MPI is started, the system will establish a global communication domain, and each process is in this global communication domain. Inter-process communication needs to specify the parameters of the communication domain.
  • Collective communication also called group communication.
  • An important difference between it and point-to-point communication is that multiple processes participate in the communication at the same time, which is different from point-to-point communication that only involves two processes: the sender and the receiver.
  • Which processes participate in collective communication and the context of collective communication are defined by the communication domain of the collective communication call.
  • Collective communication generally includes three functions: communication, synchronization and computation. Among them, the communication function mainly completes the transmission of data within the set, the synchronization function realizes the consistency of the execution progress of all processes in the set at a specific point, and the calculation function is the operation on specific data.
  • MPI collective communication is a common collective communication.
  • IB InfiniBand
  • IB InfiniBand
  • IB InfiniBand
  • InfiniBand is a computer network communication standard for high-performance computing, with extremely high throughput and extremely low latency, for data between computers interconnection. InfiniBand is also used as a direct or switched interconnect between servers and storage systems, and as an interconnect between storage systems.
  • Grid composed of multiple independent computers to provide online computing and storage capabilities, these computer resources are distributed over a relatively wide range. By utilizing the idle computing resources in the grid, a virtual powerful computing platform can be created. This high-performance computer provides the possibility to deal with large-scale computing problems in the fields of biology, mathematics, and chemistry.
  • the grid organizes interconnected computers, and integrates all kinds of resources and services connected in the network into a virtual computer with huge capacity. For users, the grid provides infrastructure including various services and resources, and users are faced with a resource far exceeding the capacity of any single supercomputer. Not only can it use its powerful computing power to solve difficult problems, but also use the services provided by any node in the grid, no matter where the node's physical location is.
  • FIG. 1 is a schematic structural diagram for realizing collective communication.
  • four processes eg, process 1, process 2, process 3 and process 4, not shown in the figure
  • the computer device 2 receives data sent by the four processes in the computer device 1 through the queue, and writes the payload sent by each process into the receiving queue.
  • a process such as process 0, not shown in the figure
  • the computer device 2 After a process (such as process 0, not shown in the figure) in the computer device 2 receives the load of the four processes, it performs related processing (such as summing, taking the maximum value or the minimum value, etc.) and sends the processed data.
  • the data sent by each process running in the computer device 1 arrives at the computer device 2 out of order, and the arrival time is also uncertain.
  • the computer device 2 cannot predict the generation time of each interrupt, which is an interrupt generated when any one of the four processes running in the computer device 1 transmits data to the computer device 2 through the network.
  • Each time the computer device 2 receives a payload sent by a process it will trigger an interrupt task to the operating system.
  • the generated interrupt will cause the operating system to stop the computing tasks being performed by some cores, turn to processing the interrupts, and return to executing the computing tasks after processing the interrupt tasks.
  • the operating system of the computer device 2 needs to save the context of the current computing task when processing the interrupt, and the overall overhead is relatively large. In this way, the operating system scheduler of the computer device 2 is disturbed, resulting in "operating system noise", ie the interrupt overhead incurred when receiving messages. This system noise causes a large number of processes to wait and a large number of processor cycles (eg CPU cycles) to be lost. In most cases, the processing of the data is simple, however the time overhead of interrupts and context switches is greater than the time overhead of the computer device 2 processing the data. Therefore, when implementing such operations, the performance of the computer device 2 is very inefficient.
  • One way to solve the above problem is to introduce a queue for control, and manage the data sent by different processes uniformly through the queue. For example, when the data of the four different processes in Figure 1 are managed and controlled by the queue, the interrupt will be triggered uniformly after the completion information of the four receiving queues arrives, so as to avoid the generation of "operating system noise".
  • FIG. 2 is a logical schematic diagram of a data flow when 8 inter-processes communicate.
  • the numbers in each circle represent the labels of the processes, and the lines between the circles represent the existence of a communication relationship between the processes.
  • a queue pair is a label for a queue pair. Taking process 0 sending data to process 4 as an example, "Send QP 4" means that process 0 sends data to the QP of process 4, and "Send disable" means canceling the sending enable.
  • the embodiment of the present application provides a method for implementing collective communication.
  • the collective communication operation When the collective communication operation is performed, processes that do not depend on communication may not be managed and controlled through a queue. In this way, the performance of inter-process communication without communication dependency in collective communication can be accelerated, the overall delay of collective communication can be reduced, and resource occupation and consumption caused by the management of inter-process communication without communication dependency can be avoided.
  • the collective communication through MPI As an example, when the upper-layer application (HPC application, big data application, etc.) communication component selects MPI, the performance of the overall upper-layer application can be accelerated end-to-end.
  • the time of MPI collective communication accounts for 40% of the entire end-to-end running time. If the performance of MPI collective communication is accelerated, the overall end-to-end performance of the application can be accelerated.
  • FIG. 3A is a schematic structural diagram of a computer device 300 according to an embodiment of the present application.
  • the computer device 300 includes a processor 301 , a memory 302 , a host channel adapter (HCA) 303 and a bus 304 . Communication between the processor 301 , the memory 302 and the host channel adapter 303 is through the bus 304 .
  • the bus 304 may be a peripheral component interconnect (PCI) bus, a peripheral component interconnect express (PCIe) or an extended industry standard architecture (EISA) bus, or the like.
  • the bus can be divided into an address bus, a data bus, a control bus, etc. For convenience, only one thick line is used in FIG. 3A, but it does not mean that there is only one bus or one type of bus.
  • the host channel adapter 303 is connected to other computer devices through a network.
  • the processor 301 may be a CPU, a graphics processing unit (Graphics Processing Unit, GPU), a general-purpose GPU (GPGPU), a tensor processing unit (Tensor Processing Unit, TPU), a data processor (Data Processing Unit, GPU), microprocessor (micro processor, MP) or digital signal processor (digital signal processor, DSP) and other chips with computing capabilities.
  • the memory 302 may include volatile memory (volatile memory), such as random access memory (random access memory, RAM); may also include non-volatile memory (non-volatile memory), such as read-only memory (read-only memory) , ROM), flash memory, HDD or SSD.
  • volatile memory volatile memory
  • non-volatile memory non-volatile memory
  • program code for at least one process or program code for at least one thread.
  • the memory 302 may also store data, including but not limited to the data that the computer device 300 needs to store.
  • the host channel adapter 303 includes a control unit 3031 , a first interface 3032 and a second interface 3033 .
  • the first interface 3032 is the interface through which the host channel adapter 303 communicates with the processor 301. It receives the request sent by the processor 301 through the bus 304, and converts the received request into a format that the host channel adapter 303 can recognize; The data or message sent by the channel adapter 303 to the processor 301 is converted into a format that the processor 301 can recognize.
  • the control unit 3031 receives the request sent by the processor 301 through the first interface 3032, and performs corresponding processing on the received request.
  • the second interface 3033 is the interface connecting the computer device 300 to the network.
  • the host channel adapter 303 receives requests or data sent by other computer devices through the network through the second interface 3033, or sends requests or data to other computer devices through the network.
  • the second interface 3033 may include multiple ports and be connected to the network through multiple ports, so as to realize synchronous transmission of multiple paths.
  • the host channel adapter 303 may be implemented through a NIC. In another implementation manner, the host channel adapter 303 can also be implemented by a chip set or an independent chip.
  • FIG. 3A is only for the convenience of describing the embodiments of the present application, and shows some hardware components and software components.
  • the computer device 300 may also include other hardware components, such as a hard disk, etc.; and may also include other software, such as an application program, an operating system, and the like.
  • the structural composition shown in FIG. 3A should not be taken as a limitation on the embodiments of the present application.
  • FIG. 3B is a schematic structural diagram of a system provided by an embodiment of the present application.
  • the system includes a computer device 300 and a computer device 400 , and the computer device 300 and the computer device 400 are connected through a network N100 .
  • the network N100 between the computer device 300 and the computer device 400 may be an Infiniband-based network.
  • the Infiniband architecture is used as an interconnection solution between the computer device 300 and the computer device 400 .
  • the network N100 between the computer equipment 300 and the computer equipment 400 may also be an Ethernet network (Ethernet) or a remote memory direct access protocol (Remote Direct Memory Access over Converged Ethernet, RoCE) network based on converged Ethernet, etc.
  • Ethernet network Ethernet
  • RoCE Remote Direct Memory Access over Converged Ethernet
  • the composition of computer device 400 is similar to that of computer device 300 , including processor 401 , memory 402 , host channel adapter 403 , and bus 404 . Communication between the processor 401 , the memory 402 and the host channel adapter 403 is through the bus 404 .
  • Stored in memory 402 are programs or instructions, such as program code comprising at least one process.
  • the host channel adapter 403 includes a control unit 4031 , a first interface 4032 and a second interface 4033 .
  • a request initiated by a process running in the computer device 300 for example, a request to send data to a process running in the computer device 400, is transmitted to the computer device 400 through the network N100.
  • the relevant processes running in the computer device 400 perform corresponding processing according to the received request.
  • one or more processes in the computer device 300 implement MPI collective communication with one or more processes in the computer device 400 through the network N100.
  • the collective communication method provided by the embodiment of the present application will be further described below by taking an application running in the computer device 300 initiates an MPI collective operation to implement communication with the computer device 400 as an example.
  • FIG. 4 is a schematic diagram of a logical structure of a program or instruction to be executed by the processor 301 .
  • the memory 302 includes an application program 3025 , an application interface module 3023 , a control module 3021 , a transmission module 3022 and a forwarding module 3024 .
  • the application interface module 3023, the control module 3021 and the transmission module 3022 constitute the MPI layer
  • the MPI layer is a unified communication framework implemented based on the MPI standard.
  • the application 3025 can be any application that implements collective communication, including but not limited to HPC industry applications, HPC-AI industry applications, and big data industry applications. These applications usually require a large number of computing tasks, and the execution of these large number of computing tasks usually starts multiple processes or threads, and it is necessary to call the collective communication interface of MPI for data calculation and information exchange between processes.
  • the application 3025 may be the application WRF (Weather Research and Forecasting) in the field of meteorology, the application of OpenFoam in the field of computational fluid dynamics, or the application VASP (Vienna Ab initio Simulation Package) in the field of molecular dynamics, and so on.
  • the application interface module 3023 is an interface between the MPI application layer and the application program 3025 , and is used to receive tasks to be executed from the application program 3025 .
  • the application interface module 3023 may receive an operation request for collective communication triggered by the application program 3025 .
  • the control module 3021 is configured to convert the operation request into a work request (WR) based on the operation request issued by the application program 3025.
  • the conversion by the control module 3021 includes, but is not limited to: performing grid segmentation according to the calculation example to be calculated, determining the process to be executed and the task to be executed by each process, the communication mode of communication between the processes, and the like.
  • the control module 3021 may be a unified communication group (UCG).
  • the transmission module 3022 is used to abstract the difference between the architectures of different hardware (for example, different network cards), and provide a low-level application programming interface (application programming interface, API), and the low-level API is used to implement collective communication.
  • the transport module 3022 may be a unified communication transport (UCT).
  • the forwarding module 3024 is used to realize the forwarding of messages or data between the API layer and the host channel adapter 303 .
  • the forwarding module 3024 may be an open fabrics enterprise distribution (OFED).
  • OFED open fabrics enterprise distribution
  • the control module 3021 calls the interface of the transmission module 3022 to notify the transmission module 3022 of the WR that needs to be executed
  • the transmission module 3022 notifies the host channel adapter 303 by calling the externally exposed interface of the OFED and knocking on the doorbell of the hardware, and sends the WR to the host channel
  • OFED may be an open source implementation of remote direct memory access (RDMA) and kernel bypass.
  • RDMA remote direct memory access
  • FIG. 5A is a schematic flowchart of a method for identifying a work request without communication dependency provided by an embodiment of the present application. As shown in Figure 5A, in conjunction with the software modules shown in Figure 4, the method includes:
  • Step 500 The application 3025 initiates an operation request for collective communication.
  • the operation request for collective communication is a collective operation request across nodes, and different nodes communicate through a network to implement collective communication.
  • the operation request of the collective communication across the nodes may be an operation request that needs to be quickly forwarded among multiple nodes and/or needs to be precisely synchronized among multiple nodes.
  • the node may be a computer device, including but not limited to a computer device that implements a computing function or a computer device that implements a storage function.
  • the operation request of the collective communication is an operation request within a node, that is, a collective operation implemented between different processes or different threads in the same node.
  • the application program 3025 may initiate the operation request of the collective communication by initiating the collective operation command.
  • the collective operation command may be an MPI command or a shared memory command.
  • the MPI command includes but is not limited to: MPI_max, MPI_min, MPI_sum, MPI_scatter, or MPI_reduce, etc.
  • Step 501 the application interface module 3023 receives the operation request initiated by the application program 3025 and forwards it to the control module 3021 .
  • the application interface module 3023 can generally process the received operation request, and send the acquired information to the control module 3021 .
  • the processing of the operation request by the application interface module 3023 includes, but is not limited to: obtaining information such as the number of collectively communicating processes, the tasks each process needs to perform, the size of the data transferred between the processes, or the communication domain.
  • the application interface module 3023 After acquiring the information, the application interface module 3023 sends or transfers the information to the control module 3021 . It can be understood that other software modules may also exist between the application interface module 3023 and the control module 3021 . For example, software modules that implement MPI communication based on the MPI communication protocol, the application interface module 3023 can send or transfer the acquired information to the control module 3021 through these software modules. From the perspective of concise description, the embodiments of this application only describe that the application interface module 3023 sends or transfers the acquired information to the control module 3021 .
  • Step 502 The control module 3021 converts the operation request of the collective communication into a work request according to the information obtained from the application interface module 3023.
  • control module 3021 converts the operation request of the collective communication into a work request according to the information obtained from the application interface module 3023 .
  • control module 3021 can convert the collective operation into a work request and a control command for executing the work request according to the information obtained from the application interface module 3023 .
  • control command is used to control the work request to realize the collective operation.
  • control module 3021 can convert the collective operation into a work request based on the MPI library stored in the memory 302 and in combination with the information obtained from the application interface module 3023, or convert it into a work request and execute the work request control command.
  • converting the collective operation into a work request and executing the control command for the work request includes the following steps:
  • Step S1 The control module 3021 performs grid segmentation according to the number of processes in collective communication and the tasks that each process needs to perform, and allocates computer devices that run the processes of collective communication.
  • a collective communication is a communication between root process 0 and 3 child processes (child process 1, child process 2 and child process 3).
  • Root process 0 needs to receive data from subprocess 1, subprocess 2, and subprocess 3 and perform a reduction operation, and then send the data after the reduction operation to subprocess 1, subprocess 2, and subprocess 3.
  • the control module 3021 can assign the computer device 300 to run the root process 0, and assign the computer device 400 to run the child process 1, the child process 2 and the child process 3.
  • the root process 0 implements collective communication with the child process 1, the child process 2 and the child process 3 through the network N100.
  • the computer device 300 may only be allocated to perform the task of the collective communication. Still taking the communication between the above-mentioned root process 0 and 3 sub-processes (sub-process 1, sub-process 2 and sub-process 3) as an example, root process 0, as well as sub-process 1, sub-process 2 and sub-process 3 all run on the computer device 300 Above, the root process 0 can communicate with the child process 1, the child process 2 and the child process 3 through the host channel adapter 303.
  • Step S2 The control module 3021 determines the method for implementing collective communication between processes.
  • control module 3021 may first select an MPI set communication interface for inter-process communication from the MPI library.
  • the MPI set communication interface includes but is not limited to: MPI_Bcast, MPI_Allreduce, or MPI_Alltoall, and the like.
  • Different MPI collective communication interfaces will select different algorithms according to factors such as network topology, the number of processes, and the size of transmitted data. Among them, commonly used algorithms include but are not limited to Binomial Tree, K-nomial Tree or Recursive doubling, etc. The algorithm is specifically used to determine the way of inter-process communication.
  • the control module 3021 selects different MPI collective communication interfaces, and determines the way of implementing collective communication between processes based on the algorithm applicable to each communication interface.
  • Step 503 The control module 3021 identifies the work request without communication dependency, and adds an identifier for the work request without communication dependency.
  • control module 3021 identifies the process of the work request without communication dependency, see the flow shown in FIG. 5B .
  • control module 3021 After the control module 3021 identifies the work request without communication dependence, it can add identification information to the work request without communication dependence to identify that the work request is a work request without communication dependence.
  • the added identification information can be in any form, and the added identification information can be located anywhere in the work request.
  • an extended attribute may be added to the Opcode of the work request to add identification information.
  • a job request could be structured as shown in Table 3:
  • the control module 3021 may add an identifier: SEND_DIRECTLY to the Opcode in the work request shown in Table 3, to identify the work request as a work request without communication dependency.
  • Step 504 Send the work request converted by the control module 3021 to the host channel adapter 303 .
  • the transmission module 3022 transmits the work request converted by the control module 3021 to the forwarding module 3024 , and the forwarding module 3024 sends the work request to the host channel adapter 303 .
  • the processor 301 transmits the work request converted by the control module 3021 to the forwarding module 3022 by executing the program of the transmission module 3022, and sends the work request to the host channel adapter 303 by executing the program of the forwarding module 3024.
  • the processor 301 can convert the work request into a format recognizable by the host channel adapter 303 by executing the program of the forwarding module 3024, and send the work request in the converted format to the host channel adapter 303 through the first interface 3032. Host channel adapter 303.
  • control module 3021 converts the operation request of the collective communication, it will obtain multiple work requests. Some of the work requests are work requests with communication dependencies, and some work requests are work requests without communication dependencies.
  • the work requests sent by the processor 301 to the host channel adapter include the work requests with no communication dependency added with the identifier, and also the work requests with communication dependencies. In this way, the host channel adapter can identify the work request without communication dependency according to the identifier, and then directly forward this part of the work request.
  • FIG. 5B is a schematic flowchart of a method for identifying whether a work request is a work request without communication dependency according to an embodiment of the present application. As shown in Figure 5B, the method includes:
  • Step 5031 The control module 3021 determines that there is a work request for inter-process communication.
  • the control module 3021 converts the work request after the collective operation into one or more work requests.
  • the control module 3021 needs to determine a work request first, and then determine whether the work request is a work request without communication dependency. Determining whether a work request is a work request without communication dependency is to determine whether the communication between processes that have a communication relationship in the work request needs to depend on other processes.
  • the following description takes as an example that the local process is process A and the peer process is process B in a work request.
  • Step 5032 The control module 3021 determines whether the local process (process A) needs to send data to the opposite process (process B).
  • step 5034 If the local process (process A) needs to send data to the opposite process (process B), go to step 5033; if the local process (process A) does not need to send data to the opposite process (process B), it means that the local process (Process A) needs to receive the data sent by the peer process (process B), then execute step 5034.
  • Step 5033 The control module 3021 judges whether the data sent by the local process (process A) to the opposite process (process B) is obtained from other processes (for example, process C, which is described below as process C on behalf of other processes); if not If it is obtained from another process (process C), go to step 5035; if it is obtained from another process (process C), go to step 5036.
  • process A the data sent by the local process
  • process B the opposite process
  • process C which is described below as process C on behalf of other processes
  • Step 5034 The control module 3021 determines whether the data sent by the peer process (process B) to the local process (process A) is obtained from other processes (process C); if it is not obtained from other processes (process C), then Go to step 5035; if it is obtained from another process (process C), go to step 5036.
  • Step 5035 The control module 3021 identifies that the work request is a process without communication dependence
  • the work request that the local process (process A) sends data to the opposite process (process B) is a work request without communication dependence.
  • the control module 3021 identifies the work request without communication dependency.
  • the work request without communication dependence may be identified by adding a first identification.
  • the first identifier is used to indicate that the work request is a work request without communication dependency.
  • a first identifier can be added to the Opcode field of the work request, for example, the added first identifier is IBV_SEND_DIRECTLY.
  • Step 5036 The control module 3021 identifies the work request as a communication-dependent work request or does not perform identification processing.
  • the work request that the local process (process A) sends data to the opposite process (process B) is a work request that has communication dependencies.
  • control module 3021 may identify the communication-dependent work request, or may not perform the identification process. If no identification processing is performed, it means that the work request is different from the work request without communication dependency.
  • the identification of the communication-dependent work request may be performed by adding a second identification to identify the communication-dependent work request.
  • the second identifier is used to indicate that the work request is a communication-dependent work request.
  • a second identifier may be added to the Opcode field of the work request, for example, the added second identifier may be IBV_SEND.
  • FIG. 5A and FIG. 5B show that each software module executes corresponding steps
  • the processor 301 implements the corresponding functions by executing the programs of these software modules. That is, the processor implements the method flow shown in FIG. 5A and FIG. 5B by executing the corresponding program stored in the memory 302 .
  • FIG. 6 is a schematic structural diagram of a specific structure of a host channel adapter 303 provided by an embodiment of the present application. As shown in FIG. 6 , the host channel adapter 303 also includes a queue 3035 and a storage unit 3034 .
  • the storage unit 3034 is connected to the control unit 3031, and is used for storing programs or codes for realizing the corresponding functions of the control unit 3031, and storing data to be processed by the control unit 3031.
  • the queue 3035 includes a plurality of work queues (work queues, WQs), each work queue includes a plurality of work queue entries (work queue entries, WQEs), and each WQE contains information related to network events, for example, can be sent to other nodes through the network. Information to send messages or receive messages from other nodes over the network.
  • Work queues are implemented by at least one QP.
  • Each QP includes a receive queue (RQ) and a send queue (SQ).
  • RQ receive queue
  • SQ send queue
  • a QP usually corresponds to a QP in a peer node, which enables point-to-point transmission.
  • the receive queue is mainly used to receive WQEs
  • the send queue is mainly used to send related WQEs.
  • the queue 3035 also includes a completion queue (CQ), and the completion queue records the completion status of the WQE.
  • Each entry in the completion queue corresponds to a WQE.
  • the completion queue may be associated with a preset group of receive queues, where the group of receive queues is used to receive messages waiting to be received.
  • a producer index (PI) is used to indicate the most recently completed entry in the completion queue.
  • the PI may also be used to indicate the latest WQE processed in the work queue.
  • the control unit 3031 indicates that the messages waiting to be received are received by identifying the PI in the completion queue.
  • FIG. 6 only shows various queues for brevity, and only shows one SQ, RQ and CQ respectively, but this does not mean that the queue 3035 only includes these queues. RQ or multiple CQs are not repeated here.
  • control unit 3031 may also determine whether the work request loaded into the queue 3035 is a data work request or a management work request. If a data work request is received, it is determined whether there is already a management work request in the queue 3035 that triggers the data work request. When there is already a management work request that triggers the data work request in the queue 3035, the data work request is triggered based on the management work request. When there is no management work request that triggers the data work request in the queue 3035, the data work request is stored in the receiving queue, and the management work request that triggers the data work request is awaited. If the management work request is received, it is determined whether there is already a data work request to be triggered by the management work request in the queue 3035 .
  • the data work request is triggered based on the management work request.
  • the management work request is stored in the receiving queue, and the data work request to be triggered by the management work request is waited for.
  • the queue 3035 and the storage unit 3034 can be implemented by RAM, for example, can be implemented by static random access memory (static random access memory, SRAM) or dynamic random access memory (dynamic random access memory, DRAM).
  • RAM random access memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • the queue 3035 and storage unit 3034 may be embedded in the control unit 3031 or independent of the host channel adapter 303.
  • control unit 3031 may further include a calculation sub-unit (not shown in FIG. 6 ), which is used for executing the relevant calculation tasks in the work request.
  • the calculation subunit may be an arithmetic logic unit (arithmetic logical unit, ALU).
  • the computing subunit may be embedded in the control unit 3031 , or may be a subunit independent of the control unit 3031 in the host channel adapter 303 and controlled by the control unit 3031 .
  • control unit 3031 and/or the calculation subunit may be implemented by a field programmable gate array (FPGA) and/or an application-specific integrated circuit (ASIC).
  • FPGA field programmable gate array
  • ASIC application-specific integrated circuit
  • the first interface 3032 sends the received work request to the control unit 3031 .
  • the control unit 3031 determines whether the received work request is an identifier of no communication dependency.
  • the control unit 3031 may judge whether the received work request is a work request without communication dependency by judging whether the received work request contains the first identifier. When the received work request contains the first identifier, it is confirmed that the work request is a work request without communication dependence; when the received work request does not contain the first identifier, it is confirmed that the work request has communication Dependent work requests. For example, after receiving the work request forwarded by the first interface 3032, the control unit 3031 first analyzes whether the Opcode field in the received work request contains IBV_SEND_DIRECTLY. If IBV_SEND_DIRECTLY is included, confirms that the work request is a work request without communication dependencies. If IBV_SEND_DIRECTLY is not included, confirm whether the work request has a communication-dependent work request.
  • control unit 3031 may judge whether the received work request is a work request without communication dependence by judging whether the received work request contains the second identifier. When the received work request does not contain the second identifier, it is confirmed that the work request is a work request without communication dependence; when the received work request contains the second identifier, it is confirmed that the work request has communication Dependent work requests.
  • the control unit 3031 directly sends the work request without communication dependence to the second interface 3033 to send the work request through the network.
  • the control unit 3031 loads the work request into the queue 3035, and controls the sending of the work request through the queue, and when the conditions for the work request to be executed are satisfied, the second interface 3033 sends the work request to the queue 3035.
  • the work request is sent over the network.
  • control unit 3031 may control the sending of the work request through the queue in the following manner:
  • the control unit 3031 firstly judges whether the triggering condition of the work request that has a communication dependency with the work request is satisfied. If the condition is not met, store the work request in the receive queue to wait. If the condition is satisfied, the work request is triggered, and the work request is sent through the network via the second interface 3033 . For example, in the above example, when process A sends data to process B, it needs to wait for process C to send data to process A.
  • control unit 3031 When the control unit 3031 receives the work request from the process A to send the data to the process B, it loads the work request into the receiving queue in the queue 3035, and judges whether the process A receives the data sent by the process C, that is, judges the completion of the queue 3035 Whether there is a completion record of process C sending data to process A in the queue. If the process A has not received the data sent by the process C, the control unit 3031 stores the work request for the process A to send the data to the process B in the receiving queue. When the work request of process C sending data to process A is completed, the completed WQE will be recorded in the completion queue. When the entry indicated by PI shows that process C has sent data to process A, process A sends data to process B. When the condition of the work request has been satisfied, the control unit 3031 takes out the work request for sending data from the process A to the process B from the receiving queue, and sends the work request through the network through the second interface 3033 .
  • the processor 301 identifies the work request without communication dependence, and after receiving the work request of the collective operation sent by the processor 301, the host channel adapter 403 directly passes the work request without communication dependence through the
  • the network transmission avoids the communication delay caused when the work request without communication dependence passes through the queue management and control, and can reduce the resource consumption caused by the host channel adapter 303 performing related management and control.
  • work requests without communication dependencies are sent directly through the network, some more interruptions will be triggered due to the lack of queue management, but the delay and resource consumption caused by these interruptions are far less than the delay and resource consumption caused by queue management. Therefore, the communication performance of the collective communication can be improved as a whole through the implementation manners provided by the embodiments of the present application.
  • FIG. 7 is a schematic structural diagram of a computer device 700 according to an embodiment of the present application.
  • computer device 700 includes processor 701 , memory 702 and host channel adapter 703 .
  • the processor 701, the memory 702 and the host channel adapter 703 are connected to each other by a bus.
  • a computer-executable program is stored in the memory 702, and the processor 701 is configured to execute the computer-executable program to realize the following operations:
  • the host channel adapter 703 is used to determine whether the received work request is a work request without communication dependence; for the work request marked as no communication dependence, it is directly forwarded, and for the work request not marked as no communication dependence, it is based on queue management and control. forwarded later.
  • the specific implementation of the computer device 700 shown in FIG. 7 can be implemented with reference to the implementation of the computer device 300 shown in FIG. 3A and the implementation shown in FIGS. 5A and 5B ; for example, the communication sub-module may be It is the process and so on described in the above-mentioned FIG. 5A and FIG. 5B , and will not be repeated.
  • the processor 701 identifies a work request that is not dependent on communication, and after receiving the work request for collective operations sent by the processor 701, the host channel adapter 703 directly passes the network for the work request that is not dependent on communication. Sending, avoids the communication delay caused by the queue management and control of work requests without communication dependencies, and can reduce the resource consumption caused by the host channel adapter 703 performing related management and control, and can improve the overall communication performance of collective communication.
  • FIG. 8 is a schematic structural diagram of a communication system according to an embodiment of the present application. As shown in FIG. 8 , the communication system includes at least one second computer device 800 that communicates with the computer device 700 in FIG. 7 through a network 708.
  • FIG. 8 can be implemented with reference to the implementation manner of the system shown in FIG. 3B above.
  • the second computer device 800 may be implemented with reference to the implementation of the computer device 400 in FIG. 3B .
  • the second computer device 800 may be one or more, and the computer device 700 may communicate with the one or more second computer devices 800 through the network 708 .
  • the implementation manner of the communication between the computer device 700 and the second computer device 800 in FIG. 8 may be implemented with reference to the implementation manners shown in FIG. 3B and FIG. 5A and FIG. 5B , which will not be repeated.
  • FIG. 9 is a schematic flowchart of a method for implementing collective communication according to an embodiment of the present application. As shown in Figure 9, the method includes:
  • Step 900 Obtain an operation request for collective communication
  • Step 901 Convert the operation request of the collective communication into a work request, and identify the work request without communication dependency;
  • Step 902 Directly forward the work request identified as having no communication dependency; for the work request not identified as having no communication dependency, forward it based on queue management and control.
  • the method shown in FIG. 9 can be implemented by a computer device, for example, it can be implemented by the computer device 300 shown in FIG. 3A ; and, the method shown in FIG. 9 can also refer to the implementation manners shown in FIGS. 5A and 5B above. to achieve, and will not repeat them.
  • the method shown in FIG. 9 by identifying the work requests without communication dependencies, the work requests without communication dependencies are directly sent through the network, which avoids the communication delay caused when the work requests without communication dependencies pass through the queue management and control, and can reduce the The resource consumption caused by the execution of related management and control can improve the communication performance of the collective communication as a whole.
  • the integrated modules if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium.
  • the technical solution of the present invention is essentially or a part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention.
  • the aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer And Data Communications (AREA)

Abstract

本申请提供一种实现集合通信的方法、计算机设备和通信系统,以解决现有技术中通信时延高、资源消耗的问题。本申请提供的方法包括标识无通信依赖的工作请求,并在转发工作请求时,对于标识为无通信依赖的工作请求直接转发,对于没有标识为无通信依赖的工作请求,通过队列管控后转发。这样,能够避免无通信依赖的工作请求通过队列管控时所造成的通信时延,并能够降低因执行相关的管控所带来的资源消耗,从整体上提升集合通信的通信性能。

Description

实现集合通信的方法、计算机设备和通信系统
本申请要求于2020年12月29日提交中国专利局、申请号为202011600044.X、发明名称为“实现集合通信的方法、计算机设备和通信系统”的中国专利申请的优先权,以及要求于2020年11月26日提交中国专利局、申请号为202011345108.6、发明名称为“通信的方法和计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及信息技术领域,尤其涉及一种实现集合通信的方法、计算机设备和通信系统。
背景技术
在集合通信中,对于集合操作的处理通常需要多个进程的参与,每个进程都可能会接收到来自若干个其它不同进程的数据并需要进行相应处理,并将处理后的数据发送给其它进程。
对于不同计算机设备上运行的进程之间的通信,需要通过不同计算机设备上的网卡实现不同进程之间数据的传递。由于受网络影响,进程之间传输的数据包到达接收端计算机设备时是无序的,并且到达的时间也是不确定的。接收端计算机设备中运行的进程接收到数据时会触发中断。当接收端的计算机设备的操作系统正在满核处理计算任务时,因接收进程间传输的数据会导致操作系统会停下部分核正在处理的计算任务,转向处理中断。然而中断和上下文切换的时间带来了接收端计算机设备操作系统处理任务时的时间消耗,影响到接收端计算机设备的性能。
为解决上述中断带来的整体性能下降的问题,一种方式是在网卡发送数据时,通过队列控制进程间数据的传输。但这种方式存在集合通信的时延高、资源消耗的问题。
发明内容
本申请实施例提供一种实现集合通信的方法、计算机设备和通信系统,以解决现有技术中通信时延高、资源消耗的问题。
第一方面,本申请实施例提供一种计算机设备,包括处理器、存储器和主机通道适配器;
所述存储器中存储有计算机可执行的程序;
所述处理器用于执行所述计算机可执行的程序,以实现如下操作:
将集合通信的操作请求转换为工作请求,并标识无通信依赖的工作请求;以及,将所述工作请求发送给所述主机通道适配器;
所述主机通道适配器,用于判断接收到的工作请求是否是无通信依赖的工作请求;对于标识为无通信依赖的工作请求直接转发,对于没有标识为无通信依赖的工作请求,基于队列管控后转发。
上述计算机设备通过标识无通信依赖的工作请求,对于无通信依赖的工作请求直接转发,避免了无通信依赖的工作请求通过队列管控时所造成的通信时延,并能够降低因执行相关的管控所带来的资源消耗,能够从整体上提升集合通信的通信性能。
可选的,所述集合通信的操作请求,是跨节点的操作请求。不同的节点之间通过网络实现通信。所述节点包括但不限于具有计算机功能的设备,例如具有计算功能的计算机设备或具有存储功能的计算机设备。
可选的,所述跨节点的集合通信的操作请求可以是需要在多个节点间快速转发和/或需要在多个节点间精确同步的操作请求。
可选的,所述集合通信的操作请求是节点内的操作请求,即同一个节点内不同进程或不同线程间实现的集合操作。
可选的,所述处理器在将集合通信的操作请求转换为工作请求前,还用于根据所述集合通信的通信子模块的数量和每个通信子模块需要执行的任务,进行网格切分。通过网格切分,将实现所述集合通信的操作请求所要执行的任务,分配到处于空闲状态的资源上运行,能够提升资源的利用率。
可选的,所述处理器将所述工作请求发送给所述主机通道适配器包括:
所述处理器将所述工作请求转换为所述主机通道适配器能够识别的格式,并通过所述处理器与所述主机通道适配器之间的接口,将所述工作请求发送给所述主机通道适配器。
可选的,所述标识无通信依赖的工作请求包括:
为无通信依赖的工作请求添加第一标识,所述第一标识用于指示所述工作请求为无通信依赖的工作请求;和/或,
为有通信依赖的工作请求添加第二标识,所述第二标识用于指示所述工作请求为有通信依赖的工作请求。
当为无通信依赖的工作请求添加第一标识时,所述主机通道适配器判断接收到的工作请求是否包括所述第一标识,当接收到的工作请求包括所述第一标识时,确定该工作请求是无通信依赖的工作请求;当接收到的工作请求没有包括所述第一标识时或包括所述第二标识时,确定该工作请求是有通信依赖的工作请求。
为有通信依赖的工作请求添加第二标识时,所述主机通道适配器判断接收到的工作请求是否包括所述第二标识,当接收到的工作请求包括所述第二标识时,确定该工作请求是有通信依赖的工作请求;当接收到的工作请求没有包括所述第二标识时,确定该工作请求是无通信依赖的工作请求。
在一些可能的实现方式中,所述集合通信为下述任一种通信:一个第一通信子模块与多个第二通信子模块之间的通信,多个第一通信子模块与一个第二通信子模块之间的通信,多个第一通信子模块与多个第二通信子模块之间的通信。
在一些可能的实现方式中,所述工作请求是一个所述第一通信子模块与一个所述第二通信子模块之间的通信请求。
在一些可能的实现方式中,所述无通信依赖是一个所述第一通信子模块与一个所述第二通信子模块之间的通信不需要依赖其它通信子模块。
在一些可能的实现方式中,所述处理器还用于执行所述计算机可执行的程序,以实现如下操作:
基于所述工作请求中不同通信子模块之间的通信方式,识别一个工作请求是否是无通信依赖的工作请求。
在一些可能的实现方式中,所述通信方式包括一个工作请求中两个通信子模块之间传输数据的方式,以及所述两个通信子模块之间传输数据是否依赖其它通信子模块发送的数据。
可选的,所述一个工作请求中两个通信子模块之间传输数据的方式,是一个工作请求中 作为数据发送方的通信子模块与作为数据接收方的通信子模块之间传输数据的方式。相应的,所述两个通信子模块之间传输数据是否依赖其它通信子模块发送的数据包括:作为数据发送方的通信子模块向作为数据接收方的通信子模块发送数据是否依赖其它通信子模块发送的数据。
在一些可能的实现方式中,所述通信方式是根据所述不同通信子模块间通信的接口确定的。
在一些可能的实现方式中,所述第一通信子模块通过所述计算机设备运行,所述第二通信子模块通过另一计算机设备运行,所述计算机设备与所述另一计算机设备通过网络通信;
所述主机通道适配器,还用于对标识为无通信依赖的工作请求直接通过所述网络转发,对于没有标识为无通信依赖的工作请求,基于队列管控后通过所述网络转发。
可选的,所述网络是基于Infiniband的网络。
在一些可能的实现方式中,所述主机通道适配器对于没有标识为无通信依赖的工作请求,基于队列管控后转发包括:
所述主机通道适配器将没有标识为无通信依赖的工作请求载入队列,并判断所述队列中记录的触发所述没有标识为无通信依赖的工作请求的条件是否已满足;
当所述条件已满足时,发送所述没有标识为无通信依赖的工作请求。
可选的,所述条件包括是否已经接收到向所述工作请求中接收数据的通信子模块发送的数据。当已经接收到向所述工作请求中接收数据的通信子模块发送的数据时,触发所述没有携带所述第一标识的工作请求的条件已满足。
或者,所述条件包括触发所述工作请求的其它工作请求是否已经在所述队列中。当触发所述工作请求的其它工作请求已经在所述队列中,则所述条件已满足。当触发所述工作请求的其它工作请求不在所述队列中,则所述条件未满足,所述处理器等待所述条件满足时再触发所述工作请求。
在一些可能的实现方式中,所述工作请求包括多个工作请求;
所述多个工作请求中包括一个或多个标识为无通信依赖的工作请求,以及,一个或多个没有标识为无通信依赖的工作请求。
在一些可能的实现方式中,所述处理器还用于执行所述计算机可执行的程序,以实现如下操作:
接收所述计算机设备中运行的应用程序发起的所述集合通信的操作请求;
获取所述集合通信的通信子模块的数量、每个通信子模块需要执行的任务以及不同通信子模块之间传递的数据和传递方式的信息。
可选的,所述应用程序可以是高性能计算(high performance computing,HPC)行业应用、HPC-人工智能(artificial intelligence,AI)行业应用以及大数据行业应用。
可选的,所述应用程序可以通过发起集合操作的命令以发起所述集合通信的操作请求。
在一些可能的实现方式中,所述处理器将集合通信的操作请求转换为工作请求包括:
将集合通信的操作请求转换为工作请求和执行所述工作请求的控制命令;其中,所述控制命令用于控制所述工作请求以实现集合操作。
在一些可能的实现方式中,所述集合通信是基于信息传递接口标准(message-passing interface,MPI)的集合通信。
可选的,所述处理器将集合通信的操作请求转换为工作请求,是根据所述存储器中存储的MPI库,并结合从集合通信的操作请求中获取到的信息实现。在一种实现方式中,所述处 理器从所述MPI库中选择用于所述通信子模块间通信的MPI集合通信接口,根据选择的MPI集合通信接口确定所述工作请求中不同通信子模块之间的通信方式。不同的MPI集合通信接口会基于网络的拓扑、通信子模块的数量、传输的数据大小等因素选择不同的算法。所述处理器根据不同MPI集合通信接口对应的算法确定不同通信子模块间通信的方式。在一些可能的实现方式中,所述主机通道适配器通过网络接口卡(network interface card,NIC)、独立的芯片或芯片组实现。
在一些可能的实现方式中,所述通信子模块为进程或线程。
第二方面,本申请实施例提供一种通信系统,所述通信系统包括至少一个第二计算机设备,所述至少一个第二计算机设备通过网络与第一方面任意一项的计算机设备通信。
第三方面,本申请实施例提供一种实现集合通信的方法,所述方法包括:
获取集合通信的操作请求;
将所述集合通信的操作请求转换为工作请求,并标识无通信依赖的工作请求;
直接转发标识为无通信依赖的工作请求;
对于没有标识为无通信依赖的工作请求,基于队列管控后转发。
上述方法通过标识无通信依赖的工作请求,对于无通信依赖的工作请求直接转发,避免了无通信依赖的工作请求通过队列管控时所造成的通信时延,并能够降低因执行相关的管控所带来的资源消耗,能够从整体上提升集合通信的通信性能。
可选的,所述集合通信的操作请求,是跨节点的操作请求。不同的节点之间通过网络实现通信。所述节点包括但不限于具有计算机功能的设备,例如具有计算功能的计算机设备或具有存储功能的计算机设备。
可选的,所述跨节点的集合通信的操作请求可以是需要在多个节点间快速转发和/或需要在多个节点间精确同步的操作请求。
可选的,所述集合通信的操作请求是节点内的操作请求,即同一个节点内不同进程或不同线程间实现的集合操作。
可选的,在将集合通信的操作请求转换为工作请求前,所述方法还包括:
根据所述集合通信的通信子模块的数量和每个通信子模块需要执行的任务,进行网格切分。通过网格切分,将实现所述集合通信的操作请求所要执行的任务,分配到处于空闲状态的资源上运行,能够提升资源的利用率。
可选的,所述标识无通信依赖的工作请求包括:
为无通信依赖的工作请求添加第一标识,所述第一标识用于指示所述工作请求为无通信依赖的工作请求;和/或,
为有通信依赖的工作请求添加第二标识,所述第二标识用于指示所述工作请求为有通信依赖的工作请求。
当为无通信依赖的工作请求添加第一标识时,通过判断接收到的工作请求是否包括所述第一标识,当接收到的工作请求包括所述第一标识时,确定该工作请求是无通信依赖的工作请求;当接收到的工作请求没有包括所述第一标识时或包括所述第二标识时,确定该工作请求是有通信依赖的工作请求。
为有通信依赖的工作请求添加第二标识时,通过判断接收到的工作请求是否包括所述第二标识,当接收到的工作请求包括所述第二标识时,确定该工作请求是有通信依赖的工作请求;当接收到的工作请求没有包括所述第二标识时,确定该工作请求是无通信依赖的工作请求。
在一些可能的实现方式中,所述集合通信为下述任一种通信:一个第一通信子模块与多个第二通信子模块之间的通信,多个第一通信子模块与一个第二通信子模块之间的通信,多个第一通信子模块与多个第二通信子模块之间的通信。
在一些可能的实现方式中,所述工作请求是一个所述第一通信子模块与一个所述第二通信子模块之间的通信请求。
在一些可能的实现方式中,所述无通信依赖是一个所述第一通信子模块与一个所述第二通信子模块之间的通信不需要依赖其它通信子模块。
在一些可能的实现方式中,所述方法还包括:
基于所述工作请求中不同通信子模块之间的通信方式,识别一个工作请求是否是无通信依赖的工作请求。
在一些可能的实现方式中,所述通信方式包括一个工作请求中两个通信子模块之间传输数据的方式,以及所述两个通信子模块之间传输数据是否依赖其它通信子模块发送的数据。
可选的,所述一个工作请求中两个通信子模块之间传输数据的方式,是一个工作请求中作为数据发送方的通信子模块与作为数据接收方的通信子模块之间传输数据的方式。相应的,所述两个通信子模块之间传输数据是否依赖其它通信子模块发送的数据包括:作为数据发送方的通信子模块向作为数据接收方的通信子模块发送数据是否依赖其它通信子模块发送的数据。
在一些可能的实现方式中,所述通信方式是根据所述不同通信子模块间通信的接口确定的。
在一些可能的实现方式中,所述第一通信子模块通过第一计算机设备运行,所述第二通信子模块通过第二计算机设备运行,所述第一计算机设备与所述第二计算机设备通过网络通信;
所述方法还包括:对标识为无通信依赖的工作请求直接通过所述网络转发,对于没有标识为无通信依赖的工作请求,基于队列管控后通过所述网络转发。
可选的,所述网络是基于Infiniband的网络。
在一些可能的实现方式中,所述对于没有标识为无通信依赖的工作请求,基于队列管控后转发包括:
将没有标识为无通信依赖的工作请求载入队列,并判断所述队列中记录的触发所述没有标识为无通信依赖的工作请求的条件是否已满足;
当所述条件已满足时,发送所述没有标识为无通信依赖的工作请求。
可选的,所述条件包括是否已经接收到向所述工作请求中接收数据的通信子模块发送的数据。当已经接收到向所述工作请求中接收数据的通信子模块发送的数据时,触发所述没有携带所述第一标识的工作请求的条件已满足。
或者,所述条件包括触发所述工作请求的其它工作请求是否已经在所述队列中。当触发所述工作请求的其它工作请求已经在所述队列中,则所述条件已满足。当触发所述工作请求的其它工作请求不在所述队列中,则所述条件未满足,所述处理器等待所述条件满足时再触发所述工作请求。
在一些可能的实现方式中,所述工作请求包括多个工作请求;
所述多个工作请求中包括一个或多个标识为无通信依赖的工作请求,以及,一个或多个没有标识为无通信依赖的工作请求。
在一些可能的实现方式中,所述方法还包括:
根据获取到的所述集合通信的操作请求,获取所述集合通信的通信子模块的数量、每个通信子模块需要执行的任务以及不同通信子模块之间传递的数据和传递方式的信息。
在一些可能的实现方式中,所述将所述集合通信的操作请求转换为工作请求包括:
将所述集合操作转换为工作请求和执行所述工作请求的控制命令;其中,所述控制命令用于控制所述工作请求以实现集合操作。
在一些可能的实现方式中,所述集合通信是基于MPI的集合通信。
在一些可能的实现方式中,所述主机通道适配器通过NIC、独立的芯片或芯片组实现。
在一些可能的实现方式中,所述通信子模块为进程或线程。
第四方面,本申请实施例提供一种包含指令的计算机程序产品,当所述计算机程序产品在计算机设备上运行时,使得计算机设备执行上述第三方面任意一项所述的方法。
第五方面,本申请实施例提供一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有指令,所述指令指示计算机设备执行上述第三方面任意一项所述的方法。
附图说明
下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种实现集合通信的结构示意图;
图2为本申请实施例提供的8个进程间通信时一种数据流的逻辑示意图;
图3A为本申请实施例提供的一种计算机设备300的结构示意图;
图3B为本申请实施例提供的一种系统的结构示意图;
图4为本申请实施例中处理器301需要执行的程序或指令的逻辑结构示意图;
图5A为本申请实施例提供的标识无通信依赖的工作请求的方法流程示意图;
图5B为本申请实施例提供的控制模块3021识别无通信依赖的工作请求过程的流程示意图;
图6为本申请实施例提供的主机通道适配器303的一种具体结构示意图;
图7为本申请实施例提供的一种计算机设备700的结构示意图;
图8为本申请实施例提供的一种通信系统的结构示意图;
图9为本申请实施例提供的一种实现集合通信的方法的流程示意图。
具体实施方式
下面结合附图,对本发明的实施例进行描述。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。
本申请说明书和权利要求书中,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或模块的过程、方法、系统、产品或设备不必限于 清楚地列出的那些步骤或模块,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或模块。在本申请中出现的对步骤进行的命名或者编号,并不意味着必须按照命名或者编号所指示的时间/逻辑先后顺序执行方法流程中的步骤,已经命名或者编号的流程步骤可以根据要实现的技术目的变更执行次序,只要能达到相同或者相类似的技术效果即可。本申请中所出现的单元的划分,是一种逻辑上的划分,实际应用中实现时可以有另外的划分方式,例如多个单元可以结合成或集成在另一个系统中,或一些特征可以忽略,或不执行,另外,所显示的或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元之间的间接耦合或通信连接可以是电性或其他类似的形式,本申请中均不作限定。并且,作为分离部件说明的单元或子单元可以是也可以不是物理上的分离,可以是也可以不是物理单元,或者可以分布到多个电路单元中,可以根据实际的需要选择其中的部分或全部单元来实现本申请方案的目的。
应理解,在本申请说明书和权利要求书中对各种所述示例的描述中所使用的术语只是为了描述特定示例,而并非旨在进行限制。如在对各种所述示例的描述和所附权利要求书中所使用的那样,单数形式“一个(“a”,“an”)”和“该”旨在也包括复数形式,除非上下文另外明确地指示。
还应理解,本申请说明书和权利要求书中所使用的术语“和/或”是指并且涵盖相关联的所列出的项目中的一个或多个项目的任何和全部可能的组合。术语“和/或”,是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本申请中的字符“/”,一般表示前后关联对象是一种“或”的关系。
应理解,根据A确定B并不意味着仅仅根据A确定B,还可以根据A和/或其它信息确定B。
还应理解,术语“包括”(也称“includes”、“including”、“comprises”和/或“comprising”)当在本说明书中使用时指定存在所陈述的特征、整数、步骤、操作、元素、和/或部件,但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元素、部件、和/或其分组。
还应理解,术语“如果”可被解释为意指“当...时”(“when”或“upon”)或“响应于确定”或“响应于检测到”。类似地,根据上下文,短语“如果确定...”或“如果检测到[所陈述的条件或事件]”可被解释为意指“在确定...时”或“响应于确定...”或“在检测到[所陈述的条件或事件]时”或“响应于检测到[所陈述的条件或事件]”。
应理解,说明书通篇中提到的“一个实施例”、“一实施例”、“一种可能的实现方式”意味着与实施例或实现方式有关的特定特征、结构或特性包括在本申请的至少一个实施例中。因此,在整个说明书各处出现的“在一个实施例中”或“在一实施例中”、“一种可能的实现方式”未必一定指相同的实施例。此外,这些特定的特征、结构或特性可以任意适合的方式结合在一个或多个实施例中。
首选,对本申请中涉及的部分术语及相关技术进行解释说明,以方便理解:
并行计算:并行计算是建立在这样一个思想之上:大的问题可以分成一些较小的问题,而这些较小的问题是可以利用现有的资源能力同时(并行)被解决的,这些小问题的解决最终导致大问题的解决。并行计算是相对于串行计算来说的,串行计算的特点是处理器按照指令顺序依次运行计算算法。并行计算的并行分为两种,时间上的并行和空间上的并行。时间上的并行指的是计算机的中央处理器中采用的流水线技术,将每一条指令分成多个步骤来完成,而这些步骤之间在时间上可以交叠执行。空间上的并行是指用多个处理器并发的执行计 算机指令,从而加快解决问题的速度。并行计算的优势在于,可以突破串行计算机计算能力的限制,提高计算速度,在更短的时间内完成计算任务,更好地发挥硬件的计算能力,节约计算成本。
HPC:是指运算能力能够达到一定级别的一整套计算机系统。因为单一的处理器很难实现如此强大的计算能力,因此HPC需要多颗中央处理器(central processing unit,CPU)或多台主机(例如多台计算机设备)来共同协作实现。构建高性能计算系统的主要目的就是提高运算速度,要达到每秒万亿次级的计算速度,对系统的处理器、内存带宽、运算方式、系统的输入输出(input/output,I/O)、存储等方面的要求都十分高,这其中的每一个环节都将直接影响到系统的运算速度。HPC主要用于快速完成科学研究、工程设计、金融、工业以及社会管理等领域内具有数据密集型、计算密集型和I/O密集型的计算。典型应用包括:生物工程、新药研制、石油物探、运载器设计(航空航天、舰船、汽车)、材料工程、核爆模拟、尖端武器制造、密码研究和各类大规模信息处理等。高性能计算的目标是:最大限度地降低完成特殊计算问题的运算时间,最大限度地提高在可指定时间内完成的问题规模,处理以前无法实现的大量复杂问题,提高性价比,扩展解决中等规模的问题和预算等。
MPI:是一个消息传递接口的标准,用于开发基于消息传递的并行程序,其目的是为用户提供一个实际可用的、可移植的、高效的和灵活的消息传递接口。MPI可应用于多种系统架构中,如分布式/共享内存的多核处理器,高性能网络,以及这些架构的组合。MPI也是一个并行编程函数库,其编译运行需要和具体的编程语言相结合。MPI在主流的操作系统上都得到了实现,包括Windows和Linux系统。MPI可以是进程级并行的软件中间件,MPI框架把所有的计算进程管理起来形成一个系统,然后提供丰富的进程间通信的函数。进程是程序的一个运行实例,除包含程序代码外,同时包含它的执行环境(内存、寄存器、程序计数器等),是操作系统中独立存在的可执行的基本程序单位。MPI可以支持多种不同的通信协议,例如Infiniband或传输控制协议(transmission control protocol,TCP)等。MPI对这些协议进行了封装,提供一套统一的通信接口,屏蔽底层的通信细节。MPI管理框架会为每一个进程分配一个进程标识号(rank号),rank从0开始依次往后排。MPI程序的每一个进程到底完成哪一部分工作,是通过其进程标识号来决定的。MPI进程需要在通信域内进行通信,通信域是进程间的通信环境,包含了进程组、上下文、虚拟拓扑等。MPI在启动时,系统会建立一个全局的通信域,每个进程都在这个全局通信域内,进程间通信需要指定通信域的参数。
集合通信:也叫组通信,其与点对点通信的一个重要区别在于,多个进程同时参加通信,区别于点对点通信只涉及发送方和接收方两个进程。集合通信由哪些进程参加以及集合通信的上下文,都是由该集合通信调用的通信域限定的。集合通信一般包括三个功能:通信、同步和计算。其中,通信功能主要完成集合内部数据的传输,同步功能实现集合内所有进程在特定的点上执行进度的一致,计算功能是对特定数据的操作。MPI集合通信是一种常见的集合通信。
InfiniBand(缩写为IB):也称为“无限带宽”技术,是一个用于高性能计算的计算机网络通信标准,具有极高的吞吐量和极低的延迟,用于计算机与计算机之间的数据互连。InfiniBand也用作服务器与存储系统之间的直接互联或交换互连,以及存储系统之间的互连。
网格:由多个独立的计算机组成以提供在线计算和存储能力,这些计算机资源分布在一个较为广的范围之内。通过利用网格中闲置的计算资源,可以创造出一个虚拟的强大的计算平台,这个高性能的计算机为处理大规模的生物、数学、化学等领域内的计算问题提供了可能性。网格将相互连接的计算机组织起来,它将整个连接在网络中的各类资源和服务整合在 一起,成为一个能力巨大的虚拟计算机。对于用户而言,网格给其提供包括各种服务、资源在内的基础设施,用户面对的是一个远远超过任何一个单个超级计算机容量的资源。不但可以利用其强大的计算能力解决难题,而且可以使用在网格内的任何一个节点所提供的服务,无论该节点的物理位置在何处。
图1为一种实现集合通信的结构示意图。如图1所示,计算机设备1中运行的4个进程(例如进程1、进程2、进程3和进程4,图中未示出),分别通过网络向计算机设备2发送数据。计算机设备2通过队列接收计算机设备1中4个进程发送的数据,将每个进程发送的载荷(payload)写入接收队列中。计算机设备2中的一个进程(例如进程0,图中未示出)接收到4个进程的载荷后,进行相关的处理(例如求和、取最大值或取最小值等)并发送处理后的数据。
在执行上述操作时,由于受网络影响,运行在计算机设备1中的每个进程发送的数据到达计算机设备2是无序的,到达的时间也是不确定的。计算机设备2无法预测每个中断的生成时间,所述中断是计算机设备1中运行的4个进程中任意一个进程通过网络传输数据到达计算机设备2时产生的中断。计算机设备2每接收一个进程发送的载荷后,就会触发中断任务给操作系统。当计算机设备2的操作系统正在满核执行计算任务时,产生的中断会使操作系统停下部分核正在执行的计算任务,转向处理中断,并在处理完中断任务之后再返回执行计算任务。计算机设备2的操作系统在处理中断时要保存当前计算任务的上下文,整体的开销比较大。这样,计算机设备2的操作系统调度程序受到干扰,产生“操作系统噪声”,即接收消息时产生的中断开销。这种系统噪声造成了大量进程的等待,以及大量的处理器循环(例如CPU cycle)损失。在大多数情况下,数据的处理很简单,然而中断和上下文切换的时间开销比计算机设备2处理数据的时间开销还要大。因此,实现此类操作时,计算机设备2的执行效率很低。
一种解决上述问题的方式是引入一个用于控制的队列,并通过队列对不同进程发送的数据进行统一的管理。例如,图1中4个不同进程的数据通过队列进行管控时,会等到4个接收队列的完成信息都到达后,统一触发中断,这样就能够避免“操作系统噪声”的产生。
但是,通过队列实现管控时,由于未区分不同的进程,对不需要管控的进程也纳入队列管理的范围,造成通信效率的低下和资源的占用。
以集合通信中广播通信为例,8个进程间的数据流如图2所示。图2为8个进程间通信时一种数据流的逻辑示意图。图2中,每个圆圈中的数字代表进程的标号,圆圈之间的连线代表进程之间存在通信关系。
在没有通过队列管理的方式实现通信时,图2中8个进程间通信的队列对的结构如表1所示:
Figure PCTCN2021103616-appb-000001
表1
其中表1中P0代表图2中的进程0,以此类推,P7代表图2中的进程7。队列对(queue pairs,QP)是一个队列对的标号。以进程0向进程4发送数据为例,“Send QP 4”代表进程0向进程4的QP发送数据,“Send disable”代表取消发送使能。
在通过队列对图2所示的进程间通信进行管控时,图2所示的8个进程间通信的队列对的结构如表2所示:
Figure PCTCN2021103616-appb-000002
表2
基于表2可以看出,通过队列实现进程间通信的管控时,将没有通信依赖的进程间通信也纳入管控,这些进程间的通信可以不用通过队列管控,导致通信时延长、资源被浪费的问题。例如,图2中0号进程发送给4号进程、2号进程、1号进程的数据可以直接发送,可以不用再通过队列管理中的send enable去控制表1中的send disable以实现数据的发送。通过队列的方式实现管控,需要将0号进程与4号进程、2号进程、1号进程之间的通信载入队列,并使能发送功能。这不仅造成0号进程与4号进程、2号进程、1号进程之间通信的时延的增加,还消耗了网卡执行这些管控时的处理资源。
在实际实现时,进行集合通信的一些进程相互之间通信是没有依赖关系的。通过队列管控没有依赖关系的进程之间的通信进行管控,不仅会带来通信时延的增加,还会因处理这部分管控对网络适配器中处理器资源的占用,带来处理器性能的消耗。
本申请实施例提供一种实现集合通信的方法,当在进行集合通信的操作时,对于无通信依赖的进程,可以不用通过队列进行管控。这样,可以加速集合通信中无通信依赖的进程间通信的性能,能够降低集合通信的整体时延,并能够避免对无通信依赖的进程间通信的管理所带来的资源的占用和消耗。以通过MPI的集合通信为例,当上层应用(HPC应用、大数据应用等等)通信组件选用了MPI时,可以端到端加速整体的上层应用的性能。例如HPC行业分子动力学领域的典型应用,其MPI集合通信的时间在整个端到端的运行时间占比为40%,如果加速了MPI集合通信的性能,可以加速应用整体端到端的性能。
首先,对实现本申请实施例提供的实现集合通信的方法的设备进行描述。
图3A为本申请实施例提供的一种计算机设备300的结构示意图。如图3A所示,计算机设备300包括处理器301、存储器302、主机通道适配器(host channel adapter,HCA)303和总线304。处理器301、存储器302和主机通道适配器303之间通过总线304通信。总线304可以是外设部件互连标准(peripheral component interconnect,PCI)总线、快捷外设部件互连标准(peripheral component interconnect express,PCIe)或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等,为便于表示,图3A中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。主机通道适配器303通过网络与其它计算机设备实现连接。
图3A中,处理器301可以为CPU、图形处理器(Graphics Processing Unit,GPU)、通用图形处理器(general-purpose GPU,GPGPU)、张量处理器(Tensor Processing Unit,TPU)、数据处理器(Data Processing Unit,GPU)、微处理器(micro processor,MP)或者数字信号处理器(digital signal processor,DSP)等具有计算能力的芯片。
存储器302可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM);也可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器,HDD或SSD。存储器302中存储有程序或指令,例如包含至 少一个进程的程序代码或至少一个线程的程序代码。当然,存储器302中还可以存储数据,包括但不限于所述计算机设备300需要存储的数据。
主机通道适配器303包括控制单元3031、第一接口3032和第二接口3033。其中,第一接口3032是主机通道适配器303与处理器301通信的接口,其通过总线304接收处理器301发送的请求,将接收到的请求转换为主机通道适配器303可以识别的格式;或将主机通道适配器303发送给处理器301的数据或报文转换为处理器301可以识别的格式。控制单元3031通过第一接口3032接收处理器301发送的请求,对接收到的请求进行相应的处理。第二接口3033是计算机设备300与网络连接的接口,主机通道适配器303通过第二接口3033接收其它计算机设备通过网络发送的请求或数据,或通过网络向其它计算机设备发送请求或数据。可选的,第二接口3033可以包含多个端口并通过多个端口与网络连接,以实现多条路径的同步传输。
在一种实现方式中,主机通道适配器303可以通过NIC实现。在另一种实现方式中,主机通道适配器303也可以通过一个芯片组或一个独立的芯片来实现。
需要说明的是,图3A只是为方便描述本申请实施例,显示了部分的硬件组件和软件组件。在具体实现时,计算机设备300还可以包括其它硬件组件,例如硬盘等;也可以包括其它软件,例如应用程序、操作系统等。图3A所示的结构组成不应作为对本申请实施例的限制。
图3B为本申请实施例提供的一种系统的结构示意图。如图3B所示,所述系统包括计算机设备300和计算机设备400,计算机设备300和计算机设备400通过网络N100连接。计算机设备300与计算机设备400之间的网络N100可以是基于Infiniband的网络。示例性的,计算机设备300与计算机设备400之间通过Infiniband架构作为互联的解决方案。可选的,计算机设备300与计算机设备400之间的网络N100也可以是以太网络(Ethernet)或基于融合以太的远程内存直接访问协议(Remote Direct Memory Access over Converged Ethernet,RoCE)网络等。
计算机设备400的组成与计算机设备300的组成类似,包括处理器401、存储器402、主机通道适配器403和总线404。处理器401、存储器402和主机通道适配器403之间通过总线404通信。存储器402中存储有程序或指令,例如包含至少一个进程的程序代码。主机通道适配器403包括控制单元4031、第一接口4032和第二接口4033。
计算机设备300中运行的进程发起的请求,例如向计算机设备400中运行的进程发送数据的请求,通过网络N100传输到计算机设备400。计算机设备400中运行的相关进程根据接收到的请求进行相应的处理。在一种实现方式中,计算机设备300中的一个或多个进程,通过网络N100与计算机设备400中的一个或多个进程实现MPI集合通信。
下面以计算机设备300中运行的一个应用程序发起MPI集合操作,与计算机设备400之间实现通信为例,对本申请实施例提供的集合通信方法做进一步的说明。
计算机设备300中的处理器301通过读取存储器302中的程序或指令,以实现MPI集合通信的相应功能。图4为处理器301需要执行的程序或指令的逻辑结构示意图。如图4所示,存储器302中包括应用程序3025、应用接口模块3023、控制模块3021、传输模块3022和转发模块3024。其中,应用接口模块3023、控制模块3021和传输模块3022构成MPI层,该MPI层是基于MPI标准实现的统一通信框架。
应用程序3025,可以是实现集合通信的任何应用,包括但不限制于HPC行业应用、HPC-AI行业应用以及大数据行业应用等。这些应用通常需要大量的计算任务,执行这些大量的计 算任务通常会启动多个进程或线程,就需要调用MPI的集合通信接口用于数据计算以及进程间信息交互。示例性的,应用程序3025可以是气象领域的应用WRF(Weather Research and Forecasting)、计算流体力学的应用OpenFoam或分子动力学领域的应用VASP(Vienna Ab initio Simulation Package)等等。
应用接口模块3023,是MPI应用层与应用程序3025之间的接口,用于从应用程序3025接收需要执行的任务。例如,应用接口模块3023可以接收应用程序3025触发的集合通信的操作请求。
控制模块3021,用于基于应用程序3025下发的操作请求,将所述操作请求转换为工作请求(work request,WR)。示例性的,所述控制模块3021转换包括但不限于:根据要计算的算例进行网格切分,确定需要执行的进程以及每个进程需要执行的任务,进程之间通信的通信方式等。示例性的,控制模块3021可以是集合通信组(unified communication group,UCG)。
传输模块3022,用于抽象不同硬件(例如不同的网卡)架构之间的差异,提供底层应用编程接口(application programming interface,API),该底层API用于实现集合通信。示例性的,传输模块3022可以是集合通信传输(unified communication transport,UCT)。
转发模块3024,用于实现API层与主机通道适配器303之间的消息或数据的转发。示例性的,转发模块3024可以是开放结构企业分布(open fabrics enterprise distribution,OFED)。当控制模块3021调用传输模块3022的接口通知传输模块3022需要执行的WR后,传输模块3022通过调用所述OFED对外暴露的接口敲硬件的doorbell通知主机通道适配器303,将所述WR发送给主机通道适配器303进行解析并处理。在一种实现方式中,OFED可以是远程直接内存访问(remote direct memory access,RDMA)和内核直通(kernel bypass)的开源实现。
图5A为本申请实施例提供的标识无通信依赖的工作请求的方法流程示意图。如图5A所示,结合图4中所示的软件模块,所述方法包括:
步骤500:应用程序3025发起集合通信的操作请求。
一种实现方式中,所述集合通信的操作请求是跨节点的集合操作请求,不同的节点间通过网络通信以实现集合通信。示例性的,所述跨节点的集合通信的操作请求可以是需要在多个节点间快速转发和/或需要在多个节点间精确同步的操作请求。其中,所述节点可以是计算机设备,包括但不限于实现计算功能的计算机设备或存储功能的计算机设备。
在另一种实现方式中,所述集合通信的操作请求是节点内的操作请求,即同一个节点内不同进程或不同线程间实现的集合操作。
具体的,应用程序3025可以通过发起集合操作的命令以发起所述集合通信的操作请求。可选的,所述集合操作的命令可以是MPI命令或共享内存命令。以MPI命令为例,MPI命令包括但不限于:MPI_max、MPI_min、MPI_sum、MPI_scatter或MPI_reduce等。
步骤501:应用接口模块3023接收应用程序3025发起的操作请求,并转发给控制模块3021。
应用接口模块3023接收到应用程序3025发起的操作请求后,可以对接收到的操作请求进行通用的处理,并将获取到的信息发送给控制模块3021。其中,应用接口模块3023对所述操作请求的处理包括但不限于:获取集合通信的进程数、每个进程需要执行的任务、进程间传递数据的大小或通信域等信息。
应用接口模块3023获取到这些信息后,将这些信息发送或传递给控制模块3021。可以理解,在应用接口模块3023与控制模块3021之间还可以存在其它的软件模块。例如基于MPI 通信协议实现MPI通信的软件模块,应用接口模块3023可以通过这些软件模块将获取到的所述信息发送或传递给控制模块3021。本申请实施例从简洁描述的角度,只描述应用接口模块3023将获取到的信息发送或传递给控制模块3021。
步骤502:控制模块3021根据从应用接口模块3023获取到的信息,将所述集合通信的操作请求转换为工作请求。
在一种实现方式中,控制模块3021根据从应用接口模块3023获取到的信息,将所述集合通信的操作请求转换为工作请求。
在另一种实现方式中,所述控制模块3021根据从应用接口模块3023获取到的信息,可以将所述集合操作转换为工作请求和执行所述工作请求的控制命令。其中,所述控制命令用于控制所述工作请求以实现集合操作。
在具体实现时,控制模块3021可以基于存储器302中存储的MPI库,结合从应用接口模块3023获取到的信息,将所述集合操转换为工作请求,或者转换为工作请求和执行所述工作请求的控制命令。
示例性的,根据从应用接口模块3023获取到的信息,将所述集合操作转换为工作请求和执行所述工作请求的控制命令包括下述步骤:
步骤S1:控制模块3021根据集合通信的进程数和每个进程需要执行的任务,进行网格切分,分配运行所述集合通信的进程的计算机设备。
如果实现集合通信的进程需要通过网络通信,则需要分配通过网络通信的不同计算机设备分别用于执行本次集合通信的任务。以图3B所示的系统为例,计算机设备300和计算机设备400分别用于运行相关的进程,以通过网络实现集合通信。示例性的,一种集合通信是根进程0与3个子进程(子进程1、子进程2和子进程3)之间的通信。根进程0需要从子进程1、子进程2和子进程3接收数据并执行归约操作后,将归约操作后的数据发送给子进程1、子进程2和子进程3。控制模块3021可以分配计算机设备300运行根进程0,分配计算机设备400运行子进程1、子进程2和子进程3。根进程0通过网络N100与子进程1、子进程2和子进程3实现集合通信。
如果实现集合通信的进程不需要通过网络实现通信,可以只分配计算机设备300用于执行本次集合通信的任务。仍以上述根进程0与3个子进程(子进程1、子进程2和子进程3)之间的通信为例,根进程0,以及子进程1、子进程2和子进程3都运行在计算机设备300上,根进程0可以通过主机通道适配器303,实现与子进程1、子进程2和子进程3间的通信。
步骤S2:控制模块3021确定进程间实现集合通信的方式。
具体的,控制模块3021可以先从MPI库中选择用于进程间通信的MPI集合通信接口。所述MPI集合通信接口包括但不限于:MPI_Bcast、MPI_Allreduce或MPI_Alltoall等。不同的MPI集合通信接口会针对网络的拓扑、进程的数量、传输的数据大小等因素选择不同的算法。其中,常用的算法包括但不限于Binomial Tree、K-nomial Tree或Recursive doubling等。所述算法具体用于确定进程间通信的方式。
控制模块3021通过选择不同的MPI集合通信接口,并基于每种通信接口所适用的算法确定进程间实现集合通信的方式。
步骤503:控制模块3021识别无通信依赖的工作请求,为无通信依赖的工作请求添加标识。
其中,控制模块3021识别无通信依赖的工作请求的过程,参见图5B所示的流程。
控制模块3021识别出无通信依赖的工作请求后,可以在无通信依赖的工作请求中添加标 识信息,以标识该工作请求是无通信依赖的工作请求。所添加的标识信息可以是任何形式,所添加的标识信息可以位于工作请求的任意位置。
示例性的,可以在工作请求的Opcode增加扩展属性,以添加标识信息。例如,工作请求的构成可以如表3所示:
wr_id *next Opcode Send_flags
表3
控制模块3021可以在表3所示的工作请求中的Opcode中增加标识:SEND_DIRECTLY,以标识该工作请求为无通信依赖的工作请求。
步骤504:将控制模块3021转换后的工作请求发送给主机通道适配器303。
具体的,传输模块3022将控制模块3021转换后的工作请求传输给转发模块3024,转发模块3024将所述工作请求发送给主机通道适配器303。即处理器301通过执行传输模块3022的程序将控制模块3021转换后的工作请求传输给转发模块3022,并通过执行转发模块3024的程序将所述工作请求发送给主机通道适配器303。一种实现方式中,处理器301可以通过执行转发模块3024的程序将所述工作请求转换为主机通道适配器303可以识别的格式,并通过第一接口3032将转换格式后的所述工作请求发送给主机通道适配器303。
可以理解,控制模块3021转换集合通信的操作请求后,会得到多个工作请求。其中部分工作请求是有通信依赖的工作请求,部分工作请求是无通信依赖的工作请求。处理器301发送给主机通道适配器的工作请求,包括添加了标识的无通信依赖的工作请求,也包括有通信依赖的工作请求。这样,主机通道适配器就能够根据标识,识别出无通信依赖的工作请求,进而直接转发这部分工作请求。
针对任意一个工作请求,控制模块3021判断其是否是无通信依赖的工作请求,可以基于一个工作请求中进程间的通信方式来判断。图5B为本申请实施例提供的识别一个工作请求是否是无通信依赖的工作请求的方法的流程示意图。以如图5B所示,所述方法包括:
步骤5031:控制模块3021确定有进程间通信的工作请求。
上述步骤502中,控制模块3021转换集合操作后的工作请求为一个或多个工作请求。控制模块3021需要先确定一个工作请求,再判断该一个工作请求是否是无通信依赖的工作请求。判断一个工作请求是否是无通信依赖的工作请求,是判断该工作请求中有通信关系的进程之间的通信,是否需要依赖其它进程。
下文以一个工作请求中本端进程为进程A,对端进程为进程B为例进行说明。
步骤5032:控制模块3021判断本端进程(进程A)是否需要向对端进程(进程B)发送数据。
如果本端进程(进程A)需要向对端进程(进程B)发送数据,则执行步骤5033;如果本端进程(进程A)不需要向对端进程(进程B)发送数据,说明本端进程(进程A)需要接收对端进程(进程B)发送数据,则执行步骤5034。
步骤5033:控制模块3021判断本端进程(进程A)向对端进程(进程B)发送的数据是否是从其它进程(例如进程C,下文以进程C代表其它进程进行说明)获取的;如果不是从其它进程(进程C)获取的,则执行步骤5035;如果是从其它进程(进程C)获取的,则执行步骤5036。
步骤5034:控制模块3021判断对端进程(进程B)向本端进程(进程A)发送的数据是 否是从其它进程(进程C)获取的;如果不是从其它进程(进程C)获取的,则执行步骤5035;如果是从其它进程(进程C)获取的,则执行步骤5036。
步骤5035:控制模块3021标识该工作请求是无通信依赖的进程;
如果本端进程(进程A)向对端进程(进程B)发送的数据不是从其它进程(进程C)获取的,说明本端进程(进程A)与对端进程(进程B)之间的通信不需要依赖其它进程,本端进程(进程A)向对端进程(进程B)发送数据的工作请求就是无通信依赖的工作请求。
控制模块3021识别出无通信依赖的工作请求后,标识该无通信依赖的工作请求。一种实现方式中,可以通过增加第一标识以标识该无通信依赖的工作请求。其中,该第一标识用于指示该工作请求是无通信依赖的工作请求。
可选的,可以通过在工作请求的Opcode字段增加第一标识,例如增加的第一标识为IBV_SEND_DIRECTLY。
步骤5036:控制模块3021标识该工作请求为有通信依赖的工作请求或不做标识处理。
如果本端进程(进程A)向对端进程(进程B)发送的数据是从其它进程(进程C)获取的,说明本端进程(进程A)与对端进程(进程B)之间的通信需要依赖与其它进程的通信,则本端进程(进程A)向对端进程(进程B)发送数据的工作请求就是有通信依赖的工作请求。
控制模块3021识别出有通信依赖的工作请求后,可以标识该有通信依赖的工作请求,也可以不做标识处理。不做标识处理,即表明该工作请求不同于无通信依赖的工作请求。
一种实现方式中,标识该有通信依赖的工作请求可以通过增加第二标识以标识该有通信依赖的工作请求。其中,该第二标识用于指示该工作请求是有通信依赖的工作请求。
可选的,可以通过在工作请求的Opcode字段增加第二标识,例如增加的第二标识可以为IBV_SEND。
需要说明的是,上述图5A和图5B虽然是各个软件模块执行相应的步骤,在具体实现时,是处理器301通过执行这些软件模块的程序实现相应的功能。即处理器通过执行存储在存储器302中的相应程序,实现图5A和图5B所示的方法流程。
图6为本申请实施例提供的主机通道适配器303的一种具体结构示意图。如图6所示,主机通道适配器303还包括队列3035和存储单元3034。
存储单元3034与控制单元3031连接,用于存储实现控制单元3031相应功能的程序或代码,并存储控制单元3031需要处理的数据。
队列3035包括多个工作队列(work queue,WQ),每个工作队列包含多个工作队列条目(work queue entries,WQEs),每个WQE包含网络事件相关的信息,例如可以是通过网络向其它节点发送消息的信息或从其它节点通过网络接收消息的信息。工作队列通过至少一个QP实现。每个QP包括一个接收队列(receive queue,RQ)和一个发送队列(send queue,SQ)。一个QP通常与一个对端节点中的QP对应,这样能够实现点对点的传输。接收队列主要用于接收WQEs,发送队列主要用于发送相关的WQEs。队列3035中还包括完成队列(completion queue,CQ),完成队列记录了WQE的完成状态。完成队列中的每一个条目对应一个WQE。示例性的,完成队列可以关联预设的一组接收队列,该组接收队列用于接收等待接收的消息。一个生产者标识(producer index,PI)用于指示完成队列中最近完成的一个条目。可选的,所述PI也可以用于指示工作队列中最近一个处理的WQE。示例性的,当完成队列关联预设的一组接收队列,该组接收队列用于接收等待接收的消息时,控制单元3031通过标识完成队列中的PI以指示接收到所述等待接收的消息。图6只是为简洁显示各种队列,只分别示出了 一个SQ、RQ和CQ,但这并不代表队列3035只包括这些队列,在具体实现时,队列3035还可以包括多个SQ、多个RQ或多个CQ,不再赘述。
在另一种实现方式中,控制单元3031也可以判断载入队列3035的工作请求是数据的工作请求还是管理的工作请求。如果接收到的是数据工作请求,则判断队列3035中是否已经有触发该数据工作请求的管理工作请求。在队列3035中已经有触发该数据工作请求的管理工作请求时,基于该管理工作请求触发该数据工作请求。在队列3035中没有触发该数据工作请求的管理工作请求时,在接收队列中存储该数据工作请求,并等待触发该数据工作请求的管理工作请求。如果接收到的是管理工作请求,则判断队列3035中是否已经有该管理工作请求要触发的数据工作请求。在队列3035中已经有该管理工作请求要触发的数据工作请求时,基于该管理工作请求触发该数据工作请求。在队列3035中没有该管理工作请求要触发的数据工作请求时,在接收队列中存储该管理工作请求,并等待该管理工作请求要触发的数据工作请求。
本申请实施例中,队列3035和存储单元3034可以通过RAM实现,例如可以是通过静态随机存取存储器(static random access memory,SRAM)或者动态随机存取存储器(dynamic random access memory,DRAM)实现。队列3035和存储单元3034可以内嵌于控制单元3031或者是独立于主机通道适配器303。
可选的,控制单元3031还可以包括一个计算子单元(图6未示出),用于执行工作请求中相关的计算任务。示例性的,计算子单元可以是算数逻辑单元(arithmetic logical unit,ALU)。计算子单元可以内嵌于控制单元3031中,也可以是主机通道适配器303中独立于控制单元3031的子单元,并由控制单元3031控制。
在一种实现方式中,控制单元3031和/或计算子单元可以通过现场可编程门阵列(field programmable gate array,FPGA)和/或专用集成电路(application-specific integrated circuit,ASIC)实现。
如图6所示,当第一接口3032接收到处理器301发送的工作请求后,将接收到的工作请求发送给控制单元3031。本申请实施例中,控制单元3031判断接收到的工作请求是否是无通信依赖的标识。
一种实现方式中,控制单元3031可以通过判断接收到的工作请求中是否包含所述第一标识,以判断接收到的工作请求是否是无通信依赖的工作请求。当接收到的工作请求中包含所述第一标识时,确认该工作请求是无通信依赖的工作请求;当接收到的工作请求中没有包含所述第一标识时,确认该工作请求是有通信依赖的工作请求。例如,控制单元3031接收到第一接口3032转发的工作请求后,先解析接收到的工作请求中Opcode字段是否包含IBV_SEND_DIRECTLY。如果包含IBV_SEND_DIRECTLY,确认该工作请求是无通信依赖的工作请求。如果没有包含IBV_SEND_DIRECTLY,确认该工作请求有无通信依赖的工作请求。
另一种实现方式中,控制单元3031可以通过判断接收到的工作请求中是否包含所述第二标识,以判断接收到的工作请求是否是无通信依赖的工作请求。当接收到的工作请求中没有包含所述第二标识时,确认该工作请求是无通信依赖的工作请求;当接收到的工作请求中包含所述第二标识时,确认该工作请求是有通信依赖的工作请求。
控制单元3031对于无通信依赖的工作请求,直接发送到第二接口3033以通过网络发送该工作请求。对于有通信依赖的工作请求,控制单元3031将该工作请求载入队列3035,并通过队列对该工作请求的发送进行管控,并在该工作请求被执行的条件满足时,经第二接口3033将该工作请求通过网络发送。
示例性的,控制单元3031通过队列对该工作请求的发送进行管控可以包括下述方式:
控制单元3031先判断与该工作请求有通信依赖的工作请求被触发的条件是否满足。如果条件未满足,将该工作请求存储到接收队列中进行等待。如果条件已满足,则触发对该工作请求的,经第二接口3033通过网络发送该工作请求。例如上述示例中,进程A向进程B发送的数据,需要等待进程C向进程A发送数据。当控制单元3031接收到进程A向进程B发送数据的工作请求后,将该工作请求载入队列3035中的接收队列,并判断进程A是否接收到进程C发送的数据,即判断队列3035的完成队列中是否有进程C向进程A发送数据的完成记录。如果进程A还未接收到进程C发送的数据,控制单元3031将该进程A向进程B发送数据的工作请求存储到接收队列中。当进程C向进程A发送数据的工作请求执行完成后,在完成队列中会记录该完成的WQE,当PI指示的条目显示进程C已经向进程A发送数据时,进程A向进程B发送数据的工作请求的条件已经满足,控制单元3031从接收队列中取出进程A向进程B发送数据的工作请求,并经第二接口3033通过网络发送该工作请求。
通过本申请实施例提供的上述实现方式,处理器301标识出无通信依赖的工作请求,主机通道适配器403接收到处理器301发送的集合操作的工作请求后,对于无通信依赖的工作请求直接通过网络发送,避免了无通信依赖的工作请求通过队列管控时造成的通信时延,并能够降低主机通道适配器303因执行相关的管控所带来的资源消耗。虽然无通信依赖的工作请求直接通过网络发送,会因未进行队列管理多触发一些中断,但这些中断所造成的时延与资源消耗,远小于通过队列管理所造成的时延和资源消耗。因此,通过本申请实施例提供的实现方式,能够从整体上提升集合通信的通信性能。
上述实施例是以通过MPI实现集合操作为例描述本申请实施例提供的方案,但本申请实施例不限定于此,对于通过其它方式实现的集合操作,也可以参照上述实现方式来实现,不再赘述。
图7为本申请实施例提供的一种计算机设备700的结构示意图。如图7所示,计算机设备700包括处理器701、存储器702和主机通道适配器703。处理器701、存储器702和主机通道适配器703通过总线相互连接。
所述存储器702中存储有计算机可执行的程序,所述处理器701用于执行所述计算机可执行的程序,以实现如下操作:
将集合通信的操作请求转换为工作请求,并标识无通信依赖的工作请求;以及,将所述工作请求发送给所述主机通道适配器703;
所述主机通道适配器703,用于判断接收到的工作请求是否是无通信依赖的工作请求;对于标识为无通信依赖的工作请求直接转发,对于没有标识为无通信依赖的工作请求,基于队列管控后转发。
图7所示的计算机设备700的具体实现方式,可以参考上述图3A所示的计算机设备300的实现方式以及参考上述图5A和图5B所示的实现方式来实现;例如,上述通信子模块可以是上述图5A和图5B中描述的进程等,不再赘述。
通过图7所示的计算机设备700,处理器701标识出无通信依赖的工作请求,主机通道适配器703接收到处理器701发送的集合操作的工作请求后,对于无通信依赖的工作请求直接通过网络发送,避免了无通信依赖的工作请求通过队列管控时造成的通信时延,并能够降低主机通道适配器703因执行相关的管控所带来的资源消耗,能够从整体上提升集合通信的通信性能。
图8为本申请实施例提供的一种通信系统的结构示意图。如图8所示,所述通信系统包括至少一个第二计算机设备800,所述至少一个第二计算机设备800通过网络708与图7中 的计算机设备700通信。
图8所示的实施例可以参考上述图3B所示的系统的实现方式来实现。具体的,第二计算机设备800可以参考图3B中计算机设备400的实现方式来实现。图8中,第二计算机设备800可以为一个或多个,计算机设备700可以通过网络708与一个或多个第二计算机设备800通信。
图8中计算机设备700与第二计算机设备800之间通信的实现方式,可以参考上述图3B以及图5A和图5B所示的实现方式来实现,不再赘述。
图9为本申请实施例提供的一种实现集合通信的方法的流程示意图。如图9所示,所述方法包括:
步骤900:获取集合通信的操作请求;
步骤901:将所述集合通信的操作请求转换为工作请求,并标识无通信依赖的工作请求;
步骤902:直接转发标识为无通信依赖的工作请求;对于没有标识为无通信依赖的工作请求,基于队列管控后转发。
图9所示的方法,可以通过一个计算机设备实现,例如可以通过上述图3A的计算机设备300来实现;并且,图9所示的方法,还可以参考上述图5A和图5B所示的实现方式来实现,不再赘述。
图9所示的方法,通过标识出无通信依赖的工作请求,对于无通信依赖的工作请求直接通过网络发送,避免了无通信依赖的工作请求通过队列管控时造成的通信时延,并能够降低因执行相关的管控所带来的资源消耗,能够从整体上提升集合通信的通信性能。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、设备和方法,可以通过其它的方式实现。例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个模块或组件可以结合或者可以集成。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接。
所述集成的模块如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求的保护范围为准。

Claims (31)

  1. 一种计算机设备,其特征在于,包括处理器、存储器和主机通道适配器;
    所述存储器中存储有计算机可执行的程序;
    所述处理器用于执行所述计算机可执行的程序,以实现如下操作:
    将集合通信的操作请求转换为工作请求,并标识无通信依赖的工作请求;以及将所述工作请求发送给所述主机通道适配器;
    所述主机通道适配器,用于判断接收到的工作请求是否是无通信依赖的工作请求;对于标识为无通信依赖的工作请求直接转发,对于没有标识为无通信依赖的工作请求,基于队列管控后转发。
  2. 根据权利要求1所述的计算机设备,其特征在于,所述集合通信为下述任一种通信:一个第一通信子模块与多个第二通信子模块之间的通信,多个第一通信子模块与一个第二通信子模块之间的通信,多个第一通信子模块与多个第二通信子模块之间的通信。
  3. 根据权利要求1或2所述的计算机设备,其特征在于,所述工作请求是一个所述第一通信子模块与一个所述第二通信子模块之间的通信请求。
  4. 根据权利要求3所述的计算机设备,其特征在于,所述无通信依赖是一个所述第一通信子模块与一个所述第二通信子模块之间的通信不需要依赖其它通信子模块。
  5. 根据权利要求1-4中任意一项所述的计算机设备,其特征在于,所述处理器还用于执行所述计算机可执行的程序,以实现如下操作:
    基于所述工作请求中不同通信子模块之间的通信方式,识别一个工作请求是否是无通信依赖的工作请求。
  6. 根据权利要求5所述的计算机设备,其特征在于,所述通信方式包括一个工作请求中两个通信子模块之间传输数据的方式,以及所述两个通信子模块之间传输数据是否依赖其它通信子模块发送的数据。
  7. 根据权利要求5或6所述的计算机设备,其特征在于,所述通信方式是根据所述不同通信子模块间通信的接口确定的。
  8. 根据权利要求3-7中任意一项所述的计算机设备,其特征在于,所述第一通信子模块通过所述计算机设备运行,所述第二通信子模块通过另一计算机设备运行,所述计算机设备与所述另一计算机设备通过网络通信;
    所述主机通道适配器,还用于对标识为无通信依赖的工作请求直接通过所述网络转发,对于没有标识为无通信依赖的工作请求,基于队列管控后通过所述网络转发。
  9. 根据权利要求1-8中任意一项所述的计算机设备,其特征在于,所述主机通道适配器对于没有标识为无通信依赖的工作请求,基于队列管控后转发包括:
    所述主机通道适配器将没有标识为无通信依赖的工作请求载入队列,并判断所述队列中记录的触发所述没有标识为无通信依赖的工作请求的条件是否已满足;
    当所述条件已满足时,发送所述没有标识为无通信依赖的工作请求。
  10. 根据权利要求1-9中任意一项所述的计算机设备,其特征在于,所述工作请求包括多个工作请求;
    所述多个工作请求中包括一个或多个标识为无通信依赖的工作请求,以及,一个或多个没有标识为无通信依赖的工作请求。
  11. 根据权利要求1-10中任意一项所述的计算机设备,其特征在于,所述处理器还用于执行所述计算机可执行的程序,以实现如下操作:
    接收所述计算机设备中运行的应用程序发起的所述集合通信的操作请求;
    获取所述集合通信的通信子模块的数量、每个通信子模块需要执行的任务以及不同通信子模块之间传递的数据和传递方式的信息。
  12. 根据权利要求1-11中任意一项所述的计算机设备,其特征在于,所述处理器将集合通信的操作请求转换为工作请求包括:
    将集合通信的操作请求转换为工作请求和执行所述工作请求的控制命令;其中,所述控制命令用于控制所述工作请求以实现集合操作。
  13. 根据权利要求1-12中任意一项所述的计算机设备,其特征在于,所述集合通信是基于信息传递接口标准MPI的集合通信。
  14. 根据权利要求1-13中任意一项所述的计算机设备,其特征在于,所述主机通道适配器通过网络接口卡NIC、独立的芯片或芯片组实现。
  15. 根据权利要求2-14中任意一项所述的计算机设备,其特征在于,所述通信子模块为进程或线程。
  16. 一种通信系统,其特征在于,所述通信系统包括至少一个第二计算机设备,所述至少一个第二计算机设备通过网络与权利要求1-15中任意一项的计算机设备通信。
  17. 一种实现集合通信的方法,其特征在于,所述方法包括:
    获取集合通信的操作请求;
    将所述集合通信的操作请求转换为工作请求,并标识无通信依赖的工作请求;;
    直接转发标识为无通信依赖的工作请求;
    对于没有标识为无通信依赖的工作请求,基于队列管控后转发。
  18. 根据权利要求17所述的方法,其特征在于,所述集合通信为下述任一种通信:一个第一通信子模块与多个第二通信子模块之间的通信,多个第一通信子模块与一个第二通信子模块之间的通信,多个第一通信子模块与多个第二通信子模块之间的通信。
  19. 根据权利要求17或18所述的方法,其特征在于,所述工作请求是一个所述第一通信子模块与一个所述第二通信子模块之间的通信请求。
  20. 根据权利要求19所述的方法,其特征在于,所述无通信依赖是一个所述第一通信子模块与一个所述第二通信子模块之间的通信不需要依赖其它通信子模块。
  21. 根据权利要求17-20中任意一项所述的方法,其特征在于,所述方法还包括:
    基于所述工作请求中不同通信子模块之间的通信方式,识别一个工作请求是否是无通信依赖的工作请求。
  22. 根据权利要求21所述的方法,其特征在于,所述通信方式包括一个工作请求中两个通信子模块之间传输数据的方式,以及所述两个通信子模块之间传输数据是否依赖其它通信子模块发送的数据。
  23. 根据权利要求21或22所述的方法,其特征在于,所述通信方式是根据所述不同通信子模块间通信的接口确定的。
  24. 根据权利要求19-23中任意一项所述的方法,其特征在于,所述第一通信子模块通过第一计算机设备运行,所述第二通信子模块通过第二计算机设备运行,所述第一计算机设备与所述第二计算机设备通过网络通信;
    所述方法还包括:对标识为无通信依赖的工作请求直接通过所述网络转发,对于没有标 识为无通信依赖的工作请求,基于队列管控后通过所述网络转发。
  25. 根据权利要求17-24中任意一项所述的方法,其特征在于,所述对于没有标识为无通信依赖的工作请求,基于队列管控后转发包括:
    将没有标识为无通信依赖的工作请求载入队列,并判断所述队列中记录的触发所述没有标识为无通信依赖的工作请求的条件是否已满足;
    当所述条件已满足时,发送所述没有标识为无通信依赖的工作请求。
  26. 根据权利要求17-25中任意一项所述的方法,其特征在于,所述工作请求包括多个工作请求;
    所述多个工作请求中包括一个或多个标识为无通信依赖的工作请求,以及,一个或多个没有标识为无通信依赖的工作请求。
  27. 根据权利要求17-26中任意一项所述的方法,其特征在于,所述方法还包括:
    根据获取到的所述集合通信的操作请求,获取所述集合通信的通信子模块的数量、每个通信子模块需要执行的任务以及不同通信子模块之间传递的数据和传递方式的信息。
  28. 根据权利要求17-27中任意一项所述的方法,其特征在于,所述将所述集合通信的操作请求转换为工作请求包括:
    将所述集合操作转换为工作请求和执行所述工作请求的控制命令;其中,所述控制命令用于控制所述工作请求以实现集合操作。
  29. 根据权利要求17-28中任意一项所述的方法,其特征在于,所述集合通信是基于信息传递接口标准MPI的集合通信。
  30. 根据权利要求17-29中任意一项所述的方法,其特征在于,所述主机通道适配器通过网络接口卡NIC、独立的芯片或芯片组实现。
  31. 根据权利要求18-30中任意一项所述的方法,其特征在于,所述通信子模块为进程或线程。
PCT/CN2021/103616 2020-11-26 2021-06-30 实现集合通信的方法、计算机设备和通信系统 WO2022110805A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21896313.0A EP4236125A4 (en) 2020-11-26 2021-06-30 METHOD FOR IMPLEMENTING COLLECTIVE COMMUNICATION, COMPUTER DEVICE AND COMMUNICATION SYSTEM
US18/324,742 US20230300080A1 (en) 2020-11-26 2023-05-26 Method for implementing collective communication, computer device, and communication system

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202011345108 2020-11-26
CN202011345108.6 2020-11-26
CN202011600044.X 2020-12-29
CN202011600044.XA CN114567520B (zh) 2020-11-26 2020-12-29 实现集合通信的方法、计算机设备和通信系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/324,742 Continuation US20230300080A1 (en) 2020-11-26 2023-05-26 Method for implementing collective communication, computer device, and communication system

Publications (1)

Publication Number Publication Date
WO2022110805A1 true WO2022110805A1 (zh) 2022-06-02

Family

ID=81712659

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/103616 WO2022110805A1 (zh) 2020-11-26 2021-06-30 实现集合通信的方法、计算机设备和通信系统

Country Status (4)

Country Link
US (1) US20230300080A1 (zh)
EP (1) EP4236125A4 (zh)
CN (1) CN114567520B (zh)
WO (1) WO2022110805A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106375329A (zh) * 2016-09-20 2017-02-01 腾讯科技(深圳)有限公司 一种数据推送方法和时序控制器以及数据推送系统
CN107819855A (zh) * 2017-11-14 2018-03-20 成都路行通信息技术有限公司 一种消息分发方法及装置
WO2018107331A1 (zh) * 2016-12-12 2018-06-21 华为技术有限公司 计算机系统及内存访问技术
EP3385842A1 (en) * 2017-04-09 2018-10-10 INTEL Corporation Efficient thread group scheduling
CN109716311A (zh) * 2016-09-29 2019-05-03 英特尔公司 用于执行分布式仲裁的系统、装置和方法

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7086062B1 (en) * 1999-10-11 2006-08-01 I2 Technologies Us, Inc. System and method for handling a unit of work
US7895601B2 (en) * 2007-01-10 2011-02-22 International Business Machines Corporation Collective send operations on a system area network
US10158702B2 (en) * 2009-11-15 2018-12-18 Mellanox Technologies, Ltd. Network operation offloading for collective operations
US8811417B2 (en) * 2009-11-15 2014-08-19 Mellanox Technologies Ltd. Cross-channel network operation offloading for collective operations
CN103455380A (zh) * 2012-06-05 2013-12-18 上海斐讯数据通信技术有限公司 多进程通信系统及其建立和通信方法
US9529643B2 (en) * 2015-01-26 2016-12-27 Qualcomm Incorporated Method and system for accelerating task control flow
US9916178B2 (en) * 2015-09-25 2018-03-13 Intel Corporation Technologies for integrated thread scheduling
EP3877863B1 (en) * 2018-12-13 2024-04-24 Huawei Technologies Co., Ltd. Apparatus, method and computer program product for performing a collective communication operation in a data communications network
CN111694675B (zh) * 2019-03-15 2022-03-08 上海商汤智能科技有限公司 任务调度方法及装置、存储介质
CN112631802B (zh) * 2019-04-29 2024-04-12 杭州涂鸦信息技术有限公司 一种线程间通信方法及相关装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106375329A (zh) * 2016-09-20 2017-02-01 腾讯科技(深圳)有限公司 一种数据推送方法和时序控制器以及数据推送系统
CN109716311A (zh) * 2016-09-29 2019-05-03 英特尔公司 用于执行分布式仲裁的系统、装置和方法
WO2018107331A1 (zh) * 2016-12-12 2018-06-21 华为技术有限公司 计算机系统及内存访问技术
EP3385842A1 (en) * 2017-04-09 2018-10-10 INTEL Corporation Efficient thread group scheduling
CN107819855A (zh) * 2017-11-14 2018-03-20 成都路行通信息技术有限公司 一种消息分发方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4236125A4

Also Published As

Publication number Publication date
US20230300080A1 (en) 2023-09-21
CN114567520B (zh) 2023-06-02
EP4236125A1 (en) 2023-08-30
CN114567520A (zh) 2022-05-31
EP4236125A4 (en) 2024-03-06

Similar Documents

Publication Publication Date Title
US8018951B2 (en) Pacing a data transfer operation between compute nodes on a parallel computer
US8108467B2 (en) Load balanced data processing performed on an application message transmitted between compute nodes of a parallel computer
US7827024B2 (en) Low latency, high bandwidth data communications between compute nodes in a parallel computer
US8676917B2 (en) Administering an epoch initiated for remote memory access
US7948999B2 (en) Signaling completion of a message transfer from an origin compute node to a target compute node
US8566841B2 (en) Processing communications events in parallel active messaging interface by awakening thread from wait state
US8732725B2 (en) Managing internode data communications for an uninitialized process in a parallel computer
US8325633B2 (en) Remote direct memory access
US20080281998A1 (en) Direct Memory Access Transfer Completion Notification
US7797445B2 (en) Dynamic network link selection for transmitting a message between compute nodes of a parallel computer
US7836143B2 (en) Message communications of particular message types between compute nodes using DMA shadow buffers
US20090019190A1 (en) Low Latency, High Bandwidth Data Communications Between Compute Nodes in a Parallel Computer
US7779173B2 (en) Direct memory access transfer completion notification
US9544261B2 (en) Data communications in a distributed computing environment
US8959172B2 (en) Self-pacing direct memory access data transfer operations for compute nodes in a parallel computer
US7966618B2 (en) Controlling data transfers from an origin compute node to a target compute node
US10873630B2 (en) Server architecture having dedicated compute resources for processing infrastructure-related workloads
US7890597B2 (en) Direct memory access transfer completion notification
US7889657B2 (en) Signaling completion of a message transfer from an origin compute node to a target compute node
Cardellini et al. Overlapping communication with computation in MPI applications
US10277547B2 (en) Data communications in a distributed computing environment
WO2022110805A1 (zh) 实现集合通信的方法、计算机设备和通信系统
US20080307121A1 (en) Direct Memory Access Transfer Completion Notification
Ravi et al. Host Software Stack Optimizations to Maximize Aggregate Fabric Throughput

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2021896313

Country of ref document: EP

Effective date: 20230525

NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21896313

Country of ref document: EP

Kind code of ref document: A1