WO2024077999A1 - 集合通信方法及计算集群 - Google Patents

集合通信方法及计算集群 Download PDF

Info

Publication number
WO2024077999A1
WO2024077999A1 PCT/CN2023/101329 CN2023101329W WO2024077999A1 WO 2024077999 A1 WO2024077999 A1 WO 2024077999A1 CN 2023101329 W CN2023101329 W CN 2023101329W WO 2024077999 A1 WO2024077999 A1 WO 2024077999A1
Authority
WO
WIPO (PCT)
Prior art keywords
communication
communication group
address
data
processes
Prior art date
Application number
PCT/CN2023/101329
Other languages
English (en)
French (fr)
Inventor
祝佳
勾文进
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024077999A1 publication Critical patent/WO2024077999A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/104Peer-to-peer [P2P] networks
    • H04L67/1044Group management mechanisms 
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/104Peer-to-peer [P2P] networks
    • H04L67/1042Peer-to-peer [P2P] networks using topology management mechanisms
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/14Session management
    • H04L67/141Setup of application sessions

Definitions

  • the present application relates to the field of computers, and in particular to a collective communication method and a computing cluster.
  • Collective communication also known as group communication or aggregate communication, is a collective communication behavior in which multiple processes running on multiple computing resources in a computing cluster participate in communication to form a process group to perform computing tasks.
  • the above process group may include at least one communication group with a communication relationship, and each communication group includes at least two processes with a communication relationship.
  • multiple data transmissions are performed between processes in the communication group to achieve computing tasks with large amounts of data.
  • each process in the collective communication is pre-allocated with memory resources.
  • Each process in the communication group needs to obtain the memory address corresponding to the memory resources of the remaining processes in the communication group before each data transmission, resulting in a large number of repeated memory address acquisition operations in the collective communication, which increases the communication delay of the collective communication.
  • the embodiment of the present application provides a collective communication method and a computing cluster.
  • the first process obtains and records the memory address corresponding to the memory resource of the second process in the communication group, so that the multiple data transmissions can be performed according to the recorded memory address, avoiding repeated acquisition of the memory address used for the transmission data in the multiple data transmissions, thereby reducing the latency of the computing cluster.
  • an embodiment of the present application provides a collective communication method, which is applied to a computing cluster, wherein the computing cluster includes multiple computing resources, N processes are running on the multiple computing resources, the N processes form a process group for executing computing tasks, each of the N processes is allocated memory resources, the process group includes M communication groups, each communication group includes at least two processes with a communication relationship, wherein M is an integer greater than or equal to 1, and N is an integer greater than or equal to 2; the method includes: the first process in the first communication group obtains a first memory address, and records the first memory address; wherein the first memory address is a memory address corresponding to the memory resource of the second process in the first communication group; the first communication group is any communication group among the M communication groups; the first process in the first communication group transmits data with the second process in the first communication group multiple times according to the first memory address.
  • FIG6a is a schematic diagram of a collective communication process provided by an embodiment of the present application.
  • the first process may be process 0, the second process may be process 1, the first memory address may be address A0, and the data may be data D0.
  • the first process is process 1
  • the second process is process 2.
  • the first process before the first process in the first communication group performs multiple data transmissions with the second process in the first communication group, the first process obtains the memory address of the memory resource corresponding to the second process in the first communication group. This can avoid repeatedly obtaining the target address, that is, the first memory address, in multiple communication steps (such as communication step 0 to communication step 2 in Figure 6a), that is, data transmission, thereby reducing the latency of the computing cluster.
  • the first process in the first communication group transmits data with the second process in the first communication group multiple times according to the first memory address, including: the first process in the first communication group writes data multiple times into the memory resources of the second process in the first communication group corresponding to the first memory address.
  • the first process in the first communication group can realize data transmission with the second process in the first communication group by writing data multiple times into the memory resource of the second process in the first communication group corresponding to the first memory address.
  • the first process in the first communication group transmits data with the second process multiple times according to the first memory address, including: the first process in the first communication group reads data multiple times from the memory resources of the second process in the first communication group corresponding to the first memory address.
  • the first process in the first communication group can realize data transmission with the second process by reading data multiple times from the memory resources of the second process in the first communication group corresponding to the first memory address.
  • a first process in the first communication group is configured to execute the first memory
  • the method comprises: each time the first process in the first communication group transmits data with the second process in the first communication group, the address space of the first memory address occupied by the transmitted data is recorded; before each data transmission, the first process in the first communication group determines the memory address corresponding to the transmitted data according to the address space occupancy of the recorded first memory address.
  • the address space of the first memory address may be, for example, address space AS1, and the first memory address may be, for example, memory address A11.
  • the first process in the first communication group Before the first process in the first communication group transmits data for the first time, it may record the occupancy of the address space AS as unoccupied, and then determine that the memory address corresponding to the transmitted data is memory address A11; before the first process in the first communication group transmits data for the second time, it may record the occupancy of the address space of the first memory address as occupied first memory address, for example, memory address A11, and then determine that the memory address corresponding to the transmitted data in the second transmission is the memory address of the next address space adjacent to the address space corresponding to memory address A11 in address space AS, for example, memory address A12.
  • the first process in the first communication group may record the occupancy of the address space of the first memory address as occupied memory address A12, and then determine that the memory address corresponding to the transmitted data in the third transmission is the memory address of the next address space adjacent to the address space corresponding to memory address A12 in address space AS, for example, memory address A13.
  • the first process in the first communication group can record, through the first memory address, the address space occupancy of the first memory address for each data transmission, so that before each data transmission, the memory address corresponding to the transmitted data is determined according to the recorded address space occupancy of the first memory address, that is, the management of the memory resources corresponding to the first memory address is realized, so that the memory address corresponding to the transmitted data can be accurately determined in each data transmission according to the first memory address obtained in advance.
  • the number of first processes in the first communication group is one or more, and the number of second processes in the first communication group is one or more.
  • the number of first processes in the first communication group is one or more, and the number of second processes in the first communication group is one or more, which can be applicable to a variety of inter-process communication scenarios, such as broadcast, reduction, global collection and other communication scenarios in collective communication.
  • the number of first processes in the first communication group is multiple, and the number of second processes in the first communication group is multiple; the first process in the first communication group obtains the first memory address, including: a main process among the multiple first processes in the first communication group obtains the first memory address corresponding to each second process in the first communication group to obtain an address set; the main process sends the address set to other first processes among the multiple first processes except the main process.
  • Figure 6d is a schematic diagram of another collective communication process provided by an embodiment of the present application.
  • the main process may be process 0, and process 0 may obtain the first memory address corresponding to each second process in the communication group to which process 0 belongs, such as process 1 to process 3, to obtain an address set; and then send the addresses in the address set to other first processes in the communication group except the first process.
  • an embodiment of the present application provides a computing cluster, comprising multiple computing resources, N processes running on the multiple computing resources, the N processes forming a process group for executing computing tasks, each of the N processes being allocated memory resources, the process group comprising M communication groups, each communication group comprising at least two processes having a communication relationship, wherein M is an integer greater than or equal to 1, and N is an integer greater than or equal to 2; the first process in the first communication group is used to obtain a first memory address and record the first memory address; wherein the first memory address is a memory address corresponding to the memory resource of the second process in the first communication group; the first communication group is any communication group among the M communication groups; the first process in the first communication group is used to perform data transmission with the second process multiple times according to the first memory address.
  • the first process in the first communication group is specifically used to: write data multiple times into the memory resource of the second process corresponding to the first memory address.
  • the first process in the first communication group is specifically used to: read data multiple times from the memory resources of the second process corresponding to the first memory address.
  • the first process in the first communication group is specifically used to: record the address space of the first memory address occupied by the transmitted data each time data is transmitted with the second process in the first communication group; before each data transmission, determine the memory address corresponding to the transmitted data based on the address space occupancy of the recorded first memory address.
  • the number of first processes in the first communication group is one or more, the number of the second processes in the first communication group is one or more.
  • the number of first processes in the first communication group is multiple, and the number of second processes in the first communication group is multiple; the main process among the multiple first processes in the first communication group is specifically used to: obtain the first memory address corresponding to each second process in the first communication group to obtain an address set; send the address set to other first processes among the multiple first processes except the main process.
  • the second aspect and any implementation of the second aspect correspond to the first aspect and any implementation of the first aspect respectively.
  • the technical effects corresponding to the second aspect and any implementation of the second aspect can refer to the technical effects corresponding to the above-mentioned first aspect and any implementation of the first aspect, which will not be repeated here.
  • an embodiment of the present application provides a computer-readable medium for storing a computer program, the computer program including instructions for executing the method in the first aspect or any possible implementation of the first aspect.
  • the computer program When the computer program is run on a computing cluster, the computing cluster executes the method in the first aspect or any possible implementation of the first aspect.
  • the third aspect and any implementation of the third aspect correspond to the first aspect and any implementation of the first aspect, respectively.
  • the technical effects corresponding to the third aspect and any implementation of the third aspect can refer to the technical effects corresponding to the first aspect and any implementation of the first aspect, which will not be repeated here.
  • an embodiment of the present application provides a computer program, the computer program including instructions for executing the method in the first aspect or any possible implementation of the first aspect.
  • the computer program When the computer program is executed by a computing cluster, the computing cluster executes the method in the first aspect or any possible implementation of the first aspect.
  • the fourth aspect and any implementation of the fourth aspect correspond to the first aspect and any implementation of the first aspect, respectively.
  • the technical effects corresponding to the fourth aspect and any implementation of the fourth aspect can refer to the technical effects corresponding to the above-mentioned first aspect and any implementation of the first aspect, which will not be repeated here.
  • FIG. 1a is a schematic diagram showing an exemplary bilateral communication process
  • FIG1b is a schematic diagram showing an exemplary one-way communication process
  • FIG1c is a schematic diagram showing another unilateral communication process by way of example.
  • FIG2a is a schematic diagram showing another point-to-point communication process by way of example.
  • FIG2b is a schematic diagram showing another point-to-point communication process by way of example.
  • FIG3a is a schematic diagram showing an exemplary collective communication process
  • FIG3b is a schematic diagram showing an exemplary address transmission process in collective communication
  • FIG3c is a schematic diagram showing another collective communication process by way of example.
  • FIG4 is a schematic diagram of the structure of a computing cluster 400 provided in an embodiment of the present application.
  • FIG5 is a flow chart of a collective communication method provided in an embodiment of the present application.
  • FIG6a is a schematic diagram of a collective communication process provided by an embodiment of the present application.
  • FIG6b is a schematic diagram of an address transmission process in a collective communication provided by an embodiment of the present application.
  • FIG6c-1 is a schematic diagram showing an exemplary point-to-point communication process
  • FIG6c-2 is a schematic diagram of an address transmission process in collective communication provided by an embodiment of the present application.
  • FIG6d is a schematic diagram of another collective communication process provided in an embodiment of the present application.
  • FIG7a is a schematic diagram of a correspondence between a process and a data storage address provided by an embodiment of the present application.
  • FIG7b is a schematic diagram of another correspondence relationship between a process and a data storage address provided in an embodiment of the present application.
  • FIG8 is a schematic diagram of another collective communication process provided in an embodiment of the present application.
  • At least one (item) means one or more, and “plurality” means two or more.
  • “And/or” is used to describe the association relationship of associated objects, indicating that three relationships may exist.
  • a and/or B can mean: only A exists, only B exists, and A and B exist at the same time, where A and B can be singular or plural.
  • the character “/” generally indicates that the objects associated before and after are in an “or” relationship.
  • At least one of the following” or similar expressions refers to any combination of these items, including any combination of single or plural items.
  • At least one of a, b or c can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c", where a, b, c can be single or multiple.
  • MPI Message Passing Interface
  • C, C++ and Fortran programming languages
  • Fortran the abbreviation of Formula Translation, which means “formula translation”. It is applied to problems that can be expressed in mathematical formulas in science, engineering problems or enterprise management, and has strong numerical calculation capabilities.
  • MPI can be applied to a variety of system architectures, such as distributed/shared memory multi-core processors, high-performance networks, and combinations of these architectures.
  • MPI is also a parallel programming function library, and its compilation and operation need to be combined with a specific programming language.
  • MPI has been implemented on mainstream operating systems, including Windows and Linux systems.
  • MPI can be a process-level parallel software middleware.
  • the MPI framework manages all computing processes to form a system, and then provides a rich set of inter-process communication functions. Among them, a process is a running instance of a program. In addition to program code, it also contains the execution environment of the program code (memory, registers, program counters, etc.), and is an independent and executable basic program unit in the operating system.
  • the MPI framework assigns a process identification number (rank number) to each process, and the ranks start from 0 and go backwards in sequence. Which part of the work each process of the MPI program completes is determined by its process identification number.
  • MPI processes need to communicate within a communication domain.
  • the communication domain is the communication environment between processes, including process groups, contexts, virtual topologies, etc.
  • MPI can support a variety of different communication protocols, such as InfiniBand (IB, a computer network communication standard for high-performance computing) or Transmission Control Protocol (TCP). MPI encapsulates these protocols, provides a unified communication interface, and shields the underlying communication details.
  • IB InfiniBand
  • TCP Transmission Control Protocol
  • Collective communication also called group communication, unlike point-to-point communication which only involves two processes, the sender and the receiver, collective communication involves multiple processes participating in the communication. Which processes participate in collective communication and the context of collective communication are all defined by the communication domain called in the collective communication.
  • Collective communication generally includes three functions: communication, synchronization and calculation. Among them, the communication function mainly completes the transmission of data within the collection, the synchronization function realizes the consistency of the execution progress of all processes in the collection at a specific point, and the calculation function is the operation of specific data. Exemplarily, according to the number of peers participating in collective communication, collective communication can be divided into three types of communication: one-to-many, many-to-one, and many-to-many.
  • broadcast MPI_Bcast is used by the root process (the current calling process in the communication group) to send a message to each other process in the communication group.
  • Reduction MPI_Reduce aggregates all the corresponding values of the variable in the process into one value and returns it to the current calling process. That is, the operation of interaction between data distributed in different processes. Common operations include summation, maximum or minimum value, etc.
  • Scatter MPI_Scatter is used to send different information to each other process. Gather MPI_Gather is the reverse process of scatter operation, in which the root process receives information from each other process.
  • one-to-many collective communication can use broadcast MPI_Bcast and scatter MPI_Scatter communication methods; many-to-one collective communication can use gather MPI_Gather and reduce MPI_Reduce communication methods; many-to-many collective communication can use global reduce Allreduce and global gather Allgather communication methods.
  • High performance computing refers to a set of computer systems with computing power that can reach a certain level. Because it is difficult for a single processor to achieve such powerful computing power, HPC requires multiple central processing units (CPUs) or multiple hosts (such as multiple computer devices) to work together to achieve it. The main purpose of building a high-performance computing system is to increase the computing speed. To achieve a computing speed of trillions of times per second, the system's processor, memory bandwidth, computing method, system input/output (I/O), storage and other aspects are very high. Each of these links will directly affect the system's computing speed. HPC is mainly used to quickly complete data-intensive, computing-intensive and I/O-intensive calculations in the fields of scientific research, engineering design, finance, industry and social management.
  • Typical applications include: bioengineering, new drug development, petroleum geophysical exploration, vehicle design (aerospace, ships, automobiles), materials engineering, cutting-edge equipment manufacturing, cryptographic research and various large-scale information processing.
  • the goals of high-performance computing are to: minimize the computational time required to complete special computing problems, maximize the size of problems that can be completed within a specified time, handle a large number of complex problems that were previously impossible, improve cost-effectiveness, and expand solutions to medium-sized problems and budgets.
  • Parallel computing is based on the idea that a large problem can be divided into several smaller problems, and these smaller problems can be solved simultaneously (in parallel) using existing resource capabilities. The solution of these small problems eventually leads to the solution of the large problem.
  • Parallel computing is relative to serial computing.
  • the characteristic of serial computing is that the processor runs the calculation algorithm in sequence according to the order of instructions.
  • Temporal parallelism refers to the pipeline technology used in the central processing unit of the computer, which divides each instruction into multiple steps to be completed, and these steps can be overlapped in time.
  • Spatial parallelism refers to the use of multiple processors to execute computer instructions concurrently, thereby speeding up the problem solving process.
  • the advantage of parallel computing is that it can break through the limitations of the computing power of serial computers, improve the computing speed, complete the computing task in a shorter time, better play the computing power of hardware, and save computing costs.
  • Peer to peer communication is the communication between two processes.
  • FIG. 1a is a schematic diagram of an exemplary bilateral communication process.
  • both parties in bilateral communication perform operations.
  • process 0 performs a sending operation
  • process 1 performs a receiving operation.
  • FIG. 1b is a schematic diagram of an exemplary unilateral communication process.
  • one party in unilateral communication participates in the communication operation.
  • writing data in unilateral communication includes: process 0 requests a data write address from process 1; after process 0 receives the data write address fed back by process 1, it performs a data sending operation, that is, writes the data to the data write address.
  • FIG. 1c is a schematic diagram of another unilateral communication process shown in FIG. 1c, reading data in unilateral communication includes: process 0 requests a data storage address from process 1; after process 0 receives the data read address fed back by process 1, it performs a read operation, that is, reads data from the data storage address. In this way, there is no need to write code for sending operations for process 1. There is a delay in the communication process in Figure 1a to Figure 1c.
  • the kernel allocates memory to each process in advance, and the process manages and uses the allocated memory.
  • process 0 can divide the storage space for the read data from the allocated memory, and then store the data in the storage space.
  • the storage space division of the data by the process can be carried out according to the application requirements, and the embodiment of the present application does not limit this.
  • Point-to-point communication can be specifically carried out using the Rendezvous protocol.
  • FIG2a is a schematic diagram of another point-to-point communication process exemplarily shown.
  • the sender in the data writing (Put) of the Rendezvous protocol, the sender first sends a message header (such as an inquiry message, that is, a request message RTS (request to send) message) to the receiver, and waits for the receiver to prepare the receiving buffer. After the receiver is ready to receive the buffer, it sends a response message (such as a CTS (Clear To Send) message) to the sender.
  • the CTS message contains the address of the receiving buffer where the receiver stores the data DR.
  • the sender writes (Put operation) the data DR into the receiving buffer of the receiver, and sends a data transmission end notification (such as a FIN packet) after completing the writing.
  • FIN means that the communication connection is closed normally, there is no data loss, and all data packets in the sender's buffer are sent to complete the sending of the FIN packet).
  • the sender starts from sending the data DR, the sender participates in the communication operation, and the receiver does not participate in the communication operation, that is, unilateral communication is performed.
  • FIG. 2b is a schematic diagram of another peer-to-peer communication process.
  • the data reading of the Rendezvous protocol In Get the sender first sends a message header (such as a query message RTS (request to send) message) to the receiver to inform the receiver of the storage address of the data DR.
  • the receiver reads (Get operation) the data DR from the sender according to the address, and sends a data transmission end notification (such as a FIN packet) after completing the reading.
  • a data transmission end notification such as a FIN packet
  • Collective communication is implemented in a parallel computing manner and is widely used in scenarios such as high-performance computing and distributed training of neural network models.
  • collective communication can include multiple communication steps, each of which is implemented based on the above-mentioned point-to-point communication.
  • a collective communication is implemented based on the point-to-point communication of the Rendezvous protocol, by splitting a complete large packet into multiple sub-packets, and transmitting multiple sub-packets through multiple communication steps to cope with the transmission of large packets.
  • the data (such as data DR) is larger than the data volume threshold (such as 512 bytes) as a large packet.
  • the data volume threshold such as 512 bytes
  • FIG3a is a schematic diagram of a collective communication process exemplarily shown.
  • the collective communication based on the Rendezvous protocol and the MPI_Allgather Ring algorithm can be applied to the collection of large packages such as weather forecast data.
  • Each process participating in the weather forecast obtains the data of each process participating in the collective communication through three communication steps: communication step 0, communication step 1 and communication step 2 according to the MPI_Allgather Ring algorithm to obtain complete weather forecast data.
  • the communication steps of the collective communication follow the logic of the MPI_Allgather Ring algorithm: each process participating in the collective communication obtains data from the left neighbor of the process, and each process participating in the collective communication forms a ring-shaped neighbor relationship.
  • process 0 communicates with process 1
  • process 1 communicates with process 2
  • process 3 communicates with process 1, forming a ring-shaped neighbor relationship. That is to say, under the MPI_Allgather Ring algorithm logic, the collective communication including process 0 to process 3 is realized through communication step 0 to communication step 2, and the MPI_Allgather Ring algorithm logic determines which specific processes participate in the communication in each communication step.
  • FIG. 3b is a schematic diagram of an address transmission process in a collective communication exemplarily shown.
  • each process such as process 0
  • Each process (such as process 1) sends a FIN message to the other end (such as process 0) when the reading is completed to inform that the data reading is completed.
  • process 1 reads target data D0 from process 0, process 2 reads target data D1 from process 1, process 3 reads target data D2 from process 2, and process 0 reads target data D3 from process 3.
  • process 1 reads target data D3 from process 0, process 2 reads target data D0 from process 1, process 3 reads target data D1 from process 2, and process 0 reads target data D3 from process 3.
  • process 1 reads target data D2 from process 0, process 2 reads target data D3 from process 1, process 3 reads target data D0 from process 2, and process 0 reads target data D1 from process 3.
  • process 0 to process 3 collect the data of each process of the weather forecast through collective communication, that is, collect global data. Under the application logic of the weather forecast, each process can respectively apply the collected global data to realize the function of the process in the weather forecast application.
  • the MPI_Allgather Ring algorithm logic may include: each process participating in the collective communication obtains data from the left neighbor of the process and sends data to the right neighbor.
  • each process (such as process 0) as the sender requests the address of the receiving buffer of the receiving end through the message header (such as RTS message); each process (such as process 1) as the receiving end sends the address (such as A1) of the receiving buffer for caching data (such as D0) to the left neighbor process (such as process 0) through the CTS message, so that the left neighbor process (such as process 0) writes its corresponding data (such as D0) into the address (such as A1) of the receiving buffer of the receiving end (such as process 1).
  • Each process (such as process 0) as the sender sends a FIN message to the other end (process 1) when the writing is completed to inform the completion of the data writing.
  • the buffer address of the data to be read by the process is offset according to a certain offset.
  • the offset may include: in the collective communication scenario shown in FIG3a, in communication steps 0 to 3, process 0 sends the data reading address, that is, the starting address, to process 1, and process 1 reads the data from process 0 in the order of the initial address. The row is offset by the offset corresponding to the sequence to obtain the read address of this data.
  • process 1 receives the initial address A0 sent by process 0, and reads the target data D0 from the initial address A0; in communication step 1, process 1 receives the initial address A0 sent by process 0, offsets the initial address A0 by the first offset, obtains the first offset address, and reads the target data D3 from the first offset address; in communication step 2, process 1 receives the initial address A0 sent by process 0, offsets the initial address A0 by the second offset, obtains the second offset address, and reads the target data D2 from the second offset address.
  • the communication steps of the collective communication can follow the algorithm logic of MPI, such as the MPI_Allgather Ring algorithm logic, the Neighbor exchange algorithm logic of MPI_Allgather, the MPI_Reduce algorithm logic, etc. shown in Figure 3a above.
  • This embodiment does not limit this and can be adaptively selected according to application requirements.
  • the different algorithm logics followed by the collective communication determine the communication steps of the collective communication and the processes involved in the communication in each communication step. In each communication step, the data transmission between each two processes can be carried out according to the Rendezvous protocol.
  • FIG3c is a schematic diagram of another collective communication process.
  • the collective communication shown in FIG3c may include communication step 0 to communication step 2.
  • each process performs data communication according to the read (Get) method in the Rendezvous protocol.
  • process 0 sends the storage address A0 of target data D0 to process 1, and process 1 reads target data D0 from process 0 according to storage address A0; process 1 sends the storage address A1 of target data D1 to process 0, and process 0 reads target data D1 from process 1 according to storage address A1; process 0 sends a FIN message to inform process 1 that data reading is complete.
  • Process 2 sends the storage address A2 of target data D2 to process 3, and process 3 reads target data D2 from process 2 according to storage address A2; process 3 sends the storage address A3 of target data D3 to process 2, and process 2 reads target data D3 from process 3 according to storage address A3; process 2 sends a FIN message to inform process 3 that data reading is complete.
  • Process 4 sends the storage address A4 of target data D4 to process 5, and process 5 reads target data D4 from process 4 according to storage address A4; process 5 sends the storage address A5 of target data D5 to process 4, and process 4 reads target data D5 from process 5 according to storage address A5; process 4 sends a FIN message to inform process 5 that data reading is complete.
  • each subsequent communication step in FIG3c transmits data to each other in a manner similar to communication step 0.
  • the same parts will not be described here.
  • the neighbor process of each process is replaced in communication step 1 and communication step 2 of FIG3c, and the storage address and data transmitted are also adaptively changed.
  • process 2 reads target data D1 and D0 from process 1 according to storage address A2, and process 1 reads target data D2 and D3 from process 2 according to storage address A2;
  • process 4 reads target data D3 and D2 from process 3 according to storage address A3, and process 3 reads target data D4 and D5 from process 4 according to storage address A4;
  • process 0 reads target data D5 and D4 from process 5 according to storage address A5, and process 5 reads target data D0 and D1 from process 0 according to storage address A0.
  • process 1 reads target data D4 and D5 from process 0 according to storage address A0, and process 0 reads target data D2 and D3 from process 1 according to storage address A1;
  • process 3 reads target data D1 and D0 from process 2 according to storage address A2, and process 2 reads target data D4 and D5 from process 3 according to storage address A3;
  • process 5 reads target data D3 and D2 from process 4 according to storage address A4;
  • process 0 reads target data D0 and D1 from process 5 according to storage address A5.
  • processes 0 to 5 participating in the collective communication respectively obtain the data of each process in the collective communication: target data D0 to target data D5.
  • this embodiment does not limit the order of data reading between processes. For example, in communication step 1, process 1 may first read target data D1 and D0 from process 2 according to storage address A2, and then process 2 may read target data D2 and D3 from process 1 according to storage address A1.
  • the two processes in the embodiment of Figure 3c can perform data transmission according to the write method in the Rendezvous protocol.
  • the specific data transmission process can refer to the process shown in the embodiment of Figure 2a, which will not be repeated here.
  • the specific data transmitted in the above embodiments depends on the application scenario of collective communication.
  • collective communication is applied to weather forecast, and the transmitted data may specifically be parameters of weather forecast, such as historical temperature data, predicted temperature data, etc.
  • Collective communication is applied to distributed training of neural networks, and the transmitted data may specifically be model parameters of the model trained by the distributed training, etc. This embodiment does not limit the application scenario of collective communication and the transmitted data.
  • the processes participating in the collective communication implement the data transmission in each communication step of the collective communication based on the Rendezvous protocol.
  • an address synchronization operation is performed: the reading end obtains the storage address of the data from the storage end.
  • the embodiment of the present application provides a collective communication method to solve the above problems.
  • the collective communication method provided by the embodiment of the present application is applied.
  • the first process such as process 1 obtains the target address (such as storage addresses A0, A2 and A3 in Figure 3a; or, storage address A0, and A2 to A5 in Figure 3c) from the second process (such as process 0, process 2 and process 3 in Figure 3a; or, process 0 in Figure 3c, and process 2 to process 5); and then the first process saves the correspondence between the target address and the second process (for example, process 0 corresponds to storage address A0).
  • the target address such as storage addresses A0, A2 and A3 in Figure 3a; or, storage address A0, and A2 to A5 in Figure 3c
  • the second process such as process 0, process 2 and process 3 in Figure 3a; or, process 0 in Figure 3c, and process 2 to process 5
  • the first process saves the correspondence between the target address and the second process (for example, process 0 corresponds to
  • the address synchronization operation of collective communication is realized to ensure that the first process participating in the collective communication saves the target address corresponding to the second process in the collective communication. Furthermore, the first process can directly transmit target data with the second process according to the corresponding relationship and the communication rules of collective communication (such as the algorithm logic of MPI_Allgather), thereby reducing a large number of repeated address synchronization operations in each communication step and reducing the delay of collective communication.
  • FIG. 4 is a schematic diagram of the structure of a computing cluster 400 provided in the embodiment of the present application.
  • the communication system 400 may include multiple servers communicating through a local area network 403, such as server 401, server 402, ..., server 40m, where m is the number of servers.
  • the communication system 400 is a system comprising multiple nodes, each of which may be a server.
  • the local area network 403 may include communication devices such as switches and network cards.
  • the server 401 includes a processor 4011 and a memory 4013 that communicate through a bus 4012.
  • the processor 4011 includes multiple cores, such as core 4011-1, core 4011-2, ..., and core 4011-n, where n is the number of cores. Different cores communicate through bus 4012. Multiple cores may belong to the same or different central processing units.
  • server 402 includes processor 4021 and memory 4023 that communicate via bus 4022.
  • Processor 4021 includes multiple cores, such as core 4021-1, core 4022-2, ..., and core 4022-n. Different cores communicate via bus 4022.
  • Server 40m includes processor 40m1 and memory 40m3 that communicate via bus 40m2.
  • Processor 40m1 includes multiple cores, such as core 40m1-1, core 40m1-2, ..., and core 40m1-n. Different cores communicate via bus 40m2.
  • the different processes in FIG. 3 a and FIG. 3 c and the different processes in the embodiment of the present application belong to different kernels respectively.
  • the different kernels may belong to the same server.
  • process 0 belongs to kernel 4011-1
  • process 1 belongs to kernel 4011-2.
  • the different kernels may belong to different servers.
  • process 0 belongs to any kernel of server 401
  • process 1 belongs to any kernel of server 402.
  • a server in the above-mentioned computing cluster starts a task (such as a task for obtaining weather forecast data) and assigns a process group to the task.
  • the process group includes multiple processes that perform data communication to implement the task, for example, process 0 to process 3 in Figure 3a, or process 0 to process 5 in Figure 3c.
  • computing cluster shown in Figure 4 is only an example of a computing cluster, and the computing cluster may have more or fewer components than shown in the figure, may combine two or more components, or may have a different component configuration.
  • the various components shown in Figure 4 may be implemented in hardware, software, or a combination of hardware and software including one or more signal processing and/or application specific integrated circuits.
  • Figure 5 is a flow chart of a collective communication method provided in the embodiment of the present application.
  • the method is applied to a computing cluster, which includes multiple processes, and the computing cluster includes multiple computing resources.
  • Each of the N processes is allocated with memory resources, and the process group includes M communication groups, each communication group includes at least two processes with a communication relationship, where M is an integer greater than or equal to 1, and N is an integer greater than or equal to 2; the method includes but is not limited to the following steps:
  • a first process in a first communication group obtains a first memory address and records the first memory address
  • the second process is a process among multiple processes that transmits target data with the first process, that is, the first process and the second process have a communication relationship and form a first communication group; the first communication group is any communication group among the M communication groups.
  • the first process obtains the first memory address, that is, obtains the target information, and the target includes the unit identifier and the target address of the second process.
  • the target address is used to indicate the storage space of the second process for data, and the target address is also the first memory address, which is the memory address of the memory resource corresponding to the second process.
  • the first process in the first communication group may include a root process
  • the second process in the first communication group may include a root process.
  • the relay process includes the root process.
  • the relay process refers to the process that transfers the data transmitted by the root process. For example, the root process is process 0, the target process is process 3, the relay processes are process 1 and process 2, and process 1 is the root process of process 2.
  • process 0 can transmit data 0 to process 1, and there is a communication relationship between process 0 and process 1, and they belong to communication group 1; process 1 transmits data 0 to process 2, and there is a communication relationship between process 1 and process 2, and they belong to communication group 2; process 2 transmits data 0 to process 3, and there is a communication relationship between process 2 and process 3, and they belong to communication group 3.
  • one data sender corresponds to one data receiver.
  • the first process in the first communication group may include any process except the root process.
  • the second process in the first communication group may include the upper-level process of the first process.
  • the second process in the first communication group may include the lower-level process of the first process.
  • Many-to-one collective communication is the reverse transmission process of the one-to-many collective communication process. That is to say, in many-to-one collective communication, one data sending end also corresponds to one data receiving end.
  • the first process includes each process participating in the collective communication, and the second process includes processes in the collective communication except the first process. That is, in many-to-many collective communication, multiple data transmitters may correspond to multiple data receivers.
  • the first process and the second process may be different processes, and the unit identifier of the second process may be the process identifier (rank number) of the process.
  • the process identifier reference may be made to the existing descriptions in the above-mentioned terminology and related technical explanations.
  • the target address is described in the subsequent specific collective communication method.
  • FIG. 6a is a schematic diagram of a collective communication process provided by an embodiment of the present application.
  • the first process may include each of processes 0 to 3.
  • the first process is process 0, and the second process includes processes 1 to 3; the first process is process 3, and the second process includes processes 0 to 2.
  • the target data may include data to be written, that is, data written by the first process to another process in a subsequent communication step.
  • the above-mentioned target address may specifically include a buffer address of the data to be written, that is, the address of the buffer of the second process where the target data is written by the first process.
  • the target data may include data to be read, that is, data read from another process by the first process in a subsequent communication step.
  • the above-mentioned target address may specifically include the current storage address of the data to be read, that is, the storage address when the target data to be read by the first process is stored by the second process.
  • the first process obtains the target address from the second process, including: the first process performs address synchronization before performing collective communication (such as transmitting target data D0 to target data D3).
  • the address synchronization may include: according to the MPI_Allgather Ring algorithm logic, each process obtains the data of the left neighbor process and forms a ring neighbor relationship: process 0 obtains target addresses A1 to A3 from process 3; process 1 obtains target addresses A1, A2 and A3 from process 0; process 2 obtains target addresses A0, A1 and A3 from process 1; process 3 obtains target addresses A0 to A2 from process 2.
  • FIG. 6b is a schematic diagram of an address transmission process in a collective communication provided by an embodiment of the present application.
  • address transmission can be performed based on the Rendezvous protocol in the address synchronization of FIG. 6a.
  • the target address can be transmitted between each two interacting processes in accordance with the reading method of the Rendezvous protocol: process 0 sends a message header RTS to process 1 to inform process 0 of the storage location of the address information, so that process 1 reads the target addresses A0, A2 and A3 from process 0 according to the storage location.
  • the target address can be transmitted between each two interacting processes in accordance with the writing method of the Rendezvous protocol: referring to FIG.
  • process 0 sends a message header RTS to process 1 to request the buffer address of process 1, and then process 1 feeds back the buffer address to process 0 through a CTS message, so that process 0 writes the target addresses A0, A2 and A3 into the buffer address.
  • the two interacting processes transmit the target address in the read or write mode in the Rendezvous protocol. It is only necessary to send a message header (such as an RTS message) once before the collective communication, or only to feedback the message header and the buffer address once (such as a CTS message), so that each process participating in the collective communication can obtain the global target address of the collective communication: the storage address of the data of other processes except the process in the collective communication.
  • each process participating in the collective communication saves the corresponding relationship between the second process and the target address, and each process can directly perform collective communication according to the corresponding relationship, so that there is no need to repeat the message sending and feedback operations in each communication step of the collective communication, thereby reducing the delay of the collective communication.
  • the target address is usually not a large packet, that is, the data volume of the target address is less than the data volume threshold (such as 512 bytes).
  • this embodiment can transmit the target address in accordance with the Eager protocol, that is, the small packet communication method, so that each process participating in the collective communication obtains the global target address of the collective communication: the storage address of the data of other processes in the collective communication except the process.
  • Figure 6c-1 is a schematic diagram of an exemplary point-to-point communication process. As shown in Figure 6c-1, the data under the Eager protocol is transmitted.
  • the transmitting end packages the payload (such as data DR) and the message header (such as sending notification) into a data packet and sends it directly to the receiving end, and the receiving end performs the receiving operation.
  • both communicating parties can participate in the communication operation, that is, bilateral communication is performed.
  • the receiving end shall copy the received data from the receiving buffer, which is suitable for the scenario where the amount of data transmitted (such as data DR) is less than or equal to the data amount threshold (such as 512 bytes), that is, the transmitted data is a small packet (such as the target address).
  • Figure 6c-2 is a schematic diagram of the address transmission process in a collective communication provided by an embodiment of the present application.
  • the target address can be transmitted according to the Eager protocol between each two interacting processes, that is, the process shown in Figure 6c-2: process 0 directly sends the target addresses A0, A2 and A3 containing the message header (notification information) to process 1, and process 1 receives the target address sent by process 0.
  • the two interacting processes transmit the target address according to the Eager protocol, and the target address can be directly transmitted before the collective communication, without the need for the message header (such as RTS message) and the feedback of the buffer address (such as CTS message).
  • each process participating in the collective communication saves the corresponding relationship between the second process and the target address, and each process can directly perform collective communication based on the corresponding relationship, thereby eliminating the need to repeat the message sending and feedback operations in each communication step of the collective communication, that is, data transmission, thereby further reducing the delay of the collective communication.
  • the target address A3 and the target address A2 saved by process 0 are obtained by process 0 from process 3. Every two interacting processes in process 0 to process 3 can transmit the target address in a manner similar to the embodiments of Figures 6b and 6c-2, except that the interacting processes and the transmitted target addresses are adaptively adjusted.
  • FIG. 6d is a schematic diagram of another collective communication process provided by an embodiment of the present application.
  • the communication steps 0 to 2 of the collective communication are still implemented according to the MPI_Allgather Ring algorithm logic, which is similar to the way of transmitting the target data D0 to the target data D3 in the embodiment of FIG. 6a.
  • the same parts are not repeated here, and the embodiment of FIG. 6a can be referred to.
  • the address synchronization of FIG. 6d includes: a main process (such as process 0) obtains the target address from the second process, and sends the second target address to the second process.
  • the second process includes all processes participating in the collective communication except the main process (such as process 1 to process 3), and the second target address includes the target address corresponding to each process participating in the collective communication, except the target address corresponding to the second process (such as the second process is process 3, and the second target address includes target addresses A0 to A2).
  • a main process among multiple first processes in a first communication group centrally obtains the first memory address corresponding to each second process in the first communication group.
  • the first communication includes multiple first processes and multiple second processes, it can reduce the need for each first process to interact with each second process to obtain the memory address corresponding to each second process, thereby more conveniently achieving memory address acquisition.
  • the address synchronization of FIG6d is similar to the method of transmitting the target address in the embodiment of FIG6b or FIG6c-2, except that the target process interacts with the second process in the address synchronization of FIG6d.
  • process 0 can read the target addresses A1 to A3 from process 1 to process 3 respectively in the manner of reading in the Rendezvous protocol; or, process 1 to process 3 can write the target addresses A1 to A3 to process 0 respectively in the manner of writing in the Rendezvous protocol; or, process 1 to process 3 can send the target addresses A1 to A3 directly to process 0 respectively in accordance with the Eager protocol.
  • process 0 can write target addresses A0 to A2 to process 3, target addresses A0, A1 and A3 to process 2, and target addresses A0, A2 and A3 to process 1 in the manner written in the Rendezvous protocol; or, process 0 can send target addresses A0 to A2 directly to process 3, target addresses A0, A1 and A3 directly to process 2, and target addresses A0, A2 and A3 directly to process 1 in accordance with the Eager protocol.
  • process 0 can write target addresses A0 to A2 to process 3, target addresses A0, A1 and A3 to process 2, and target addresses A0, A2 and A3 to process 1 in the manner written in the Rendezvous protocol; or, process 0 can send target addresses A0 to A2 directly to process 3, target addresses A0, A1 and A3 directly to process 2, and target addresses A0, A2 and A3 directly to process 1 in accordance with the Eager protocol.
  • the transmission of target addresses between two processes in the embodiment of FIG6b and FIG6c-2 please refer to the description of the transmission of target addresses between two processes
  • the first process in addition to obtaining the target address from the second process, can also obtain the unit identification of the second process.
  • process 1 in addition to obtaining the target addresses A0, A2 and A3 from process 0, process 1 can also obtain the identification P0 of process 0, the identification P2 of process 2, and the identification P3 of process 3 respectively.
  • the target address and the unit identification of the second process can be obtained at the same time.
  • the first process can obtain target information from the second process, and the target information may include the target address and the unit identification of the second process to further reduce latency.
  • process 1 obtains target information 1 from process 0, and target information 1 may include the target address A0 and the identification P0 of process 0.
  • the address synchronization in FIG. 6a and FIG. 6d above that is, the specific manner in which the first process obtains the target address from the second process, is only for illustrative purposes, and this embodiment does not limit the specific address synchronization manner.
  • the first process in the first communication group records the first memory address, which may include: the first process saves the corresponding relationship between the target address and the second process based on the unit identifier of the second process.
  • the first process After the first process obtains the target address from the second process, it can extract the unit identifier of the second process contained in the target address, and then The unit identifier of the second process saves the correspondence between the target address and the second process.
  • the embodiment of the present application is applied to the computing cluster shown in FIG. 4 above, and different processes (such as process 0 and process 1) belong to different cores of the same server.
  • the first process saves the correspondence between the target address and the second process based on the unit identifier of the second process, which may include: the first process creates a correspondence between the unit identifier of the second process and the target address containing the identifier, and obtains the correspondence between the target address and the second process.
  • FIG7a is a schematic diagram of a correspondence between a process and a data storage address provided by an embodiment of the present application.
  • the data structure of the correspondence between the target address and the second process may be, for example, a correspondence table between a process and a data storage address shown in FIG7a, in which the process identifier and the storage address of the corresponding data are stored in the table entry.
  • the process identifier P0 corresponds to the storage address A0 of the data
  • the process identifier P1 corresponds to the storage address A0 of the data
  • the storage address of the data is the above-mentioned target address.
  • the correspondence between the target address and the second process may be a key-value pair, with one of the target address and the second process as a key and the other as a value.
  • the embodiment of the present application is applied to the computing cluster shown in FIG. 4 above, and different processes (such as process 0 and process 1) may belong to different servers.
  • the first process saves the correspondence between the target address and the second process based on the unit identifier of the second process, which may include: the first process also extracts the device identifier of the server to which the second process belongs from the target address, creates a correspondence between the target address and the unit identifier and the device identifier extracted from the target address, and obtains the correspondence between the target address and the second process.
  • FIG. 7b is a schematic diagram of another correspondence between a process and a data storage address provided by an embodiment of the present application. As shown in FIG.
  • the data structure of the correspondence between the target address and the second process can be, for example, a correspondence table between a process and a data storage address shown in FIG. 7b, in which the device identifier, the process identifier and the storage address of the corresponding data are saved in the table entry.
  • the device identifier N0, the process identifier P0 and the storage address A0 of the data correspond; the device identifier N1, the process identifier P1 and the storage address A0 of the data correspond, etc.
  • This embodiment does not limit the specific data structure of the above corresponding relationship.
  • a first process in a first communication group performs data transmission with a second process in the first communication group multiple times according to a first memory address.
  • the first process After the first process saves the correspondence between the target address and the second process, it can transmit the target data with the second process according to the corresponding relationship and the communication rules of the collective communication performed by the computing cluster.
  • the communication rules of the collective communication performed by the computing cluster are the algorithm logic of the MPI algorithm adopted by the computing cluster.
  • the embodiment of the present application does not limit the specific MPI algorithm adopted by the computing cluster, and can be adaptively set according to application requirements. It can be understood that the above-mentioned communication rules determine the specific communication steps such as which processes in the computing cluster the first process interacts with and how to interact.
  • the above-mentioned correspondence between the target address and the second process can ensure that the first process can accurately transmit data with the opposite end, that is, the second process in the first communication group to which the first process belongs, to avoid data transmission anomalies.
  • the address space of the first memory address may be, for example, address space AS1, and the first memory address may be, for example, memory address A11.
  • the first process in the first communication group Before the first process in the first communication group transmits data for the first time, it may record the occupancy of the address space AS as unoccupied, and then determine that the memory address corresponding to the transmitted data is memory address A11; before the first process in the first communication group transmits data for the second time, it may record the occupancy of the address space of the first memory address as occupied first memory address, for example, memory address A11, and then determine that the memory address corresponding to the transmitted data in the second transmission is the memory address of the next address space adjacent to the address space corresponding to memory address A11 in address space AS, for example, memory address A12.
  • the first process in the first communication group may record the occupancy of the address space of the first memory address as occupied memory address A12, and then determine that the memory address corresponding to the transmitted data in the third transmission is the memory address of the next address space adjacent to the address space corresponding to memory address A12 in address space AS, for example, memory address A13.
  • the first process in the first communication group can record, through the first memory address, the address space occupancy of the first memory address for each data transmission, so that before each data transmission, the memory address corresponding to the transmitted data is determined according to the recorded address space occupancy of the first memory address, that is, the management of the memory resources corresponding to the first memory address is realized, so that the memory address corresponding to the transmitted data can be accurately determined in each data transmission according to the first memory address obtained in advance.
  • the computing cluster adopts the MPI_Allgather Ring algorithm
  • the communication rules may include: each process sends data to the right neighbor and receives data from the left neighbor, and all processes form a ring-shaped neighbor relationship; on this basis, each process offsets the buffer of the received and sent data, or each process offsets the buffer of the read data.
  • the communication steps 0 to 2 of the collective communication shown in FIG6a and FIG6d are implemented according to the logic of the MPI_Allgather Ring algorithm, which is similar to the method of transmitting the target data D0 to the target data D3 in the embodiment of FIG3a. The same parts will not be repeated here, and you can refer to the embodiment of FIG3a. Description of the transmission of target data D0 to target data D3. The difference is that in this embodiment, each communication step no longer performs address synchronization.
  • the specific data transmitted in the above embodiments depends on the application scenario of collective communication.
  • collective communication is applied to weather forecast, and the transmitted data may specifically be parameters of weather forecast, such as historical temperature data, predicted temperature data, etc.
  • Collective communication is applied to distributed training of neural networks, and the transmitted data may specifically be model parameters of the model trained by the distributed training, etc. This embodiment does not limit the application scenario of collective communication and the transmitted data.
  • FIG8 is a schematic diagram of another collective communication process provided in an embodiment of the present application.
  • the communication rule of the collective communication performed by the computing cluster is the Neighbor exchange algorithm of MPI_Allgather.
  • FIG8 is similar to the method of transmitting target data D0 to target data D3 in the embodiment of FIG3c. The same parts are not repeated here. Please refer to the description of the transmission of target data D0 to target data D3 in the embodiment of FIG3c. The difference is that in the embodiment of FIG8, before performing collective communication, that is, before transmitting target data D0 to target data D3, address synchronization is performed, that is, the first process obtains the target address from the second process.
  • the address synchronization in FIG8 may include: process 0 obtains target addresses A1, A2 and A3 from process 1, and obtains target addresses A4 and A5 from process 5; process 1 obtains target addresses A0, A4 and A5 from process 0, and obtains target addresses A2 and A3 from process 2; process 2 obtains target addresses A0 and A1 from process 1, and obtains target addresses A3, A4 and A5 from process 3; process 3 obtains target addresses A0 to A2 from process 2, and obtains target addresses A4 and A5 from process 4; process 4 obtains target addresses A2 and A3 from process 3, and obtains target addresses A5, A0 and A1 from process 5; process 5 obtains target addresses A4, A2 and A3 from process 4, and obtains target addresses A0 and A1 from process 0.
  • each process obtains the storage address of the data of the second communication list of the computing cluster, that is, obtains the global address of the process in the computing cluster.
  • the address synchronization in Figure 8 may be similar to the address synchronization shown in Figure 6d, except that the processes involved in the interaction and the adaptive adjustment of the target address of the transmission are not described here. For details, see the relevant description in the embodiment of Figure 6d above.
  • the specific method adopted is not limited in this embodiment. That is, in address transmission, the first process can adopt at least one of the embodiments of Figure 6b and Figure 6c-2: read, write and/or packet transmission; in data transmission, at least one of the methods shown in Figure 2a and Figure 2b can be adopted: read and/or write.
  • packet transmission is to perform address transmission according to the Eager protocol: the sending end actively sends the payload to the receiving end without considering whether the receiving end is capable of receiving the payload. This requires the receiving process to prepare enough buffer space in advance (for example, the space size meets the space threshold: the space size of the buffer is greater than or equal to the space threshold) to receive the sent payload.
  • At least one of the methods shown in Figure 2a and Figure 2b is adopted in data transmission: when reading and/or writing, the first process has already sent at least one of the RTS message and the CTS message in the global address transmission. Therefore, when reading and writing data, the first process does not need to repeat the sending of these two messages, and can directly perform unilateral operations: write (Put) and read (Get).
  • the first process when it completes the address synchronization and the data transmission in each communication step of the collective communication, it can send an end notification (such as FIN) to the other end to inform the other end that the interaction is completed.
  • an end notification such as FIN
  • the algorithmic logic other than collective communication can be, for example, in the weather forecast scenario, the above communication step 2 ends, and each process can use the data obtained through collective communication for other algorithmic logic such as drawing and permanent storage.
  • the computing cluster is applied to the computing cluster, and the computing cluster includes 6 nodes, each of which can be a different core of the same server shown in the embodiment of Figure 4, or each node can be a different server shown in the embodiment of Figure 5.
  • the servers in the computing cluster use ARM processors to perform collective communication according to the MPI Allgather algorithm.
  • the ARM processor is a processor of a reduced instruction set computer (RISC).
  • the data of 2048 bytes is transmitted between two adjacent processes, and a total of 1000 tests are performed to test the average delay.
  • the average delay when the embodiment of the present application is not adopted is 10.82 microseconds ( ⁇ s)
  • the average delay when the embodiment of the present application is adopted is 8.78 microseconds ( ⁇ s). That is to say, the latency of the MPI Allgather collective communication based on address synchronization and unilateral operation provided by the embodiment of the present application is much shorter than the latency of the MPI Allgather collective communication directly implemented based on the Rendezvous protocol.
  • the address synchronization operation of the collective communication is implemented to ensure that each process participating in the computing cluster saves the target address corresponding to the second process in the computing cluster.
  • the first process that is, each process participating in the computing cluster, can directly transmit data according to the corresponding relationship and the communication rules of the collective communication performed by the computing cluster (such as the algorithm logic of MPI_Allgather), reducing the large number of repeated address synchronization operations in each communication step and reducing the latency of the collective communication.
  • the server includes hardware and/or software modules corresponding to the execution of each function.
  • the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a function is executed in the form of hardware or computer software driving hardware depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application in combination with the embodiments, but such implementation should not be considered to be beyond the scope of the present application.
  • the collective communication method provided in the embodiment of the present application can be encapsulated as an interface in the form of PMPI_XXX for calling by the subject of the application collective communication.
  • the collective communication method provided in the embodiment of the present application can be encapsulated as interfaces PMPI_Allgather_Neighbor exchange, PMPI_Allreduce, etc., depending on the MPI algorithm followed by the collective communication.
  • the weather forecast application calls the PMPI_XXX interface
  • running the weather forecast application will execute the collective communication method provided in the embodiment of the present application, that is, the underlying logic will execute the point-to-point operation described in the embodiment of the present application.
  • a weather forecast application is installed
  • the server will first perform the address synchronization in the embodiment of the present application for the messages sent by the weather forecast application, the generated network traffic characteristics (such as data transmitted in the switch) and the register events in the server, and then perform subsequent communication steps such as the above-mentioned communication steps 0 to 2.
  • This embodiment also provides a computer storage medium, in which computer instructions are stored.
  • the server executes the above-mentioned related method steps to implement the collective communication method in the above-mentioned embodiment.
  • This embodiment also provides a computer program product.
  • the computer program product When the computer program product is run on a computer, the computer is enabled to execute the above-mentioned related steps to implement the collective communication method in the above-mentioned embodiment.
  • an embodiment of the present application also provides a device, which can specifically be a chip, component or module, and the device may include a connected processor and memory; wherein the memory is used to store computer execution instructions, and when the device is running, the processor can execute the computer execution instructions stored in the memory so that the chip executes the collective communication method in the above-mentioned method embodiments.
  • the server, computer storage medium, computer program product or chip provided in this embodiment are all used to execute the corresponding methods provided above. Therefore, the beneficial effects that can be achieved can refer to the beneficial effects in the corresponding methods provided above, and will not be repeated here.
  • the disclosed devices and methods can be implemented in other ways.
  • the device embodiments described above are only schematic, for example, the division of modules or units is only a logical function division, and there may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another device, or some features can be ignored or not executed.
  • Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may be one physical unit or multiple physical units, that is, they may be located in one place or distributed in multiple different places. Some or all of the units may be selected according to actual needs to achieve the purpose of the present embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit may be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a readable storage medium.
  • the technical solution of the embodiment of the present application is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions to enable a device (which can be a single-chip microcomputer, chip, etc.) or a processor (processor) to execute all or part of the steps of the methods of each embodiment of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), disk or optical disk and other media that can store program code.
  • the steps of the method or algorithm described in conjunction with the disclosed content of the embodiments of the present application can be implemented in hardware or by a processor executing software instructions.
  • the software instructions can be composed of corresponding software modules, and the software modules can be stored in random access memory (RAM), flash memory, read-only memory (ROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, hard disks, mobile hard disks, read-only compact disks (CD-ROMs) or any other form of storage medium known in the art.
  • An exemplary storage medium is coupled to a processor so that the processor can read information from the storage medium and write information to the storage medium.
  • the storage medium can also be a component of the processor.
  • the processor and the storage medium can be located in In an application specific integrated circuit (ASIC).
  • the ASIC may be located in a server.
  • the processor and the storage medium may also exist in the server as discrete components.
  • Computer-readable media include computer storage media and communication media, wherein the communication media include any media that facilitates the transmission of a computer program from one place to another.
  • the storage medium can be any available medium that a general or special-purpose computer can access.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Multi Processors (AREA)
  • Computer And Data Communications (AREA)

Abstract

本申请实施例提供了一种集合通信方法及计算集群,涉及计算机领域,该方法应用于计算集群,计算集群的进程组中包括M个通信组,每个通信组包括至少两个具有通信关系的进程,其中,M为大于等于1的整数,该方法包括:M个通信组中的任一通信组如第一通信组中的第一进程获取第一通信组中第二进程的内存资源对应的内存地址也就是第一内存地址;第一通信组中的第一进程根据第一内存地址多次与第一通信组中的第二进程进行数据传输。本申请实施例提供的方案,可以避免在计算集群中进程组的多次进程间数据传输中反复获取传输数据所需的内存地址,从而减少计算集群的时延。

Description

集合通信方法及计算集群
本申请要求在2022年10月12日提交中国专利局、申请号为202211245236.2、发明名称为“集合通信方法及计算集群”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机领域,尤其涉及一种集合通信方法及计算集群。
背景技术
集合通信(collective communication,COLL)又称组通信、聚合通信,是计算集群中多个计算资源上运行的多个进程均参加通信从而形成进程组,以执行计算任务的集体通信行为。上述进程组可以包括至少一个具有通信关系的通信组,每个通信组包括至少两个存在通信关系的进程。通常通信组的进程间会进行多次数据传输,以实现数据量较大的计算任务。
相关技术中,集合通信的每个进程预先分配有内存资源,通信组中的每个进程在每次数据传输前均需获取该通信组内其余进程的内存资源对应的内存地址,导致集合通信中存在大量重复的内存地址获取操作,造成集合通信的通信时延增加。
发明内容
本申请实施例提供一种集合通信方法及计算集群。在该方法中,计算集群的通信组中第一进程在与该通信组中第二进程进行多次数据传输前,获取该通信组中第二进程的内存资源对应的内存地址并记录,从而可以根据所记录的内存地址进行上述多次数据传输,避免在上述多次数据传输中反复获取传输数据所使用的内存地址,从而减少计算集群的时延。
第一方面,本申请实施例提供一种集合通信方法,应用于计算集群,计算集群包括多个计算资源,多个计算资源上运行有N个进程,N个进程形成一个用于执行计算任务的进程组,N个进程中的每个进程分配有内存资源,进程组中包括M个通信组,每个通信组包括至少两个具有通信关系的进程,其中,M为大于等于1的整数,N为大于等于2的整数;该方法包括:第一通信组中的第一进程获取第一内存地址,并记录第一内存地址;其中,第一内存地址为第一通信组中第二进程的内存资源对应的内存地址;第一通信组为M个通信组中的任一通信组;第一通信组中的第一进程根据第一内存地址多次与第一通信组中的第二进程进行数据传输。
示例性的,图6a是本申请实施例提供的一种集合通信过程的示意图。如图6a所示,第一进程可以为进程0,第二进程可以为进程1,第一内存地址可以为地址A0,数据可以为数据D0。类似的,第一进程为进程1时,第二进程为进程2。
本申请实施例中,第一通信组中的第一进程在与第一通信组中的第二进程进行多次数据传输之前,获取第一通信组中的第二进程对应的内存资源的内存地址,可以避免在多个通信步骤(如图6a的通信步骤0至通信步骤2)也就是数据传输中反复获取目标地址也就是第一内存地址,从而减少计算集群的时延。
根据第一方面,第一通信组中的第一进程根据第一内存地址多次与第一通信组中的第二进程进行数据传输,包括:第一通信组中的第一进程将数据多次写入第一内存地址对应的第一通信组中的第二进程的内存资源。
本申请实施例中,第一通信组中的第一进程可以通过将数据多次写入第一内存地址对应的第一通信组中的第二进程的内存资源,实现与该第二进程间的数据传输。
根据第一方面,第一通信组中的第一进程根据第一内存地址多次与第二进程进行数据传输,包括:第一通信组中的第一进程从第一内存地址对应的第一通信组中的第二进程的内存资源中多次读取数据。
本申请实施例中,第一通信组中的第一进程可以通过从第一内存地址对应的第一通信组中的第二进程的内存资源中多次读取数据,实现与该第二进程间的数据传输。
根据第一方面,或者以上第一方面的任意一种实现方式,第一通信组中的第一进程根据第一内存 地址多次与第一通信组中的第二进程进行数据传输,包括:第一通信组中的第一进程与第一通信组中的第二进程每传输一次数据,记录所传输数据所占用的第一内存地址的地址空间;第一通信组中的第一进程在每次传输数据之前,根据所记录的第一内存地址的地址空间占用情况确定所传输数据对应的内存地址。
示例性的,第一内存地址的地址空间例如可以是地址空间AS1,第一内存地址例如可以是内存地址A11。第一通信组中的第一进程在第一次传输数据之前,可以记录地址空间AS的占用情况为未占用,则确定所传输数据对应的内存地址为内存地址A11;第一通信组中的第一进程在第二次传输数据之前,可以记录第一内存地址的地址空间占用情况为已占用第一内存地址例如内存地址A11,则确定第二次传输数据中所传输数据对应的内存地址为地址空间AS中,与内存地址A11对应的地址空间相邻的下一段地址空间的内存地址,例如内存地址A12。以此类推,第一通信组中的第一进程在第三次传输数据之前,可以记录第一内存地址的地址空间占用情况为已占用内存地址A12,则确定第三次传输数据中所传输数据对应的内存地址为地址空间AS中,与内存地址A12对应的地址空间相邻的下一段地址空间的内存地址,例如内存地址A13。
本申请实施例中,第一通信组中的第一进程可以通过第一内存地址,记录每一次数据传输对第一内存地址的地址空间占用情况,从而在每次传输数据前,根据所记录的第一内存地址的地址空间占用情况确定所传输数据对应的内存地址,也就是实现了对第一内存地址对应的内存资源的管理,从而可以根据提前获取的第一内存地址,在每次数据传输中准确确定所传输数据对应的内存地址。
根据第一方面,或者以上第一方面的任意一种实现方式,第一通信组中的第一进程的数量为一个或多个,第一通信组中的第二进程的数量为一个或多个。
本申请实施例中,第一通信组中的第一进程的数量为一个或多个,第一通信组中的第二进程的数量为一个或多个,可以适用于多种进程间通信场景,例如集合通信中的广播、归约、全局收集等等通信场景。
根据第一方面,或者以上第一方面的任意一种实现方式,第一通信组中的第一进程的数量为多个,第一通信组中的第二进程的数量为多个;第一通信组中的第一进程获取第一内存地址,包括:第一通信组中的多个第一进程中的主进程获取第一通信组中的每个第二进程对应的第一内存地址,得到地址集合;主进程发送地址集合至多个第一进程中除主进程外的其他第一进程。
示例性的,图6d是本申请实施例提供的另一种集合通信过程的示意图。如图6d所示,主进程可以为进程0,进程0可以获取进程0所属通信组中每个第二进程例如进程1至进程3对应的第一内存地址,得到地址集合;进而将地址集合中的地址发送给该通信组中除第一进程外的其他第一进程。
本申请实施例中,通过第一通信组中的多个第一进程中的主进程集中获取第一通信组中的每个第二进程对应的第一内存地址,可以在第一通信包括多个第一进程和多个第二进程的场景中,减少每个第一进程均与每个第二进程交互以获取每个第二进程对应的内存地址,从而更加便捷地实现内存地址的获取。
第二方面,本申请实施例提供一种计算集群,包括多个计算资源,多个计算资源上运行有N个进程,N个进程形成一个用于执行计算任务的进程组,N个进程中的每个进程分配有内存资源,进程组中包括M个通信组,每个通信组包括至少两个具有通信关系的进程,其中,M为大于等于1的整数,N为大于等于2的整数;第一通信组中的第一进程,用于获取第一内存地址,并记录第一内存地址;其中,第一内存地址为第一通信组中第二进程的内存资源对应的内存地址;第一通信组为M个通信组中的任一通信组;第一通信组中的第一进程,用于根据第一内存地址多次与第二进程进行数据传输。
根据第二方面,第一通信组中的第一进程,具体用于:将数据多次写入第一内存地址对应的第二进程的内存资源。
根据第二方面,或者以上第二方面的任意一种实现方式,第一通信组中的第一进程,具体用于:从第一内存地址对应的第二进程的内存资源中多次读取数据。
根据第二方面,或者以上第二方面的任意一种实现方式,第一通信组中的第一进程,具体用于:与第一通信组中的第二进程每传输一次数据,记录所传输数据所占用的第一内存地址的地址空间;在每次传输数据之前,根据所记录的第一内存地址的地址空间占用情况确定所传输数据对应的内存地址。
根据第二方面,或者以上第二方面的任意一种实现方式,第一通信组中的第一进程的数量为一个 或多个,第一通信组中的第二进程的数量为一个或多个。
根据第二方面,或者以上第二方面的任意一种实现方式,第一通信组中的第一进程的数量为多个,第一通信组中的第二进程的数量为多个;第一通信组中的多个第一进程中的主进程,具体用于:获取第一通信组中的每个第二进程对应的第一内存地址,得到地址集合;发送地址集合至多个第一进程中除主进程外的其他第一进程。
第二方面以及第二方面的任意一种实现方式分别与第一方面以及第一方面的任意一种实现方式相对应。第二方面以及第二方面的任意一种实现方式所对应的技术效果可参见上述第一方面以及第一方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第三方面,本申请实施例提供了一种计算机可读介质,用于存储计算机程序,该计算机程序包括用于执行第一方面或第一方面的任意可能的实现方式中的方法的指令。当计算机程序在计算集群上运行时,使得计算集群执行上述第一方面或第一方面的任意可能的实现方式中的方法。
第三方面以及第三方面的任意一种实现方式分别与第一方面以及第一方面的任意一种实现方式相对应。第三方面以及第三方面的任意一种实现方式所对应的技术效果可参见上述第一方面以及第一方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第四方面,本申请实施例提供了一种计算机程序,该计算机程序包括用于执行第一方面或第一方面的任意可能的实现方式中的方法的指令。当该计算机程序被计算集群执行时,使得该计算集群执行上述第一方面或第一方面的任意可能的实现方式中的方法。
第四方面以及第四方面的任意一种实现方式分别与第一方面以及第一方面的任意一种实现方式相对应。第四方面以及第四方面的任意一种实现方式所对应的技术效果可参见上述第一方面以及第一方面的任意一种实现方式所对应的技术效果,此处不再赘述。
附图说明
图1a是示例性示出的双边通信过程的示意图;
图1b是示例性示出的一种单边通信过程的示意图;
图1c是示例性示出的另一种单边通信过程的示意图;
图2a是示例性示出的另一种点对点通信过程的示意图;
图2b是示例性示出的另一种点对点通信过程的示意图;
图3a是示例性示出的一种集合通信过程的示意图;
图3b是示例性示出的一种集合通信中地址传输过程的示意图;
图3c是示例性示出的另一种集合通信过程的示意图;
图4是本申请实施例提供的一种计算集群400的结构示意图;
图5是本申请实施例提供的一种集合通信方法的流程示意图;
图6a是本申请实施例提供的一种集合通信过程的示意图;
图6b是本申请实施例提供的一种集合通信中地址传输过程的示意图;
图6c-1是示例性示出的一种点对点通信过程的示意图;
图6c-2是本申请实施例提供的一种集合通信中地址传输过程的示意图;
图6d是本申请实施例提供的另一种集合通信过程的示意图;
图7a是本申请实施例提供的一种进程与数据存储地址的对应关系的示意图;
图7b是本申请实施例提供的另一种进程与数据存储地址的对应关系的示意图;
图8是本申请实施例提供的另一种集合通信过程的示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合本申请中的附图,对本申请中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请的说明书实施例和权利要求书及附图中的术语“第一”、“第二”等仅用于区分描述的目的,而不能理解为指示或暗示相对重要性,也不能理解为指示或暗示顺序。“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上。“安装”、“连接”、“相连”等应做广义理解,例如可以是电连接,也可以是机械连接;可以是固定连接,也可以是可拆卸连接,或者一体地连接;可以是直接连接,也可以是通过中间媒介间接,也可以是两个元件内部的连通。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元。方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。“上”、“下”、“左”、“右”等仅用于相对于附图中的部件的方位而言的,这些方向性术语是相对的概念,它们用于相对于的描述和澄清,其可以根据附图中的部件所放置的方位的变化而相应地发生变化。
应当理解,在本申请中,“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,“a和b”,“a和c”,“b和c”,或“a和b和c”,其中a,b,c可以是单个,也可以是多个。
首先,对本申请中涉及的部分术语及相关技术进行解释,以方便理解。
消息传递接口(message passing interface,MPI):是一种消息传递接口的标准,用于开发基于消息传递的并行程序,其目的是为用户提供一个实际可用的、可移植的、高效的和灵活的消息传递接口。MPI定义了通信库核心的语法和语义,用户可以根据MPI使用编程语言(如C,C++和Fortran)编写消息传递程序。其中,编程语言Fortran是Formula Translation的缩写,意为“公式翻译”,应用于科学、工程问题或企事业管理中能够用数学公式表达的问题,其数值计算的功能较强。
示例性的,MPI可应用于多种系统架构中,如分布式/共享内存的多核处理器,高性能网络,以及这些架构的组合。MPI也是一个并行编程函数库,其编译运行需要和具体的编程语言相结合。MPI在主流的操作系统上都得到了实现,包括Windows和Linux系统。MPI可以是进程级并行的软件中间件,MPI框架把所有的计算进程管理起来形成一个系统,然后提供丰富的进程间通信的函数。其中,进程是程序的一个运行实例,除包含程序代码外,还包含程序代码的执行环境(内存、寄存器、程序计数器等),是操作系统中独立存在的可执行的基本程序单位。MPI框架为每一个进程分配一个进程标识号(rank号),rank从0开始依次往后排。MPI程序的每一个进程到底完成哪一部分工作,是根据其进程标识号来确定的。MPI进程需要在通信域内进行通信,通信域是进程间的通信环境,包含了进程组、上下文、虚拟拓扑等。MPI在启动时,系统会建立一个全局的通信域,每个进程都在这个全局通信域内,进程间通信需要指定通信域的参数。
MPI可以支持多种不同的通信协议,例如无限带宽标准(infiniBand,IB,一个用于高性能计算的计算机网络通信标准)或传输控制协议(transmission control protocol,TCP)等。MPI对这些协议进行了封装,提供一套统一的通信接口,屏蔽底层的通信细节。
集合通信:也叫组通信,区别于点对点通信只涉及发送方和接收方两个进程,集合通信为多个进程参加通信。集合通信由哪些进程参加以及集合通信的上下文,都是由该集合通信中调用的通信域限定的。集合通信一般包括三个功能:通信、同步和计算。其中,通信功能主要完成集合内部数据的传输,同步功能实现集合内所有进程在特定的点上执行进度的一致,计算功能是对特定数据的操作。示例性的,按照参与集合通信的对端数量,集合通信可以划分为一对多、多对一、多对多三种通信类型。实现集合通信的具体通信方式有很多,比如广播MPI_Bcast,归约MPI_Reduce,散射MPI_Scatter,收集MPI_Gather,全局归约Allreduce,全局收集Allgather等MPI原语。原语是指由若干条指令组成的程序段,用于实现某个特定功能,在执行过程中不可被中断。全局通信原语由属于同一个通信组的所有进程执行。
示例性的,广播MPI_Bcast用于根进程(通信组中的当前调用进程)向该通信组的其他每个进程发送消息。归约MPI_Reduce将变量在进程中对应的所有值汇聚为一个值,并将其返回给当前调用进程, 也就是在分布于不同进程中的数据间进行交互的运算,常用的运算有求和、求最大或最小值等。散射MPI_Scatter用于将不同的信息发送到其他每个进程中。收集MPI_Gather是散射操作的逆过程,这个过程中根进程从其他每个进程接收信息。举例而言,一对多的集合通信可以采用广播MPI_Bcast、散射MPI_Scatter的通信方式;多对一的集合通信可以采用收集MPI_Gather、归约MPI_Reduce的通信方式;多对多的集合通信可以采用全局归约Allreduce、全局收集Allgather的通信方式。
高性能计算(high performance computing,HPC):是指运算能力能够达到一定级别的一整套计算机系统。因为单一的处理器很难实现如此强大的计算能力,因此HPC需要多颗中央处理器(central processing unit,CPU)或多台主机(例如多台计算机设备)来共同协作实现。构建高性能计算系统的主要目的就是提高运算速度,要达到每秒万亿次级的计算速度,对系统的处理器、内存带宽、运算方式、系统的输入输出(input/output,I/O)、存储等方面的要求都十分高,这其中的每一个环节都将直接影响到系统的运算速度。HPC主要用于快速完成科学研究、工程设计、金融、工业以及社会管理等领域内具有数据密集型、计算密集型和I/O密集型的计算。典型应用包括:生物工程、新药研制、石油物探、运载器设计(航空航天、舰船、汽车)、材料工程、尖端设备制造、密码研究和各类大规模信息处理等。高性能计算的目标是:最大限度地降低完成特殊计算问题的运算时间,最大限度地提高在可指定时间内完成的问题规模,处理以前无法实现的大量复杂问题,提高性价比,扩展解决中等规模的问题和预算等。
并行计算:并行计算是建立在这样一个思想之上:大的问题可以分成一些较小的问题,而这些较小的问题是可以利用现有的资源能力同时(并行)被解决的,这些小问题的解决最终导致大问题的解决。并行计算是相对于串行计算来说的,串行计算的特点是处理器按照指令顺序依次运行计算算法。并行计算的并行分为两种,时间上的并行和空间上的并行。时间上的并行指的是计算机的中央处理器中采用的流水线技术,将每一条指令分成多个步骤未完成,而这些步骤之间在时间上可以交叠执行。空间上的并行是指用多个处理器并发的执行计算机指令,从而加快解决问题的速度。并行计算的优势在于,可以突破串行计算机计算能力的限制,提高计算速度,在更短的时间内完成计算任务,更好地发挥硬件的计算能力,节约计算成本。
点对点通信(peer to peer,P2P):点对点通信为两个进程间的通信。
按照通信双方的参与差异,点对点通信可以分为双边通信和单边通信。示例性的,图1a是示例性示出的双边通信过程的示意图。如图1a所示,双边通信中通信双方均进行操作。例如,进程0进行发送操作,进程1进行接收操作。图1b是示例性示出的一种单边通信过程的示意图。如图1b所示,单边通信中一方参与通信操作。例如,单边通信中数据的写入包括:进程0向进程1请求数据写入地址;进程0接收到进程1反馈的数据写入地址后,进行数据发送操作即将数据写入该数据写入地址。这样,无需针对进程1编写用于进行接收操作的代码(如检视所接收的数据,对接收的数据进行存储等操作),也就是说,接收端无需进行通信操作。图1c是示例性示出的另一种单边通信过程的示意图。如图1c所示,单边通信中数据的读取包括:进程0向进程1请求数据存储地址;进程0接收到进程1反馈的数据读取地址后,进行读取操作即从该数据存储地址中读取数据。这样,无需针对进程1编写用于发送操作的代码。图1a至图1c中的通信过程均存在延时。
另外,在进程间的数据通信中,内核预先为每个进程分配内存,进程对所分配的内存进行管理和使用。例如,进程0可以从所分配的内存中,为读取的数据划分存储空间,进而将该数据存储在该存储空间中。进程对于数据的存储空间划分可以根据应用需求进行,本申请实施例对此不作限制。
点对点通信具体可以采用Rendezvous协议进行。
图2a是示例性示出的另一种点对点通信过程的示意图。如图2a所示,Rendezvous协议的数据写入(Put)中,发送端先发送报文头(如询问报文也就是请求报文RTS(request to send,请求发送)报文)给接收端,等待接收端准备好接收缓冲区。接收端准备好接收缓冲区后,发送响应报文(如CTS(Clear To Send,允许发送)报文)至发送端。CTS报文中包含接收端存储数据DR的接收缓冲区的地址。这样,发送端将数据DR写入(Put操作)接收端的接收缓冲区,并在完成写入后发送数据传输结束通知(如FIN包,FIN表示正常关闭通信连接,没有数据丢失,发送端缓冲区的所有数据包都发送完成发送FIN包)。该通信过程中,从发送数据DR开始,发送端参与通信操作,接收端不参与通信操作,也就是进行单边通信。
图2b是示例性示出的另一种点对点通信过程的示意图。如图2b所示,Rendezvous协议的数据读 取(Get)中,发送端先发送报文头(如询问报文RTS(request to send,请求发送)报文)给接收端,以告知接收端数据DR的存储地址。接收端根据该地址从发送端读取(Get操作)数据DR,并在完成读取后发送数据传输结束通知(如FIN包)。该通信过程中,从读取数据DR开始,接收端参与通信操作,发送端不参与通信操作,也就是进行单边通信。
可以理解的是,上述图2a和图2b的实施例示出按照Rendezvous协议,两个进程间进行数据通讯的两种方式,包括图2a所示的写入Put方式,图2b所示的读取Get方式。具体的数据通讯方式可以根据应用需求设置,本申请实施例对此不作限制。
集合通信采用并行计算的方式实现,被广泛应用于高性能计算、神经网络模型的分布式训练等场景。按照并行计算的方式,集合通信可以包括多个通信步骤,每个通信步骤基于上述点对点通信实现。一种集合通信基于Rendezvous协议的点对点通信实现,以通过将一个完整的大包拆分为多个子包,通过多个通信步骤传输多个子包,以应对大包的传输。其中,数据(如数据DR)的数据量大于数据量阈值(如512字节)为大包。为了便于理解和方便描述,下面以Rendezvous协议中的数据读取为例对此进行具体说明。
图3a是示例性示出的一种集合通信过程的示意图。如图3a所示,基于Rendezvous协议和MPI_Allgather Ring算法的集合通信,可以应用于大包例如天气预报数据的收集。参与该天气预报的每个进程按照MPI_Allgather Ring算法,经过三个通信步骤:通信步骤0、通信步骤1和通信步骤2,获得参与该集合通信的每个进程的数据,以得到完整的天气预报数据。相应地,该集合通信的通信步骤遵循MPI_Allgather Ring算法的逻辑:参与集合通信的每个进程从该进程的左邻居处获得数据,参与集合通信的每个进程形成环状的邻居关系。例如,进程0与进程1通信,进程1与进程2通信,进程2与进程3通信,进程3与进程1通信,形成环状的邻居关系。也就是说,在MPI_Allgather Ring算法逻辑下,包括进程0至进程3的集合通信通过通信步骤0至通信步骤2实现,且MPI_Allgather Ring算法逻辑决定了每个通信步骤中具体有哪些进程参与通信。
在一种情况中,图3a的每个通信步骤中,进程间具体基于Rendezvous协议进行数据读取(Get)。例如,图3b是示例性示出的一种集合通信中地址传输过程的示意图。如图3b所示,通信步骤0至通信步骤2中,每个进程(如进程0)均通过报文头(如RTS报文)发送数据(如D0)的存储地址(如A0)至右邻居进程(如进程1),从而右邻居进程(如进程1)从该存储地址(如A0)中读取数据(如D0)。每个进程(如进程1)在完成读取时发送FIN报文至对端(如进程0)以告知数据读取完成。
在图3b的通信步骤0中,进程1从进程0读取目标数据D0,进程2从进程1读取目标数据D1,进程3从进程2读取目标数据D2,进程0从进程3读取目标数据D3。在通信步骤1中,进程1从进程0读取目标数据D3,进程2从进程1读取目标数据D0,进程3从进程2读取目标数据D1,进程0从进程3读取目标数据D3。在通信步骤2中,进程1从进程0读取目标数据D2,进程2从进程1读取目标数据D3,进程3从进程2读取目标数据D0,进程0从进程3读取目标数据D1。这样,进程0至进程3通过集合通信,收集了该天气预报的每个进程的数据,也就是收集了全局数据。在天气预报的应用程序逻辑下,每个进程可以分别应用收集的全局数据实现该进程在天气预报的应用程序中的功能。
在另一种情况中,每个通信步骤中进程间具体基于Rendezvous协议进行数据写入(Put)。此时,MPI_Allgather Ring算法逻辑可以包括:参与集合通信的每个进程从该进程的左邻居处获得数据,向右邻居发送数据。示例性的,如图2a所示,通信步骤0至通信步骤2中:作为发送端的每个进程(如进程0)均通过报文头(如RTS报文)请求接收端的接收缓冲区的地址;作为接收端的每个进程(如进程1)均通过CTS报文发送用于缓存数据(如D0)的接收缓冲区的地址(如A1)至左邻居进程(如进程0),从而左邻居进程(如进程0)将自身对应的数据(如D0)写入接收端如进程1)的接收缓冲区的地址(如A1)中。作为发送端的每个进程(如进程0)在完成写入时发送FIN报文至对端(进程1)以告知数据写入完成。本实施例实现集合通信的通信步骤与图3a类似,区别在于本实施例中每两个进程间的数据传输具体采用图2a中的写入方式,对于相同部分此处不再赘述,详见图3a实施例的描述。
另外,图3b实施例每个通信步骤中进程将读取的数据的缓冲区地址,按照一定偏移量进行偏移。以读取的情况为例,该偏移可以包括:在图3a所示集合通信场景下,通信步骤0至通信步骤3中,进程0发送数据读取地址也就是起始地址给进程1,进程1按照从进程0读取数据的顺序,对初始地址进 行与该顺序对应偏移量的偏移,得到本次数据的读取地址。举例而言,通信步骤0中进程1接收进程0发送的初始地址A0,从初始地址A0读取目标数据D0;通信步骤1中进程1接收进程0发送的初始地址A0,对初始地址A0偏移第一偏移量,得到第一偏移地址,从第一偏移地址中读取目标数据D3;通信步骤2中进程1接收进程0发送的初始地址A0,对初始地址A0偏移第二偏移量,得到第二偏移地址,从第二偏移地址中读取目标数据D2。
集合通信的通信步骤可以遵循MPI的算法逻辑,例如上述图3a所示的MPI_Allgather Ring算法逻辑、MPI_Allgather的Neighbor exchange算法逻辑、MPI_Reduce算法逻辑等等,本实施例对此不作限制,可以根据应用需求适应性选择。集合通信所遵循的不同算法逻辑决定了该集合通信的通信步骤和每个通信步骤中参与通信的进程。每个通信步骤中,每两个进程间的数据传输可以按照Rendezvous协议进行。
示例性的,图3c是示例性示出的另一种集合通信过程的示意图。如图3c所示,基于MPI_Allgather的Neighbor exchange算法的集合通信,有进程0至进程5共六个进程参与,该集合通信的通信步骤遵循MPI_Allgather的Neighbor exchange算法逻辑:邻居进程间相互传输数据,按照该算法逻辑,图3c所示的集合通信可以包括通信步骤0至通信步骤2。每个通信步骤中,每个进程按照Rendezvous协议中的读取(Get)方式进行数据通讯。
仍参见图3c,在通信步骤0中,进程0发送目标数据D0的存储地址A0至进程1,进程1根据存储地址A0从进程0读取目标数据D0;进程1发送目标数据D1的存储地址A1至进程0,进程0根据存储地址A1从进程1读取目标数据D1;进程0发送FIN报文告知进程1数据读取完成。进程2发送目标数据D2的存储地址A2至进程3,进程3根据存储地址A2从进程2读取目标数据D2;进程3发送目标数据D3的存储地址A3至进程2,进程2根据存储地址A3从进程3读取目标数据D3;进程2发送FIN报文告知进程3数据读取完成。进程4发送目标数据D4的存储地址A4至进程5,进程5根据存储地址A4从进程4读取目标数据D4;进程5发送目标数据D5的存储地址A5至进程4,进程4根据存储地址A5从进程5读取目标数据D5;进程4发送FIN报文告知进程5数据读取完成。
以此类推,图3c中后续每个通信步骤按照与通信步骤0类似的方式相互传输数据,对于相同部分此处不再赘述,详见上述图3c中通信步骤0的描述。区别在于:图3c的通信步骤1和通信步骤2中更换每个进程的邻居进程,传输的存储地址和数据也适应性改变。例如,在图3c的通信步骤1中,进程2根据存储地址A2从进程1读取目标数据D1和D0,进程1根据存储地址A2从进程2读取目标数据D2和D3;进程4根据存储地址A3从进程3读取目标数据D3和D2,进程3根据存储地址A4从进程4读取目标数据D4和D5;进程0根据存储地址A5从进程5读取目标数据D5和D4,进程5根据存储地址A0从进程0读取目标数据D0和D1。
在图3c的通信步骤2中,进程1根据存储地址A0从进程0读取目标数据D4和D5,进程0根据存储地址A1从进程1读取目标数据D2和D3;进程3根据存储地址A2从进程2读取目标数据D1和D0,进程2根据存储地址A3从进程3读取目标数据D4和D5;进程5根据存储地址A4从进程4读取目标数据D3和D2;进程0根据存储地址A5从进程5读取目标数据D0和D1。
这样,图3c中经过三个通信步骤,参与集合通信的进程0至进程5分别获取该集合通信中每个进程的数据:目标数据D0至目标数据D5。另外,本实施例对各进程间的数据读取顺序不作限制。例如,在通信步骤1中,可以是进程1先根据存储地址A2从进程2读取目标数据D1和D0,接着进程2根据存储地址A1从进程1读取目标数据D2和D3。
在一种情况中,图3c实施例中两个进程间可以按照Rendezvous协议中的写入方式进行数据传输。具体的数据传输过程可以参见图2a实施例所示的过程,此处不再赘述。
可以理解的是,上述实施例中传输的具体数据取决于集合通信的应用场景。例如,集合通信应用于天气预报,传输的数据具体可以为天气预报的参数,例如历史温度数据、预测温度数据等。集合通信应用于神经网络的分布式训练中,传输的数据具体可以为该分布式训练所训练模型的模型参数等。本实施例对集合通信的应用场景,以及所传输的数据不作限制。
在图3a实施例和图3c实施例中,参与集合通信的进程基于Rendezvous协议实现集合通信的每个通信步骤中的数据传输,在每个通信步骤的每一次进程间通信中,均进行地址同步操作:读取端从存储端获取数据的存储地址。可见,不论采用哪种MPI算法逻辑实现集合通信,集合通信均存在大量重复的地址同步操作。例如,图3a中进程0在通信步骤0至通信步骤2中重复执行传输存储地址A0的 地址同步操作,图3c中进程0和进程1在通信步骤0和通信步骤2中重复执行传输存储地址A0和A1的地址同步操作。大量重复的地址同步操作导致集合通信的时延增加。
因此,本申请实施例提供了一种集合通信方法,以解决上述问题。示例性地,应用本申请实施例提供的集合通信方法,在图3a或者图3c所示集合通信的场景中,第一进程(如进程1)从第二进程(如图3a的进程0、进程2和进程3;或者,图3c的进程0,和进程2至进程5)获取目标地址(如图3a的存储地址A0、A2和A3;或者,图3c的存储地址A0,和A2至A5);进而第一进程保存目标地址与第二进程间的对应关系(例如进程0对应存储地址A0)。这样,在计算集群中集合通信的通信步骤执行之前,也就是按照集合通信的通信规则进行目标数据的传输之前,实现了集合通信的地址同步操作,保证参与集合通信的第一进程保存该集合通信中第二进程对应的目标地址。进而,第一进程可以直接根据对应关系,按照集合通信的通信规则(例如MPI_Allgather的算法逻辑)与第二进程进行目标数据的传输,减少在每个通信步骤中大量重复执行地址同步操作,减少集合通信的时延。
在对本申请实施例的技术方案说明之前,首先结合附图对本申请实施例的通信系统进行说明。在一种可选的实施方式中,图4是本申请实施例提供的一种计算集群400的结构示意图。如图4所示,本申请实施例提供的集合通信方法可以应用于计算集群400。通信系统400可以包括多个通过局域网403进行通信的服务器,例如服务器401,服务器402,……,服务器40m,m为服务器的数量。通信系统400即一个包含多个节点的系统,每个节点可以是一个服务器。局域网403中可以包括交换机、网卡等通信设备。服务器401包括通过总线4012进行通信的处理器4011和存储器4013。其中,处理器4011包括多个内核,例如内核4011-1,内核4011-2,……,和内核4011-n,n为内核的数量。不同内核间通过总线4012通信。多个内核可以属于相同或者不同的中央处理器。类似的,服务器402包括通过总线4022进行通信的处理器4021和存储器4023。其中,处理器4021包括多个内核,例如内核4021-1,内核4022-2,……,和内核4022-n。不同内核间通过总线4022通信。服务器40m包括通过总线40m2进行通信的处理器40m1和存储器40m3。其中,处理器40m1包括多个内核,例如内核40m1-1,内核40m1-2,……,和内核40m1-n。不同内核间通过总线40m2通信。
应用于图4所示计算集群时,图3a和图3c中的不同进程以及本申请实施例的不同进程分别属于不同内核。该不同内核可以属于同一服务器。例如,进程0属于内核4011-1,进程1属于内核4011-2。或者,该不同内核可以属于不同服务器。例如,进程0属于服务器401的任一内核,进程1属于服务器402的任一内核。
在具体应用中,上述计算集群中的服务器启动任务(例如天气预报数据获取的任务),为该任务分配进程组,进程组包括多个进行数据通讯以实现该任务的进程,例如,图3a中的进程0至进程3,或者,图3c中的进程0至进程5。
应该理解的是,图4所示计算集群仅是计算集群的一个范例,并且计算集群可以具有比图中所示的更多的或者更少的部件,可以组合两个或多个的部件,或者可以具有不同的部件配置。图4中所示出的各种部件可以在包括一个或多个信号处理和/或专用集成电路在内的硬件、软件、或硬件和软件的组合中实现。
下面结合图5至图8,对本申请实施例提供的集合通信方法进行具体说明。示例性的,图5是本申请实施例提供的一种集合通信方法的流程图。如图5所示,该方法应用于计算集群,该计算集群包括多个进程,计算集群包括多个计算资源,多个计算资源上运行有多个例如可以是N个进程,N个进程形成一个用于执行计算任务的进程组,N个进程中的每个进程分配有内存资源,进程组中包括M个通信组,每个通信组包括至少两个具有通信关系的进程,其中,M为大于等于1的整数,N为大于等于2的整数;该方法包括但不限于如下步骤:
S501,第一通信组中的第一进程获取第一内存地址,并记录第一内存地址;
其中,第二进程为多个进程中与第一进程进行目标数据传输的进程,也就是说,第一进程与第二进程存在通信关系,组成一个第一通信组;第一通信组为M个通信组中的任一通信组。示例性的,第一进程获取第一内存地址即获取目标信息,目标包括第二进程的单元标识和目标地址。目标地址用于指示第二进程对数据的存储空间,目标地址也就是第一内存地址,为第二进程对应的内存资源的内存地址。
在一对多的集合通信中,第一通信组中的第一进程可以包括根进程,第一通信组中的第二进程可 以包括该根进程的中继进程。其中,中继进程指对根进程传输的数据进行中转的进程。例如,根进程为进程0,目标进程为进程3,中继进程为进程1和进程2,且进程1为进程2的根进程。那么进程0可以将数据0传输至进程1,进程0和进程1存在通信关系,属于通信组1;进程1将数据0传输至进程2,进程1和进程2存在通信关系,属于通信组2;进程2将数据0传输至进程3,进程2和进程3存在通信关系,属于通信组3。也就是说,在一对多的集合通信中,一个数据发送端对应有一个数据接收端。
在多对一的集合通信中,第一通信组中的第一进程可以包括除根进程以外的任一进程。第一进程将数据写入第一进程的上一级进程时,第一通信组中的第二进程可以包括第一进程的上一级进程。第一进程从第二进程读取数据时,第一通信组中的第二进程可以包括第一进程的下一级进程。多对一的集合通信即一对多的集合通信过程的反向传输过程。也就是说,在多对一的集合通信中,一个数据发送端同样对应有一个数据接收端。
在多对多的集合通信中,第一进程包括参与集合通信的每个进程,第二进程包括该集合通信中除第一进程以外的进程。也就是说,在多对多的集合通信中,多个数据发送端可以对应有多个数据接收端。
示例性的,第一进程和第二进程例如可以为不同的进程,第二进程的单元标识可以为进程的进程标识号(rank号)。对于进程标识号可以参见上述术语及相关技术解释中的已有描述。为了便于理解,目标地址在后续具体的集合通信方法中进行说明。
下面以多对多的集合通信为例,对本申请实施例提供的集合通信方法进行具体说明。
示例性的,图6a是本申请实施例提供的一种集合通信过程的示意图。如图6a所示,在基于MPI_Allgather Ring算法的集合通信中,第一进程可以包括进程0至进程3中的每一个。举例而言,第一进程为进程0,第二进程包括进程1至进程3;第一进程为进程3,第二进程包括进程0至进程2。在一种可选的实施方式中,目标数据可以包括待写入数据,也就是第一进程在后续通信步骤中写入另一进程的数据。相应地,上述目标地址具体可以包括待写入数据的缓冲区(buffer)地址,也就是被第一进程写入目标数据的第二进程的缓冲区的地址。在另一种可选的实施方式中,目标数据可以包括待读取数据,也就是第一进程在后续通信步骤中从另一进程读取的数据。相应地,上述目标地址具体可以包括待读取数据当前的存储地址,也就是第一进程待读取的目标数据被第二进程存储时的存储地址。
第一进程从第二进程获取目标地址,包括:第一进程在进行集合通信(如传输目标数据D0至目标数据D3)之前,进行地址同步。举例而言,如图6a所示,该地址同步可以包括:按照MPI_Allgather Ring算法逻辑,每个进程获取左邻居进程的数据,并形成环状邻居关系:进程0从进程3获取目标地址A1至A3;进程1从进程0获取目标地址A1,A2和A3;进程2从进程1获取目标地址A0,A1和A3;进程3从进程2获取目标地址A0至A2。
在一种示例中,图6b是本申请实施例提供的一种集合通信中地址传输过程的示意图。如图6b所示,图6a的地址同步中可以基于Rendezvous协议进行地址传输。这样,每两个交互的进程间可以按照Rendezvous协议的读取方式传输目标地址:进程0发送报文头RTS至进程1,以告知进程0中地址信息的保存位置,从而进程1根据该保存位置从进程0读取目标地址A0,A2和A3。可以理解的是,在一种情况中,上述每两个交互的进程间可以按照Rendezvous协议的写入方式传输目标地址:参见图2a,进程0发送报文头RTS至进程1,以请求进程1的缓冲区地址,进而进程1通过CTS报文反馈该缓冲区地址给进程0,从而进程0将目标地址A0,A2和A3写入该缓冲区地址中。
上述图6b实施例中,交互的两个进程按照Rendezvous协议中读取或写入方式传输目标地址,只需在集合通信前进行一次报文头(如RTS报文)的发送,或者只需进行一次报文头和缓冲区地址的反馈(如CTS报文),即可实现参与集合通信的每个进程获得集合通信全局的目标地址:该集合通信中除该进程以外的其他进程的数据的存储地址。在此基础上,参与集合通信的每个进程保存第二进程和目标地址的对应关系,每个进程即可直接根据该对应关系进行集合通信,从而无需在集合通信的每个通信步骤中重复进行报文发送和反馈的操作,从而减少集合通信的时延。
在另一种示例中,目标地址通常不属于大包,也就是说,目标地址的数据量小于数据量阈值(如512字节),这样,本实施例可以按照Eager协议也就是小包通信的方式传输目标地址,从而实现参与集合通信的每个进程获得集合通信全局的目标地址:该集合通信中除该进程以外的其他进程的数据的存储地址。图6c-1是示例性示出的一种点对点通信过程的示意图。如图6c-1所示,Eager协议下的数 据传输包括:发送端将载荷(如数据DR)和报文头(如发送通知)打包为一个数据包直接发送给接收端,接收端执行接收操作。Eager协议中可以由通信双方参与通信操作,也就是进行双边通信。基于Eager协议时,接收端要从接收缓冲区中拷贝所接收的数据,适用于所传输数据(如数据DR)的数据量小于或者等于数据量阈值(如512字节)也就是传输的数据为小包(例如目标地址)的场景。在此基础上,图6c-2是本申请实施例提供的一种集合通信中地址传输过程的示意图。如图6c-2所示,对于图6a的地址同步,每两个交互的进程间可以按照Eager协议传输目标地址,也就是图6c-2所示的过程:进程0直接发送包含报文头(通知信息)的目标地址A0,A2和A3至进程1,进程1接收进程0发送的目标地址即可。
上述图6c-2实施例中,交互的两个进程按照Eager协议传输目标地址,在集合通信前可以直接进行目标地址的传输,无需进行报文头(如RTS报文)和缓冲区地址的反馈(如CTS报文)。在此基础上,参与集合通信的每个进程保存第二进程和目标地址的对应关系,每个进程即可直接根据该对应关系进行集合通信,从而无需在集合通信的每个通信步骤也就是数据传输中重复进行报文发送和反馈的操作,从而可以进一步减少集合通信的时延。
可以理解的是,上述图6b至图6c-2实施例中,进程0保存的目标地址A3和目标地址A2为进程0从进程3获取的。进程0至进程3中每两个交互的进程,均可以按照与图6b和图6c-2实施例类似的方式传输目标地址,区别在于交互的进程与所传输的目标地址适应性调整。
在一种可选的实施方式中,图6d是本申请实施例提供的另一种集合通信过程的示意图。如图6d所示,基于MPI_Allgather Ring算法的集合通信中,集合通信的通信步骤0至通信步骤2仍按照MPI_Allgather Ring算法逻辑实现,与图6a实施例中传输目标数据D0至目标数据D3的方式类似,对于相同部分此处不再赘述,可以参见图6a实施例。区别在于,图6d的地址同步包括:一个主进程(如进程0)从第二进程获取目标地址,并发送第二目标地址至第二进程。其中,第二进程包括参与该集合通信的所有进程中除主进程以外的进程(如进程1至进程3),第二目标地址包括参与该集合通信的每个进程对应的目标地址中,除第二进程对应的目标地址以外的目标地址(如第二进程为进程3,第二目标地址包括目标地址A0至A2)。
本申请实施例,通过第一通信组中的多个第一进程中的主进程集中获取第一通信组中的每个第二进程对应的第一内存地址,可以在第一通信包括多个第一进程和多个第二进程的场景中,减少每个第一进程均与每个第二进程交互以获取每个第二进程对应的内存地址,从而更加便捷地实现内存地址的获取。
举例而言,图6d的地址同步与图6b或者图6c-2实施例中传输目标地址的方式类似,区别在于图6d的地址同步中由目标进程与第二进程交互。例如,进程0可以按照Rendezvous协议中读取的方式,从进程1至进程3分别读取目标地址A1至A3;或者,进程1至进程3可以按照Rendezvous协议中写入的方式,分别将目标地址A1至A3写入进程0;或者,进程1至进程3可以按照Eager协议,分别将目标地址A1至A3直接发送至进程0。在此基础上,参见图6d,进程0可以按照Rendezvous协议中写入的方式,将目标地址A0至A2写入进程3,将目标地址A0,A1和A3写入进程2,将目标地址A0,A2和A3写入进程1;或者,进程0可以按照Eager协议,将目标地址A0至A2直接发送至进程3,将目标地址A0,A1和A3直接发送至进程2,将目标地址A0,A2和A3直接发送至写入进程1。对于相同部分可以参见图6b和图6c-2实施例中两个进程间传输目标地址的描述,此处不再赘述。
可以理解的是,本申请实施例中第一进程从第二进程获取目标地址之外,还可以获取第二进程的单元标识。例如,进程1从进程0获取目标地址A0,A2和A3以外,还可以分别获取进程0的标识P0,进程2的标识P2,进程3的标识P3。在一种可选的实施方式中,目标地址和第二进程的单元标识可以同时获取,示例性的,第一进程可以从第二进程获取目标信息,目标信息可以包括目标地址和第二进程的单元标识,以进一步减少时延。例如,进程1从进程0获取目标信息1,目标信息1可以包括目标地址A0和进程0的标识P0。
上述图6a和图6d中的地址同步,也就是第一进程从第二进程获取目标地址的具体方式,仅为示例性说明,本实施例对具体的地址同步方式不作限定。
在一种可选的实施方式中,第一通信组中的第一进程记录第一内存地址,可以包括:第一进程基于第二进程的单元标识,保存目标地址与第二进程间的对应关系。
第一进程从第二进程获取目标地址后,可以提取目标地址包含的第二进程的单元标识,进而基于 第二进程的单元标识,保存目标地址与第二进程间的对应关系。示例性的,本申请实施例应用于上述图4所示的计算集群中,不同进程(如进程0和进程1)属于同一服务器的不同内核。这样,第一进程基于第二进程的单元标识,保存目标地址与第二进程间的对应关系,可以包括:第一进程创建第二进程的单元标识与包含该标识的目标地址间的对应关系,得到目标地址与第二进程间的对应关系。
举例而言,图7a是本申请实施例提供的一种进程与数据存储地址的对应关系的示意图。如图7a所示,目标地址与第二进程间的对应关系的数据结构,例如可以是图7a所示进程与数据存储地址的对应关系表,表项中保存进程标识和相应数据的存储地址。例如,进程标识P0与数据的存储地址A0对应,进程标识P1与数据的存储地址A0对应等。其中,数据的存储地址即上述目标地址。在一种情况中,目标地址与第二进程间的对应关系可以是键-值(key-value)对,目标地址与第二进程中的一个作为键,另一个作为值。
在一种可选的实施方式中,本申请实施例应用于上述图4所示的计算集群中,不同进程(如进程0和进程1)可以属于不同的服务器。这样,第一进程基于第二进程的单元标识,保存目标地址与第二进程间的对应关系,可以包括:第一进程还从目标地址中提取第二进程所属服务器的设备标识,创建该目标地址与从该目标地址提取的单元标识和设备标识间的对应关系,得到目标地址与第二进程间的对应关系。举例而言,图7b是本申请实施例提供的另一种进程与数据存储地址的对应关系的示意图。如图7b所示,目标地址与第二进程间的对应关系的数据结构,例如可以是图7b所示进程与数据存储地址的对应关系表,表项中保存设备标识,进程标识和相应数据的存储地址。例如,设备标识N0,进程标识P0和数据的存储地址A0对应;设备标识N1,进程标识P1和数据的存储地址A0对应等。
本实施例对上述对应关系的具体数据结构不作限制。
S502,第一通信组中的第一进程根据第一内存地址多次与第一通信组中的第二进程进行数据传输。
第一进程在保存目标地址与第二进程间的对应关系后,可以根据该对应关系,按照计算集群进行的集合通信的通信规则与第二进程进行目标数据的传输。其中,计算集群进行的集合通信的通信规则即计算集群所采用的MPI算法的算法逻辑,具体可以参见上述对MPI的已有描述和图3a实施例的描述,此处不再赘述。本申请实施例对计算集群所采用的具体MPI算法不作限制,可以根据应用需求适应性设置。可以理解的是,上述通信规则决定了第一进程与计算集群中的哪些进程交互,以及如何交互等具体的通信步骤。上述目标地址与第二进程间的对应关系可以保证第一进程能够准确与对端也就是第一进程所属第一通信组中的第二进程进行数据传输,避免数据传输的异常。
在一种可选的实施方式中,第一通信组中的第一进程与第一通信组中的第二进程每传输一次数据,可以记录所传输数据所占用的第一内存地址的地址空间;第一通信组中的第一进程在每次传输数据之前,根据所记录的第一内存地址的地址空间占用情况确定所传输数据对应的内存地址。
示例性的,第一内存地址的地址空间例如可以是地址空间AS1,第一内存地址例如可以是内存地址A11。第一通信组中的第一进程在第一次传输数据之前,可以记录地址空间AS的占用情况为未占用,则确定所传输数据对应的内存地址为内存地址A11;第一通信组中的第一进程在第二次传输数据之前,可以记录第一内存地址的地址空间占用情况为已占用第一内存地址例如内存地址A11,则确定第二次传输数据中所传输数据对应的内存地址为地址空间AS中,与内存地址A11对应的地址空间相邻的下一段地址空间的内存地址,例如内存地址A12。以此类推,第一通信组中的第一进程在第三次传输数据之前,可以记录第一内存地址的地址空间占用情况为已占用内存地址A12,则确定第三次传输数据中所传输数据对应的内存地址为地址空间AS中,与内存地址A12对应的地址空间相邻的下一段地址空间的内存地址,例如内存地址A13。
本申请实施例中,第一通信组中的第一进程可以通过第一内存地址,记录每一次数据传输对第一内存地址的地址空间占用情况,从而在每次传输数据前,根据所记录的第一内存地址的地址空间占用情况确定所传输数据对应的内存地址,也就是实现了对第一内存地址对应的内存资源的管理,从而可以根据提前获取的第一内存地址在每次数据传输中准确确定所传输数据对应的内存地址。
示例性的,参见图6a和图6d,计算集群采用MPI_Allgather Ring算法,则通信规则可以包括:每个进程向右邻居发送数据,从左邻居接收数据,所有进程形成环形的邻居关系;在此基础上,每个进程对接收和发送的数据的缓冲区进行偏移,或者,每个进程对读取的数据的缓冲区进行偏移。图6a和图6d所示集合通信的通信步骤0至通信步骤2按照MPI_Allgather Ring算法逻辑实现,与图3a实施例中传输目标数据D0至目标数据D3的方式类似,对于相同部分此处不再赘述,可以参见图3a实施例 对传输目标数据D0至目标数据D3的描述。区别在于本实施例中每个通信步骤不再进行地址同步。
可以理解的是,上述实施例中传输的具体数据取决于集合通信的应用场景。例如,集合通信应用于天气预报,传输的数据具体可以为天气预报的参数,例如历史温度数据、预测温度数据等。集合通信应用于神经网络的分布式训练中,传输的数据具体可以为该分布式训练所训练模型的模型参数等。本实施例对集合通信的应用场景,以及所传输的数据不作限制。
在一种可选的实施方式中,图8是本申请实施例提供的另一种集合通信过程的示意图。如图8所示,基于MPI_Allgather的Neighbor exchange算法的集合通信中,计算集群进行的集合通信的通信规则为MPI_Allgather的Neighbor exchange算法。图8与图3c实施例中传输目标数据D0至目标数据D3的方式类似,对于相同部分此处不再赘述,可以参见图3c实施例对传输目标数据D0至目标数据D3的描述。区别在于,图8的实施例中,在进行集合通信之前,也就是在传输目标数据D0至目标数据D3之前,进行地址同步,也就是第一进程从第二进程获取目标地址。
在一种示例中,按照MPI_Allgather的Neighbor exchange算法,图8中地址同步可以包括:进程0从进程1获取目标地址A1,A2和A3,从进程5获取目标地址A4和A5;进程1从进程0获取目标地址A0,A4和A5,从进程2获取目标地址A2和A3;进程2从进程1获取目标地址A0和A1,从进程3获取目标地址A3,A4和A5;进程3从进程2获取目标地址A0至A2,从进程4获取目标地址A4和A5;进程4从进程3获取目标地址A2和A3,从进程5获取目标地址A5,A0和A1;进程5从进程4获取目标地址A4,A2和A3,从进程0获取目标地址A0和A1。这样,每个进程获取了该计算集群的第二通信单的数据的存储地址,也就是获得了该计算集群中进程的全局地址。
上述图8示例的地址同步中,每个进程获取目标地址的具体方式,与图6b和图6c-2所述方式类似,区别在于所传输的数据和参与交互的进程适应性调整。对于相同部分此处不再赘述,详见上述图6b和图6c-2实施例中的相关描述。
在另一种示例中,图8中地址同步可以与图6d所示的地址同步类似,区别在于参与交互的进程以及传输的目标地址适应性调整。对于相同部分此处不再赘述,详见上述图6d实施例中的相关描述。
可以理解的是,在具体的地址传输和数据传输中,具体采用的方式本实施例不作限制。也就是说,在地址传输中,第一进程可以采用图6b和图6c-2实施例中的至少一种:读取,写入和/或小包传输;在数据传输中,可以采用图2a和图2b所示方式中的至少一种:读取和/或写入。其中,小包传输也就是按照Eager协议进行地址传输:发送端主动发送载荷到接收端,而不会考虑接收端是否有能力接收载荷。这就要求接收进程预先准备足够的缓冲空间(例如空间大小满足空间阈值:缓冲区的空间大小大于或者等于空间阈值)来接收所发送的载荷。在一种可选的实施方式中,数据传输中采用图2a和图2b所示方式中的至少一种:读取和/或写入时,全局地址传输中第一进程已经进行了RTS报文和CTS报文中至少之一的发送,因此,数据读取和写入时,第一进程无需再重复进行这两个报文的发送,可以直接进行单边操作:写入(Put)和读取(Get)中的至少之一即可。
参见图6a、图6d和图8,第一进程完成地址同步和集合通信的每个通信步骤中的数据传输时,可以发送结束通知(如FIN)至对端,以告知对端本次交互完成。这样,通过集合通信完成多个P2P的Rendervous FIN操作,可以实现在每个步骤结束时,在所有进程间进行一次屏障操作fence的效果,从而保证下一次交互的开始(如通信步骤0结束,通信步骤1开始),或者进程执行集合通信以外的算法逻辑。其中,集合通信以外的算法逻辑例如可以是天气预报场景中,上述通信步骤2结束,每个进程可以将通过集合通信获得的数据用于绘图、永久性存储等其他算法逻辑。
可以理解的是,本申请实施例应用于一对多和多对一的集合通信场景与应用于多对多的集合通信场景类似,区别在于具体的第一进程和第二进程不同,可以按照相应集合通信场景进行适应性调整,对于相同部分可以参见上述本申请实施例应用于多对多集合通信场景的具体描述,此处不再赘述。
为了便于理解本申请实施例提供的集合通信方法的优势,下面以表1所示集合通信时延对比进行说明。如下表1所示,计算集群应用于计算集群,该计算集群包括6个节点,每个节点可以是图4实施例中所示同一服务器的不同内核,或者每个节点可以是图5实施例中所示不同的服务器。该计算集群中的服务器应用ARM处理器,按照MPI Allgather算法进行集合通信。其中,ARM处理器是一种精简指令集计算机(reduced instruction set computer,RISC)的处理器。
表1中,对于按照MPI Allgather算法进行的集合通信中相邻进程间交互的平均时延:当具体的通信步骤采用本申请图3a和图3c实施例的方式时,平均时延为表1的列“two-sides(双边)”中所示时延;当具体的通信步骤采用本申请图5至图8实施例的方式时,平均时延为表1的列“one-sided(单边)”中所示时延。可见,在传输的数据包大小相同,通信系统相同,经过相同的测试次数时,本申请图5至图8实施例的方式可以大幅减少集合通信的平均时延。例如,本实施例集合通信中相邻两个进程间传输大小为2048字节(byte)的数据,共试验1000次以测试平均时延,可以得到未采用本申请实施例方案时的平均时延为10.82微秒(μs),采用本申请实施例方案时的平均时延为8.78微秒(μs)。也就是说,本申请实施例提供的基于地址同步,结合单边操作实现的MPI Allgather集合通信的时延,远少于直接基于Rendezvous协议实现的MPI Allgather集合通信的时延。
本申请实施例,在计算集群的通信步骤执行之前,也就是按照计算集群进行的集合通信的通信规则进行数据的传输之前,实现了集合通信的地址同步操作,保证参与计算集群的每个进程均保存该计算集群中第二进程对应的目标地址。进而,第一进程也就是参与计算集群的每个进程,可以直接根据对应关系和计算集群进行的集合通信的通信规则(例如MPI_Allgather的算法逻辑),进行数据的传输,减少在每个通信步骤中大量重复执行的地址同步操作,减少集合通信的时延。
可以理解的是,服务器为了实现上述功能,其包含了执行各个功能相应的硬件和/或软件模块。结合本文中所公开的实施例描述的各示例的算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。本领域技术人员可以结合实施例对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
另外,在一种应用场景中,本申请实施例提供的集合通信方法可以封装为PMPI_XXX形式的接口,供应用集合通信的主体调用。例如,本申请实施例提供的集合通信方法可以封装为接口PMPI_Allgather_Neighbor exchange、PMPI_Allreduce等,具体取决于集合通信所遵循的MPI算法。示例性的,天气预报应用程序调用PMPI_XXX接口后,运行天气预报应用会执行本申请实施例提供的集合通信方法,也就是说底层逻辑会执行本申请实施例中描述的点对点操作。相应地,安装有天气预报应用程序 的服务器针对天气预报应用程序发送的报文、产生的网络流量特征(如交换机中传输的数据)和服务器中的寄存器事件,均会先执行本申请实施例中的地址同步,接着执行后续的通信步骤例如上述通信步骤0至通信步骤2。
其中,上述方法实施例涉及的各步骤的所有相关内容均可以援引到对应功能模块的功能描述,在此不再赘述。
本实施例还提供一种计算机存储介质,该计算机存储介质中存储有计算机指令,当该计算机指令在服务器上运行时,使得服务器执行上述相关方法步骤实现上述实施例中的集合通信方法。
本实施例还提供了一种计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述相关步骤,以实现上述实施例中的集合通信方法。
另外,本申请的实施例还提供一种装置,这个装置具体可以是芯片,组件或模块,该装置可包括相连的处理器和存储器;其中,存储器用于存储计算机执行指令,当装置运行时,处理器可执行存储器存储的计算机执行指令,以使芯片执行上述各方法实施例中的集合通信方法。
其中,本实施例提供的服务器、计算机存储介质、计算机程序产品或芯片均用于执行上文所提供的对应的方法,因此,其所能达到的有益效果可参考上文所提供的对应的方法中的有益效果,此处不再赘述。
通过以上实施方式的描述,所属领域的技术人员可以了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个装置,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是一个物理单元或多个物理单元,即可以位于一个地方,或者也可以分布到多个不同地方。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
本申请各个实施例的任意内容,以及同一实施例的任意内容,均可以自由组合。对上述内容的任意组合均在本申请的范围之内。
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该软件产品存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,均属于本申请的保护之内。
结合本申请实施例公开内容所描述的方法或者算法的步骤可以硬件的方式来实现,也可以是由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于随机存取存储器(Random Access Memory,RAM)、闪存、只读存储器(Read Only Memory,ROM)、可擦除可编程只读存储器(Erasable Programmable ROM,EPROM)、电可擦可编程只读存储器(Electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、只读光盘(CD-ROM)或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位 于专用集成电路(Application Specific Integrated Circuit,ASIC)中。另外,该ASIC可以位于服务器中。当然,处理器和存储介质也可以作为分立组件存在于服务器中。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本申请实施例所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。计算机可读介质包括计算机存储介质和通信介质,其中通信介质包括便于从一个地方向另一个地方传送计算机程序的任何介质。存储介质可以是通用或专用计算机能够存取的任何可用介质。
上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,均属于本申请的保护之内。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (14)

  1. 一种集合通信方法,其特征在于,应用于计算集群,所述计算集群包括多个计算资源,所述多个计算资源上运行有N个进程,所述N个进程形成一个用于执行计算任务的进程组,所述N个进程中的每个进程分配有内存资源,所述进程组中包括M个通信组,每个通信组包括至少两个具有通信关系的进程,其中,M为大于等于1的整数,N为大于等于2的整数;所述方法包括:
    第一通信组中的第一进程获取第一内存地址,并记录所述第一内存地址;其中,所述第一内存地址为所述第一通信组中第二进程的内存资源对应的内存地址;所述第一通信组为所述M个通信组中的任一通信组;
    所述第一通信组中的第一进程根据所述第一内存地址多次与所述第一通信组中的第二进程进行数据传输。
  2. 根据权利要求1所述的方法,其特征在于,所述第一通信组中的第一进程根据所述第一内存地址多次与所述第一通信组中的第二进程进行数据传输,包括:
    所述第一通信组中的第一进程将数据多次写入所述第一内存地址对应的所述第一通信组中的第二进程的内存资源。
  3. 根据权利要求1所述的方法,其特征在于,所述第一通信组中的第一进程根据所述第一内存地址多次与所述第二进程进行数据传输,包括:
    所述第一通信组中的第一进程从所述第一内存地址对应的所述第一通信组中的第二进程的内存资源中多次读取数据。
  4. 根据权利要求1-3中任一项所述的方法,其特征在于,所述第一通信组中的第一进程根据所述第一内存地址多次与所述第一通信组中的第二进程进行数据传输,包括:
    所述第一通信组中的第一进程与所述第一通信组中的第二进程每传输一次数据,记录所传输数据所占用的所述第一内存地址的地址空间;
    所述第一通信组中的第一进程在每次传输数据之前,根据所记录的所述第一内存地址的地址空间占用情况确定所传输数据对应的内存地址。
  5. 根据权利要求1-4中任一项所述的方法,其特征在于,所述第一通信组中的第一进程的数量为一个或多个,所述第一通信组中的第二进程的数量为一个或多个。
  6. 根据权利要求1-5中任一项所述的方法,其特征在于,所述第一通信组中的第一进程的数量为多个,所述第一通信组中的第二进程的数量为多个;
    所述第一通信组中的第一进程获取所述第一内存地址,包括:
    所述第一通信组中的多个第一进程中的主进程获取所述第一通信组中的每个第二进程对应的所述第一内存地址,得到地址集合;
    所述主进程发送所述地址集合至所述多个第一进程中除所述主进程外的其他第一进程。
  7. 一种计算集群,其特征在于,包括多个计算资源,所述多个计算资源上运行有N个进程,所述N个进程形成一个用于执行计算任务的进程组,所述N个进程中的每个进程分配有内存资源,所述进程组中包括M个通信组,每个通信组包括至少两个具有通信关系的进程,其中,M为大于等于1的整数,N为大于等于2的整数;
    第一通信组中的第一进程,用于获取第一内存地址,并记录所述第一内存地址;其中,所述第一内存地址为所述第一通信组中第二进程的内存资源对应的内存地址;所述第一通信组为所述M个通信组中的任一通信组;
    所述第一通信组中的第一进程,用于根据所述第一内存地址多次与所述第二进程进行数据传输。
  8. 根据权利要求7所述的计算集群,其特征在于,所述第一通信组中的第一进程,具体用于:
    将数据多次写入所述第一内存地址对应的所述第二进程的内存资源。
  9. 根据权利要求7所述的计算集群,其特征在于,所述第一通信组中的第一进程,具体用于:
    从所述第一内存地址对应的所述第二进程的内存资源中多次读取数据。
  10. 根据权利要求7-9中任一项所述的计算集群,其特征在于,所述第一通信组中的第一进程,具体用于:
    与所述第一通信组中的第二进程每传输一次数据,记录所传输数据所占用的所述第一内存地址的 地址空间;
    在每次传输数据之前,根据所记录的所述第一内存地址的地址空间占用情况确定所传输数据对应的内存地址。
  11. 根据权利要求7-10中任一项所述的计算集群,其特征在于,所述第一通信组中的第一进程的数量为一个或多个,所述第一通信组中的第二进程的数量为一个或多个。
  12. 根据权利要求7-11中任一项所述的计算集群,其特征在于,所述第一通信组中的第一进程的数量为多个,所述第一通信组中的第二进程的数量为多个;所述第一通信组中的多个第一进程中的主进程,具体用于:
    获取所述第一通信组中的每个第二进程对应的所述第一内存地址,得到地址集合;
    发送所述地址集合至所述多个第一进程中除所述主进程外的其他第一进程。
  13. 一种计算机可读存储介质,其特征在于,包括计算机程序,其特征在于,当所述计算机程序在计算集群上运行时,使得所述计算集群执行如权利要求1至6中任意一项所述的方法。
  14. 一种计算机程序产品,其特征在于,包括计算机程序,当所述计算机程序被计算集群执行时,使得所述计算集群执行权利要求1至6中任意一项所述的方法。
PCT/CN2023/101329 2022-10-12 2023-06-20 集合通信方法及计算集群 WO2024077999A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211245236.2 2022-10-12
CN202211245236.2A CN117914860A (zh) 2022-10-12 2022-10-12 集合通信方法及计算集群

Publications (1)

Publication Number Publication Date
WO2024077999A1 true WO2024077999A1 (zh) 2024-04-18

Family

ID=90668636

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/101329 WO2024077999A1 (zh) 2022-10-12 2023-06-20 集合通信方法及计算集群

Country Status (2)

Country Link
CN (1) CN117914860A (zh)
WO (1) WO2024077999A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975407A (zh) * 2016-03-22 2016-09-28 华为技术有限公司 一种内存地址的映射方法及设备
CN110795206A (zh) * 2018-08-02 2020-02-14 阿里巴巴集团控股有限公司 用于促进集群级缓存和内存空间的系统和方法
CN111316244A (zh) * 2018-12-28 2020-06-19 深圳市大疆创新科技有限公司 多进程间的通信方法和系统
WO2022021896A1 (zh) * 2020-07-30 2022-02-03 华为技术有限公司 一种进程间通信的方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975407A (zh) * 2016-03-22 2016-09-28 华为技术有限公司 一种内存地址的映射方法及设备
CN110795206A (zh) * 2018-08-02 2020-02-14 阿里巴巴集团控股有限公司 用于促进集群级缓存和内存空间的系统和方法
CN111316244A (zh) * 2018-12-28 2020-06-19 深圳市大疆创新科技有限公司 多进程间的通信方法和系统
WO2022021896A1 (zh) * 2020-07-30 2022-02-03 华为技术有限公司 一种进程间通信的方法及装置

Also Published As

Publication number Publication date
CN117914860A (zh) 2024-04-19

Similar Documents

Publication Publication Date Title
US10255230B2 (en) Lock-free processing of stateless protocols over RDMA
US7103888B1 (en) Split model driver using a push-push messaging protocol over a channel based network
CN113485823A (zh) 数据传输方法、装置、网络设备、存储介质
US11922304B2 (en) Remote artificial intelligence (AI) acceleration system
CN112291293B (zh) 任务处理方法、相关设备及计算机存储介质
US20120066460A1 (en) System and method for providing scatter/gather data processing in a middleware environment
CN102521201A (zh) 多核数字信号处理器片上系统及数据传输方法
EP3482298A1 (en) Multicast apparatuses and methods for distributing data to multiple receivers in high-performance computing and cloud-based networks
US12034604B2 (en) MQTT protocol simulation method and simulation device
WO2023104194A1 (zh) 一种业务处理方法及装置
WO2023093418A1 (zh) 数据迁移方法、装置及电子设备
KR20240004315A (ko) Smartnic들 내의 네트워크 연결형 mpi 프로세싱 아키텍처
Qiu et al. Full-kv: Flexible and ultra-low-latency in-memory key-value store system design on cpu-fpga
Shim et al. Design and implementation of initial OpenSHMEM on PCIe NTB based cloud computing
Cardellini et al. Overlapping communication with computation in MPI applications
CN117370046A (zh) 进程间通信方法、系统、设备和存储介质
WO2024077999A1 (zh) 集合通信方法及计算集群
US8291033B2 (en) Combining multiple hardware networks to achieve low-latency high-bandwidth point-to-point communication
WO2001016742A2 (en) Network shared memory
US8572276B2 (en) Pipelining protocols in misaligned buffer cases
TW202008172A (zh) 儲存系統
CN111404842A (zh) 数据传输方法、装置及计算机存储介质
Huang et al. Accelerating NoC-based MPI primitives via communication architecture customization
CN113778937A (zh) 用于执行片上网络(NoC)中的事务聚合的系统和方法
WO2023093065A1 (zh) 数据传输方法、计算设备及计算系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23876209

Country of ref document: EP

Kind code of ref document: A1