CN118069571B

CN118069571B - PCIe (peripheral component interconnect express) switching chip with aggregate communication on-line computing function and PCIe switch

Info

Publication number: CN118069571B
Application number: CN202410497224.1A
Authority: CN
Inventors: 张洪波
Original assignee: Beijing Shudu Information Technology Co ltd
Current assignee: Beijing Shudu Information Technology Co ltd
Priority date: 2024-04-24
Filing date: 2024-04-24
Publication date: 2024-06-18
Anticipated expiration: 2044-04-24
Also published as: CN118069571A

Abstract

The invention relates to a PCIe exchange chip with an aggregate communication on-line computing function and a PCIe switch, comprising PCIe endpoint equipment with internal integration, a special data transmission channel, a task control unit and a computing unit, wherein the special data transmission channel comprises a DMA and a FIFO, the task control unit is used for receiving an aggregate communication command of a user, controlling the data transmission and controlling the computing unit, and the task control unit is used for operating the DMA to finish the input and the output of data through reading and writing a control register of the DMA; the calculation unit reads data from the specified input FIFO under the control of the task control unit, completes the specified calculation, and outputs the data to the specified FIFO. The invention can greatly reduce the network flow, reduce the communication delay and improve the system efficiency; each computing node only needs to perform data communication with the PCIe exchange chip once, so that the system operation efficiency is greatly improved; and the larger the scale of the computing node is, the more significant the system efficiency is improved.

Description

PCIe (peripheral component interconnect express) switching chip with aggregate communication on-line computing function and PCIe switch

Technical Field

The invention relates to a PCIe exchange chip with an aggregate communication on-network computing function and a PCIe switch, belonging to the technical fields of PCIe exchange chips, aggregate communication and on-network computing.

Background

PCIe is a high-speed serial computer expansion bus standard that is widely used for computer host connection peripherals. The PCIe controller at the host side is called RC, and the device side is called EP. Each EP device has a standard configuration space (Configuration Space) which is a set of registers in a canonical format that indicates information about the basic properties of the device, and the BAR register identifies address space resources such as registers, memory, and the like, referred to as BAR space, of the device. After the PCIe physical layer establishes the link, the RC maps the device's address spaces to the host's address spaces, and the host's driver can then access and operate the device. Under PCIe bus topology, each PCIe device has a unique BDF number.

The request and data are transmitted over the PCIe physical link using a TLP (Transaction LAYER PACKETS), which is of the following 4 types:

The memory request TLP is used to read and write a register or a memory of the device; IO TLP (IO refers to Iuput output, input and output are commonly denoted by IO) is used for accessing IO space (a special space and a space independent of memory space) under certain system architecture; the configuration TLP is used to access a configuration space register of the device; the message TLP is used to implement a user-defined message.

PCIe is a point-to-point connection and requires the use of PCIe switching chips when the host's RC controllers are not sufficient in number to connect more devices. PCIe switch chips typically have one upstream port USP (for connecting to the RC on the host side), and multiple downstream ports DSP (for connecting to the EP equipment). During PCIe enumeration, a lookup table is generated in the PCIe switch chip, and records the bus numbers, address ranges and other information of each port. After enumeration, the TLP coming from each port looks up the table according to its different types, and can route information such as BDF number or address to the corresponding port, so as to implement one RC to connect multiple EP devices, and also implement communication between EP devices. The PCIe exchange chip can also realize NTB (Non-TRANSPARENT BRIDGE) function, namely, has a plurality of USP ports, thereby being capable of connecting a plurality of hosts and realizing the resource access among the hosts.

The RC/EP ports of PCIe or USP/DSP ports on the switch chip may support a different number of physical layer channels, one port supporting a maximum of 16 physical layer channels, denoted X16. The 16 channels can support one X16 device, 2X 8 devices and 4X 4 devices. In the hardware development implementation, each port is internally provided with 4 controllers, if an X16 device is externally connected, only one controller is enabled, and if the X16 device is 4X 4 devices, the 4 controllers are all enabled.

The aggregate communication (Collective Communication) refers to communication and calculation operations such as synchronization, data transmission and reception, and the like, which are performed by each distributed calculation node (CPU/GPU, etc.) in parallel calculation such as high performance calculation HPC and AI training, and mainly includes the following typical operations:

reception/Send: data receiving/transmitting operation;

barrer: synchronization, the synchronization operation between the nodes is calculated;

Broadcast: broadcasting, namely copying the data of one computing node to a plurality of other designated computing nodes, wherein each node obtains the same original data copy;

scatter: scattering, namely distributing the data of one computing node to a plurality of other specified computing nodes, wherein each node obtains a part of the original data;

Gather: aggregating, namely merging and collecting data of a plurality of computing nodes to one computing node, wherein the node obtains data of all nodes;

ALLGATHER: collecting data from a plurality of nodes to all nodes, wherein each node obtains all data;

Reduce: the protocol can reduce the data volume to be called as reduce operation, such as summation, average value, extremum and the like, and after the data of each calculation node is subjected to reduce operation, the result is summarized to one node;

AllReduce: after the full specification, the data operation result is sent to all nodes after the Reduce operation.

A communication definition is assembled, a group of processes capable of receiving and transmitting data is called a Communicator, and a unique identifier of each communication process is called Rank.

Aggregate communication is widely used, so that some standard interfaces are formulated and implemented, for example, MPI is a classical set of interface definitions, openMPI is its open source implementation; NCCL of Inwinda is also a set of aggregate communications for GPU implementation.

HPC and AI operations require the interconnection of a large number of parallel operation nodes. Some GPU vendors have implemented their own proprietary interconnections. Of these, nvLink and Infiniband from Inwinda have good interconnection performance, but are only suitable for their own GPU and are expensive. Other small and medium vendors, proprietary interconnect protocols are not scaled. The biggest defect of the private interconnection protocol is that the private interconnection protocol is isolated in ecology, and is not beneficial to popularization.

PCIe interconnects are widely used, with both software and hardware ecologically complete. The use of PCIe switch chips to interconnect GPUs for AI operations is still a widely supported standard option.

HPC and AI operations require collective communication, and the communication interface is frequently invoked, so its efficiency is one of the key factors affecting the overall performance of the system. Taking AllReduce operations as an example, each operation node needs to exchange data with all other nodes through the switching network, the unoptimized communication mode can initiate a large amount of data communication requests, occupy network bandwidth, and can cause large delay and congestion, and each calculation node can waste a large amount of time waiting for network communication.

The network switching equipment proprietary by some manufacturers realizes the function of network computing (In-network Computing), namely, when the collection communication represented by AllReduce is carried out, each computing node sends data to the network switching equipment, and after the switching equipment completes the computation, the computing result is returned to each computing node, so that the mutual communication among all nodes is avoided, the network communication times are greatly reduced, and the system efficiency is remarkably improved.

As a PCIe switching chip which is one of the most widely used interconnection modes, the PCIe switching chip still does not have the functions of optimizing and accelerating collective communication, and greatly influences the efficiency of HPC or AI parallel operation of the interconnected computing devices.

Based on this, the present invention has been proposed.

Disclosure of Invention

In the process of developing PCIe exchange chips, PCIe protocols, PCIe exchange chip compositions and principles are studied in detail. Meanwhile, aiming at the application scenes of HPC and AI, the network communication modes, namely the collective communication, are analyzed in detail, and the fact that the original communication modes which are not optimized cause extremely large network concurrent traffic is found, so that delay and possible congestion are caused.

When optimizing research is conducted on aggregate communication, certain proprietary communication protocols and products thereof are found, and network computing functions, such as Infiniband switches of Infinida, can be realized, but the communication protocols are not open, only network access equipment matched with the company can be used, the price is high, and the application scene is inflexible.

The PCIe exchange interconnection which is widely applied and disclosed in the protocol at present is more common and important in the connection of the GPU along with the explosion of the AI large model, but does not have the online computing function of optimizing the aggregate communication.

Aiming at the defects existing in the prior art, the invention provides a PCIe exchange chip and a PCIe switch with an aggregate communication on-network computing function, and the specific technical scheme is as follows:

In a first aspect, the present invention provides a PCIe switch chip having aggregate communication on-network computing functionality, comprising: the device comprises an internally integrated PCIe endpoint device, a special data transmission channel, a task control unit and a computing unit, wherein the internally integrated PCIe endpoint device is connected to an original switching network, and can be found and distributed with BDF numbers in the enumeration process;

The special data transmission channel comprises a DMA and a FIFO, and the DMA completes the transceiving of data by sending out a memory read-write TLP request; before input data enter a computing unit and after the computing unit outputs the data, a group of input and output caches are realized in a FIFO (first in first out) mode;

The task control unit is used for receiving an aggregate communication command of a user, controlling data transmission and controlling the calculation unit, and the task control unit is used for operating the DMA to finish the input and output of data by reading and writing a control register of the DMA;

the calculation unit reads data from the specified input FIFO under the control of the task control unit, completes the specified calculation, and outputs the data to the specified FIFO.

In a further development, the control registers of the FIFO, task control unit and computation unit are mapped into the BAR space with an internally integrated PCIe endpoint device, which is accessible to the control software after enumeration.

All data input and output have own special data channels, are not applicable and do not affect the original internet, so that parallelism of all data input and output is realized, no conflict or congestion of transmission is ensured, and calculation and result distribution are completed in the flow of data input and output.

In a further development, the calculation unit only needs to select certain input FIFOs for calculation, and then gives a result to one or more output FIFOs, wherein the input/output selection is configured by the task control unit according to the user request. The architecture can support one-to-many communication (e.g., broadcast), many-to-one communication (e.g., reduce), and many-to-many communication (e.g., allReduce and ALLGATHER) of collective communication simultaneously, and only the corresponding input and output ports of the computing unit need be selected for different communication modes.

In a further improvement, 4 USP/DSP controllers are used to form a group, and each group of controllers is provided with a storage unit for 4 FIFO dynamic allocation;

Each USP/DSP configures a FIFO memory unit that can be shared by the USP/DSP controller so that the memory unit is fully utilized without idling under different numbers and rates of external devices, and the size of the memory unit allocated to each FIFO is proportional to the rate of devices external to the port.

Still further improvements, the BAR space of each USP or DSP needs to be enabled so that its FIFO maps to the BAR space, and each cache FIFO will obtain a PCIe bus address after enumeration for addressing of subsequent data transfers.

In a further development, the data processing procedure of the task control unit comprises the following steps:

when the designated output FIFO reaches a LOW threshold (i.e., th_low signal is active), an output DMA is started to output data to the designated destination address;

When the designated input FIFO does not reach the HIGH threshold (i.e., th_high signal is inactive), then the input DMA is started to read data from the designated location into the input FIFO;

When all of the specified output FIFOs do not reach the HIGH threshold (i.e., th_high signal is inactive) and all of the specified input FIFOs reach the LOW threshold (i.e., th_low signal is active), the computation unit is enabled to perform data computation.

In summary, all parallel data input and output and calculation control are completed by hardware control of the task control unit, and software only needs to complete initialization configuration without participation of the software in real-time flow control, so that the efficiency is extremely high.

In a second aspect, the present invention provides a PCIe switch, including at least one PCIe switch chip having aggregate communication on-network computing functionality.

The data transmission in the invention refers to how to input or output data to be operated by the aggregate communication request into or out of a data channel and control of a PCIe exchange chip.

The task control unit is used for receiving the aggregate communication command of the user, controlling the data transmission and controlling the module of the calculation unit in the PCIe exchange chip; the task control unit manages the topology, running state and the like of each computing unit; some collective communications do not require the initiation of a computational unit, such as a synchronization operation, and can be accomplished directly by the task control unit.

The calculation unit can read data from each designated input FIFO according to the control command, complete the specified calculation and write the result into each designated output FIFO; some aggregate communications, although not requiring computation, such as broadcast operations, still read data from one input FIFO and write it directly to each output FIFO by means of the physical channels of the computation unit.

The invention has the beneficial effects that:

1. The invention unloads the aggregate communication and the operation thereof to the network exchange chip for execution, thereby reducing the network flow and the delay thereof, reducing the time for each computing node to wait for the network communication and greatly improving the system efficiency.

2. In the invention, the PCIe endpoint equipment, the special data transmission channel, the task control unit and the computing unit which are integrated are added on the basis of the common PCIe exchange chip, so that the realization of the network computing function of the collective communication can be completed, and the existing PCIe exchange chip does not have the function; meanwhile, when a plurality of PCIe exchange chips are interconnected to form a larger exchange network, the implementation method is also applicable.

3. Compared with PCIe exchange chips without the network computing function, the invention can greatly reduce network traffic, reduce communication delay and improve system efficiency when in collective communication; each computing node only needs to perform data communication with the PCIe exchange chip once, so that the system operation efficiency is greatly improved; and the larger the scale of the computing node is, the more significant the system efficiency is improved.

Drawings

FIG. 1 is a schematic diagram before improving efficiency of network computing (AllReduce is taken as an example, which implements a plurality of different algorithms, and network traffic patterns are different, but parallel network communication is initiated multiple times, which is a schematic diagram);

FIG. 2 is a schematic diagram of efficiency improvement in network computing;

FIG. 3 is a functional block diagram of a PCIe switch chip having aggregate communication on-network computing capabilities in accordance with the present invention;

FIG. 4 is a schematic diagram of FIFO memory cell allocation connecting a node at x16 rate;

FIG. 5 is a schematic diagram of FIFO memory cell allocation connecting two nodes at x8 rate;

FIG. 6 is a schematic diagram of FIFO memory cell allocation connecting four nodes at x4 rate;

FIG. 7 is a flow chart of the input/output and control method of the computing unit according to the present invention;

FIG. 8 is a flow chart of a method of signal and data flow control of a task control unit.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Abbreviation and key term definitions

Aggregate communication: collective Communication, synchronizing and data interaction among computing nodes in parallel computing;

And (3) calculating on the network: in-network Computing, transferring the computation from the traditional computing node to the internet to improve efficiency;

BAR: base ADDRESS REGISTER, a register that indicates the memory resources of the PCIe device;

BDF: bus Device Function, PCIe device unique ID identification;

TLP: a data packet transmission format specified by a Transaction LAYER PACKETS, PCIE protocol;

IO: input and Output, input and Output;

USP: upstream Port, PCIe exchange chip Upstream Port, used for connecting RC apparatus;

DSP: downstream ports of the PCIe switching chip are used for connecting the EP equipment;

NTB: non-TRANSPARENT BRIDGE, non-transparent bridge, PCIe exchanges modify their addresses as set when transferring TLPs;

Enumerating: a process of finding PCIe bus topology and emptying its devices to host address space by a host at PCIe RC end;

HPC: high Performance Computing, high-performance calculation;

AI: ARTIFICIAL INTELLIGENCE, artificial intelligence;

Communicator: a group of processes capable of receiving and transmitting data in the aggregate communication is called a Communicator;

Rank: the unique identifier of each communication process in the aggregate communication is called Rank;

MPI: MASSAGE PASSING INTERFACE, aggregating a set of standard software interface definitions for communication;

OpenMPI: open source implementation code of MPI;

NCCL: nvidia Collective Communication Library, an aggregate communication software library of the Injeida company;

computing node: the node running the aggregate communication process, namely the CPU or GPU where each Rank is located, is specified in the text;

A calculation unit: the invention refers to a module with a calculation function, which is added in a PCIe exchange chip;

FIFO: first-In-First-Out, first-In First-Out, a data buffer implementation method;

DMA: direct Memory Access, finishing data migration by a special hardware unit without passing through a CPU;

MSI: MESSAGE SIGNALED Interrupts, a mechanism specified by the PCIe standard for using messaging interrupts;

MSI-X: extension of message interrupt mechanism MSI specified by PCIe standard;

BFloat: a floating point data type is widely applied in the field of artificial intelligence;

IN: input and Input;

OUT: output is the same as Output;

on the basis of a common PCIe exchange chip, the invention creatively adds a special data transmission channel, a task control unit and a calculation unit to realize the on-line calculation of the aggregate communication, so that the specified operations such as calculation, distribution and the like can be directly completed in the flowing process of the data.

Fig. 1 shows that in an environment without a network computing function, each computing node needs to communicate bidirectionally, and each node needs to obtain data of all other nodes, which causes congestion and large delay. In fig. 2, after a task control and calculation unit is added in a PCIe switching chip, each calculation node only needs to perform data communication with the switching chip once, so that system operation efficiency is greatly improved. Here, the "task control and calculation unit" in fig. 2 refers to a "task control unit" and a "calculation unit".

Example 1

The general framework of the present invention is shown in FIG. 3 (only one USP and two DSPs are illustrated), where the dashed line is a typical implementation of an original conventional PCIe switch chip: the data coming in from each port is passed to the switching network of the bus structure to route to the destination port after the switching logic module searches the routing table.

The invention adds functions based on the original PCIe exchange chip without greatly changing the original architecture. In order to ensure that the parallel operation of input and output data does not generate congestion, the data of each port directly reaches the calculation unit after being buffered by the special FIFO without passing through the original public switching network, thereby ensuring the requirement of a large number of concurrent input and output. The calculation unit is configured according to the requirement, and after selecting some input FIFOs for calculation, one or more paths of output FIFOs given by the result are output. The hardware of the task control unit automatically controls the work of the calculation unit according to the collective communication request of the user and the data condition of the input/output FIFO, and the whole control is automatically completed by the hardware, thereby ensuring the efficient data pipelining operation.

The specific functions of each module of the invention are as follows:

1. PCIe endpoint device with internal integration (DSPiEP)

In PCIe architectures, one or a group of functional units need a BDF number to implement routing of PCIe packets, i.e., a new EP is needed; and in PCIe switching networks, an EP needs to be connected under the DSP to meet specifications, which is achieved in practice by DSP integration of EPs, DSPiEP.

DSPiEP are connected to the original switching network, and can be discovered and allocated with resources such as BDF numbers in the enumeration process. The control registers of the newly added FIFO, task control unit and calculation unit are mapped into the BAR space of DSPiEP, so that the newly added FIFO, task control unit and calculation unit can be accessed after enumeration, thereby realizing all operations of configuration management, request and the like related to collective communication.

2. Special data transmission channel (including DMA and FIFO)

In order to realize efficient parallel data input and output, the invention designs a special data transmission channel which is independent of the original internet, as shown in fig. 3.

To efficiently realize data input and output, DMA is realized. The DMA may be located (but is not limited to) within the USP/DSP and the transceiving of data is accomplished by issuing a memory read-write TLP request. The control register functions of the core in the DMA include source address, destination address, start-stop control, response running state and the like of configuration data. The task control unit reads and writes the control register of the DMA to operate the input and output of the DMA completion data.

In order to minimize the effect of rate jitter on the transmission link, a set of input and output buffers, implemented in FIFO form, need to be placed on the link before the data is input to the computation unit and after it is output by the computation unit.

Typically using 4 USP/DSP controllers in a group, one X16, two X8 or 4X 4 devices can be configured. To ensure that the memory cells of the cache FIFO are fully and efficiently utilized, the present invention employs a design as shown in fig. 3:

each group of controllers is provided with a storage unit, and when the ports of the group of controllers are connected with X16 equipment, as shown in fig. 4, the whole group of storage units are used as one FIFO; if the port is enabled for 2 x8 rate devices, as in FIG. 5, the cache is divided into two sets of FIFOs for use; the same applies to 4 FIFOs, as shown in fig. 6. Wherein U (D) SP in FIGS. 4-6 refers to USP/DSP.

The memory cells may be shared in rate division, but each set of FIFOs has its own independent control registers and control signals, which are disabled when not enabled. A method that each group of controllers is adopted to configure a storage unit for dynamically distributing 4 FIFOs instead of a static fixed block buffer memory of each controller is adopted, and the advantage is that the obtained storage space size of each FIFO is proportional to the equipment speed, for example, the buffer memory of an X16 equipment is 4 times that of an X4 equipment; and the situation that the static fixed cache of the controller causes idle waste under the condition that the controller is not enabled can be avoided.

The output signal of each FIFO, in addition to the FULL signal, the EMPTY signal, has a th_high signal and a th_low signal, the valid th_high signal indicating that the FIFO reaches the HIGH threshold, will overflow after writing the next packet, and therefore it is necessary to stop writing; when the TH_LOW signal is valid, the data quantity in the FIFO reaches the LOW-order threshold value, the length of the data packet which is read at the lowest time is met, and the data can be read. The two threshold signals are used for judging the start and stop of the operation DMA and the computing unit by the task control unit; its specific threshold value may be programmed by a register for performance tuning.

The BAR space for each USP or DSP needs to be enabled so that its FIFO maps to the BAR space so that each cache FIFO will obtain the PCIe bus address after enumeration. The address is the address needed by DMA transmission, and meanwhile, it is important that for external input data, after the exchange logic module judges that the input data address is the BAR space address of the controller, the input data is directly distributed to the FIFO corresponding to the controller and then reaches the computing unit without passing through the exchange network, so that the fundamental guarantee of massive concurrent transmission of the data is achieved. The output data is also sent out by the computing unit, and is output by the DMA after passing through the respective FIFOs.

3. Calculation unit

The calculation unit reads data from the specified input FIFO under the control of the task control unit, completes the specified calculation, and outputs the data to the specified FIFO. In some cases, the data collection and distribution function may be completed without a calculation operation.

The main input/output signals of the computing unit are as shown in fig. 7 (only 4 rank examples), the input/output selection signals enable the corresponding input/output FIFOs, for example, rank 0-2 can be selected as input, and the result is output to rank3 after computation. The aggregate communication operation signal is an operation for notifying the computing unit of a specific aggregate communication command, such as Scatter, allGather, and if Reduce, allReduce is an operation, it is also necessary to specify the type of computation, such as and, or, sum, and the like. The calculation start/pause signal is used as a flow control signal, and the calculation unit is started when the output FIFO has space and all the input FIFOs have data, otherwise the operation is paused. The calculation state output signal mainly refers to the total calculation data quantity, and is used for determining whether the calculation is completed or not, and also includes some error state signals and the like.

The architecture can support one-to-many communication (e.g., broadcast), many-to-one communication (e.g., reduce), and many-to-many communication (e.g., allReduce and ALLGATHER) of collective communication simultaneously, and only the corresponding input and output ports of the computing unit need be selected for different communication modes.

The data types supported by the computing unit include: 64-bit double-precision floating point/long integer/unsigned long integer, 32-bit single-precision floating point/integer/unsigned integer, 16-bit half-precision floating point/short integer/unsigned short integer, 16-bit BFloat floating point, 8-bit floating point/integer/unsigned integer.

The operations supported by the computing unit include: maximum, minimum, sum, product, logical AND, bitwise AND, logical OR, bitwise OR, logical XOR, bitwise XOR, maximum and position, minimum and position.

4. Task control unit

The task control unit may perform the following functions: and (3) managing system resources, managing states, collecting communication command receiving and responding, and controlling a data processing flow.

In the collective communication, a group communicator is formed by a plurality of rank (computing process), and communicator is transmitted as a parameter in addition to the data transceiving address and the operation type in the collective communication software interface. Therefore, the PCIe switch chip needs to accept the initialization configuration of the user in the initialization stage, and records the rank and the BDF number corresponding to each rank included in each communicator.

The task control unit is internally provided with a state machine, enters a waiting state after the initialization of the resource management is completed, can receive the collective operation request, and enters a busy state, and does not receive a new request at the moment; after the processing of the primary aggregate communication request is finished, returning to a waiting state, and receiving the request again; if an error occurs during the processing, an error state is entered, requiring user software to respond to the processing. After the aggregate communication request is completed, the completion status may be indicated by a special register, or a message may be sent to a memory address registered by the user, and an MSI/MSI-X message may be sent, with the specific mode selected by the user. (State management, aggregate communication command reception and response).

Data processing flow control as shown in fig. 8, the signal of DMA control indicates that the task control unit can configure the source address and destination address of the DMA and control the start and stop of the DMA; FIFO information includes the FULL signal, EMPTY signal, th_high signal (HIGH threshold reached), and th_low signal (LOW threshold reached) described previously; when the TH_LOW signal of the selected output FIFO is valid, the task control unit configures the destination address of the output DMA and starts the DMA (the source address is the output FIFO); when the selected input FIFO does not reach the upper threshold, namely TH_HIGH is invalid, configuring the source address of the input DMA and starting the DMA (the destination address is the input FIFO); when the selected output FIFOs do not reach the high threshold and all the selected input FIFOs reach the low threshold, the calculation unit is started to perform data calculation. All processes are managed by the task control unit, and the data flow operation is automatically completed by hardware, so that the efficiency is extremely high.

The invention realizes the function of integrating communication in the PCIe exchange chip by the functional modules, and the computing nodes where the various rank are positioned only need to send the integrated communication request and wait for completion, and the computing function is completed at the beginning of the network data transmission process, namely the network computing. The blocking and non-blocking modes of the software interface may all support: in the blocking mode, software submits a request, polls a state register in the chip, and returns when execution is completed; in the non-blocking mode, the software can return after submitting the request, and after the execution of the PCIe exchange chip is completed, the MSI/MSI-X is sent to inform that the collective communication is completed.

In summary, the following collective communication functions may be implemented on this PCIe switch chip:

Reduce/AllReduce (protocol/full protocol): the task control unit of the PCIe switch chip receives the request, controls the computing unit to read data from a specified group of input channels, and outputs the result to one (Reduce) or all (AllReduce) computing nodes after the specified computation is completed.

Receive/Send: the task control unit of the PCIe exchange chip receives the request, controls the calculation unit to read data from a specified group of input channels, does not need to start a calculation function, and sends the data to a specified output channel.

Broadcast: the task control unit of the PCIe switch chip receives the request, controls the computing unit to read the data from the specified group of input channels, does not need to start a computing function, and copies the data to all the specified output channels.

Scatter (Scatter): the task control unit of the PCIe switch chip receives the request, controls the computing unit to read the data from a specified group of input channels, does not need to start a computing function, and distributes the data to all specified output channels.

Gather/ALLGATHER (aggregate/total aggregate): the task control unit of the PCIe switch chip receives the request, controls the computing unit to read data from all designated input channels, does not need to start a computing function, and sends all collected data to one (Gather) or all (ALLGATHER) computing nodes.

Barrer (synchronization): after receiving a request of a computing node, the task control unit of the PCIe exchange chip does not need to pass through the computing unit, the state machine register displays that the request enters a busy-wait state, and after receiving all the requests, the task control unit of the PCIe exchange chip notifies all the synchronous requests to return, namely, all the computing nodes complete synchronization.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. PCIe switching chip having aggregate communication on-network computing function, comprising: the device comprises an internally integrated PCIe endpoint device, a special data transmission channel, a task control unit and a computing unit, wherein the internally integrated PCIe endpoint device is connected to an original switching network, and can be found and distributed with BDF numbers in the enumeration process;

The calculation unit reads data from the appointed input FIFO under the control of the task control unit, completes the appointed calculation, and outputs the data to the appointed FIFO;

The control registers of the FIFO, the task control unit and the computing unit are mapped into the BAR space of the PCIe endpoint device with internal integration, and can be accessed by the control software after enumeration;

the data processing flow of the task control unit comprises the following steps:

When the designated output FIFO reaches the low-order threshold value, starting an output DMA to output data to the designated destination address;

When the designated input FIFO does not reach the high-order threshold value, starting the input DMA to read the data from the designated position into the input FIFO;

when all the designated output FIFOs do not reach the high threshold and all the designated input FIFOs reach the low threshold, the calculation unit is started to perform data calculation.

2. The PCIe switching chip with aggregate communication on-network computing function of claim 1 wherein: the calculation unit only needs to select some input FIFO to calculate, and then gives the result to one or more output FIFOs, and the input/output selection is configured by the task control unit according to the user request.

3. The PCIe switching chip with aggregate communication on-network computing function of claim 1 wherein: forming a group by using 4 USP/DSP controllers, wherein each group of controllers is provided with a storage unit for 4 FIFO dynamic allocation;

Each USP/DSP configures a FIFO memory unit that can be shared by the USP/DSP controller.

4. The PCIe switching chip with aggregate communication on-network computing function as defined in claim 3 wherein: the BAR space of each USP or DSP needs to be enabled so that its FIFO maps to the BAR space, and each cache FIFO will obtain the PCIe bus address after enumeration.

Pcie switch, characterized in that: comprising at least one PCIe switching chip with collective communication on-network computing functionality as defined in any of claims 1-4.