CN114066707A

CN114066707A - General graphic processing system, computing device and distributed system

Info

Publication number: CN114066707A
Application number: CN202010787539.1A
Authority: CN
Inventors: 陆叶; 吴政原; 韩亮
Original assignee: Alibaba Group Holding Ltd
Current assignee: Pingtouge Shanghai Semiconductor Co Ltd
Priority date: 2020-08-07
Filing date: 2020-08-07
Publication date: 2022-02-18

Abstract

General purpose graphics processing systems, computing devices, and distributed systems are disclosed. The general graphic processing system includes: a calculation unit; caching; a storage controller coupled to the cache; the exchange module comprises a plurality of interfaces, is used for receiving the identifier of the target to be accessed and the source address of the first data to be written, determines a first interface in the plurality of interfaces according to the identifier of the target to be accessed and pre-stored interconnection information, reads the first data to be written from the cache according to the source address, and sends the first data to be written through the first interface; and the connecting unit is used for coupling the computing unit, the storage controller, the cache and the switching module. According to the embodiment of the disclosure, the general graphic processing system integrated with the switching module is not a pure end device, but has networking and switching capabilities, so that networking does not rely on an external switch and a router.

Description

General graphic processing system, computing device and distributed system

Technical Field

The present disclosure relates to the field of deep learning, and in particular, to a general purpose graphics processing system, a computing device, and a distributed system.

Background

Deep learning is one of the most attractive technologies that have been rising again in the last decade, and has made many breakthrough advances and applications in the fields of voice, image, big data, biomedical technology, and the like. In order to meet the requirements of more complex application scenes, the scale of the deep learning model becomes increasingly large. The increase of model parameters makes the requirement of the model training on the computing resources difficult to be met by the computing resources of a single node, so for a complex model, a distributed system comprising a plurality of computing nodes is usually adopted for model training.

However, in a distributed system, the interconnection communication overhead between the computing nodes and the interconnection communication overhead between the internal devices of the computing nodes both become bottlenecks in the increase of computing power, and therefore, the computing power does not increase linearly with the increase of the number of nodes. The traditional PCIe and TCP/IP interconnection technologies are proved to have various limitations of bandwidth, network delay, expansion scale and the like, and are not suitable for the distributed systems. Therefore, the problem of interconnection communication has become one of the core difficult problems of distributed system design.

Disclosure of Invention

In view of the above, it is an object of the present disclosure to provide a general-purpose graphics processing system, a computing device and a distributed system to solve the problems of the prior art.

According to a first aspect of embodiments of the present disclosure, an embodiment of the present disclosure provides a general-purpose graphics processing system, including:

a calculation unit;

caching;

a storage controller coupled to the cache;

the switching module comprises a plurality of interfaces, a cache module and a plurality of data processing modules, wherein the interfaces are used for receiving an identifier of a target to be accessed and a source address of first data to be written, determining a first interface in the plurality of interfaces according to the identifier of the target to be accessed and pre-stored interconnection information, reading the first data to be written from the cache according to the source address, and sending the first data to be written through the first interface;

a connection unit for coupling the computing unit, the storage controller, the cache, and the switch module.

In some embodiments, the identification of the target to be accessed and the source address are from a data operation request submitted by the compute unit or a storage controller.

In some embodiments, the switching module further comprises:

the transmission engine is used for encoding the first data to be written according to a specified transmission layer/network layer communication protocol;

and the switching unit is used for determining the first interface according to the identifier of the target to be accessed and the interconnection information, continuing to encode the first data to be written according to an Ethernet communication protocol and a physical layer protocol, and sending the encoded data through the first interface.

In some embodiments, the transmission engine supports a plurality of transport layer/network layer communication protocols, and the transmission engine selects the specified transport layer/network layer communication protocol from the plurality of transport layer/network layer communication protocols to encode the first data to be written.

In some embodiments, the transmission engine comprises:

the RoCE protocol processing module is used for encoding the first data to be written based on a RoCEv2 communication protocol;

and the proprietary protocol processing module is used for encoding the first data to be written based on a special protocol.

In some embodiments, the RoCE protocol processing module includes a TOE unit, and the TOE unit is configured to encode the first data to be written according to an IP/TCP/UDP protocol.

In some embodiments, the RoCE protocol processing module includes an IB unit to establish end-to-end data transmission based on an IB protocol.

In some embodiments, the RoCE protocol processing module includes a predicate interface.

In some embodiments, the proprietary protocol processing module includes a proprietary protocol unit for encoding and decoding data in accordance with a proprietary protocol and a proprietary protocol driver interface for providing drivers and hardware interfaces.

In some embodiments, the identification of the target to be accessed includes an identification of a target graphics processing unit connected via the first interface and an identification of a target application.

In some embodiments, the switching unit is further configured to receive second data to be written via a second interface, determine a target address according to an identifier of the target to be accessed, and write the second data to be written to the target address.

In some embodiments, the source address and the destination address are specific storage addresses of the source application program and the destination application program in respective application memory spaces, respectively. .

In some embodiments, the switching module is integrated as a network card processor.

In some embodiments, the target graphics processing unit and the general purpose graphics processing system are located in different compute nodes.

In some embodiments, the plurality of interfaces are ethernet interfaces.

In a second aspect, the disclosed embodiments provide a computing apparatus comprising a plurality of computing nodes, the computing nodes comprising coupled memory, a general purpose processor, and the general purpose graphics processing system of any of the above, wherein the computing nodes are coupled with the general purpose graphics processing system of at least one other computing node through their own general purpose graphics processing system.

In some embodiments, the compute nodes are packaged in the same silicon die, with multiple compute nodes being integrated together by a printed circuit board.

In some embodiments, the computing device is configured to perform a training task for a deep learning model.

In a third aspect, an embodiment of the present disclosure provides a distributed system, including a plurality of computing apparatuses according to any one of the above descriptions, where data transmission is performed through an external bridge device in the plurality of computing apparatuses.

In a fourth aspect, an embodiment of the present disclosure provides a distributed system, including a plurality of computing nodes, where the computing nodes include coupled memory, a general purpose processor, and the general purpose graphics processing system of any one of the above, where the computing nodes are coupled with the general purpose graphics processing system of at least one other computing node in proximity via their own general purpose graphics processing system, and form a 3D-Torus interconnection network.

In a fifth aspect, an embodiment of the present disclosure provides a cloud server including the computing apparatus according to any one of the above items.

In a sixth aspect, an embodiment of the present disclosure provides a method implemented in a general purpose graphics processing system, including:

receiving an identifier of a target to be accessed and a source address of data to be written;

determining a first interface in the plurality of interfaces according to the identifier of the target to be accessed and pre-stored interconnection information;

and reading the data to be written from the cache according to the source address, and sending the data to be written through the first interface.

According to the embodiment of the disclosure, the general graphic processing system of the integrated switching module is not a pure end device, but has networking and data switching capabilities, so that networking does not rely on an external switch and a router. The general-purpose graphic processing system can read data from other equipment or write data into other equipment without depending on an external switch and a router, so that the data exchange capacity of the general-purpose graphic processing system and other equipment can be improved, and the improvement of the computing performance of the general-purpose graphic processing system is facilitated.

Drawings

The foregoing and other objects, features, and advantages of the disclosure will be apparent from the following description of embodiments of the disclosure, which refers to the accompanying drawings in which:

FIG. 1 is a hierarchy of a data center;

FIG. 2 is a perspective block diagram of a data center;

fig. 3 is a schematic diagram of a cloud server of a general structure of a data center.

FIG. 4a is a block diagram of a cloud server for performing model training tasks in the prior art;

FIG. 4b shows a modification to the cloud server based on FIG. 4 a;

FIG. 5 is a block diagram of a general purpose graphics processing system provided in accordance with one embodiment of the present disclosure;

FIG. 6 shows a more detailed functional block diagram of the switch module of FIG. 5;

FIG. 7 is a block diagram of a message of a proprietary protocol;

FIG. 8 is a schematic diagram of the interconnect structure of the general purpose graphics processing system shown in FIG. 5;

FIGS. 9 and 10 are schematic diagrams of an application of a compute node incorporating an embodiment of the present disclosure;

FIG. 11 is a flow diagram of a method for a general purpose graphics processing system, according to one embodiment of the present disclosure.

Detailed Description

The present disclosure is described below based on examples, but the present disclosure is not limited to only these examples. In the following detailed description of the present disclosure, some specific details are set forth in detail. It will be apparent to those skilled in the art that the present disclosure may be practiced without these specific details. Well-known methods, procedures, and procedures have not been described in detail so as not to obscure the present disclosure. The figures are not necessarily drawn to scale.

The following terms are used herein:

the OSI model: the international organization for standardization (international organization for standardization) has established the osi (open System interconnection) model. The OSI model is an abstract model architecture that divides the network communications into seven layers, namely a physical layer, a data link layer, a network layer, a transport layer, a session layer, a presentation layer and an application layer. Wherein the physical layer performs the transmission of the actual final signal and transmits the electronic signal of the bit stream over the physical medium. The data link layer is responsible for network addressing, error detection and error correction, and in this layer, the header and trailer are added to the packet to form a frame. The routing and forwarding of data is determined by the Network Layer (Network Layer), where Network Headers (NH) are added to packets to form packets. The transport layer decides to provide a reliable connection of the terminal to the terminal, in which layer a Transport Header (TH) is added to the data to form a data packet. The session layer provides mechanisms for establishing and maintaining communication between applications, including access authentication and session management. The presentation layer mainly solves the problem of syntactic presentation of user information, and provides a formatted presentation and conversion data service. The application layer provides interface services between the network and the user application software.

Ethernet technology: the essence of this technique is a medium access control technique at the data link layer. It can combine with various physical layer technologies to form various Ethernet access systems, for example, combine with VDSL on the telephone copper cable to form EoVDSL technology; in combination with a passive optical network, an EPON technique is generated; in a wireless environment, WLAN technology is being developed.

RoCE protocol: is an extension protocol of the Ethernet technology in the field of private networks. Ethernet technology, while always dominating the global internet, reveals many drawbacks in high bandwidth, low latency proprietary networks. With the rise of network convergence concept, in dcb (data Center bridging) standards released by IETF (internet engineering task force), RDMA/Infiniband-based lossless link is solved, ethernet finally has its own standard in proprietary network field, and the concept of roce (RDMA over converted ethernet) is also proposed. With the upgrade of the version (from RoCEv1 to RoCEv2), the novel NIC (network interface controller) and switch (switch) with 10Gb and above basically integrate RoCE support. RoCE v1(Layer 2) operates at the data Link Layer (Ehternet Link Layer). RoCE v2 operates at the network/transport layer (UDP/IPv4 or UDP/IPv 6).

Data center

Fig. 1 shows a hierarchical structure diagram of a data center as one scenario to which an embodiment of the present disclosure is applied.

A data center is a globally collaborative network of devices that is used to communicate, accelerate, present, compute, store data information over an internet network infrastructure. In future development, the data center will become an asset for enterprise competition. With the popularization of data center applications, artificial intelligence and the like are increasingly applied to data centers. The neural network is an important technology of artificial intelligence, and is widely applied to big data analysis and operation of a data center.

In a conventional large data center, the network structure is generally a three-layer structure shown in fig. 1, i.e., a hierarchical interconnection network model (hierarchical inter-networking model). This model contains the following three layers:

access Layer (Access Layer) 103: sometimes referred to as the edge layer, includes access switch 130 and servers 140 to which the access switch is connected. Each server 140 is a processing and storage entity of a data center in which the processing and storage of large amounts of data is performed by the servers 140. Access switch 130 is a switch used to access these servers to the data center. One access switch 130 accesses multiple servers 140. The access switches 130 are typically located on Top of the Rack, so they are also called set-Top (Top of Rack) switches, which physically connect the servers.

Aggregation Layer (Aggregation Layer) 102: sometimes referred to as the distribution layer, includes aggregation switches 120. Each aggregation switch 120 connects multiple access switches while providing other services such as firewalls, intrusion detection, network analysis, and the like.

Core Layer (Core Layer) 101: including core switches 110. Core switches 110 provide high-speed forwarding of packets to and from the data center and connectivity for multiple aggregation layers. The entire data center network is divided into an L3 layer routing network and an L2 layer routing network, and the core switch 110 provides a flexible L3 layer routing network for the entire data center network.

Typically, the aggregation switch 120 is the demarcation point between L2 and L3 layer routing networks, with L2 below and L3 above the aggregation switch 120. Each group Of aggregation switches manages a Point Of Delivery (POD), within each Of which is a separate VLAN network. Server migration within a POD does not have to modify the IP address and default gateway because one POD corresponds to one L2 broadcast domain.

A Spanning Tree Protocol (STP) is typically used between the switch 120 and the access switch 130. STP makes only one aggregation layer switch 120 available for a VLAN network and the other aggregation layer switches 120 are used in the event of a failure (dashed lines in the upper figure). That is, at the aggregation level, no horizontal scaling is done, since only one is still working even if multiple aggregation switches 120 are added.

FIG. 2 illustrates the physical connections of the components in the hierarchical data center of FIG. 1. As shown in fig. 2, one core switch 110 connects to multiple aggregation switches 120, one aggregation switch 120 connects to multiple access switches 130, and one access switch 130 accesses multiple servers 140.

Cloud server

The cloud server 140 is the real device of the data center. Since the cloud server 140 operates at high speed to perform various tasks such as matrix calculation, image processing, machine learning, compression, search ranking, etc., the cloud server 140 generally includes a Central Processing Unit (CPU) and various acceleration units, as shown in fig. 3, in order to be able to efficiently accomplish the various tasks. The acceleration unit is, for example, one of an acceleration unit dedicated to a neural network, a Data Transfer Unit (DTU), a Graphics Processing Unit (GPU), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA). The following description is made of each acceleration unit by way of example in fig. 3.

Acceleration unit 230 dedicated to neural network: the method is a processing unit which adopts a data-driven parallel computing architecture and is used for processing a large number of operations (such as convolution, pooling and the like) of each neural network node. Because data in a large number of operations (such as convolution, pooling and the like) of each neural network node and intermediate results are closely related in the whole calculation process and are frequently used, the existing CPU framework is adopted, and because the memory capacity in a CPU core is small, a large amount of external storage is frequently accessed, and the processing efficiency is low. By adopting the accelerating unit, each core is provided with an on-chip internal memory with the storage capacity suitable for the neural network calculation, so that the frequent access to a memory outside the core is avoided, the processing efficiency can be greatly improved, and the calculation performance is improved.

Data Transmission Unit (DTU) 260: the wireless terminal device is specially used for converting serial port data into IP data or converting the IP data into the serial port data and transmitting the serial port data through a wireless communication network. The main function of the DTU is to wirelessly transmit data from the remote device back to the back office. At the front end, the DTU interfaces with the customer's equipment. After the DTU is powered on and operated, the DTU is firstly registered to a mobile GPRS network and then goes to a background center arranged in the DTU to establish socket connection. The background center is used as a server side of socket connection, and the DTU is a client side of socket connection. Therefore, the DTU and the background software are matched for use, and after the connection is established, the front-end equipment and the background center can perform wireless data transmission through the DTU.

Graphics Processing Unit (GPU) 240: is a processor specially used for image and graph related operation. The GPU is used, the defect that the space of a computing unit in the CPU is too small is overcome, a large number of computing units special for graphic computation are adopted, the dependence of a display card on the CPU is reduced, and some image processing work which is intensive in computation and originally born by the CPU is born.

Application Specific Integrated Circuit (ASIC): refers to integrated circuits designed and manufactured to meet the needs of a particular user and the needs of a particular electronic system. Since such integrated circuits are customized to the user's requirements, their structure is often tailored to the specific user's requirements.

Field Programmable Gate Array (FPGA): is a product developed on the basis of programmable devices such as PAL, GAL and the like. The circuit is used as a semi-custom circuit in the field of Application Specific Integrated Circuits (ASIC), not only overcomes the defects of the custom circuit, but also overcomes the defect that the gate circuit quantity of the original programmable device is limited.

On the basis of the cloud server with the general structure, system operation and maintenance personnel can select different accelerating units in the cloud server according to different tasks in practice. FIG. 4a shows a block diagram of a cloud server performing model training tasks. As shown in the figure, the cloud server 140 is composed of a plurality of nodes 141 and a plurality of switches 144 located outside the nodes 141. Each node 141 is coupled to at least one external switch 144, and a plurality of switches 144 are coupled to one another to form a network structure in which data is exchanged between nodes 141 via switches 144.

The structure of an exemplary node 141 is shown in the figure. The node 141 includes task units 143 and 142. The task unit 142 and the task unit 143 have the same structure. Here, the task unit 143 is described as an example. The task unit 143 includes a memory 1411, a CPU 1412, an interconnection switching unit 1413, a GPU 1414, a Network Interface Controller (NIC)1415, a GPU 1416, and a network interface controller 1417. Wherein the functions of CPU 1412, GPU 1414, and GPU 1416 may be as described above. It should be appreciated that the cloud server 140 will include an appropriate number of task units, and the CPU and GPU in each task unit will be in an appropriate ratio, depending on the model training task to be performed.

The task unit 143 is described as an example. The memory 1411 is used for storing instructions and data, and may be a random access memory or a flash memory. The interconnect switch unit 1413 has two functions: physical connections are provided between and data forwarding capabilities are provided between GPU 1414, GPU 1416,

network interface controllers

1415 and 1417, and CPU 1412. For example, by interconnecting switching unit 1413 and through network interface controller 1415, CPU 1412 can obtain instructions and data from the outside and provide the instructions and data to GPU 1414 for use. The

network interface controllers

1415 and 1417 are connected to the external switch 144.

The workflow of the task unit 141 may be as follows: the central processor 1412 acquires instructions and data from the outside via the network interface controller 1415 and the network interface controller 1417 and stores them on the memory 1411 via the interconnection switching unit 1413, and then reads the instructions and data from the memory 1411 and distributes them to the

GPUs

1414 and 1416, so that the

GPUs

1414 and 1416 perform parallel computation of image data.

It should be noted that node 141 may have multiple product manifestations, such as node 141 implemented on the same silicon (i.e., a system on a chip) and a network fabric of multiple nodes 141 and multiple switches 144 implemented as an integrated device, and further such as node 141 implemented as a computer and a network fabric of multiple nodes 141 and multiple switches 144 implemented as a computer cluster fabric.

In the above configuration, the interconnect communication overhead may become a bottleneck for computational increase, for example, the computational power of task unit 143 may be generally weaker than the sum of the computational power of

GPUs

1414 and 1416 due to the interconnect communication overhead and network delay caused by interconnect switch unit 1413 and external switch 144. Therefore, in order to increase the overall system power, it is necessary to further reduce the overhead of the interconnection communication. Fig. 4b shows a modification of fig. 4 a. In fig. 4b, the interconnection switching unit 1413 and the CPU 1412 are integrated, so that the data exchange requirement between the CPU 1412 and the network interface controller 1415 and the network interface controller 1417 can be eliminated, and also, the data exchange requirement between the CPU 1422 and the network interface controller 1425 and the network interface controller 1427 can be eliminated because the interconnection switching unit 1423 and the CPU 1422 are integrated. However, this solution still has the following limitations: since the

network interface controllers

1415 and 1417 may use the latest RDMA technology to enable 800G of data bandwidth, but the switch 144 typically uses existing commercial ethernet switches, but such ethernet switches are dominated by supporting 25G-100G of data bandwidth, and the mismatch of the two results in the inability to maximize the computation.

The embodiment of the disclosure provides a general graphic processing system

The disclosed embodiment provides a General-purpose graphics processing system (General-purpose computing system on graphics processing)500 as shown in fig. 5, which is used to replace the GPU in fig. 3. Various aspects of the general purpose graphics processing system 500 are described in detail below.

As shown, the general purpose graphics processing system 500 is augmented with a switch module 501, where the switch module 501 comprises a hardware and software implementation that functions similar to the interconnect switch unit of fig. 4a and 4b to implement data switching functions. In addition to the switching module 501, the general-purpose graphics processing system 500 includes a plurality of computing units 502, a connection unit 503, a memory controller 504, and a cache 505.

The connection unit 503 is configured to couple the switching module 501, the plurality of computing units 502, and the memory controller 504. The connection unit 503 mainly functions to transmit data from one component to another component and perform interface conversion, i.e., converting the format of the received data and outputting the converted data.

The computing unit 502 is used to perform computing tasks related to the image data. Multiple compute units 502 may perform the compute tasks in parallel, thereby improving the computational performance of the general purpose graphics processing system 500. The compute unit 502 may read instructions and data from the cache 505 via the memory controller 504 to perform computational tasks. The calculation unit 502 further includes an instruction fetch unit, a decode unit, an arithmetic calculation unit, and registers and buffers necessary for performing calculations, which are not shown in the drawing. The instruction unit is used for reading instructions and data from the memory. The decoding unit is used for analyzing the instruction. The logic calculation unit is used for carrying out actual arithmetic operation. These components cooperate to accomplish the computational tasks of the computational unit 502.

As shown, the memory controller 504 is coupled to a cache 505. The memory controller 504 and cache 505 may be integrated as one memory device or as separate devices as shown. The memory controller 504 performs the necessary control of access to the cache 505. For example, when the computing unit 502 accesses the cache memory 505 via the memory controller 504, the memory controller 504 converts the read and write commands issued by the computing unit 502 into signals that can be recognized by the cache memory 505, and also completes address decoding and data format conversion between the computing unit 502 and the memory controller 504. Similarly, when the switch module 501 accesses the cache 505 via the storage controller 504, the storage controller 504 performs corresponding access control.

The switching module 501 includes a plurality of interfaces 5011, is coupled to other devices, such as other general purpose graphics processing systems, through the plurality of interfaces 5011, and exchanges data with the other devices through the plurality of interfaces 5011. Specifically, if the current gpus is to write data to other gpus coupled thereto, the switching module 50 receives a write operation request, which typically includes an identification of an object to be accessed and a source address of the data to be written, determines a first interface of the plurality of interfaces 5011 according to the identification of the object to be accessed and pre-stored interconnect information, reads the data to be written from the cache 505 according to the source address, and sends the data to be written via the first interface. If the current general purpose graphics processing system is to read data from the other general purpose graphics processing systems coupled to it, the switching module 50 may receive a read operation request, typically containing an identification of the object to be accessed and a source address, indicating a memory address of the read data, determine a first interface of the plurality of interfaces 5011 based on the identification of the object to be accessed and pre-stored interconnect information, and receive the data to be written to the source address via the first interface. The interconnection information records the connection information of the current general-purpose graphic processing system and other general-purpose graphic processing systems.

The write operation request and the read operation request may come from an application executed by the computing unit 502 or from the memory controller 504. The target to be accessed may be a target application executed in another general purpose graphics processing system coupled to the current general purpose graphics processing system, and thus the identification of the target to be accessed typically includes an identification of the general purpose image processing unit to which the target application belongs and an identification of the target application itself.

Other gpus coupled to the current gpus also have a switch module 501, and for both write operation request and read operation request, the switch module 501 needs to determine a target address in the memory space of the target application, where the target address corresponds to a source address, the source address is a specific storage address of the application memory space of the source application in the cache, and the target address is a specific storage address of the application memory space requested by the target application in the cache, if the write operation request is made, the switch module 501 directly reads data from the source address and writes the data to the target address, and if the read operation request is made, the switch module 501 reads data from the target address and writes the data to the source address.

In this embodiment, the general-purpose graphics processing system has a data exchange function, so that a network structure can be formed between multiple graphics processing units without external and internal switches and data forwarding can be completed, thereby improving data transmission capability between the multiple graphics processing units, reducing network delay, and improving computing performance of the graphics processing units.

Furthermore, reading and writing data are independently completed by the exchange module without using resources occupying the computing unit, so that the computing load of the computing unit is reduced, and the computing performance of the graphic processing unit is improved.

With continued reference to what is shown in the figures, as an alternative embodiment, the switching module 501 includes a plurality of interfaces 5011, a switching unit 5012, and a transmission engine 5013 coupled internally.

The interface 5011 is, for example, an ethernet interface conforming to the 802.3 specification. The interface 5011 may be implemented by an ethernet interface chip, and mainly functions to transmit a bitstream signal at a physical layer. Transmitting the bitstream signal includes transmitting and receiving the bitstream signal. When transmitting bit stream, receiving frame data from data link layer, converting frame data into bit stream signal and outputting. Upon receiving the bit stream, a bit stream signal is received from a transmission medium connected to the interface 5011 and is converted into frame data to be supplied to the data link layer. Interface 5011 also typically includes a serial-to-parallel converter implemented based on SerDes technology to enable high speed transmission of serial signals. When the bit stream signal is transmitted, the serial-to-parallel converter converts the parallel bit stream signal into a serial bit stream signal, and transmits the serial bit stream signal to the interface 5011 of the receiving end through the transmission medium. The integration of serial-to-parallel converters in the interface to achieve high-speed serial signal transmission has become the mainstream choice for many commercial products.

The transmission engine 5013 is operable to encode data in accordance with a specified transport layer/network layer communication protocol and transmit the encoded data to the switching unit 5012, while receiving data to be decoded from the switching unit 5012 and decoding the data in accordance with the specified transport layer/network layer communication protocol.

The switching unit 5012 is configured to determine a first interface according to the identifier of the target to be accessed and the interconnection information, encode data according to the ethernet communication protocol and the physical layer protocol, and send the encoded data through the first interface, and meanwhile, the switching unit 5012 receives the data from the first interface, decodes the data according to the ethernet communication protocol and the physical layer protocol, and sends the decoded data to the transmission engine 5013.

Of course, it is also possible that the operation of determining the first interface based on the identification of the object to be accessed and the interconnection information is performed by the transmission engine 5013, and then the identification of the first interface is transferred to the switching unit 5012 by the transmission engine 5013.

It should be noted that the communication protocols referred to herein are generally a family of protocols. For example, in the foregoing, we refer to the OSI model as including a plurality of layers, each of which involves a plurality of communication protocols, e.g., ARP, RARP, IEEE802.3, PPP, CSDA/CD, RoCE, etc., at the data link layer, IP, ICMP, RIP, IGMP, etc., at the network layer, and TCP, UDP, etc., at the transport layer. The data transmission implemented by the switching module 501 requires encoding and decoding according to the OSI model specification or other model specifications.

In some embodiments, at least a portion of the switching module 501 may be integrated into a network card processor, and the interface 5011 may sometimes be integrated into the network card processor.

Fig. 6 shows a more detailed functional block diagram of the switching module in fig. 5. As shown in the figure, the switching module 501 includes a switching unit 5012, a transmission engine 5013, and a plurality of interfaces 5011. The transmission engine 5013 includes a RoCE protocol processing module 601, a proprietary protocol processing module 602, and a driver module 603. The switching unit 5012 includes an ethernet processing unit 6031 and a plurality of MAC controllers 6032.

The RoCE protocol processing module 601 adopts RoCEv2 communication protocol to realize the function of the transmission layer, and the proprietary protocol processing module 602 uses a customized proprietary protocol. The RoCE protocol processing module 601 further includes a TOE unit 6011, an IB unit 6012, and a predicate interface (verbinf) 6013. The proprietary protocol processing module 602 further includes a proprietary protocol driver interface 6021 and a proprietary protocol unit 6022. The driver module 603 is a driver of hardware, and if the transmission engine 5013 executes an operation of determining the first interface according to the identifier of the object to be accessed and the interconnection information, the driver module 603 may determine the first interface according to the identifier of the object to be accessed and the interconnection information, and transmit the identifier information of the first interface to the other function modules. Of course, the present disclosure is not limited thereto.

TOE unit 6011 implements data encoding and decoding of conventional network layer and transport layer communication protocols, which include one or more of IP, UDP, DHCP, ICMP, and ARP. The TOE unit 6011, when receiving a data packet from the switching unit 5012, strips out header information, such as an IP header, a UDP header, an ARP header, and the like, from the received data. It also checks whether there is a packet header type error and a packet error according to the communication protocol type. Only error-free packets will be sent to IB unit 6012. When it receives the packet from the IB unit 6012, it adds a header according to the communication protocol type, and sends the packet with the header added to the ethernet processing unit 6031.

The IB unit 6012 implements a data transfer function based on the IB protocol. IB unit 6012 parses the Work Queue Element (WQE), and then performs send message, receive message, read/write atomic operation, etc. of the RoCEv2 protocol according to the work queue element to implement end-to-end data transmission. Work queue elements are written to by other processing units, for example, by compute unit 502 or memory controller 504. The IB unit 6012 also checks data integrity according to the Packet Sequence Number (PSN) of each packet.

The predicate interfaces 6013 are various interfaces between drivers and hardware. It includes memory read and write and atomic operations based on the RoCEv2 protocol, which are not performed via the processor.

The proprietary protocol processing module 602 further includes a proprietary protocol driver interface 6021 and a proprietary protocol unit 6022. The special protocol driver interface 6021 is a custom special protocol processing module above the data link layer that does not include processing of IP/TCP/UDP protocols, thereby reducing the resource overhead associated with processing heavy protocol headers. The proprietary protocol encodes the destination address, security key and Packet Sequence Number (PSN) of its operation directly after the MAC frame header.

As shown in the figure, the switching unit 5012 includes an ethernet processing unit 6031 and a plurality of MAC controllers 6032. The MAC controller 6032 is a device that realizes a function of a data link layer, and the MAC controller 6032 receives a packet from the ethernet processing unit 6031, transmits the processed packet to the interface 5011 in accordance with a protocol specification of the data link layer, and at the same time, the MAC controller 6032 receives a packet from the interface 5011 and transmits the processed packet to the ethernet processing unit 6031. The MAC controller 6032 also simultaneously checks for errors in the frame data and discards corrupted packets.

The ethernet processing unit 6031 is a functional layer of a data link layer, and is configured to encode and decode data according to an ethernet protocol. In data transmission, the ethernet processing unit 6031 needs to obtain the MAC address of the first interface, and then encodes data according to the MAC address, and further transmits the encoded data to one of the MAC controllers 6032 according to the MAC address. During data reception, the ethernet processing unit 6031 parses the received data packet and verifies the data packet, and after the verification is completed, the ethernet processing unit 6031 sends the data packet to the RoCE protocol processing module 601 or the proprietary protocol processing module 602 according to the indication information included in the MAC header. The indication information in the MAC indicates the communication protocol used by the data at the transport layer.

In the embodiment, the RoCEv2 protocol is adopted to realize direct transfer of buffer between applications at two ends, and no intervention of an operating system and a protocol stack is needed, so that ultra-low-latency and ultra-high-throughput data transmission can be realized, and the processing resources of a computing unit are not basically occupied.

It should be noted that RoCEv2 is a direct memory access technique used in the present embodiment. However, in addition to RoCEv2, other direct memory access techniques may be used in the present embodiment. The dma is generally applied to a computer system, but the embodiment applies the dma to a general-purpose graphics processing system, and this processing method can improve the computing performance of the general-purpose graphics processing system.

Fig. 7 is a diagram of an exemplary proprietary protocol message structure. As shown in the figure, the message of the proprietary protocol is divided into a MAC header and data itself. The MAC header includes a receiver MAC address and a sender MAC address, an ethertype. Take the above embodiment as an example. When the computing unit 502 or the memory controller 504 writes data to another general-purpose graphic processing system, the sender MAC address should be the MAC address of the first interface 5011 determined based on the identification of the object to be accessed, the receiver MAC address should be the MAC address of the second interface of another general-purpose graphic processing system connected to the first interface 5011, and the ethertype refers to the protocol type used by the previous layer, and since the switching unit 5012 processes the communication protocols on the respective layers in the top-down order of the OSI model, when writing data, the protocol type used by the previous layer, for example, the network layer, is the IP protocol, the TCP protocol, or a custom-made dedicated protocol. The data itself includes a target address, a security key, and data divided by groups to be written to the target address. The security key provides security protection in the data integrity function.

The encoding and decoding of the MAC header is performed by the ethernet processing unit 6031, for example, after the ethernet processing unit 6031 receives the packet from the proprietary protocol engine 6015, the MAC header is encoded before the packet, and the encoding and decoding of the packet is performed by the proprietary protocol engine 6015.

FIG. 8 is a schematic diagram of the interconnect structure of the general purpose graphics processing system shown in FIG. 5. As described above, the interconnect structure includes compute nodes 800 and 900. The computing node 800 comprises two task units 801 and 802. The task unit 801 includes a memory 8011, a processor 8012 and general purpose

graphics processing systems

8013 and 8014 implemented in accordance with embodiments of the disclosure coupled by a bus. Task unit 802 includes memory 8022, processor 8021, general purpose

graphics processing systems

8023 and 8024 coupled by a bus. The processors 8012 and 8021 are coupled. Computing node 900 includes task units 901 and 902. The task unit 901 includes a memory 9011, a processor 9012, and general purpose

graphics processing systems

9013 and 9014, which are coupled by a bus. The task unit 902 includes a memory 9021, a processor 9022, and general

purpose graphics processors

9023 and 9024 coupled by a bus.

Processors

9012 and 9022 are coupled by a bus.

In the figure, compute nodes 800 and 900 are coupled by means of interfaces internal to the respective internal general purpose graphics processing system. The computing node 800 is shown coupled to general purpose

graphics processing systems

9013, 9014, 9023, and 9024 in the computing node 900, respectively, using an interface in the general purpose graphics processing system 8013.

Based on the interconnection structure, the computing nodes communicate with each other. For example, the processor 8012 may access caches in the general purpose graphics processing system 9013 via the general purpose

graphics processing system

8013 or 8014. As another example, if both general purpose

graphics processing systems

8013 and 9013 are deployed with RDMA-compliant memory controllers, the memory controller of general purpose graphics processing system 8013 may write data directly to the memory controller in general purpose graphics processing system 9013 or read data.

Fig. 9 and 10 are schematic diagrams of applications of a computing node incorporating an embodiment of the present disclosure, respectively.

As shown in fig. 9, the computation nodes 11 to 14 are integrated into one integrated device through a Printed Circuit Board (PCB)1, and the computation nodes N1 to N4 are integrated into one integrated device through a printed circuit board N. Inside the integrated devices, coupling communication is realized through the switching module, in the capacity of scale-up, the integrated devices are coupled and communicated through the printed circuit board, at the moment, the board level transmission can reach the transmission rate of 800G, and in the capacity of scale-out, the integrated devices and the bridging equipment of the data center can provide the output transmission capacity of 100G to the bridging equipment of the data center through the switching module in the computing node.

Alternatively, each compute node may be implemented as a system on chip (SoC), and the interconnect structure formed by the compute nodes may be packaged as an integrated device.

FIG. 10 shows a 3D-Torus interconnect network. Within each computing node, a NOC and NIU are employed to couple together the cache, general purpose graphics processing system, and switch module. A NOC may be defined as a multi-processing system based on network communications implemented on a single chip. NOCs use a network instead of a bus has the following advantages: 1) the method has good address space expandability, and the number of integratable resource nodes is not limited theoretically; 2) good parallel communication capability is provided, so that data throughput and overall performance are improved; 3) NOCs use packet switching as the basic communication technique, using global async-local synchronization. NIU (NOC Interface units) is a NOC Interface unit that provides transaction layer interconnect services between IP cores.

In terms of architecture, each switching module is interconnected with adjacent computing nodes through a plurality of interfaces thereof, and a 3D-Torus interconnection network is completely built by the switching modules in the computing nodes and does not depend on external switching equipment.

In summary, according to the embodiment of the present disclosure, the general graphics processing system integrated with the switch module is not a pure end device, but has networking and data switching capabilities, so that networking does not rely on an external switch and a router, and the computing performance of the general graphics processing system is improved.

In a further embodiment, the general graphic processing system integrates a lightweight RDMA protocol and a self-defined proprietary protocol, so that the efficiency is improved, the powerful expansion capability is provided, the overall resource overhead of the system is reduced, and the system integration level is improved.

It should be emphasized that the disclosed embodiments do not have any limitations on the manufacturing process of a general purpose graphics processing system. For example, the switch module and other devices may be separately designed and manufactured and formed as separate components and then packaged together through an integration process, the switch module and other devices may be integrally formed as a single component, further, in the manufacturing ring, the computing unit, the cache, the memory controller, the switch module and the connection unit may be implemented in the same wafer (wafer), and further may be implemented on one or more dies (die), e.g., the computing unit, the cache, the memory controller and the connection unit are implemented on one die and the switch module is implemented on another die.

Data structure in embodiments of the present disclosure

Referring to table 1, the interconnection information defines interconnection relationships among a plurality of general-purpose graphic processing systems. The local application identification is an identification of an application program that initiated the data operation request, and the target application identification is an identification of an application program that received the data operation request. The local application identification and the target application identification both include an identification of a graphics processing unit to which the application program belongs. The interface is an interface of a graphics processing unit to which the application belongs.

Table 1 interconnection information table

Method for realizing in the above general graphic processing system

FIG. 11 shows a flow diagram for a general purpose graphics processing system implementing communications, according to one embodiment of the present disclosure. The flowchart includes steps S100 and S300. Steps S100 and S300 may be implemented in a switching module as shown in fig. 5.

In step S100, an identifier of a target to be accessed and a source address of data to be written are received.

In step S200, a first interface of the plurality of interfaces is determined according to the identifier of the object to be accessed and the pre-stored interconnection information.

In step S300, the data to be written is read from the cache according to the source address, and is sent via the first interface.

In some implementations, the data to be written and the target address are received via a second interface of the plurality of interfaces of the general purpose graphics processing system and the data to be written is written to the target address.

In some implementations, prior to sending the data to be written over the first interface, a particular communication protocol is selected from a plurality of different communication protocols, and the data to be written is then encoded according to the particular communication protocol to send the encoded data over the first interface.

In some implementations, the plurality of communication protocols include an RDMA RoCEv2 communication protocol that enables end-to-end data transfer at the transport layer and a custom proprietary protocol that encodes the first data directly after the MAC header, i.e., omitting the header overhead of TCP/UDP/IP.

In some implementations, the target address is one of the following devices: the source address and the target address are respectively the specific storage addresses of the source application program and the target application program in the respective application memory space.

Commercial value of the disclosed embodiments

Conventional general purpose graphics processing systems do not have networking and switching capabilities. Embodiments of the present disclosure provide a general purpose graphics processing system with networking and switching capabilities, whereby a distributed system for model training can be constructed using the general purpose graphics processing system without the need for external switches and routers. Therefore, the embodiment of the disclosure has application prospect and commercial value.

As will be appreciated by one skilled in the art, the present disclosure may be embodied as systems, methods and computer program products. Accordingly, the present disclosure may be embodied in the form of entirely hardware, entirely software (including firmware, resident software, micro-code), or in the form of a combination of software and hardware. Furthermore, in some embodiments, the present disclosure may also be embodied in the form of a computer program product in one or more computer-readable media having computer-readable program code embodied therein.

Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium is, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer-readable storage medium include: an electrical connection for the particular wire or wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical memory, a magnetic memory, or any suitable combination of the foregoing. In this context, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a processing unit, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a chopper. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any other suitable combination. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., and any suitable combination of the foregoing.

Computer program code for carrying out embodiments of the present disclosure may be written in one or more programming languages or combinations. The programming language includes an object-oriented programming language such as JAVA, C + +, and may also include a conventional procedural programming language such as C. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A general purpose graphics processing system, comprising:

a calculation unit;

caching;

a storage controller coupled to the cache;

2. A general purpose graphics processing system as claimed in claim 1, wherein the identification of the target to be accessed and the source address are from a data operation request submitted by the compute unit or memory controller.

3. The general purpose graphics processing system of claim 1, wherein the switch module further comprises:

4. The general purpose graphics processing system of claim 3, wherein the transport engine supports a plurality of transport layer/network layer communication protocols, and wherein the transport engine selects the designated transport layer/network layer communication protocol from the plurality of transport layer/network layer communication protocols to encode the first data to be written.

5. The general purpose graphics processing system of claim 4, wherein the transport engine comprises:

6. The generic graphics processing system of claim 5, wherein the RoCE protocol processing module comprises a TOE unit for encoding the first data to be written according to an IP/TCP/UDP protocol.

7. The generalized graphics processing system of claim 5, wherein the RoCE protocol processing module includes an IB unit to establish end-to-end data transmission based on an IB protocol.

8. The general purpose graphics processing system of claim 5, wherein the RoCE protocol processing module comprises a predicate interface.

9. The general purpose graphics processing system of claim 5, wherein the proprietary protocol processing module comprises a proprietary protocol unit to codec data according to a proprietary protocol and a proprietary protocol driver interface to provide drivers and hardware interfaces.

10. The general purpose graphics processing system of claim 1, wherein the identification of the target to access comprises an identification of a target graphics processing unit connected via the first interface and an identification of a target application.

11. The gpu of claim 1, wherein the switch unit is further configured to receive second data to be written via a second interface, determine a target address according to an identifier of the target to be accessed, and write the second data to be written to the target address.

12. A general purpose graphics processing system according to claim 11, wherein the source address and the destination address are specific memory addresses of the source application and the destination application in their respective application memory spaces, respectively.

13. The general purpose graphics processing system of claim 1, wherein the switch module is integrated as a network card processor.

14. The general purpose graphics processing system of claim 10, wherein the target graphics processing system and the general purpose graphics processing system are located in different compute nodes.

15. The general purpose graphics processing system of claim 1, wherein the plurality of interfaces are ethernet interfaces.

16. A computing device comprising a plurality of computing nodes, the computing nodes comprising coupled memory, a general purpose processor, and the general purpose graphics processing system of any of claims 1 to 15, wherein the computing nodes are coupled with the general purpose graphics processing system of at least one other computing node through their own general purpose graphics processing system.

17. The computing device of claim 16, wherein the compute nodes are packaged in the same silicon die, a plurality of the compute nodes being integrated together by a printed circuit board.

18. The computing device of claim 16, wherein the computing device is to perform a training task for a deep learning model.

19. A distributed system comprising a plurality of computing devices according to any one of claims 16 to 18, wherein data transfer is via an external bridging device.

20. A distributed system comprising a plurality of computing nodes including coupled memory, a general purpose processor, and the general purpose graphics processing system of any of claims 1 to 15, wherein the computing nodes are coupled by their own general purpose graphics processing system to a general purpose graphics processing system of at least one other computing node in proximity and form a 3D-Torus interconnection network.

21. A cloud server comprising the computing device of any of 16 to 18.

22. A method implemented in a general purpose graphics processing system, comprising: