CN113946537A

CN113946537A - Accelerating device and server

Info

Publication number: CN113946537A
Application number: CN202111199059.4A
Authority: CN
Inventors: 白秀杨; 叶丰华
Original assignee: Inspur Power Commercial Systems Co Ltd
Current assignee: Inspur Power Commercial Systems Co Ltd
Priority date: 2021-10-14
Filing date: 2021-10-14
Publication date: 2022-01-18

Abstract

The application discloses accelerating device relates to the computer field, is applied to CPU, includes: the GPU controller, GPU, IB exchange interface, GPU controller and IB exchange interface, the GPU controller is connected with GPU through NVLink bus, and is connected with IB exchange interface through IB bus, CPU can transmit data task to be processed to GPU controller through IB bus, GPU controller transmits data task to GPU through NVLink bus for data processing, GPU controller can realize IB protocol and NVLink protocol conversion, the interconnection architecture effectively solves the data transmission bandwidth bottleneck problem of bidirectional PCIE bus bandwidth 64GB/s in the existing CPU + GPU architecture, and data is transmitted and exchanged between CPU memory and GPU memory to realize bandwidth balance.

Description

Accelerating device and server

Technical Field

The present application relates to the field of computers, and in particular, to an acceleration apparatus for a CPU of a server.

Background

The sudden increase of information amount in modern society continuously raises the requirement on the computing power of server hardware, and the server is required to have strong floating point operation, matrix operation and large-scale parallel computing power, while a central processing unit (hereinafter, referred to as a CPU) system structure determines that the server is more suitable for general computing, and the computing efficiency is low. The graphics processing unit (hereinafter referred to as GPU) makes up the disadvantages of the general CPU in floating point operation, matrix operation and large-scale parallel computation, and the current heterogeneous computing architecture of the general CPU and the GPU is an efficient solution for processing a large amount of data processing operation. In the existing server CPU + GPU heterogeneous architecture, a CPU is interconnected with a GPU through a PCI Express (hereinafter abbreviated as PCIE) bus. PCIE is a high-speed serial point-to-point bus, the PCIE bus may support configurations of different numbers of lanes x1, x2, x4, x8, and x16 according to different device bandwidths, and usually, a GPU is interconnected with a CPU through x16 lanes (16 PCIE buses), or the PCIE bus is connected to multiple GPUs below one CPU through a PCIE Switch chip.

The CPU and the GPU are interconnected through a PCIE bus, the latest PCIE 4.0 speed is 16Gb/s, the total bandwidth of the PCIE bus of x16 lane which is connected in a bidirectional mode is 64GB/s, the data transmission bandwidth between the CPU and the CPU memory is 500GB/s, and the data transmission bandwidth between the GPU and the GPU memory is 512 GB/s. The bandwidth of the PCIE data transmission between the CPU and the GPU is only about 1/8 of the bandwidth of the data transmission between the CPU and the CPU memory, and the bandwidth bottleneck of the data transmission between the CPU and the GPU is on the PCIE bus from the CPU to the GPU (or from the CPU to the PCIE switch unit), which greatly limits the bandwidth of the data transmission.

Therefore, how to solve the bandwidth bottleneck problem of data transmission from the CPU to the GPU is an urgent technical problem to be solved by those skilled in the art.

Disclosure of Invention

The application aims to provide an accelerating device for solving the bandwidth bottleneck problem of data transmission from a CPU to a GPU.

In order to solve the above technical problem, the present application provides an acceleration apparatus, which is applied to a CPU, and includes:

a GPU controller, a GPU, an IB switching interface,

the GPU controller is connected with the GPU through an NVLink bus;

the GPU controller is connected with the IB exchange interface through an IB bus;

the GPU controller is used for realizing the conversion between an IB protocol and an NVLink protocol;

and the CPU is connected with the GPU controller through the IB exchange interface and is used for sending data tasks to the GPU controller.

Preferably, the GPU controller is further configured to, after receiving the data task of the CPU through the IB exchange interface, allocate the data task to the GPU, and control the GPU to perform data processing on the data task.

Preferably, the method further comprises the following steps: and the GPU memory is connected with the GPU and used for caching the data of the GPU.

Preferably, the method further comprises the following steps:

a management controller controlling the interface;

the management controller is connected with the GPU controller and the GPU through a system management bus and is used for acquiring working voltage, temperature and power consumption state information of the GPU controller and the GPU;

the control interface is connected with the management controller through Ethernet;

the control interface is connected with the CPU and used for sending the working voltage, the temperature and the power consumption state information of the GPU controller and the GPU to the CPU.

Preferably, the method further comprises the following steps: and the power supply unit is used for supplying power to the GPU controller, the GPU memory and the management controller.

Preferably, the GPU controller is an ARM architecture based controller.

Preferably, the GPU controller internally integrates a HBM memory unit.

Preferably, the number of GPUs is plural.

The application also provides a server which comprises the accelerating device.

The acceleration device provided by the application, wherein, wireless broadband technology (InfiniBand, hereinafter abbreviated as IB), NVLink, is a bus developed and proposed by England (NVIDIA) and a communication protocol thereof, a GPU controller is connected with a GPU through an NVLink bus, and is connected with an IB exchange interface through an IB bus, the bandwidth of the NVLink bus can reach 600GB/s, the CPU can transmit data tasks to be processed to the GPU controller through the IB bus through the IB exchange interface, and the GPU controller transmits the data tasks to the GPU for data processing through the NVLink bus.

In addition, the application also provides a server which comprises the accelerating device and has the same effect as the accelerating device.

Drawings

In order to more clearly illustrate the embodiments of the present application, the drawings needed for the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

Fig. 1 is a schematic structural diagram of an acceleration device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without any creative effort belong to the protection scope of the present application.

The core of the application is to provide an accelerating device, and the accelerating device is used for solving the problem of data transmission bandwidth bottleneck of a bidirectional PCIE bus bandwidth of 64GB/s in the existing CPU + GPU architecture.

In order that those skilled in the art will better understand the disclosure, the following detailed description will be given with reference to the accompanying drawings.

Fig. 1 is a schematic structural diagram of an acceleration device according to an embodiment of the present application, and as shown in fig. 1, the embodiment provides an acceleration device applied to a CPU, including:

GPU controller 1, GPU2, IB switch interface 3,

the GPU controller 1 is connected with the GPU2 through an NVLink bus;

the GPU controller 1 is connected with the IB exchange interface 3 through an IB bus;

the GPU controller 1 is used for realizing the conversion between an IB protocol and an NVLink protocol;

the CPU is connected to the GPU controller 1 through an IB switching interface 3, and is configured to send a data task to the GPU controller 1.

The CPU is one of the main devices of an electronic computer, a core accessory in the computer. Its functions are mainly to interpret computer instructions and to process data in computer software. The CPU is the core component of the computer responsible for reading, decoding and executing instructions. The CPU self-system structure is more suitable for general calculation, and the calculation efficiency is low. The GPU is designed based on large throughput, provides a multi-core parallel computing infrastructure, has a very large number of cores, can support parallel computing of a large amount of data, and has higher access speed and higher floating point operation capability. The heterogeneous computing architecture with CPU + GPU is an efficient solution to handle large amounts of data processing operations.

NVLink is a bus and its communication protocol developed and introduced by NVIDIA (NVIDIA). The NVLink adopts a point-to-point structure and serial transmission, is used for connection between a Central Processing Unit (CPU) and a Graphics Processing Unit (GPU), and can also be used for interconnection between a plurality of graphics processing units. IB buses have extremely high throughput and extremely low latency for computer-to-computer data interconnections. The IB bus also serves as a direct or switched interconnect between the server and the storage system, as well as an interconnect between the storage systems.

It should be noted that the GPU controller 1 mentioned in this embodiment is a device for implementing conversion between the NVLink protocol and the IB protocol, and this embodiment does not limit the method for implementing protocol conversion by the GPU controller 1, nor the composition architecture of the GPU controller 1, and the GPU controller 1 may be designed according to specific needs.

In the accelerator apparatus provided in this embodiment, the GPU controller 1 and the GPU2 are connected by an NVLink bus, the GPU controller 1 and the IB exchange interface 3 are connected by an IB bus, and the CPU is connected to the accelerator apparatus by the IB exchange interface 3.

In addition, the present embodiment is not limited to the power supply manner of the acceleration device, and may be designed according to specific situations, for example, a built-in power supply is used for supplying power to the acceleration device, or an external power supply is connected to supply power to the acceleration device.

Specifically, the CPU can transmit data tasks to be processed to the GPU2 control unit through the IB bus via the IB switching interface 3, the GPU2 control unit transmits the received data tasks to the GPU2, and controls the GPU2 to perform data processing, and when the GPU2 obtains a data processing result, the data tasks are transmitted to the CPU via the GPU2 control unit, because of the high bandwidth of the NVlink bus and the IB bus, the data transmission between the CPU and the GPU2 is no longer limited by the PCIE bus bandwidth of 64GB/s, and the data is transmitted and exchanged between the CPU memory and the GPU memory 4 to achieve bandwidth balance.

According to the above embodiments, in order to save the computing resources of the CPU, the present embodiment provides a preferable solution, and the GPU controller 1 is further configured to, after receiving the data task of the CPU through the IB switch interface 3, allocate the data task to the GPU2, and control the GPU2 to perform data processing on the data task.

In the existing CPU + GPU heterogeneous computing architecture, before the CPU transmits a data task to be computed to the GPU2, corresponding allocation and scheduling control work needs to be completed in the CPU.

Specifically, in the preferred scheme provided in this embodiment, the CPU transmits all data tasks to the GPU controller 1, and the GPU controller 1 completes the allocation and scheduling control work that should be performed in the CPU, thereby saving the computational resources of the CPU.

According to the foregoing embodiment, in order to improve the work efficiency of the GPU2, this embodiment provides a preferable solution, and the acceleration apparatus further includes: the GPU memory 4 is connected to the GPU2, and the GPU memory 4 is used for caching data of the GPU 2.

GPU2 is more computationally powerful than CPU, but GPU2 is equipped with much less memory capacity than CPU. The CPU is generally equipped with a memory capacity of 64 GB-256 GB, and the GPU2 is equipped with a memory capacity of 4 GB-12 GB. Since the GPU2 needs to perform a large amount of data operations, and therefore needs a large enough memory space for data caching, data required for the computation task can be transmitted to the GPU memory 4 before the computation task is executed on the GPU2, so that the transmission bandwidth between the CPU and the GPU2 is fully utilized, the time consumption of data preparation work is reduced, and the work efficiency of the GPU2 computation is improved.

According to the above embodiment, since the operating voltage, temperature and power consumption state information of the GPU controller 1 and the GPU2 may be affected due to long operating time or large amount of calculation when the GPU controller 1 and the GPU2 operate, and thus the service life of the GPU controller 1 and the GPU2 is affected, this embodiment provides a preferable solution, and the acceleration apparatus further includes:

a management controller 5, a control interface 6;

the management controller 5 is connected with the GPU controller 1 and the GPU2 through a system management bus and is used for acquiring working voltage, temperature and power consumption state information of the GPU controller 1 and the GPU 2;

the control interface 6 is connected with the management controller 5 through Ethernet;

the control interface 6 is connected to the CPU and is used to send the state information of the GPU controller 1 and the GPU2 to the CPU.

The management controller 5 mentioned in this embodiment refers to a device for acquiring the operating voltage, temperature, and power consumption state information of the GPU controller 1 and the GPU2, and this embodiment does not limit the method for the management controller 5 to acquire the operating voltage, temperature, and power consumption state information of the GPU controller 1 and the GPU2, nor the configuration architecture of the management controller 5, and the management controller 5 may be designed according to specific needs

In addition, the management controller 5 may acquire the operating voltage, temperature, and power consumption state information of the GPU controller 1 and the GPU2 by acquiring information of the voltage detection device, the temperature sensor, or the power consumption detection device provided in the GPU controller 1 and the GPU2, which is not limited to this manner, but merely provides a preferred embodiment.

A System Management Bus (hereinafter abbreviated as SMBus), which has only two signal lines: a bidirectional data line and a clock signal line. The SMBus has a data transmission rate of 100kbps, has a simple structure and low cost although the speed is low, and is commonly used for transmitting measurement results of various sensors. Ethernet is a computer local area network technology, which specifies the contents of physical layer connection, electronic signal and medium access layer protocol, and has good ethernet compatibility, wide technical support, low cost and high communication rate, and ethernet is the most commonly used local area network technology at present.

Specifically, the management controller 5 acquires the operating voltage, temperature and power consumption state information of the GPU controller 1 and the GPU2 through the SMBus, and transmits the operating voltage, temperature and power consumption state information to the control interface 6 through the ethernet, the control interface 6 is connected to the CPU, and the CPU receives the state information of the GPU controller 1 and the GPU 2. After the CPU obtains the operating voltage, temperature, and power consumption state information of the GPU controller 1 and the GPU2, it may determine whether the operating states of the GPU controller 1 and the GPU2 are normal, and may control the GPU controller 1 and the GPU2 to stop working or control the cooling device to cool down in time.

The management controller 5 receives the operating voltages of the GPU controller 1 and the GPU2, and the temperature and power consumption state information may be monitored in real time or acquired at preset intervals, which is not specifically required in this embodiment.

The management controller 5 monitors the state information of the GPU controller 1 and the GPU2, transmits the state information to the CPU, and the CPU determines whether the working states of the GPU controller 1 and the GPU2 are normal, so as to take measures in time to protect the GPU controller 1 and the GPU2 and prolong the service life.

According to the above embodiment, this embodiment provides a preferable scheme for supplying power to an acceleration device, and the acceleration device further includes: and the power supply unit 7 is used for supplying power to the GPU controller 1, the GPU2, the GPU memory 4 and the management controller 5.

The power supply unit 7 is responsible for supplying power to the acceleration device, and the GPU controller 1, the GPU2, the GPU memory 4 and the management controller 5 need power input when operating. The present embodiment does not limit the form of the power supply unit 7, and may be set as the case may be. The built-in power supply unit 7 makes the acceleration device more integrated.

According to the above embodiments, this embodiment provides a preferred solution of the architecture of the GPU controller 1, and the GPU controller 1 is a controller based on the ARM architecture.

The processor based on the ARM architecture is small in size, can perfectly complete application in an embedded environment, the ARM also keeps super-strong performance, the ARM architecture has the advantages that the performance, power consumption, code density, price and the like can be considered, the performance is balanced, the instruction execution speed is higher, and the GPU controller 1 based on the ARM architecture is a preferred scheme.

According to the above embodiments, this embodiment provides a preferred solution, and the GPU controller 1 internally integrates the HBM memory unit.

A High Bandwidth Memory (HBM) is a High performance DRAM based on a 3D stack process initiated by an ultra-micro semiconductor and SK Hynix, is a standardized stack storage technology, can provide a High Bandwidth channel for data inside a stack and between storage and logic elements, and is suitable for an application occasion with High Memory Bandwidth requirements, and an HBM Memory unit is integrated inside the GPU controller 1 as a data cache unit, which is beneficial to faster data processing.

According to the above embodiments, this embodiment provides a preferable scheme that the number of GPUs 2 is plural.

The GPU controller 1 may be connected with a plurality of GPUs 2, specifically, the CPU transmits data tasks to be calculated to the GPU controller 1 through the IB switch interface 3, the GPU controller 1 may allocate the data tasks to a plurality of GPUs 2 for calculation, and the GPUs 2 perform data processing simultaneously, thereby improving data processing efficiency.

A server comprising the acceleration apparatus according to any of the above embodiments.

The servers mentioned in this embodiment include, but are not limited to, an AI intelligent server, a graphics processing server, and the like, which require a large amount of data processing. The acceleration device is connected with a CPU of the server, the CPU transmits tasks needing data processing to the GPU2 through the GPU controller 1, due to the high bandwidth of the NVlink bus and the IB bus, the data transmission between the CPU and the GPU2 is not limited by the bandwidth of the PCIE bus of 64GB/s any more, and the data is transmitted and exchanged between the CPU memory and the GPU memory 4 to realize bandwidth balance.

The acceleration device and the server provided by the present application are described in detail above. The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. An acceleration device applied to a CPU, comprising:

a GPU controller, a GPU, an IB switching interface,

the GPU controller is connected with the GPU through an NVLink bus;

2. The acceleration apparatus of claim 1, wherein the GPU controller is further configured to, after receiving the data task of the CPU through the IB switch interface, allocate the data task to the GPU and control the GPU to perform data processing on the data task.

3. The accelerating device of claim 1, further comprising: and the GPU memory is connected with the GPU and used for caching the data of the GPU.

4. The accelerating device of claim 2, further comprising:

a management controller controlling the interface;

5. The accelerating device of claim 3, further comprising: and the power supply unit is used for supplying power to the GPU controller, the GPU memory and the management controller.

6. The acceleration apparatus of claim 1, wherein the GPU controller is an ARM architecture based controller.

7. The acceleration device of claim 1, wherein the GPU controller has an HBM memory unit integrated therein.

8. An acceleration device according to any one of claims 1 to 7, characterized in that the number of GPUs is plural.

9. A server, characterized by comprising an acceleration device according to any one of claims 1 to 8.