WO2021135283A1

WO2021135283A1 - Heterogeneous computing system and computing method therefor

Info

Publication number: WO2021135283A1
Application number: PCT/CN2020/110980
Authority: WO
Inventors: 许溢允
Original assignee: 苏州浪潮智能科技有限公司
Priority date: 2019-12-29
Filing date: 2020-08-25
Publication date: 2021-07-08
Also published as: CN111143276A

Abstract

Provided are a heterogeneous computing system and a computing method therefor. The heterogeneous computing system comprises a local server, a first root-level acceleration card, and a secondary acceleration card, wherein the first root-level acceleration card is a PHY heterogeneous acceleration card which is directly connected to the local server via a PCIE module; and the secondary acceleration card is a PHY heterogeneous acceleration card which is directly or indirectly connected to the first root-level acceleration card via a MAC module. It can be seen that the first root-level acceleration card in the heterogeneous computing system is not only connected to the local server via the PCIE module, but is also connected to the secondary acceleration card via the MAC module, such that, on one hand, the PHY heterogeneous acceleration card is no longer tightly coupled with a CPU, and the number of the PHY heterogeneous acceleration cards is no longer limited; in addition, the communication between the PHY heterogeneous acceleration cards does not need to be performed via the CPU, thereby reducing the resource occupation rate of the local server.

Description

A heterogeneous computing system and its computing method

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on December 29, 2019, the application number is 201911386453.1, and the invention title is "a heterogeneous computing system and its calculation method", the entire content of which is incorporated by reference In this application.

Technical field

This application relates to the field of computer technology, and in particular to a heterogeneous computing system and a computing method thereof.

Background technique

At present, almost all domestic PHY cloud service vendors adopt the binding mode of single-machine single-card and single-machine multi-card, that is, one card is inserted into one server, or multiple cards are inserted into one server. In this machine-card binding mode, the PHY is tightly coupled with the CPU. The user can only access and use the PHY card through the CPU on the host side, and the PHY boards that each user can be assigned to are determined by the number of boards bound. Limitations: There is no direct data communication link between the board and the board. If there is a communication requirement between the boards, the data must be forwarded by the CPU.

It can be seen that the current PHY boards are tightly coupled with the CPU. On the one hand, the number of PHYs is limited. On the other hand, communication between PHY boards requires server computing resources.

Summary of the invention

The purpose of this application is to provide a heterogeneous computing system to solve the problem that the number of PHYs is limited due to the tight coupling between the current PHY boards and the CPU, and the communication between the PHY boards requires server computing resources.

In order to solve the above technical problems, this application provides a heterogeneous computing system, including: a local server, a first root-level accelerator card, and a secondary accelerator card. The first root-level accelerator card directly communicates with the A PHY heterogeneous accelerator card connected to a local server, where the secondary accelerator card is a PHY heterogeneous accelerator card directly or indirectly connected to the first root-level accelerator card through a MAC module;

Wherein, the local server is used to send source operands to the first root-level accelerator card, and the first root-level accelerator card is used to allocate the source operands to the secondary accelerator card for calculation, And feedback the calculation results of each of the secondary accelerator cards to the local server.

Preferably, the secondary accelerator card is arranged in the PHY disk cabinet.

Preferably, the first root-level accelerator card is connected to the secondary accelerator card in the PHY panel through a MAC module via an Ethernet switch.

Preferably, when the local server sends the source operand to the first root-level accelerator card, the local server recognizes the first root-level accelerator card through a software driver, and controls the first root-level accelerator card by configuring registers. A level accelerator card reads the source operand locally.

Preferably, the first root-level accelerator card determines the data amount of the source operand when reading the source operand from the local server, and if the data amount of the source operand exceeds a preset threshold, Then, the source operand is packaged according to the configuration register on the local server side, and the data packet is sent to the secondary accelerator card.

Preferably, the secondary accelerator card unpacks the data packet after receiving the data packet, and if the data amount of the source operand obtained by the unpacking exceeds the preset threshold, it calls the implementation based on RTL The interface of encapsulates the source operand obtained by unpacking again, and sends the data packet obtained by encapsulating again to the secondary accelerator card connected to itself.

Preferably, it further includes a second root-level accelerator card, and the second root-level accelerator card is a PHY heterogeneous accelerator card directly connected to the local server through a MAC module.

Preferably, when the local server sends the source operand to the second root-level accelerator card, it generates an Ethernet data frame according to the source operand and the target network layer protocol, and sends the Ethernet data frame to The second root-level accelerator card; the second root-level accelerator card processes or forwards the Ethernet data frame according to the target network layer protocol.

Preferably, it further includes: a remote server connected to the secondary accelerator card through a MAC module.

In addition, this application also provides a heterogeneous computing method, which is implemented based on the above-mentioned heterogeneous computing system, and the method includes:

The local server sends the source operands to the first root-level accelerator card through the PCIE module;

The first root-level accelerator card allocates the source operand to the secondary accelerator card through the MAC module;

The secondary accelerator card calculates the source operand to obtain a calculation result;

The first root-level accelerator card obtains the calculation result of each secondary accelerator card through the MAC module, and feeds it back to the local server through the PCIE module.

A heterogeneous computing system provided by this application includes a local server, a first root-level accelerator card, and a secondary accelerator card. The first root-level accelerator card is a PHY heterogeneous acceleration that is directly connected to the local server through a PCIE module. The secondary accelerator card is a PHY heterogeneous accelerator card that is directly or indirectly connected to the first-level accelerator card through the MAC module. In the process of heterogeneous computing, the local server is used to send source operands to the first root-level accelerator card, and the first root-level accelerator card then allocates the source operands to the secondary accelerator cards for calculation, and accelerates each secondary accelerator card. The calculation result of the card is fed back to the local server. It can be seen that the first root-level accelerator card in the heterogeneous computing system is not only connected to the local server through the PCIE module, but also connected to the secondary accelerator card through the MAC module. Therefore, on the one hand, the PHY heterogeneous accelerator card is no longer connected to the local server. The CPU is tightly coupled, and the number is no longer limited; on the other hand, communication between PHY heterogeneous accelerator cards no longer needs to pass through the CPU, which reduces the resource occupancy rate of the local server.

In addition, this application also provides a heterogeneous calculation method, the technical effect of which corresponds to the technical effect of the above-mentioned system, and will not be repeated here.

Description of the drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are merely For some of the embodiments of the present application, for those of ordinary skill in the art, other drawings may be obtained based on these drawings without creative work.

Fig. 1 is a schematic structural diagram of an embodiment of a heterogeneous computing system provided by this application;

FIG. 2 is a schematic diagram of the structure of an embodiment of a heterogeneous computing system provided by this application; FIG.

FIG. 3 is a schematic structural diagram of an embodiment of a heterogeneous computing system provided by this application; FIG.

FIG. 4 is an implementation flowchart of an embodiment of a heterogeneous computing method provided by this application.

Detailed ways

In order to enable those skilled in the art to better understand the solution of the application, the application will be further described in detail below with reference to the accompanying drawings and specific implementations. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

With the development of the application of PHY boards in cloud data centers, PHY boards have begun to be deployed on a large scale. The current deployment method is generally the method of binding the capture machine card, that is, each PHY board is directly plugged into the PCIE slot. On the standard bus interface of the local server, when the user applies for the use of the PHY instance, the user will generally be assigned a virtual machine environment, and the user will access and use the board under the virtual machine. The above-mentioned machine-card binding architecture causes the server to be tightly coupled with the PHY board, and the number of PHY boards is limited. If you want to increase the PHY board, you need to support the server. And because there is no direct communication link between PHY boards, it cannot meet the needs of flexible deployment of services, and it cannot form an effective distributed acceleration architecture.

In response to the above problems, this application provides a heterogeneous computing system, so that PHY heterogeneous accelerator cards are no longer tightly coupled with the CPU, and the number is no longer limited, and communication between PHY heterogeneous accelerator cards no longer needs to pass through the CPU. Reduce the resource occupancy rate of the local server.

The following describes an embodiment of a heterogeneous computing system provided by the present application. See FIG. 1. This embodiment includes: a local server, a first root-level accelerator card, and a secondary accelerator card. The PCIE module is directly connected to the PHY heterogeneous accelerator card of the local server, and the secondary accelerator card is the PHY heterogeneous accelerator card directly or indirectly connected to the first root-level accelerator card through the MAC module;

The PHY heterogeneous accelerator card can use the high-speed computing power of the PHY to accelerate the calculation of the source operands sent by the CPU, and return the calculated results to the CPU, thereby achieving high-performance computing capabilities, such as video encoding and decoding, in-depth Learning, scientific computing, and graphics processing require higher computing capabilities.

It should be noted that in this embodiment, the hardware structure and software framework of each PHY heterogeneous accelerator card are the same, and the first-level accelerator card and the secondary accelerator card are only used to distinguish the two connection relationships, as shown in Figure 1. That is, the PCIE module is directly connected to the local server, and the MAC module is directly or indirectly connected to the first root-level accelerator card. In addition, this embodiment does not limit the number of first-level accelerator cards and secondary accelerator cards.

As a specific implementation manner, this embodiment retains the form of machine-card binding on the one hand, and introduces the BOX OF PHY (PHY disk cabinet) mode on the other hand, as shown in FIG. 2. The secondary accelerator cards are stored in the PHY cabinet, and each secondary accelerator card is connected to each other through the MAC module. Specifically, the PHY enclosure may include various types of heterogeneous accelerator cards, such as Intel chips, PHY manufacturer chips, and so on. The first root-level accelerator card is connected to the secondary accelerator card in the PHY enclosure through a MAC module via an Ethernet switch, so as to decouple the tight coupling between the PHY and the CPU.

When the local server sends the source operand to the first root-level accelerator card, the local server recognizes the first root-level accelerator card through a software driver, and controls the first root-level accelerator card to read the source operation locally by configuring registers. number. When the first-level accelerator card reads the source operand from the local server, it first determines the data amount of the source operand. If the data amount of the source operand exceeds the preset threshold, the source operand is checked according to the configuration register on the local server side. The operand is packaged and the data packet is sent to the secondary accelerator card; if the data amount of the source operand does not exceed the preset threshold, the first root accelerator card completes the calculation of the source operand by itself. After receiving the data packet, the secondary accelerator card unpacks the data packet. If the data volume of the source operand obtained by unpacking exceeds the preset threshold, the interface based on the RTL implementation is called to the source operation obtained from the unpacking The data packet is repacked, and the data packet obtained by the repackaging is sent to the secondary accelerator card connected to itself; if the data amount of the source operand obtained by unpacking exceeds the preset threshold, the source operand is calculated.

As a specific implementation manner, this embodiment further includes a remote server. As shown in FIG. 3, the remote server is connected to the secondary accelerator card through a MAC module.

Several typical data transmission paths are described below:

When the data transmission path is the remote server → the secondary accelerator card, the remote server uses the secondary accelerator card through the network, and its interface form is MAC to MAC. In this scenario, a software driver is required to packetize according to the format, and the PHY decodes according to the format. Package and match the package type.

When the data transmission path is remote server→secondary accelerator card→secondary accelerator card/first root accelerator card, the remote server uses the secondary accelerator card to accelerate data distribution through the network, and the interface form is MAC to MAC then To MAC, in this scenario, a software driver is required to packetize according to the format, and the secondary accelerator card re-encapsulates the MAC packet and forwards it.

When the data transmission path is from the secondary accelerator card to the remote server, the secondary accelerator card sends the calculation result to the remote server, and its interface is in the form of MAC to MAC. In this scenario, the secondary accelerator card needs to be packaged according to the format, and the software is driven. Unpack according to the format.

When the data transmission path is the remote server → the first root accelerator card, the first root accelerator card is used locally or through virtual machine passthrough, and its interface form is PCIE. In this scenario, a software driver is required to identify the floor card. The board is used in the manner of configuration registers, and the data is directly transmitted.

When the data transmission path is the remote server → the first root accelerator card → the secondary accelerator card, the first root accelerator card is used for data forwarding locally or through virtual machine passthrough, and the interface form is PCIe to MAC and then to MAC, in this scenario, a software driver is required to identify the floor card, and control the first-level accelerator card to forward data by configuring registers, and the first-level accelerator card needs to perform packet forwarding in the above-mentioned format.

When the data transmission path is the first root-level accelerator card → remote server, the first root-level accelerator card returns the calculation result to the local server, and its interface form is PCIE. In this scenario, the first root-level accelerator card is required to return the result directly On the local server, the software driver directly receives the data.

When the data transmission path is the secondary accelerator card → the first root accelerator card, the secondary accelerator card sends the calculation result to the first root accelerator card, and its interface form is MAC to MAC. In this scenario, a secondary accelerator card is required Packet in the above format.

As a specific implementation manner, as shown in FIG. 3, this embodiment further includes a second root-level accelerator card, and the second root-level accelerator card is a heterogeneous PHY that is directly connected to the local server through a MAC module. Accelerator card.

Different from the first root-level accelerator card, in this embodiment, when the local server sends a source operand to the second root-level accelerator card, an Ethernet data frame is generated according to the source operand and the target network layer protocol , And send the Ethernet data frame to the second root-level accelerator card; the second root-level accelerator card processes or forwards the Ethernet data frame according to the target network layer protocol.

The heterogeneous computing system provided in this embodiment includes a local server, a first root-level accelerator card, and a secondary accelerator card. The first root-level accelerator card in the heterogeneous computing system is not only connected to the local server through a PCIE module , It is also connected to the secondary accelerator card through the MAC module. Under the above architecture, applications that need to be accelerated can transmit data to the acceleration card in two ways: PCIE or MAC. In addition, the PHY resources that can be divided by users are not restricted by the host, so that PHY resources can be more flexibly allocated and deployed, and seamlessly connect to the existing server cloud ecosystem.

The following describes in detail an embodiment of a heterogeneous computing method provided by the present application, referring to FIG. 4. This embodiment is implemented based on the heterogeneous computing system as described above, and includes:

S401. The local server sends the source operand to the first root-level accelerator card through the PCIE module;

S402: The first root-level accelerator card allocates the source operand to the secondary accelerator card through the MAC module;

S403: The secondary accelerator card calculates the source operand to obtain a calculation result;

S404. The first root-level accelerator card obtains the calculation result of each secondary accelerator card through the MAC module, and feeds it back to the local server through the PCIE module.

As a specific implementation, it also includes:

The remote server sends the source operands to the secondary accelerator card through the MAC module;

The secondary accelerator card calculates the source operand to obtain the calculation result;

The secondary accelerator card sends the calculation result to the remote server through the MAC module.

As a specific implementation manner, the sending of the source operand by the local server to the first root-level accelerator card includes:

The local server recognizes the first root-level accelerator card through a software drive, and controls the first root-level accelerator card to read the source operand locally by means of a configuration register.

As a specific implementation manner, after the secondary accelerator card receives the data packet, the method includes:

The secondary accelerator card unpacks the data packet, and if the data volume of the source operand obtained by unpacking exceeds the preset threshold, the interface based on the RTL implementation is called to perform the unpacking on the source operand again Encapsulate and send the data packet obtained by enveloping again to the secondary accelerator card connected to itself.

As a specific implementation, it also includes:

The local server generates an Ethernet data frame according to the source operand and the target network layer protocol, and sends the Ethernet data frame to the second root-level accelerator card;

The second root-level accelerator card processes or forwards the Ethernet data frame according to the target network layer protocol.

The various embodiments in this specification are described in a progressive manner. Each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments can be referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method part.

The steps of the method or algorithm described in the embodiments disclosed in this document can be directly implemented by hardware, a software module executed by a processor, or a combination of the two. The software module can be placed in random access memory (RAM), internal memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROMs, or all areas in the technical field. Any other known storage media.

The above provides a detailed introduction to the solution provided by the application, and specific examples are used in this article to illustrate the principles and implementation of the application. The descriptions of the above examples are only used to help understand the methods and core ideas of the application; at the same time; For those of ordinary skill in the art, according to the ideas of the application, there will be changes in the specific implementation and the scope of application. In summary, the content of this specification should not be construed as limiting the application.

Claims

A heterogeneous computing system, comprising: a local server, a first root-level accelerator card, and a secondary accelerator card, the first root-level accelerator card is a PHY directly connected to the local server through a PCIE module Heterogeneous accelerator card, the secondary accelerator card is a PHY heterogeneous accelerator card directly or indirectly connected to the first root-level accelerator card through a MAC module;

Wherein, the local server is used to send source operands to the first root-level accelerator card, and the first root-level accelerator card is used to allocate the source operands to the secondary accelerator card for calculation, And feedback the calculation results of each of the secondary accelerator cards to the local server.
The system according to claim 1, wherein the secondary accelerator card is arranged in a PHY enclosure.
The system according to claim 2, wherein the first root-level accelerator card is connected to the secondary accelerator card in the PHY enclosure through an Ethernet switch through a MAC module.
The system of claim 1, wherein when the local server sends a source operand to the first root-level accelerator card, the local server recognizes the first root-level accelerator card through a software driver, And control the first root-level accelerator card to read the source operand locally by means of a configuration register.
The system according to claim 4, wherein when the first root-level accelerator card reads the source operand from the local server, it determines the amount of data of the source operand, if the source If the data amount of the operand exceeds a preset threshold, the source operand is packaged according to the configuration register on the local server side, and the data packet is sent to the secondary accelerator card.
The system according to claim 5, wherein after receiving the data packet, the secondary accelerator card unpacks the data packet, and if the data amount of the source operand obtained by the unpacking exceeds all the data packets. According to the preset threshold, an interface based on RTL is called to re-encapsulate the source operand obtained by unpacking, and the data packet obtained by re-packaging is sent to the secondary accelerator card connected to itself.
The system according to any one of claims 1-6, further comprising a second root-level accelerator card, and the second root-level accelerator card is a PHY differentiator directly connected to the local server through a MAC module. Structure accelerator card.
The system according to claim 7, wherein when the local server sends a source operand to the second root-level accelerator card, an Ethernet data frame is generated according to the source operand and the target network layer protocol, And send the Ethernet data frame to the second root-level accelerator card; the second root-level accelerator card processes or forwards the Ethernet data frame according to the target network layer protocol.
The system according to claim 1, further comprising: a remote server, the remote server is connected to the secondary accelerator card through a MAC module.
A heterogeneous computing method, characterized in that it is implemented based on the heterogeneous computing system according to any one of claims 1-9, and the method comprises:

The local server sends the source operands to the first root-level accelerator card through the PCIE module;

The first root-level accelerator card allocates the source operand to the secondary accelerator card through the MAC module;

The secondary accelerator card calculates the source operand to obtain a calculation result;

The first root-level accelerator card obtains the calculation result of each secondary accelerator card through the MAC module, and feeds it back to the local server through the PCIE module.