CN112463714B - Remote direct memory access method, heterogeneous computing system and electronic equipment - Google Patents

Remote direct memory access method, heterogeneous computing system and electronic equipment Download PDF

Info

Publication number
CN112463714B
CN112463714B CN202011372146.0A CN202011372146A CN112463714B CN 112463714 B CN112463714 B CN 112463714B CN 202011372146 A CN202011372146 A CN 202011372146A CN 112463714 B CN112463714 B CN 112463714B
Authority
CN
China
Prior art keywords
accelerator
bus
physical address
processing unit
central processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011372146.0A
Other languages
Chinese (zh)
Other versions
CN112463714A (en
Inventor
钟于义
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Haiguang Integrated Circuit Design Co Ltd
Original Assignee
Chengdu Haiguang Integrated Circuit Design Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Haiguang Integrated Circuit Design Co Ltd filed Critical Chengdu Haiguang Integrated Circuit Design Co Ltd
Priority to CN202011372146.0A priority Critical patent/CN112463714B/en
Publication of CN112463714A publication Critical patent/CN112463714A/en
Application granted granted Critical
Publication of CN112463714B publication Critical patent/CN112463714B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17306Intercommunication techniques
    • G06F15/17331Distributed shared memory [DSM], e.g. remote direct memory access [RDMA]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/32Handling requests for interconnection or transfer for access to input/output bus using combination of interrupt and burst mode transfer
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2213/00Indexing scheme relating to interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F2213/0026PCI express
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

A remote direct memory access method, a heterogeneous computing system and an electronic device are provided. The method is based on a heterogeneous computing system comprising a central processing unit, a network card and at least one accelerator. The method comprises the following steps: converting the target virtual address to a target physical address, the target physical address corresponding to one of the at least one accelerator; based on the target physical address, reading target information stored in the target physical address from an accelerator corresponding to the target physical address; writing the target information into an interface control space of a central processing unit through a unified memory architecture bus; transmitting the target information from the interface control space of the central processing unit to the network card through a standard bus of a high-speed serial computer expansion bus; and sending the target information by using the network card. The method can reduce address mapping management, reduce address translation overhead, improve communication efficiency and performance, and reduce the delay of remote direct memory access.

Description

Remote direct memory access method, heterogeneous computing system and electronic equipment
Technical Field
Embodiments of the present disclosure relate to a remote direct memory access method, a heterogeneous computing system, and an electronic device.
Background
With the development of heterogeneous computing, more and more computing centers adopt a heterogeneous computing structure combining a Central Processing Unit (CPU) and an accelerator, a large amount of data is transferred from the CPU to the accelerator, and a large amount of computing is also transferred from the CPU to the accelerator for execution, so that better performance and lower power consumption can be obtained. In a heterogeneous computing system, a CPU is mainly responsible for flow control, operation analysis and a small amount of calculation, and a large amount of calculation is transferred to an accelerator to run. With the change of operation mode, the demand of the accelerator and the accelerator for directly exchanging data with each other between different nodes is increasing. Data in the accelerator of one node may be transferred to the accelerator of another node as needed to meet the computational demand.
Disclosure of Invention
At least one embodiment of the present disclosure provides a remote direct memory access method based on a heterogeneous computing system, where the heterogeneous computing system includes a central processing unit, a network card, and at least one accelerator, the central processing unit is connected to the at least one accelerator through a high-speed serial computer expansion bus standard bus and a unified memory architecture bus, and the central processing unit is connected to the network card through the high-speed serial computer expansion bus standard bus, the method includes: translating a target virtual address to a target physical address, wherein the target physical address corresponds to one of the at least one accelerator; reading target information stored in the target physical address from an accelerator corresponding to the target physical address based on the target physical address; writing the target information into an interface control space of the central processing unit through the unified memory architecture bus; transmitting the target information from an interface control space of the central processing unit to the network card through the high-speed serial computer expansion bus standard bus; and sending the target information by utilizing the network card.
For example, an embodiment of the present disclosure provides a method further including: setting an address buffer in a virtual address space of the accelerator and mapping the address buffer into a physical address space; the physical address space is a physical address space uniformly addressed by the central processing unit and the at least one accelerator under a uniform memory architecture.
For example, in a method provided by an embodiment of the present disclosure, the physical address space includes a first address space and a second address space that are consecutive, the first address space is an address space of the central processing unit, and the second address space is an address space of the at least one accelerator.
For example, in a method provided by an embodiment of the present disclosure, translating the target virtual address to the target physical address includes: and converting the target virtual address into the target physical address according to the mapping relation between the virtual address space and the physical address space.
For example, in a method provided by an embodiment of the present disclosure, based on the target physical address, reading, from an accelerator corresponding to the target physical address, target information stored in the target physical address, including: and starting data reading by a direct memory access engine based on the target physical address, and reading target information stored in the target physical address from an accelerator corresponding to the target physical address.
For example, in a method provided by an embodiment of the present disclosure, transmitting the target information from an interface control space of the central processing unit to the network card through the high-speed serial computer expansion bus standard bus includes: setting a memory address mapping interface space in the network card; and transmitting the target information from an interface control space of the central processing unit to the memory address mapping interface space of the network card through the high-speed serial computer expansion bus standard bus.
For example, in a method provided by an embodiment of the present disclosure, the accelerator includes an accelerator memory, and the target information is stored in the accelerator memory.
For example, in a method provided by an embodiment of the present disclosure, the unified memory architecture bus comprises an xGMI bus, a Gen-Z bus, a CCIX bus, or a CXL bus.
At least one embodiment of the present disclosure further provides a remote direct memory access method based on a heterogeneous computing system, where the heterogeneous computing system includes a central processing unit, a network card, and at least one accelerator, the central processing unit is connected to the at least one accelerator through a high-speed serial computer expansion bus standard bus and a unified memory architecture bus, and the central processing unit is connected to the network card through the high-speed serial computer expansion bus standard bus, the method includes: receiving target information by using the network card; transmitting the target information from the network card to an interface control space of the central processing unit through the high-speed serial computer expansion bus standard bus; and writing the target information into the accelerator corresponding to the target physical address from the interface control space of the central processing unit through the unified memory architecture bus based on the target physical address.
For example, an embodiment of the present disclosure provides a method further including: setting an address buffer in a virtual address space of the accelerator and mapping the address buffer into a physical address space; the physical address space is a physical address space uniformly addressed by the central processing unit and the at least one accelerator under a uniform memory architecture.
For example, in a method provided by an embodiment of the present disclosure, the physical address space includes a first address space and a second address space that are consecutive, the first address space is an address space of the central processing unit, and the second address space is an address space of the at least one accelerator.
For example, in the method provided by an embodiment of the present disclosure, the target physical address is obtained by converting an acquired standard system address of a high-speed serial computer expansion bus.
For example, in a method provided by an embodiment of the present disclosure, the accelerator includes an accelerator memory, and the target information is written into the accelerator memory.
For example, in a method provided by an embodiment of the present disclosure, the unified memory architecture bus comprises an xGMI bus, a Gen-Z bus, a CCIX bus, or a CXL bus.
At least one embodiment of the present disclosure further provides a heterogeneous computing system, including a central processing unit, a network card and at least one accelerator, where the central processing unit is connected to the at least one accelerator through a high-speed serial computer expansion bus standard bus and a unified memory architecture bus, the central processing unit is connected to the network card through the high-speed serial computer expansion bus standard bus, and the heterogeneous computing system is configured to: translating a target virtual address to a target physical address, wherein the target physical address corresponds to one of the at least one accelerator; reading target information stored in the target physical address from an accelerator corresponding to the target physical address based on the target physical address; writing the target information into an interface control space of the central processing unit through the unified memory architecture bus; transmitting the target information from an interface control space of the central processing unit to the network card through the high-speed serial computer expansion bus standard bus; and sending the target information by utilizing the network card.
For example, in a heterogeneous computing system provided by an embodiment of the present disclosure, the at least one accelerator includes a plurality of accelerators, and the plurality of accelerators are connected to each other through the unified memory architecture bus.
For example, in a heterogeneous computing system provided by an embodiment of the present disclosure, the central processing unit includes a plurality of dies packaged as a whole, the plurality of accelerators are in one-to-one correspondence with the plurality of dies, and each accelerator is connected to a corresponding die through the high-speed serial computer expansion bus standard bus and the unified memory architecture bus.
For example, in a heterogeneous computing system provided by an embodiment of the present disclosure, the unified memory architecture bus comprises an xGMI bus, a Gen-Z bus, a CCIX bus, or a CXL bus.
For example, in the heterogeneous computing system provided by an embodiment of the present disclosure, the network card includes an infiniband network card, and the accelerator includes a graphics processing unit, a depth calculation unit, an artificial intelligence calculation unit, or a digital signal processing unit.
For example, in a heterogeneous computing system provided by an embodiment of the present disclosure, the accelerator includes an accelerator memory that includes a high bandwidth memory, a double rate memory for graphics, or a low power double rate memory.
At least one embodiment of the present disclosure further provides a heterogeneous computing system, including a central processing unit, a network card, and at least one accelerator, where the central processing unit is connected to the at least one accelerator through a high-speed serial computer expansion bus standard bus and a unified memory architecture bus, the central processing unit is connected to the network card through the high-speed serial computer expansion bus standard bus, and the heterogeneous computing system is configured to: receiving target information by using the network card; transmitting the target information from the network card to an interface control space of the central processing unit through the high-speed serial computer expansion bus standard bus; and writing the target information into the accelerator corresponding to the target physical address from the interface control space of the central processing unit through the unified memory architecture bus based on the target physical address.
For example, in a heterogeneous computing system provided by an embodiment of the present disclosure, the at least one accelerator includes a plurality of accelerators, and the plurality of accelerators are connected to each other through the unified memory architecture bus.
For example, in a heterogeneous computing system provided by an embodiment of the present disclosure, the central processing unit includes a plurality of dies packaged as a whole, the plurality of accelerators are in one-to-one correspondence with the plurality of dies, and each accelerator is connected to a corresponding die through the high-speed serial computer expansion bus standard bus and the unified memory architecture bus.
For example, in a heterogeneous computing system provided by an embodiment of the present disclosure, the unified memory architecture bus comprises an xGMI bus, a Gen-Z bus, a CCIX bus, or a CXL bus.
For example, in the heterogeneous computing system provided by an embodiment of the present disclosure, the network card includes an infiniband network card, and the accelerator includes a graphics processing unit, a depth calculation unit, an artificial intelligence calculation unit, or a digital signal processing unit.
For example, in a heterogeneous computing system provided by an embodiment of the present disclosure, the accelerator includes an accelerator memory that includes a high bandwidth memory, a double rate memory for graphics, or a low power double rate memory.
At least one embodiment of the present disclosure further provides an electronic device including the heterogeneous computing system according to any one of the embodiments of the present disclosure.
Drawings
To more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure and are not limiting to the present disclosure.
FIG. 1 is a flow chart illustrating a remote direct memory access method;
FIG. 2 is a flow chart illustrating another remote direct memory access method;
FIG. 3 is a schematic diagram of a heterogeneous computing system;
fig. 4 is a schematic flowchart illustrating a remote direct memory access method based on a heterogeneous computing system according to some embodiments of the present disclosure;
FIG. 5 is a block diagram of a heterogeneous computing system according to some embodiments of the present disclosure;
fig. 6A is a schematic flow chart illustrating another method for remote direct memory access based on a heterogeneous computing system according to some embodiments of the present disclosure;
FIG. 6B is a schematic diagram of address mapping provided by some embodiments of the present disclosure;
fig. 7 is a flowchart illustrating a remote direct memory access method based on a heterogeneous computing system according to some embodiments of the present disclosure;
fig. 8 is a flowchart illustrating another method for remote direct memory access based on a heterogeneous computing system according to some embodiments of the present disclosure;
fig. 9 is a flowchart illustrating a remote direct memory access method based on a heterogeneous computing system according to some embodiments of the present disclosure;
FIG. 10 is a schematic block diagram of a heterogeneous computing system provided by some embodiments of the present disclosure; and
fig. 11 is a schematic block diagram of an electronic device provided in some embodiments of the present disclosure.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings of the embodiments of the present disclosure. It is to be understood that the described embodiments are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the disclosure without inventive step, are within the scope of protection of the disclosure.
Unless otherwise defined, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in this disclosure is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. Also, the use of the terms "a," "an," or "the" and similar referents do not denote a limitation of quantity, but rather denote the presence of at least one. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used only to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.
Heterogeneous computing systems include many types, with architectures that combine CPUs with accelerators being a common type. When the heterogeneous computing system of the type is used for operation or data processing, the accelerators and the accelerators often need to directly exchange data with each other among different nodes. Here, different nodes refer to, for example, different heterogeneous computing systems.
Fig. 1 is a flowchart illustrating a remote direct memory access method. In the method, remote Direct Memory Access (RDMA) is realized by adopting a software processing mode. RDMA enables direct access from one device in a network node to another device memory in a remote node, thereby enabling direct transfer between the remote node memory and the local node memory. As shown in fig. 1, software opens a buffer area in a DDR (Double Data Rate Synchronous Random-Access Memory, DDR SDRAM, which is conventionally abbreviated as DDR) Memory of a CPU. The accelerator copies data in a device Memory of the local accelerator to a cache region of the local CPU in a Direct Memory Access (DMA) manner. For example, the accelerator may be a depth calculation unit, and the device Memory of the accelerator may be a High Bandwidth Memory (HBM).
Opening a buffer area in the DDR memory of the CPU of the remote node, then starting the network card, and copying the data in the DDR memory of the local node to the DDR memory of the remote node. Here, the network Card may be an Infiniband Card (IB network Card), and the IB network Card is a network adapting device based on the Infiniband protocol, and is mainly used for network data transmission of different nodes. The CPU of the remote node starts DMA in the accelerator of the remote node, and data of a cache region in the DDR memory of the remote node is copied to a device memory (such as HBM) of the accelerator (such as a depth calculation unit) of the remote node. Thus, the accelerator of the remote node may use the data in the device memory. The method shown in fig. 1 is more conventional and was used in an early stage.
FIG. 2 is a flow chart illustrating another remote direct memory access method. Since accelerators are devices based on a Peripheral Component Interconnect Express (PCIe) bus, in a heterogeneous computing system, a point-To-point transport mechanism (Peer To Peer, P2P) of PCIe can be fully utilized To implement RDMA access. In a PCIe system architecture, one end node device of PCIe may send an upstream Memory Read/Write (Memory Read/Write) access, and if the access address hits in a Memory Mapped Input/Output (MMIO) based buffer of other end node devices on the PCIe system, the access is routed to the corresponding end node device, and this point-to-point access is referred to as a P2P access. The MMIO buffer is the address space of the IO device that can be addressed and accessed in the system by memory addresses.
As shown in FIG. 2, the system software opens up an MMIO buffer in the IB network card of the local node. The accelerator of the local node transmits data in the device memory of the accelerator of the local node to the MMIO buffer area of the IB network card of the local node by using DMA in a P2P mode. And the local node IB network card transmits the received data to the IB network card of the remote node. The accelerator of the remote node is opened with an MMIO buffer (the MMIO buffer is opened in the device memory of the accelerator, for example), the IB network card of the remote node starts DMA and maps the received data into the MMIO buffer of the accelerator of the remote node. Thus, the accelerator of the remote node may use the data in its device memory. In current applications, the method shown in fig. 2 is typically employed to implement RDMA.
The method shown in fig. 2 still has more problems. In the method shown in fig. 2, the bandwidth of RDMA is affected by the bandwidth of PCIe, and if the bandwidth of PCIe is insufficient, the efficiency of RDMA is greatly affected. Moreover, RDMA requires multiple address translations. For example, the virtual Address GPUVM of the accelerator of the local node needs to be first converted into a PCIe System Address (PCIe System Address), and then, in the MMIO buffer of the accelerator of the remote node, the remote node needs to map the received data to the actual physical memory Address again. In the process of data transmission, multiple address conversions are required, so that the overhead of hardware and software is high, the communication efficiency is low, and large communication delay exists.
Fig. 3 is a schematic diagram of a heterogeneous computing system. As shown in fig. 3, the heterogeneous computing system includes a CPU, 4 accelerator depth computing units 0 to 3, and a network card 001. The CPU is composed of 4 bare chips DIE0-DIE3 packaged as a whole. For example, the DIE is a DIE, also referred to as a DIE or a silicon wafer. The 4 accelerator depth calculation units 0-3 are connected with the 4 bare chips DIE0-DIE3 in a one-to-one correspondence manner, and each accelerator is connected with the corresponding bare chip through a PCIe bus. For example, each die has 32 card slots (Lane), which can be used for both PCIe and CPU and accelerator heterogeneous direct connection xGMI buses. In this example, the card slot is for PCIe only, and the CPU and accelerator are connected directly over PCIe, in which case the PCIe bandwidth may be X16. The 4 accelerator depth calculating units 0 to 3 are connected to each other via a Unified Memory Architecture (UMA) bus, for example, via an xGMI bus. Each die is connected to a network card 001, for example, a PCIe bus, and the network card 001 is, for example, an IB network card.
In the heterogeneous computing system, the CPU and the accelerator are connected only through a PCIe bus, and belong to a non-UMA architecture without utilizing the advantages of the UMA architecture. If the UMA architecture is adopted, that is, the CPU and the accelerator are interconnected by, for example, the xGMI bus, the operation performance can be improved. However, since the xGMI bus and the PCIe bus share a card slot (Lane) of the serial interface, for example, the xGMI bus needs to occupy 16 card slots, which reduces the bandwidth of PCIe from X16 to X8, resulting in halving the bandwidth of PCIe, which in turn affects the performance of RDMA.
At least one embodiment of the disclosure provides a remote direct memory access method, a heterogeneous computing system and an electronic device. The method can reduce address mapping management, reduce the overhead of software and hardware in address translation and management, and ensure that the performance of remote direct memory access under a unified memory architecture is not restricted by PCIe bandwidth, thereby improving communication efficiency and performance, improving transmission efficiency and reducing the delay of remote direct memory access.
Hereinafter, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings. It should be noted that the same reference numerals in different figures will be used to refer to the same elements that have been described.
At least one embodiment of the present disclosure provides a remote direct memory access method based on a heterogeneous computing system. The heterogeneous computing system includes a central processing unit, a network card, and at least one accelerator. The central processing unit is connected with at least one accelerator through a high-speed serial computer expansion bus standard bus and a unified memory architecture bus, and the central processing unit is connected with the network card through the high-speed serial computer expansion bus standard bus. The method comprises the following steps: converting the target virtual address to a target physical address, the target physical address corresponding to one of the at least one accelerator; reading target information stored in the target physical address from an accelerator corresponding to the target physical address based on the target physical address; writing the target information into an interface control space of a central processing unit through a unified memory architecture bus; transmitting the target information from the interface control space of the central processing unit to the network card through a standard bus of a high-speed serial computer expansion bus; and sending the target information by using the network card.
Fig. 4 is a schematic flowchart of a remote direct memory access method based on a heterogeneous computing system according to some embodiments of the present disclosure, where the method is used, for example, in a heterogeneous computing system serving as a sending end in both remote direct memory access interaction parties. As shown in fig. 4, the method includes the following operations.
Step S110: translating the target virtual address to a target physical address, wherein the target physical address corresponds to one of the at least one accelerator;
step S120: based on the target physical address, reading target information stored in the target physical address from an accelerator corresponding to the target physical address;
step S130: writing the target information into an interface control space of a central processing unit through a unified memory architecture bus;
step S140: transmitting the target information from the interface control space of the central processing unit to the network card through a standard bus of a high-speed serial computer expansion bus;
step S150: and sending the target information by using the network card.
The above method is applied to, for example, a heterogeneous computing system shown in fig. 5. The heterogeneous computing system of fig. 5 is briefly described below, and then the method of fig. 4 is described in conjunction with fig. 5.
As shown in fig. 5, an embodiment of the present disclosure provides a heterogeneous computing system 100. The heterogeneous computing system 100 includes a Central Processing Unit (CPU), a network card 001, and at least one accelerator depth computing unit 0-depth computing unit 3 (in this example, the number of accelerators is 4, for example, but embodiments of the present disclosure are not limited thereto). The CPU is connected to the network card 001 through a high-speed serial computer expansion bus standard bus (PCIe bus), and the network card 001 needs to occupy 8 slots (i.e., X8). The CPU is connected to at least one accelerator depth calculation unit 0-depth calculation unit 3 via a high-speed serial computer expansion bus standard bus (PCIe bus) and a unified memory architecture bus (e.g., xGMI bus). For example, between the CPU and the accelerator, the xGMI bus occupies 16 card slots, that is, the bandwidth of the xGMI bus is X16, and at this time, the card slot allocated to the PCIe bus is the remaining 8 card slots, that is, the bandwidth of the PCIe bus is X8. The plurality of accelerator depth calculation units 0 to 3 are connected to each other via a unified memory architecture bus (e.g., an xGMI bus).
For example, the CPU includes a plurality of DIEs DIE0-DIE3 (in this example, the number of DIEs is, for example, 4, but the embodiments of the present disclosure are not limited thereto) packaged as one body. The plurality of accelerator depth calculation units 0-3 correspond to the plurality of DIEs DIE0-DIE3 one-to-one, and each accelerator is connected to the corresponding DIE through a high-speed serial computer expansion bus standard bus (PCIe bus) and a unified memory architecture bus (e.g., xGMI bus). For example, the plurality of DIEs DIE0-DIE3 may have the same structure and function, or may have different structures and functions, which is not limited by the embodiments of the disclosure. Similarly, the plurality of accelerator depth calculating units 0 to 3 may have the same structure and function, or may have different structures and functions, and the embodiment of the present disclosure is not limited thereto.
It should be noted that, in the embodiment of the present disclosure, the number of the accelerators is not limited, and may be 4, or may also be 1, 2, or any other number, which may be determined according to actual needs, and although 4 accelerators are taken as an example for description herein, this does not constitute a limitation to the embodiment of the present disclosure. Accordingly, the number of the dies included in the CPU is also not limited, and may be 4, or may be 1, 2, or any other number, which may be determined according to actual needs, and the embodiments of the present disclosure are not limited thereto. For example, in some examples, the number of accelerators is equal to the number of dies included in the CPU, thereby facilitating one-to-one connections.
For example, the unified memory architecture bus is not limited to the xGMI bus, but may be a Gen-Z bus, a CCIX bus, a CXL bus, or any other suitable bus. For example, the network card 001 may be an Infiniband (IB) network card or any device having a network transmission function. For example, the accelerator may be a Graphics Processing Unit (GPU), a depth calculation Unit, an artificial intelligence calculation Unit, a Digital Signal processing Unit (DSP), or any other type of accelerator. For example, the Central Processing Unit (CPU) may be any type of processor, may be an X86 or ARM architecture, or the like.
For example, the accelerator includes an accelerator memory, which is the device memory described above. The accelerator memory may be a High Bandwidth Memory (HBM), double data rate memory (DDR), graphics double data rate memory (GDDR), low power double data rate memory (LPDDR), or any other type of memory device.
The heterogeneous computing system 100 employs a UMA architecture, with the CPU and accelerator connected via a PCIe bus as well as a unified memory architecture bus (e.g., xGMI bus). In the heterogeneous computing system 100, the original accelerator device may still use the standard PCIe for device enumeration and system initialization, and the system interrupt is still sent to the CPU through the PCIe conventional interrupt vector, thereby ensuring that the software is compatible with the existing ecology to the maximum extent. Under this architecture, the main data path between the CPU and the accelerator is switched to a high bandwidth, low latency unified memory architecture bus (e.g., xGMI bus) to support unified addressing.
The method shown in fig. 4 will be described with reference to fig. 5.
In step S110, the target virtual address is converted into a target physical address. For example, the target virtual address is the address of the data stored in the accelerator that needs to be transferred. For example, the target virtual address may be an address in the virtual address space of the accelerator. For example, the target physical address may be an address in a physical address space that a Central Processing Unit (CPU) and an accelerator address uniformly under a Unified Memory Architecture (UMA). For example, the virtual address space of the accelerator has a mapping relationship with the uniformly addressed physical address space under the UMA architecture. Translating the target virtual address to the target physical address may include: and converting the target virtual address into the target physical address according to the mapping relation between the virtual address space and the physical address space. For example, the mapping relationship may be stored in a local arbitrary storage space in a page table manner, and when address translation is required, the page table may be queried and translated.
Since the target virtual address is an address in the virtual address space of the accelerator, the translated target physical address corresponds to the accelerator, and the target physical address is the physical address of the corresponding accelerator. For example, the target physical address may correspond to one of the accelerator depth calculation units 0-3, depending on the storage location of the data that needs to be transferred.
For example, in some examples, as shown in fig. 6A, before step S110 is executed, the direct memory access method provided in the embodiment of the present disclosure may further include step S160. Steps other than step S160 in fig. 6A are the same as the respective steps shown in fig. 4, and the description about fig. 4 may be referred to for the relevant explanation.
Step S160: an address buffer is set in the virtual address space of the accelerator and mapped into the physical address space. The physical address space is a physical address space uniformly addressed by the central processing unit and the at least one accelerator under a uniform memory architecture.
For example, as shown in fig. 6B, in some examples, the virtual address GPUVM of the accelerator is 48 bits (48 bits), which 48 bits define the virtual address space of the accelerator, which is the address directly used by software and applications. The physical address GPUPA is 44 bits (44 bits), which 44 bits define a physical address space that is uniformly addressed by the central processing unit and the accelerator under a uniform memory architecture. For example, the physical address GPUPA is a corresponding actual physical address generated by the virtual address GPUVM of the accelerator through page table lookup and translation. It should be noted that the virtual address GPUVM of the accelerator is not limited to 48 bits, the physical address GPUPA is not limited to 44 bits, and each address may be any bit width, for example, 32 bits, 52 bits, 64 bits, and the like.
For example, software sets an address Buffer FB (Frame Buffer) in the virtual address space of the accelerator for programs and applications, the address Buffer FB being mapped onto the actual physical address space by the page table.
For example, the physical address space includes a first address space and a second address space in succession, the first address space being an address space of the central processing unit and the second address space being an address space of the at least one accelerator. For example, in the example shown in fig. 6B, in the physical address space, the first address space is a CPU memory, and the second address space is an address space of a plurality of (e.g., 4) accelerator depth calculating units 0 to 3, that is, the depth calculating unit 0 storage, the depth calculating unit 1 storage, the depth calculating unit 2 storage, and the depth calculating unit 3 storage in fig. 6B. For example, the address spaces of the plurality of accelerator depth calculating units 0 to 3 are also continuous. The CPU memory, the storage of the depth calculation unit 0, the storage of the depth calculation unit 1, the storage of the depth calculation unit 2 and the storage of the depth calculation unit 3 are uniformly addressed under a uniform memory architecture.
For example, in the UMA architecture, the area from 0 to PF _ OFFSET is set as the address space of the CPU, and above the address of the CPU, the address spaces of the plurality of accelerators are arranged in sequence, and the entire effective address space is a continuous space. For example, a specific value of PF _ OFFSET may be set by a configuration register, for example, according to the specification of the CPU.
For example, assuming that the size of the address space of each accelerator is depth calculation unit x _ OFFSET, the address range of accelerator depth calculation unit 0 is PF _ OFFSET to PF _ OFFSET + depth calculation unit x _ OFFSET, the address range of accelerator depth calculation unit 1 is PF _ OFFSET + depth calculation unit x _ OFFSET to PF _ OFFSET + depth calculation unit x _ OFFSET 2, and the address ranges of accelerator depth calculation unit 2 and depth calculation unit 3, and so on. Each accelerator also includes accelerator memory, for example, accelerator depth calculation unit 1 also includes a depth calculation unit 1 local HBM, and address UMC of the HBM is defined by 32 bits.
It should be noted that the size of the address space of each accelerator may be equal or different, which may be determined according to the actual requirement, and only the CPU and the accelerator need to address uniformly under the uniform memory architecture.
For example, the total size of the physical address space uniformly addressed by the CPU and the accelerator under the uniform memory architecture is: CPU PA (PF _ OFFSET) + depth calculation unit PA × W. Here, the CPU PA denotes an address space of the CPU physics, the depth calculation unit PA denotes an address space of the accelerator physics, and W denotes the number of accelerators. In the example shown in fig. 6B, W is, for example, 4.
For example, the address buffer FB is mapped into the entire available physical address space. In the case that a certain accelerator needs to initiate an access, based on the physical address of the access, the object corresponding to the physical address can be directly accessed. When the physical address is in the area of 0-PF _ OFFSET, the accelerator sends the access to the xGMI bus interconnected with the CPU, thereby realizing the access to the CPU; when the physical address is located on other accelerator spaces, the accelerator sends the access to the corresponding accelerator through the corresponding xGMI bus to realize the access; when the physical address falls on the address space of the accelerator itself, the accelerator directly accesses the accelerator Memory (e.g. HBM) local to the accelerator through a Unified Memory Controller (UMC). By this address allocation and management, unified partitioning and addressing of the physical addresses of the CPU and accelerator can be achieved.
It should be noted that, in fig. 6B, some preset spaces, such as Legacy P2P Hole, AGP, GART, etc., are further disposed in the virtual address space defined by the virtual address GPUVM of the accelerator, and these preset spaces are required to be set for implementing the use of the accelerator, which may refer to the conventional design and will not be described in detail herein.
For example, as shown in fig. 4, in step S120, based on the target physical address, target information stored in the target physical address is read from the accelerator corresponding to the target physical address. The target information is data stored in an accelerator that needs to be transferred, and further, the target information is stored in an accelerator memory of the accelerator, for example, in the HBM. For example, in some examples, based on the target physical address, reading the target information stored in the target physical address from the accelerator corresponding to the target physical address may include: based on the target physical address, a direct memory access Engine (DMA Engine) initiates a data read and reads target information stored in the target physical address from an accelerator corresponding to the target physical address. That is, the target information stored in the target physical address can be read by DMA.
For example, in step S130, the target information is written into the interface control space of the CPU via the unified memory architecture bus (e.g., the xGMI bus). In the heterogeneous computing system 100, since the CPU and the accelerator are connected by the xGMI bus, the transmission speed and the transmission efficiency can be improved by transmitting the target information from the accelerator to the CPU using the xGMI bus. Of course, the unified memory architecture bus is not limited to the xGMI bus, but may be a Gen-Z bus, a CCIX bus, a CXL bus, or any other suitable bus. For example, the interface control space of a CPU is the IO controller (IOC) of the CPU, which is used to interact with components connected to the CPU through IO.
For example, in step S140, the target information is transmitted from the interface control space (i.e., IOC) of the CPU to the network card 001 through the PCIe bus. For example, in some examples, step S140 may further include: setting a memory address mapping interface space in the network card 001; and transmitting the target information from the interface control space of the CPU to the memory address mapping interface space of the network card 001 through a PCIe bus. For example, the memory address mapping interface space is an MMIO space, and the target information is transmitted to the MMIO space of the network card 001.
For example, in step S150, the destination information is transmitted using the network card 001. For example, the network card 001 may be an IB network card, and accordingly, the network card 001 may be used to send out the target information through the internet. The network card 001 may be any device having a network transmission function, and is not limited to the IB network card, and at this time, the target information is sent out through a network adapted to the network card. For example, the target information may be sent to a receiving end, such as to a heterogeneous computing system of a remote node. For example, in some examples, when the target information is sent, the target information may also be subjected to conventional packaging, encoding, and other processing, so that the processed target information is more suitable for being sent through the network card 001.
By performing the above steps S110-S150, the target information can be transmitted from a certain accelerator of the heterogeneous computing system 100 as a transmitting end to a receiving end, for example, from a local node to a remote node, thereby achieving the transmission of the target information. Of course, the heterogeneous computing system 100 as the sender may be located in any location, not limited to the local node, and this may be determined according to actual needs.
In the process of sending the target information, only one address translation is needed, that is, the target virtual address is converted into the target physical address in step S110, and other steps are performed based on unified addressing under the UMA architecture, and conversion between the virtual address and the physical address is no longer needed, so that address mapping management can be reduced, and overhead of software and hardware in address translation and management can be reduced. Although the bandwidth of PCIe in the heterogeneous computing system 100 used in the remote direct memory access method provided in the embodiment of the present disclosure is reduced by half compared with the heterogeneous computing system shown in fig. 3, due to the adoption of the UMA architecture, the performance of remote direct memory access is not limited by the PCIe bandwidth.
The remote direct memory access method provided by the embodiment of the disclosure fully utilizes the advantage of unified addressing of physical address spaces of a CPU and an accelerator under a UMA architecture, replaces an access path of PCIe P2P in a common remote direct memory access method by communication through a special data path between the CPU and the accelerator, and adopts direct physical address mapping, thereby reducing address mapping management, reducing overhead of software and hardware in address translation and management, enabling performance of remote direct memory access not to be limited by PCIe bandwidth, improving communication efficiency and performance, improving transmission efficiency, and reducing delay of remote direct memory access.
At least one embodiment of the present disclosure further provides a remote direct memory access method based on a heterogeneous computing system. The heterogeneous computing system comprises a central processing unit, a network card and at least one accelerator, wherein the central processing unit is connected with the at least one accelerator through a high-speed serial computer expansion bus standard bus and a unified memory architecture bus, and the central processing unit is connected with the network card through the high-speed serial computer expansion bus standard bus. The method comprises the following steps: receiving target information by using a network card; transmitting the target information from the network card to an interface control space of the central processing unit through a high-speed serial computer expansion bus standard bus; and writing the target information into the accelerator corresponding to the target physical address from the interface control space of the central processing unit through the unified memory architecture bus based on the target physical address.
Fig. 7 is a flowchart illustrating a remote dma method according to some embodiments of the present disclosure, where the method is used, for example, in a heterogeneous computing system serving as a receiving end in both remote dma interaction parties. As shown in fig. 7, the method includes the following operations.
Step S210: receiving target information by using a network card;
step S220: transmitting the target information from the network card to an interface control space of the central processing unit through a high-speed serial computer expansion bus standard bus;
step S230: and writing the target information into the accelerator corresponding to the target physical address from the interface control space of the central processing unit through the unified memory architecture bus based on the target physical address.
For example, the method illustrated in FIG. 7 may be applied to the heterogeneous computing system 100 illustrated in FIG. 5. The heterogeneous computing system 100 includes a Central Processing Unit (CPU), a network card 001, and at least one accelerator depth computing unit 0-depth computing unit 3 (in this example, the number of accelerators is 4, for example, but embodiments of the present disclosure are not limited thereto). The CPU is connected with the network card 001 through a high-speed serial computer expansion bus standard bus (PCIe bus), and is connected with at least one accelerator depth calculating unit 0-depth calculating unit 3 through the high-speed serial computer expansion bus standard bus (PCIe bus) and a unified memory architecture bus (such as an xGMI bus). For the description of the heterogeneous computing system 100, reference is made to the above description, and the description is not repeated herein.
For example, in step S210, the target information is received by the network card 001. For example, the network card 001 may be an IB network card, and accordingly, the network card 001 may be used to receive the target information through the internet. The network card 001 may be any device having a network transmission function, and is not limited to the IB network card, and in this case, the target information is received through an appropriate network. For example, target information transmitted from a heterogeneous computing system as a transmitting end, the target information being data transmitted by the transmitting end, may be received. For example, in some examples, upon receiving the target information, if the target information is packaged, encoded, etc., the received target information may be conventionally unpacked, decoded, etc., so that the processed target information is more suitable for data transmission within the heterogeneous computing system 100.
For example, in step S220, the target information is transmitted from the network card 001 to the interface control space (i.e., IOC) of the CPU through the PCIe bus. The IOC is used to interact with components connected to the CPU through IOs. At this time, the software provides a PCIe System Address (PCIe System Address) corresponding to the physical Address for storing the target information. For example, the physical address for storing the target information is specified by software, which may depend on the situation and actual requirements of the software operation.
For example, in step S230, the PCIe system address is converted into a target physical address, and then based on the target physical address, target information is written from an interface control space (i.e., IOC) of the CPU to an accelerator corresponding to the target physical address through a unified memory architecture bus (e.g., xGMI bus). That is, the target physical address is obtained by converting the obtained PCIe system address, and the target physical address points to a certain accelerator, so that the target information can be written into the accelerator through the xGMI bus. The accelerator includes accelerator memory (e.g., HBM) into which the target information is written. In the heterogeneous computing system 100, since the CPU and the accelerator are connected by the xGMI bus, the transmission speed and the transmission efficiency can be improved by transmitting the target information from the CPU to the accelerator using the xGMI bus. Of course, the unified memory architecture bus is not limited to the xGMI bus, but may be a Gen-Z bus, a CCIX bus, a CXL bus, or any other suitable bus.
For example, taking the address shown in fig. 6B as an example, under the UMA architecture, when the target physical address is located in the range from PF _ OFFSET to PF _ OFFSET + depth calculation unit x _ OFFSET, the target information is sent to accelerator depth calculation unit 0 through xGMI bus; when the target physical address is located in PF _ OFFSET + depth calculation unit x _ OFFSET-PF _ OFFSET + depth calculation unit x _ OFFSET × 2, the target information is sent to accelerator depth calculation unit 1 through the xGMI bus, and so on.
For example, in some examples, as shown in fig. 8, before performing step S210, the direct memory access method provided in the embodiment of the present disclosure may further include step S240. The steps other than step S240 in fig. 8 are the same as the respective steps shown in fig. 7, and the description about fig. 7 may be referred to for the relevant explanation.
Step S240: an address buffer is set in the virtual address space of the accelerator and mapped into the physical address space. The physical address space is a physical address space uniformly addressed by the central processing unit and the at least one accelerator under a uniform memory architecture.
For example, the physical address space includes a first address space and a second address space in succession, the first address space being an address space of the central processing unit and the second address space being an address space of the at least one accelerator.
For a detailed description of the physical address space uniformly addressed by the cpu and the at least one accelerator under the unified memory architecture, reference may be made to the description about step S160 in fig. 6A and fig. 6B, which are not repeated herein.
By performing the above steps S210 to S230, the target information transmitted by the transmitting end may be received and stored in a certain accelerator of the heterogeneous computing system 100 as the receiving end, thereby achieving the reception of the target information.
In the process of receiving the target information, all the steps are carried out based on unified addressing under the UMA architecture, and conversion between virtual addresses and physical addresses is not required, so that address mapping management can be reduced, and the overhead of software and hardware in address translation and management can be reduced. Although the bandwidth of PCIe in the heterogeneous computing system 100 used in the remote direct memory access method provided in the embodiment of the present disclosure is reduced by half compared with the heterogeneous computing system shown in fig. 3, due to the adoption of the UMA architecture, the performance of remote direct memory access is not limited by the PCIe bandwidth. Therefore, the communication efficiency and performance can be improved, the transmission efficiency is improved, and the time delay of remote direct memory access is reduced.
Fig. 9 is a flowchart illustrating another remote direct memory access method based on a heterogeneous computing system according to some embodiments of the disclosure. The operations performed by the sender and the receiver in the Remote Direct Memory Access (RDMA) process are briefly described below with reference to fig. 9.
For example, as shown in fig. 9, the heterogeneous computing system 100A is a sending end, the heterogeneous computing system 100B is a receiving end, and both the heterogeneous computing systems 100A and 100B adopt the heterogeneous computing system 100 shown in fig. 5. The heterogeneous computing system 100A is located at a local node and the heterogeneous computing system 100B is located at a remote node, and the heterogeneous computing system 100A needs to send target information to the heterogeneous computing system 100B to implement RDMA.
First, an address translator in the heterogeneous computing system 100A translates a target virtual address (GPUVM) to a target physical address (GPUPA) corresponding to one accelerator in the heterogeneous computing system 100A through a page table. Then, based on the target physical address, the DMA engine starts data reading, and reads the target information stored in the target physical address from an accelerator memory (e.g., HBM) of an accelerator corresponding to the target physical address, that is, reads the target information stored in the target physical address by using a DMA method. The target information is then written to the interface control space (IOC) of the CPU over the xGMI bus. Since the IOC has an address range of 0 PF _ OFFSET, the access is sent directly to the CPU-side IOC via the xGMI bus. The IOC then makes a linear mapping and transmits the target information from the CPU's interface control space (IOC) to the MMIO space of the network card over the PCIe bus. The network card in the heterogeneous computing system 100A receives the write operation, and sends the target information to the network card in the heterogeneous computing system 100B serving as the receiving end through the network.
After the network card in the heterogeneous computing system 100B receives the target information, the target information is transmitted to an interface control space (IOC) of the CPU through a PCIe bus. In the heterogeneous computing System 100B, software provides a PCIe System Address (PCIe System Address) and translates the PCIe System Address into a target physical Address. Then, after receiving the access of the network card, the CPU writes the target information into the accelerator memory (e.g., HBM) of the accelerator corresponding to the target physical address from the interface control space (IOC) of the CPU through the xGMI bus based on the target physical address and according to the address range of each accelerator.
Thus, the target information is transferred from the accelerator of heterogeneous computing system 100A to the accelerator of heterogeneous computing system 100B, implementing RDMA. In the whole transmission process of the target information, only one address translation is needed, that is, the address translator in the heterogeneous computing system 100A converts the target virtual address into the target physical address, and other steps are performed based on the unified addressing under the UMA architecture, so that the conversion between the virtual address and the physical address is not needed, and page table conversion is not needed, thereby reducing address mapping management, and reducing the overhead of software and hardware in address translation and management. Although the bandwidth of PCIe in the heterogeneous computing system 100 used in the remote direct memory access method provided in the embodiment of the present disclosure is reduced by half compared with the heterogeneous computing system shown in fig. 3, due to the adoption of the UMA architecture, the performance of remote direct memory access is not limited by the PCIe bandwidth. Therefore, the communication efficiency and performance can be improved, the transmission efficiency is improved, and the time delay of remote direct memory access is reduced.
It should be noted that the target physical address used by the heterogeneous computing system 100A in the process of sending the target information and the target physical address used by the heterogeneous computing system 100B in the process of receiving the target information do not have an association, and may be the same or different, which may be determined according to the software setting in the RDMA process and is also influenced by the architecture and addressing mode of the heterogeneous computing systems 100A and 100B, respectively. The unified addressing scheme of the heterogeneous computing system 100A and the unified addressing scheme of the heterogeneous computing system 100B may be the same or different. For example, the target information in accelerator depth calculation unit 0 in heterogeneous computing system 100A may be sent to any one of accelerator depth calculation unit 0-depth calculation unit 3 in heterogeneous computing system 100B.
It should be noted that, in fig. 9, the storage controller and the GPU Data Fabric are components or modules required in the operation process of the accelerator, the CAKE is an interface and a related protocol required when the accelerator and the CPU are interconnected through the xGMI bus, the CPU Data Fabric is a component or module required in the operation process of the CPU, and the PCIe Root Complex is an interface and a related protocol required when the CPU and the network card are interconnected through the PCIe bus, and the related description refers to a conventional design and is not described in detail herein.
It should be noted that the remote direct memory access method provided by the embodiment of the present disclosure is not limited to the steps described above, and may include further steps. The order of execution of the various steps is not limited, and although the various steps are described above in a particular order, this is not to be construed as a limitation on embodiments of the disclosure.
At least one embodiment of the present disclosure further provides a heterogeneous computing system, which may reduce address mapping management and reduce overhead of software and hardware in address translation and management when performing remote direct memory access, so that performance of remote direct memory access under a unified memory architecture is not limited by PCIe bandwidth, thereby improving communication efficiency and performance, improving transmission efficiency, and reducing latency of remote direct memory access. In addition, the heterogeneous computing system fully utilizes the unified addressing of the physical address spaces of the CPU and the accelerator under the UMA architecture, and can ensure that software is compatible with the existing ecology to the maximum extent.
Fig. 10 is a schematic block diagram of a heterogeneous computing system provided in some embodiments of the present disclosure. As shown in fig. 10, the heterogeneous computing system 200 includes a central processing unit 210, a network card 220, and at least one accelerator 230. For example, the central processing unit 210 is connected to the at least one accelerator 230 via a high-speed serial computer expansion bus standard bus (PCIe bus) and a unified memory architecture bus (e.g., xGMI bus), and the central processing unit 210 is connected to the network card 220 via a high-speed serial computer expansion bus standard bus (PCIe bus).
For example, in some embodiments, the heterogeneous computing system 200 is configured to: translating the target virtual address to a target physical address, the target physical address corresponding to one of the at least one accelerator 230; based on the target physical address, reading target information stored in the target physical address from the accelerator 230 corresponding to the target physical address; writing the target information into the interface control space of the central processing unit 210 through a unified memory architecture bus (e.g., an xGMI bus); transmitting the target information from the interface control space of the central processing unit 210 to the network card 220 through a high-speed serial computer expansion bus standard bus (PCIe bus); the destination information is transmitted using the network card 220. That is, the heterogeneous computing system 200 is configured to implement the remote direct memory access method shown in fig. 4, and the heterogeneous computing system 200 may be the heterogeneous computing system 100A in fig. 9, so as to serve as a sending end to send out the target information.
For example, in other embodiments, the heterogeneous computing system 200 is configured to: receiving the target information using the network card 220; transmitting the target information from the network card 220 to the interface control space of the central processing unit 210 through a high-speed serial computer expansion bus standard bus (PCIe bus); based on the target physical address, the target information is written from the interface control space of the central processing unit 210 to the accelerator 230 corresponding to the target physical address via a unified memory architecture bus (e.g., the xGMI bus). That is, the heterogeneous computing system 200 is configured to implement the remote direct memory access method shown in fig. 7, and the heterogeneous computing system 200 may be the heterogeneous computing system 100B in fig. 9, so as to serve as a receiving end for receiving the target information.
For example, in still other embodiments, the heterogeneous computing system 200 is configured to: translating the first target virtual address to a first target physical address, the first target physical address corresponding to one of the at least one accelerator 230; based on the first target physical address, reading first target information stored in the first target physical address from the accelerator 230 corresponding to the first target physical address; writing the first target information into an interface control space of the central processing unit 210 through a unified memory architecture bus (e.g., an xGMI bus); transmitting the first target information from the interface control space of the central processing unit 210 to the network card 220 through a high-speed serial computer expansion bus standard bus (PCIe bus); sending first target information by using the network card 220; receiving second target information by using the network card 220; transmitting the second target information from the network card 220 to an interface control space of the central processing unit 210 through a high-speed serial computer expansion bus standard bus (PCIe bus); based on the second target physical address, the second target information is written from the interface control space of the central processing unit 210 to the accelerator 230 corresponding to the second target physical address through a unified memory architecture bus (e.g., xGMI bus). That is, the heterogeneous computing system 200 is configured to implement both the remote direct memory access method shown in fig. 4 and the remote direct memory access method shown in fig. 7, and the heterogeneous computing system 200 may serve as a sending end to send the first target information and as a receiving end to receive the second target information, so as to implement data transmission with multiple remote nodes and perform multi-node and parallel remote direct memory access.
The heterogeneous computing system 200 may be, for example, the heterogeneous computing system 100 shown in fig. 5. The heterogeneous computing system 100 includes a plurality of accelerator depth computing units 0 to 3 (in this example, the number of accelerators is, for example, 4, but the embodiments of the present disclosure are not limited thereto), and the plurality of accelerator depth computing units 0 to 3 are connected to each other through a unified memory architecture bus (e.g., an xGMI bus).
For example, in some examples, a Central Processing Unit (CPU) includes a plurality of DIEs DIE0-DIE3 packaged as one (in this example, the number of DIEs is, for example, 4, although embodiments of the disclosure are not limited thereto). The plurality of accelerator depth calculation units 0-3 are in one-to-one correspondence with the plurality of DIEs DIE0-DIE3, and each accelerator is connected to the corresponding DIE through a high-speed serial computer expansion bus standard bus (PCIe bus) and a unified memory architecture bus (e.g., xGMI bus).
For example, the unified memory architecture bus is not limited to the xGMI bus, but may be a Gen-Z bus, a CCIX bus, a CXL bus, or any other suitable bus. For example, the network card 001 may be an Infiniband (IB) network card or any device having a network transmission function. For example, the accelerator may be a Graphics Processing Unit (GPU), a depth computation unit, an artificial intelligence computation unit, a digital signal processing unit (DSP), or any other type of accelerator. For example, the Central Processing Unit (CPU) may be any type of processor, may be an X86 or ARM architecture, or the like.
For example, the accelerator includes an accelerator memory, which is the device memory described above. The accelerator memory may be a High Bandwidth Memory (HBM), double data rate memory (DDR), graphics double data rate memory (GDDR), low power double data rate memory (LPDDR), or any other type of memory device.
It should be noted that, in the embodiment of the present disclosure, the number of the accelerators is not limited, and may be 4, or may also be 1, 2, or any other number, which may be determined according to actual needs, and although 4 accelerators are taken as an example for description herein, this does not constitute a limitation to the embodiment of the present disclosure. Accordingly, the number of the dies included in the CPU is also not limited, and may be 4, or may be 1, 2, or any other number, which may be determined according to actual needs, and the embodiments of the present disclosure are not limited thereto. For example, in some examples, the number of accelerators is equal to the number of dies included in the CPU, thereby facilitating one-to-one connections.
It should be noted that the heterogeneous computing system 200 may also include more components or modules according to actual needs, and the embodiments of the present disclosure are not limited thereto. For detailed description and technical effects of the heterogeneous computing system 200, reference may be made to the above description of the heterogeneous computing system 100 and the remote direct memory access method, which are not described herein again.
At least one embodiment of the present disclosure also provides an electronic device including the heterogeneous computing system according to any one of the embodiments of the present disclosure. When the electronic equipment carries out remote direct memory access, address mapping management can be reduced, the overhead of software and hardware in address translation and management is reduced, the performance of the remote direct memory access under a unified memory architecture is not limited by PCIe bandwidth, the communication efficiency and performance can be improved, the transmission efficiency is improved, and the delay of the remote direct memory access is reduced. In addition, the electronic equipment fully utilizes the unified addressing of the CPU and the accelerator physical address space under the UMA architecture, and can ensure that software is compatible with the existing ecology to the maximum extent.
Fig. 11 is a schematic block diagram of an electronic device provided in some embodiments of the present disclosure. As shown in fig. 11, the electronic device 300 includes a heterogeneous computing system 310, where the heterogeneous computing system 310 is a heterogeneous computing system provided by any of the embodiments of the present disclosure, such as the heterogeneous computing system 100 shown in fig. 5 or the heterogeneous computing system 200 shown in fig. 10. The electronic device 300 may implement the remote direct memory access method shown in fig. 4, or may implement the remote direct memory access method shown in fig. 7, or may implement the remote direct memory access methods shown in fig. 4 and fig. 7 at the same time. For example, the electronic device 300 may be implemented as a personal computer, a computer system, a server cluster, or any other device including heterogeneous computing systems, and the embodiments of the present disclosure are not limited thereto.
It should be noted that the electronic device 300 may further include more components or modules, which may be required according to practical needs, and the embodiments of the present disclosure are not limited thereto. For detailed description and technical effects of the electronic device 300, reference may be made to the above description of the heterogeneous computing system 100, the heterogeneous computing system 200, and the remote direct memory access method, which are not described herein again.
The following points need to be explained:
(1) The drawings of the embodiments of the disclosure only relate to the structures related to the embodiments of the disclosure, and other structures can refer to common designs.
(2) Without conflict, embodiments of the present disclosure and features of the embodiments may be combined with each other to arrive at new embodiments.
The above description is only a specific embodiment of the present disclosure, but the scope of the present disclosure is not limited thereto, and the scope of the present disclosure should be subject to the scope of the claims.

Claims (25)

1. A remote direct memory access method based on a heterogeneous computing system is disclosed, wherein the heterogeneous computing system comprises a central processing unit, a network card and at least one accelerator, the central processing unit is connected with the at least one accelerator through a high-speed serial computer expansion bus standard bus and a unified memory architecture bus, the central processing unit is connected with the network card through the high-speed serial computer expansion bus standard bus,
the method comprises the following steps:
setting an address buffer in a virtual address space of the accelerator, and mapping the address buffer into a physical address space, wherein the physical address space is a physical address space uniformly addressed by the central processing unit and the at least one accelerator under a uniform memory architecture;
translating a target virtual address to a target physical address, wherein the target physical address corresponds to one of the at least one accelerator;
reading target information stored in the target physical address from an accelerator corresponding to the target physical address based on the target physical address;
writing the target information into an interface control space of the central processing unit through the unified memory architecture bus;
transmitting the target information from an interface control space of the central processing unit to the network card through the high-speed serial computer expansion bus standard bus;
and sending the target information by utilizing the network card.
2. The method of claim 1, wherein the physical address space comprises a contiguous first address space and a second address space, the first address space being an address space of the central processing unit and the second address space being an address space of the at least one accelerator.
3. The method of claim 1, wherein translating the target virtual address to the target physical address comprises:
and converting the target virtual address into the target physical address according to the mapping relation between the virtual address space and the physical address space.
4. The method of claim 1, wherein reading target information stored in the target physical address from an accelerator corresponding to the target physical address based on the target physical address comprises:
and starting data reading by a direct memory access engine based on the target physical address, and reading target information stored in the target physical address from an accelerator corresponding to the target physical address.
5. The method of claim 1, wherein transmitting the target information from the interface control space of the central processing unit to the network card over the high speed serial computer expansion bus standard bus comprises:
setting a memory address mapping interface space in the network card;
and transmitting the target information from an interface control space of the central processing unit to the memory address mapping interface space of the network card through the high-speed serial computer expansion bus standard bus.
6. The method of any of claims 1-5, wherein the accelerator includes an accelerator memory, the target information being stored in the accelerator memory.
7. The method of any of claims 1-5, wherein the unified memory architecture bus comprises an xGMI bus, a Gen-Z bus, a CCIX bus, or a CXL bus.
8. A remote direct memory access method based on a heterogeneous computing system is disclosed, wherein the heterogeneous computing system comprises a central processing unit, a network card and at least one accelerator, the central processing unit is connected with the at least one accelerator through a high-speed serial computer expansion bus standard bus and a unified memory architecture bus, the central processing unit is connected with the network card through the high-speed serial computer expansion bus standard bus,
the method comprises the following steps:
setting an address buffer in a virtual address space of the accelerator, and mapping the address buffer into a physical address space, wherein the physical address space is a physical address space uniformly addressed by the central processing unit and the at least one accelerator under a uniform memory architecture;
receiving target information by using the network card;
transmitting the target information from the network card to an interface control space of the central processing unit through the high-speed serial computer expansion bus standard bus;
and writing the target information into the accelerator corresponding to the target physical address from the interface control space of the central processing unit through the unified memory architecture bus based on the target physical address.
9. The method of claim 8, wherein the physical address space comprises a contiguous first address space and a second address space, the first address space being an address space of the central processing unit and the second address space being an address space of the at least one accelerator.
10. The method of claim 8, wherein the target physical address is translated from the retrieved high speed serial computer expansion bus standard system address.
11. The method of any of claims 8-10, wherein the accelerator includes an accelerator memory, the target information being written to the accelerator memory.
12. The method of any of claims 8-10, wherein the unified memory architecture bus comprises an xGMI bus, a Gen-Z bus, a CCIX bus, or a CXL bus.
13. A heterogeneous computing system comprises a central processing unit, a network card and at least one accelerator, wherein the central processing unit is connected with the at least one accelerator through a high-speed serial computer expansion bus standard bus and a unified memory architecture bus, the central processing unit is connected with the network card through the high-speed serial computer expansion bus standard bus,
the heterogeneous computing system is configured to:
setting an address buffer in a virtual address space of the accelerator, and mapping the address buffer into a physical address space, wherein the physical address space is a physical address space uniformly addressed by the central processing unit and the at least one accelerator under a uniform memory architecture;
translating a target virtual address to a target physical address, wherein the target physical address corresponds to one of the at least one accelerator;
reading target information stored in the target physical address from an accelerator corresponding to the target physical address based on the target physical address;
writing the target information into an interface control space of the central processing unit through the unified memory architecture bus;
transmitting the target information from an interface control space of the central processing unit to the network card through the high-speed serial computer expansion bus standard bus;
and sending the target information by utilizing the network card.
14. The heterogeneous computing system of claim 13, wherein the at least one accelerator comprises a plurality of accelerators connected to one another by the unified memory architecture bus.
15. The heterogeneous computing system of claim 14, wherein the central processing unit comprises a plurality of dies packaged as a single unit, the plurality of accelerators being in one-to-one correspondence with the plurality of dies, each accelerator being connected to a corresponding die via the high speed serial computer expansion bus standard bus and the unified memory architecture bus.
16. The heterogeneous computing system of any of claims 13-15, wherein the unified memory architecture bus comprises an xGMI bus, a Gen-Z bus, a CCIX bus, or a CXL bus.
17. The heterogeneous computing system of any of claims 13-15, wherein the network card comprises an infiniband network card, and the accelerator comprises a graphics processing unit, a depth computation unit, an artificial intelligence computation unit, or a digital signal processing unit.
18. The heterogeneous computing system of any of claims 13 to 15, wherein the accelerator includes accelerator memory, the accelerator memory including high bandwidth memory, double rate memory for graphics, or double rate memory for low power consumption.
19. A heterogeneous computing system comprises a central processing unit, a network card and at least one accelerator, wherein the central processing unit is connected with the at least one accelerator through a high-speed serial computer expansion bus standard bus and a unified memory architecture bus, the central processing unit is connected with the network card through the high-speed serial computer expansion bus standard bus,
the heterogeneous computing system is configured to:
setting an address buffer in a virtual address space of the accelerator, and mapping the address buffer into a physical address space, wherein the physical address space is a physical address space uniformly addressed by the central processing unit and the at least one accelerator under a uniform memory architecture;
receiving target information by using the network card;
transmitting the target information from the network card to an interface control space of the central processing unit through the high-speed serial computer expansion bus standard bus;
and writing the target information into the accelerator corresponding to the target physical address from the interface control space of the central processing unit through the unified memory architecture bus based on the target physical address.
20. The heterogeneous computing system of claim 19, wherein the at least one accelerator comprises a plurality of accelerators connected to one another by the unified memory architecture bus.
21. The heterogeneous computing system of claim 20, wherein the central processing unit comprises a plurality of dies packaged together, the plurality of accelerators being in a one-to-one correspondence with the plurality of dies, each accelerator being connected to a corresponding die through the high speed serial computer expansion bus standard bus and the unified memory architecture bus.
22. The heterogeneous computing system of any of claims 19-21, wherein the unified memory architecture bus comprises an xGMI bus, a Gen-Z bus, a CCIX bus, or a CXL bus.
23. The heterogeneous computing system of any of claims 19-21, wherein the network card comprises an infiniband network card and the accelerator comprises a graphics processing unit, a depth computation unit, an artificial intelligence computation unit, or a digital signal processing unit.
24. The heterogeneous computing system of any of claims 19 to 21, wherein the accelerator includes accelerator memory, the accelerator memory including high bandwidth memory, double rate memory for graphics, or double rate memory for low power consumption.
25. An electronic device comprising the heterogeneous computing system of any of claims 13-24.
CN202011372146.0A 2020-11-30 2020-11-30 Remote direct memory access method, heterogeneous computing system and electronic equipment Active CN112463714B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011372146.0A CN112463714B (en) 2020-11-30 2020-11-30 Remote direct memory access method, heterogeneous computing system and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011372146.0A CN112463714B (en) 2020-11-30 2020-11-30 Remote direct memory access method, heterogeneous computing system and electronic equipment

Publications (2)

Publication Number Publication Date
CN112463714A CN112463714A (en) 2021-03-09
CN112463714B true CN112463714B (en) 2022-12-16

Family

ID=74805829

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011372146.0A Active CN112463714B (en) 2020-11-30 2020-11-30 Remote direct memory access method, heterogeneous computing system and electronic equipment

Country Status (1)

Country Link
CN (1) CN112463714B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113312300A (en) * 2021-06-17 2021-08-27 上海天玑科技股份有限公司 Nonvolatile memory caching method integrating data transmission and storage
CN114691385A (en) * 2021-12-10 2022-07-01 全球能源互联网研究院有限公司 Electric power heterogeneous computing system
CN114138702B (en) * 2022-01-29 2022-06-14 阿里云计算有限公司 Computing system, PCI device manager and initialization method thereof
CN114490222B (en) * 2022-02-14 2022-11-15 无锡众星微系统技术有限公司 PCIe P2P system test starting method and device
CN114745325A (en) * 2022-03-28 2022-07-12 合肥边缘智芯科技有限公司 MAC layer data exchange method and system based on PCIe bus
CN114866534B (en) * 2022-04-29 2024-03-15 浪潮电子信息产业股份有限公司 Image processing method, device, equipment and medium
CN115793983B (en) * 2022-12-23 2024-01-30 摩尔线程智能科技(北京)有限责任公司 Addressing method, apparatus, system, computing device and storage medium
CN116185910B (en) * 2023-04-25 2023-07-11 北京壁仞科技开发有限公司 Method, device and medium for accessing device memory and managing device memory

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104471540A (en) * 2012-08-17 2015-03-25 英特尔公司 Memory sharing via a unified memory architecture
WO2020023797A1 (en) * 2018-07-26 2020-01-30 Xilinx, Inc. Unified address space for multiple hardware accelerators using dedicated low latency links

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104094244B (en) * 2011-09-30 2017-05-31 英特尔公司 For the method and apparatus that the direct I/O for system coprocessor is accessed
WO2017189620A1 (en) * 2016-04-25 2017-11-02 Netlist, Inc. Method and apparatus for uniform memory access in a storage cluster
CN111247512B (en) * 2017-10-17 2021-11-09 华为技术有限公司 Computer system for unified memory access
CN109710544B (en) * 2017-10-26 2021-02-09 华为技术有限公司 Memory access method, computer system and processing device
CN111190837A (en) * 2019-08-13 2020-05-22 腾讯科技(深圳)有限公司 Information communication method, device and computer readable storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104471540A (en) * 2012-08-17 2015-03-25 英特尔公司 Memory sharing via a unified memory architecture
WO2020023797A1 (en) * 2018-07-26 2020-01-30 Xilinx, Inc. Unified address space for multiple hardware accelerators using dedicated low latency links

Also Published As

Publication number Publication date
CN112463714A (en) 2021-03-09

Similar Documents

Publication Publication Date Title
CN112463714B (en) Remote direct memory access method, heterogeneous computing system and electronic equipment
US11929927B2 (en) Network interface for data transport in heterogeneous computing environments
RU2491616C2 (en) Apparatus, method and system for managing matrices
KR101887126B1 (en) Enhanced data bus invert encoding for or chained buses
US9424214B2 (en) Network interface controller with direct connection to host memory
US20220283975A1 (en) Methods and apparatus for data descriptors for high speed data systems
US11036658B2 (en) Light-weight memory expansion in a coherent memory system
US11693809B2 (en) Asymmetric read / write architecture for enhanced throughput and reduced latency
US7657724B1 (en) Addressing device resources in variable page size environments
CN106844263B (en) Configurable multiprocessor-based computer system and implementation method
WO2016169032A1 (en) Data format conversion device, buffer chip and method
KR20110120094A (en) System on chip including unified input/output memory management unit
US6425071B1 (en) Subsystem bridge of AMBA's ASB bus to peripheral component interconnect (PCI) bus
CN113297111B (en) Artificial intelligence chip and operation method thereof
US10853287B2 (en) Information processing system, semiconductor integrated circuit, and information processing method
US20220327070A1 (en) Memory Manager, Processor Memory Subsystem, Processor, and Electronic Device
CN112463668B (en) Multichannel high-speed data access structure based on STT-MRAM
CN112397112A (en) Memory, memory chip and memory data access method
CN116795742A (en) Storage device, information storage method and system
CN117667791A (en) Computing device and data access method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20221122

Address after: 610216 building 3, No. 171, hele Second Street, Chengdu high tech Zone, China (Sichuan) pilot Free Trade Zone, Chengdu, Sichuan Province

Applicant after: CHENGDU HAIGUANG INTEGRATED CIRCUIT DESIGN Co.,Ltd.

Address before: 300392 North 2-204 industrial incubation-3-8, 18 Haitai West Road, Huayuan Industrial Zone, Tianjin

Applicant before: Haiguang Information Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant