WO2022143019A1 - 异构计算系统以及相关设备 - Google Patents

异构计算系统以及相关设备 Download PDF

Info

Publication number
WO2022143019A1
WO2022143019A1 PCT/CN2021/135791 CN2021135791W WO2022143019A1 WO 2022143019 A1 WO2022143019 A1 WO 2022143019A1 CN 2021135791 W CN2021135791 W CN 2021135791W WO 2022143019 A1 WO2022143019 A1 WO 2022143019A1
Authority
WO
WIPO (PCT)
Prior art keywords
processor
memory
read
data
write
Prior art date
Application number
PCT/CN2021/135791
Other languages
English (en)
French (fr)
Inventor
刘晓
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Publication of WO2022143019A1 publication Critical patent/WO2022143019A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • G06F13/1668Details of memory controller
    • G06F13/1673Details of memory controller using buffers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/42Bus transfer protocol, e.g. handshake; Synchronisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/42Bus transfer protocol, e.g. handshake; Synchronisation
    • G06F13/4204Bus transfer protocol, e.g. handshake; Synchronisation on a parallel bus
    • G06F13/4221Bus transfer protocol, e.g. handshake; Synchronisation on a parallel bus being an input/output bus, e.g. ISA bus, EISA bus, PCI bus, SCSI bus
    • G06F13/4226Bus transfer protocol, e.g. handshake; Synchronisation on a parallel bus being an input/output bus, e.g. ISA bus, EISA bus, PCI bus, SCSI bus with asynchronous protocol

Definitions

  • the present application relates to the field of computers, and in particular, to a heterogeneous computing system and related equipment.
  • Heterogeneous computing mainly refers to the computing method in which computing units of different types of instruction sets and architectures are used to form a system. Heterogeneous computing has become ubiquitous, from supercomputing systems to desktops to clouds to terminals, including computing units of different types of instruction sets and architectures. The advantages of heterogeneous computing are mainly reflected in indicators such as performance, cost performance, power consumption, and area. In specific scenarios, heterogeneous computing often shows amazing computing advantages.
  • the present application proposes a heterogeneous computing system and related devices, which can improve system performance.
  • a heterogeneous computing system including: a first processor, a second processor, a memory controller, and a first memory, wherein the first processor is connected to the memory controller through a high-speed bus , the second processor is connected to the memory controller through a high-speed bus, the memory controller is connected to the first memory through an interface, the first processor and the second processor are heterogeneous, so The first processor can perform read and write operations on at least a part of the first memory, and the second processor can also perform read and write operations on at least a part of the first memory.
  • the first computing unit and the second computing unit can perform data interaction through the first memory, and do not need to perform data interaction through a high-speed bus, which greatly improves the performance of the system.
  • the first processor and the second processor employ different instruction sets.
  • the first processor and the second processor employ different microarchitectures.
  • the first processor includes a greater number of operators than the second processor includes.
  • the first processor is a central processing unit (CPU)
  • the second processor includes one or more of a graphics processor GPU, an artificial intelligence AI chip, and an encoding chip.
  • the first memory includes a first memory part
  • the first processor has read and write permissions to the first memory part
  • the second processor has read and write permissions to the first memory part Read and write permissions.
  • the first processor is configured to perform operations on input data to obtain first data, and write the first data to the first memory part through the memory controller; the first memory for storing the first data in the first memory part; the second processor for reading the first data from the first memory, and for the The first data is operated to obtain second data; the first memory is used for storing the second data in the first memory part.
  • the first memory further includes a second memory part; the first processor has read and write permissions to the second memory part, and the second processor has read and write permissions to the second memory part Has read-only permissions.
  • the first memory further includes a third memory part and a fourth memory part, the first processor has read and write permissions to the third memory part, and the second processor has access to all The third memory part does not have read and write rights; the first processor does not have read and write rights to the fourth memory part, and the second processor has read and write rights to the fourth memory part.
  • the system further includes a second memory, the first memory is connected to the first processor, and the second memory can be accessed by the first processor but cannot be accessed by the first processor. Two-processor access.
  • the first processor and the second processor form a symmetric multiprocessing system.
  • a computing device including the heterogeneous computing system according to any one of the first aspects.
  • FIG. 1 is a schematic structural diagram of a heterogeneous computing system proposed by the present application.
  • FIG. 2 is a schematic diagram of a comparison between a first processor involved in the present application and a heterogeneous acceleration module
  • FIG. 3 is a schematic diagram of viewing a target scene from different angles provided by the present application.
  • FIG. 4 is a schematic diagram of a process of performing rasterization rendering by a heterogeneous computing system involved in the present application
  • FIG. 5 is a schematic structural diagram of a heterogeneous computing system provided by the present application.
  • FIG. 6 is a schematic structural diagram of another heterogeneous computing system provided by the present application.
  • FIGS. 7A-7C are schematic structural diagrams of some heterogeneous computing systems provided by the present application.
  • 8A-8C are schematic diagrams of read and write permissions of some first memories provided by the present application.
  • FIG. 1 is a schematic structural diagram of a heterogeneous computing system proposed by the present application.
  • the heterogeneous computing system includes: multiple processors, and the multiple processors include: one or more first processors 110 and one or more second processors 120 .
  • the heterogeneous computing system also includes a plurality of memories 130 .
  • the first processor and the second processor 120 may be connected through a bus, for example, peripheral component interconnect express (PCIE).
  • the first processor 110 may be connected to the memory 130 through PCIE.
  • the second processors 120 may be interconnected through PCIE or a high-speed bus.
  • PCIE peripheral component interconnect express
  • a heterogeneous architecture is formed between the first processor 110 and the second processor 120 .
  • the meaning of the heterogeneous architecture may include: the instruction sets or microstructures of the first computing unit and the second computing unit are different.
  • the instruction sets and microstructures of the first computing unit and the second computing unit are different.
  • the number of operators of the first calculation unit is larger than that of the operator of the second calculation unit, and the function of the controller of the first calculation unit is more powerful than that of the second calculation unit, so
  • the storage space of the first computing unit is larger than the storage space of the second computing unit.
  • the first processor 110 and the second processor 120 may specifically be:
  • the first processor 110 is usually dominant in a heterogeneous computing system, and is responsible for coordinating various computing tasks.
  • the first processor 110 may generally be a central processing unit (central processing unit, CPU) or the like.
  • CPU central processing unit
  • the CPU includes an arithmetic unit (arithmetic and logic unit, ALU), a controller (control unit, CU), and a cache (cache).
  • ALU arithmetic and logic unit
  • CU control unit
  • cache cache
  • the number of ALUs in the CPU is not large, but it has powerful logic operation capabilities, the controller has relatively powerful functions, and can realize complex data control and data forwarding.
  • the storage space of the cache is large enough to store well. The result of the calculation, or the data that will be used immediately later.
  • the ALU in the CPU may occupy 25% of the hardware resources
  • the CU may occupy 25% of the hardware resources
  • the cache may occupy 50% of the hardware resources.
  • the second processor 120 is usually in a subordinate position in a heterogeneous computing system, and is responsible for performing various simple and large computing tasks.
  • the second processor 120 may include a graphics processor (graphics processing unit, GPU), a digital signal processing (digital signal processing, DSP), an artificial intelligence (artificial intelligence, AI) chip, and a codec chip.
  • the second processor 120 may include one or more of a GPU, an AI chip, a codec chip, and the like.
  • the GPU includes an ALU, a CU, and a cache.
  • the number of ALUs in the GPU is very large, but only simple logical operations can be performed.
  • the function of the controller is relatively weak, mainly responsible for merging and forwarding data, and the storage space of the cache is relatively small.
  • the ALU in the GPU may occupy 90% of the hardware resources, the CU may occupy 5% of the hardware resources, and the cache may occupy 5% of the hardware resources.
  • the first processor 110 has super logic capability and is good at processing computing tasks with complex calculation steps and complex data dependencies, while the second processor 120 has less and simple logic and super high operation speed.
  • the heterogeneous computing system obtained by combining the first processor 110 and the second processor 120 can effectively meet the requirements of the business for multiple computing power, and improve the computing density of the first processor 110 .
  • the first processor 110 can coordinate and schedule the image rendering task to the second processor 120 for execution, and utilize the highly parallelized vector computing capability and special texture processing capability of the second processor 120 , which greatly improves offline and real-time rendering speed, and produces high-quality, photorealistic rendered images.
  • the second processor 120 is used to parallelize tensor operations to quickly complete matrix multiplication and addition, which can improve the acceleration effect by dozens of times compared to the operations of the first processor 110 .
  • a separate memory 130 is usually provided for each first processor 110 and each second processor 120 , that is, the heterogeneous computing system adopts a hardware architecture of memory separation.
  • the first processor 1 has a separate memory 1
  • the first processor 2 has a separate memory 2
  • the second processor 1 has a separate memory 3
  • the second processor 2 has a separate memory 4.
  • the second processor 3 is provided with a memory 5 alone.
  • the memory space allocated to each processor is It is limited. With the continuous improvement of processor performance, insufficient memory of the processor will constitute the bottleneck of the heterogeneous computing system.
  • the local memory of the processor is up to 32GB. If the heterogeneous computing system can integrate up to 16 processors , the distributed local aggregate memory of the heterogeneous computing system is up to 512GB, and the actual operation process cannot load larger data into the local memory. In addition, frequent data movement between the second processors 120 and between the second processor 120 and the first processor 110 will cause the performance of the heterogeneous computing system to not be fully utilized. Moreover, if the second processor 120 There is a certain dependency relationship between the data between the second processor 120 and the first processor 110, which will cause a large number of intermediate calculation results or parameter information to be synchronized between the second processors 120, which will lead to the loss of computing resources. waste.
  • the target scene includes a light source and a three-dimensional model.
  • the light produced by the light source is projected on the 3D model.
  • Fig. 3 suppose the target scene is shown in the upper part of Fig. 3.
  • the rendered image that needs to be generated is as shown on the left in Fig. 3.
  • the rendered image that needs to be generated is shown on the right in Figure 3.
  • the first user and the second user can use the resources of the heterogeneous computing system to render the target scene, so as to obtain rendered images from different angles.
  • the process of rasterization rendering through heterogeneous computing systems can be as follows:
  • the first processor After the first processor receives the first rendering request from the first user, it schedules the image rendering pipeline 1 in the second processor 1 according to the first rendering request to rasterize the target scene from the perspective of the first user, and obtain A rendered image of the target scene generated from the first user's perspective.
  • the first rendering request indicates the first viewing angle and scene information of the target scene.
  • the first processor After the first processor receives the second rendering request sent by the second user, it schedules the image rendering pipeline 2 in the second processor 2 according to the second rendering request to rasterize and render the target scene from the perspective of the second user, and obtain A rendered image of the target scene generated from the second user's perspective.
  • the second rendering request indicates the second viewing angle and scene information of the target scene.
  • the first processor needs to synchronize the scene information of the target scene from the memory of the first processor to the memory of the second processor 1 and the memory of the second processor 2, and each The memory of the two processors must store the scene information of the same target scene, which will waste a lot of memory resources in the existing heterogeneous computing system, and because the bus performance between the first processor and the second processor is difficult to meet the The requirement of transmitting scene information of the target scene between the first processor and the second processor results in a great reduction in the real-time performance of image rendering.
  • the present application provides a heterogeneous computing system, in which processors can share the same memory space, providing centralized and order-of-magnitude memory capacity, and the same data can be accessed by multiple processors, avoiding the need for It solves the problem that the data that needs to be stored is too large.
  • FIG. 5 is a schematic structural diagram of a heterogeneous computing system provided by the present application.
  • the heterogeneous computing system includes: a plurality of processors 210 , a memory controller 230 and a first memory 240 .
  • the processor 210 is connected to the memory controller 230 through the high-speed bus 220, and the memory controller 230 is connected to the first memory 240 through an interface.
  • the plurality of processors 210 include a first processor and a second processor.
  • the first processor and the second processor constitute a heterogeneous architecture.
  • the memory controller 230 is a bus circuit controller for managing and planning the transmission speed between the processor 210 and the first memory 240 .
  • the memory controller 230 determines parameters such as the maximum memory capacity, memory type and speed, memory particle data depth and data width that can be used by the heterogeneous computing system, that is to say, determines the memory performance of the heterogeneous computing system, and thus also determines the memory performance of the heterogeneous computing system.
  • the overall performance of heterogeneous computing systems has a greater impact.
  • the first memory 240 is a storage space for temporarily storing programs and data, that is, the first memory 240 is used to temporarily store operation data in the CPU and data exchanged with an external memory such as a hard disk.
  • the first memory 240 generally adopts a semiconductor storage unit, including a random access memory (RAM), a read only memory (ROM), and a cache (cache).
  • the first memory is composed of memory chips, circuit boards, memory particles and other parts.
  • the processor 210 and the memory controller 230 are interconnected through a high-speed bus 220, such as Gen-Z, CCIX, or a custom high-speed bus.
  • a high-speed bus 220 such as Gen-Z, CCIX, or a custom high-speed bus.
  • a symmetric multiprocessing (SMP) system may be formed between the processors 210, that is, there is no primary-secondary or subordinate relationship between the processors 210, and the processors 210 share the same Buses, memory, and I/O devices. Therefore, any processor 210 can access the first memory 240 through the memory controller 230 as a master device. Also, the time required for any one of the processors 210 to access any address in the first memory 240 is the same.
  • SMP symmetric multiprocessing
  • the number of processors 210 may also be increased by setting the switch unit 260 .
  • a switch unit 260 is added to the heterogeneous computing system, and multiple processors 210 may be arranged on the switch unit 260 .
  • the processor 210 accesses the first memory 240 through the switch unit 260 and then through the memory controller 230 .
  • the processor 210 added above the switch unit 260 is taken as an example to be the second processor.
  • the processor added above the switch unit 260 may also include a second processor.
  • a processor which is not specifically limited here.
  • the heterogeneous computing system may include the following three specific implementation manners:
  • the first processor 1 can access the first memory 1 through the memory controller 1, the first processor 1 can access the first memory 2 through the memory controller 2, and the first processor 1 can access the first memory 2 through the memory controller 2.
  • the controller 3 accesses the first memory 3 .
  • the second processor 1 can access the first memory 1 through the memory controller 1 , the second processor 1 can access the first memory 2 through the memory controller 2 , and the second processor 1 can access the first memory 3 through the memory controller 3 .
  • the second processor 2 can access the first memory 1 through the memory controller 1 , the second processor 2 can access the first memory 2 through the memory controller 2 , and the second processor 2 can access the first memory 3 through the memory controller 3 . That is to say, the first processor 1 , the second processor 1 and the second processor 2 can share the first memories 1 - 3 through the memory controllers 1 - 3 .
  • part of the second processor occupies the second memory 250 exclusively.
  • the first processor can access the first memory 1 through the memory controller 1
  • the first processor can access the first memory 2 through the memory controller 2 .
  • the second processor 1 can access the first memory 1 through the memory controller 1
  • the second processor 1 can access the first memory 2 through the memory controller 2 .
  • the second processor 2 exclusively occupies the second memory 1, that is, the second processor 2 can access the second memory 1, but neither the first processor, the second processor 1, nor the second processor 3 can access the second memory 1.
  • the second processor 3 exclusively occupies the second memory 2, that is, the second processor 3 can access the second memory 2, but neither the first processor, the second processor 1, nor the second processor 2 can access the second memory 2. 2.
  • Memory 2 is, the second processor 3 can access the second memory 2, but neither the first processor, the second processor 1, nor the second processor 2 can access the second memory 2.
  • part of the first processor occupies the second memory 250 exclusively.
  • the first processor 1 can access the first memory 1 through the memory controller 1
  • the first processor 1 can access the first memory 2 through the memory controller 2 .
  • the second processor can access the first memory 1 through the memory controller 1
  • the second processor can access the first memory 2 through the memory controller 2 .
  • the first processor 2 exclusively occupies the second memory 3, that is, the first processor 2 can access the second memory 3, but neither the first processor 1, the second processor, nor the first processor 3 can access the second memory 3.
  • the first processor 3 exclusively occupies the second memory 4, that is, the first processor 3 can access the second memory 4, but neither the first processor 1, the second processor, nor the first processor 2 can access the second memory 4. 2.
  • weak bidirectional memory consistency is implemented between the first processor and the second processor.
  • the weak two-way memory consistency means that the modification of the first memory by the first processor can be known by the second processor, and the modification of the first memory by the second processor can also be known by the first processor.
  • the memory Consistency protection is relatively weak. Specific implementations can include the following three:
  • the first processor divides the first memory into a first memory part and a second memory part. That is, the first memory may include a first memory part and a second memory part. Wherein, both the first processor and the second processor can read and write the first memory part, therefore, the first memory part needs to ensure strict memory consistency.
  • the first processor can read and write the second memory part, and the second processor can only read the second memory part, therefore, the second memory part does not need to guarantee strict memory consistency.
  • the first processor divides the first memory into a first memory part, a second memory part and a third memory part. That is, the first memory may include a first memory part, a second memory part and a third memory part. Wherein, both the first processor and the second processor can read and write the first memory part, therefore, the first memory part needs to ensure strict memory consistency.
  • the first processor can read and write the second memory portion, and the second processor can only read the second memory portion. Therefore, the second memory portion does not need to guarantee strict memory consistency.
  • the first processor 1 and the first processor 2 can read and write the third memory part, and neither the second processor 1 nor the second processor 2 can read or write the third memory part.
  • the first processor divides the first memory into a first memory part, a second memory part, a third memory part and a fourth memory part. That is, the first memory may be a first memory portion, a second memory portion, a third memory portion, and a fourth memory portion. Wherein, both the first processor and the second processor can read and write the first memory part, therefore, the first memory part needs to ensure strict memory consistency.
  • the first processor can read and write the second memory portion, and the second processor can only read the second memory portion. Therefore, the second memory portion does not need to guarantee strict memory consistency.
  • the first processor 1 and the first processor 2 can read and write the third memory part, and neither the second processor 1 nor the second processor 2 can read or write the third memory part.
  • the second processor 1 and the second processor 2 can read and write the fourth memory part, and neither the first processor 1 nor the first processor 2 can read or write the fourth memory part.
  • the first processor is configured to perform operations on input data to obtain first data, and write the first data to the first memory part through the memory controller ; the first memory for storing the first data in the first memory part; the second processor for reading the first data from the first memory, and for The first data is operated to obtain second data; the first memory is used for storing the second data in the first memory part.
  • the processor includes a first processor and a second processor, the first processor includes a CPU, and the second processor includes a GPU, an AI chip, and a codec chip.
  • the CPU writes the to-be-rendered data to the address A of the first memory part of the first memory
  • the GPU reads the to-be-rendered data from the address A of the first memory part of the first memory for rendering to obtain the rendering data, and writes the rendering data to Address B of the first memory portion of the first memory.
  • the AI chip reads the rendering data from the address B of the first memory part of the first memory, performs super-resolution processing to obtain the super-resolution data, and writes it into the address C of the first memory part of the first memory.
  • the codec chip reads the super-resolution data from the address C of the first memory, completes H.264/H.265 encoding to obtain the encoded data, and writes it into the address D of the first memory part of the first memory.
  • the heterogeneous computing system before starting the heterogeneous computing system, it is necessary to configure the heterogeneous computing system. For example, page table management, methods of interleaving, and weak two-way memory consistency, etc. Specifically, through the operating system's access rights to the page table and data lock protection settings, and the fast synchronization mechanism between hardware modules within the computing unit, the weak cache consistency within the heterogeneous system (MOESI strong cache consistency between non-homogeneous systems) is realized, Consider a tradeoff between implementation cost and performance.
  • MOESI strong cache consistency between non-homogeneous systems is realized, Consider a tradeoff between implementation cost and performance.
  • the above-mentioned embodiments it may be implemented in whole or in part by software, hardware, firmware or any combination thereof.
  • software it can be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center by wire (eg, coaxial cable, optical fiber, digital subscriber line) or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes an integration of one or more available media.
  • the usable media may be magnetic media (eg, floppy disks, storage disks, magnetic tapes), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), among others.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

本申请提供了一种异构计算系统以及相关设备,所述系统包括:第一处理器、第二处理器、内存控制器以及第一内存,其中,所述第一处理器通过高速总线连接所述内存控制器,所述第二处理器通过高速总线连接所述内存控制器,所述内存控制器通过接口连接所述第一内存,所述第一处理器和所述第二处理器是异构的,所述第一处理器可以对所述第一内存中的至少一部分进行读写操作,所述第二处理器也可以对所述第一内存中的至少一部分进行读写操作。

Description

异构计算系统以及相关设备
本申请要求于2020年12月31日提交中国专利局、申请号为202011641741.X、申请名称为“异构计算系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。本申请要求于2021年5月21日提交中国专利局、申请号为202110559382.1、申请名称为“异构计算系统以及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机领域,尤其涉及一种异构计算系统以及相关设备。
背景技术
异构计算主要是指使用不同类型指令集和体系架构的计算单元组成系统的计算方式。异构计算已经无处不在,从超算系统到桌面到云到终端,都包含不同类型指令集和体系架构的计算单元。异构计算优势主要提现在性能、性价比、功耗、面积等指标上,在特定场景,异构计算往往会表现出惊人的计算优势。
但是,现有采用异构计算的系统的架构限制了系统性能的发挥。
发明内容
为了解决上述问题,本申请提出了一种异构计算系统以及相关设备,能够提升系统性能。
第一方面,提供了一种异构计算系统,包括:第一处理器、第二处理器、内存控制器以及第一内存,其中,所述第一处理器通过高速总线连接所述内存控制器,所述第二处理器通过高速总线连接所述内存控制器,所述内存控制器通过接口连接所述第一内存,所述第一处理器和所述第二处理器是异构的,所述第一处理器可以对所述第一内存中的至少一部分进行读写操作,所述第二处理器也可以对所述第一内存中的至少一部分进行读写操作。
上述方案中,第一计算单元和第二计算单元之间可以通过第一内存进行数据交互,不需要通过高速总线进行数据交互,大大提升了系统的性能。
在一些可能的设计中,所述第一处理器和所述第二处理器采用的指令集不相同。
在一些可能的设计中,所述第一处理器和所述第二处理器采用的微结构不相同。
在一些可能的设计中,所述第一处理器包括的运算器的数量比所述第二处理器包括的运算器的数量多。
在一些可能的设计中,所述第一处理器为中央处理器CPU,所述第二处理器包括图形处理器GPU、人工智能AI芯片以及编码芯片中的一种或者多种。
在一些可能的设计中,所述第一内存包括第一内存部分,所述第一处理器对所述第一内存部分具有读写权限,所述第二处理器对所述第一内存部分具有读写权限。
在一些可能的设计中,所述第一处理器,用于对输入数据进行运算得到第一数据,并通过所述内存控制器将所述第一数据写入至所述第一内存部分;所述第一内存,用于在所述第一内存部分中存储所述第一数据;所述第二处理器,用于从所述第一内存中读取所述第一数据,并对所述第一数据进行运算,得到第二数据;所述第一内存,用于在所述第一内存部分中存储所述第二数据。
在一些可能的设计中,所述第一内存还包括第二内存部分;所述第一处理器对所述第二 内存部分具有读写权限,所述第二处理器对所述第二内存部分具有只读权限。
在一些可能的设计中,所述第一内存还包括第三内存部分以及第四内存部分,所述第一处理器对所述第三内存部分具有读写权限,所述第二处理器对所述第三内存部分不具有读写权限;所述第一处理器对所述第四内存部分不具有读写权限,所述第二处理器对所述第四内存部分具有读写权限。
在一些可能的设计中,所述系统还包括第二内存,所述第一内存连接所述第一处理器,所述第二内存可以被所述第一处理器访问,不可以被所述第二处理器访问。
在一些可能的设计中,所述第一处理器、所述第二处理器组成对称多处理系统。
第二方面,提供了一种计算设备,包括如第一方面任一项所述的异构计算系统。
附图说明
为了更清楚地说明本申请实施例或背景技术中的技术方案,下面将对本申请实施例或背景技术中所需要使用的附图进行说明。
图1是本申请提出一种异构计算系统的结构示意图;
图2是本申请涉及的第一处理器和异构加速模块的对比示意图;
图3是本申请提供的从不同角度观看目标场景的示意图;
图4是本申请涉及的通过异构计算系统进行光栅化渲染的过程的示意图;
图5是本申请提供的一种异构计算系统的结构示意图;
图6是本申请提供的另一种异构计算系统的结构示意图;
图7A-图7C是本申请提供的一些异构计算系统的结构示意图;
图8A-图8C是本申请提供的一些第一内存的读写权限的示意图。
具体实施方式
参见图1,图1是本申请提出一种异构计算系统的结构示意图。如图1所示,该异构计算系统包括:多个处理器,所述多个处理器包括:一个或者多个第一处理器110,一个或者多个第二处理器120。所述异构计算系统还包括多个内存130。其中,第一处理器和第二处理器120之间可以通过总线,例如,外围组建互联快速(peripheral component interconnect express,PCIE)进行连接。第一处理器110可以通过PCIE连接内存130。第二处理器120之间可以通过PCIE或者高速总线进行互联。
在一具体的实施方式中,第一处理器110和第二处理器120之间构成了异构架构。具体地,异构架构的含义可以包括:所述第一计算单元和所述第二计算单元的指令集或微结构并不相同。或者,所述第一计算单元和所述第二计算单元的指令集和微结构并不相同。或者,所述第一计算单元的运算器的数量比所述第二计算单元的运算器的数量多,所述第一计算单元的控制器的功能比所述第二计算单元的功能强大,所述第一计算单元的存储空间比所述第二计算单元的存储空间大。
在一具体的实施方式中,第一处理器110和第二处理器120具体可以是:
第一处理器110在异构计算系统中通常处于主导地位,负责统筹协调各种计算任务。第一处理器110通常可以是中央处理器(central processing unit,CPU)等。如图2左边所示,以第一处理器110采用CPU为例,CPU包括运算器(arithmetic and logic unit,ALU)、控制器(control unit,CU)以及高速缓存器(cache)。其中,CPU中的ALU的数量不多,但是,具有强大的逻辑运算能力,控制器的功能比较强大,能够实现复杂的数据控制以及数据转发, 高速缓存的存储空间足够大,能够很好地存储计算完成的结果,或者是后面马上要用到的数据。在一具体的实施例中,CPU中的ALU可以占据25%的硬件资源,CU可以占据25%的硬件资源,高速缓存器可以占据50%的硬件资源。
第二处理器120在异构计算系统中通常处于从属地位,负责执行各种简单大量的计算任务。第二处理器120可以包括图形处理器(graphics processing unit,GPU)、数字信号处理(digital signal processing,DSP)、人工智能(artificial intelligence,AI)芯片、编解码芯片。在一具体的实施例中,在渲染场景下,第二处理器120可以包括GPU、AI芯片、编解码芯片等中的一种或者多种。如图2右边所示,以第二处理器120采用GPU为例,GPU包括ALU、CU以及高速缓存器。其中,GPU中的ALU的数量非常多,但是,只能进行简单的逻辑运算,控制器的功能比较弱,主要是负责合并和转发数据,高速缓存的存储空间也比较小。在一具体的实施例中,GPU中的ALU可以占据90%的硬件资源,CU可以占据5%的硬件资源,高速缓存器可以占据5%的硬件资源。
因此,第一处理器110拥有超强的逻辑能力,擅长处理具有复杂计算步骤和复杂数据依赖的计算任务,而第二处理器120的逻辑少而简单,并且拥有超高的运算速度。将第一处理器110和第二处理器120组合得到的异构计算系统可以有效满足业务对多元算力的要求,提升第一处理器110的计算密度。例如,当采用异构计算系统进行图像渲染时,第一处理器110可以将图像渲染任务统筹调度至第二处理器120执行,利用第二处理器120高度并行化矢量运算能力以及特殊纹理处理能力,极大提高离线和实时渲染速度,并且产生高画质、真实感的渲染图像。另外,在深度学习领域采用第二处理器120并行化张量运算,快速完成矩阵乘加,相对第一处理器110运算可以提升几十倍的加速效果。
为了提高数据存取速度,通常会为每个第一处理器110以及每个第二处理器120分别设置单独的内存130,即,异构计算系统采用了内存分离的硬件架构。例如,如图1所示,第一处理器1单独设置了内存1,第一处理器2单独设置了内存2,第二处理器1单独设置了内存3,第二处理器2单独设置了内存4,第二处理器3单独设置了内存5。但是,由于异构计算系统的总面积是有限的,又需要为每个处理器(包括第一处理器以及第二处理器)单独设置内存,因此,分配给每个处理器的内存的空间都是有限的,随着处理器性能的持续提升,处理器的内存不足将会构成异构计算系统的瓶颈,比如,处理器的本地内存最大32GB,如果异构计算系统最大可以集成16个处理器,则异构计算系统的分布式本地聚合的内存最大为512GB,实际运算过程无法将更大数据装入本地内存。另外,第二处理器120之间、第二处理器120和第一处理器110之间频繁地进行数据搬移,将会造成异构计算系统的性能无法充分发挥,而且,假若第二处理器120之间、第二处理器120和第一处理器110之间的数据间存在一定依赖关系,将导致第二处理器120之间有大量的计算中间结果或参数信息需要同步,进而导致计算资源的浪费。
在一具体的实施例中,在多用户参与的场景中,为了能够让每个用户都产生置身其中的真实感,往往不同的用户需要的是从不同的角度生成的同一个目标场景的渲染图像。其中,目标场景包括光源以及三维模型。光源产生的光线投射在三维模型中。以图3所示为例,假设目标场景如图3中的上面所示,当第一用户从第一视角进行观察时,需要生成的渲染图像如图3中的左边所示,当第二用户从第二视角进行观察时,需要生成的渲染图像如图3中的右边所示。第一用户和第二用户可以利用异构计算系统的资源对目标场景进行渲染,从而得到不同角度的渲染图像。以光栅化渲染为例,如图4所示,通过异构计算系统进行光栅化渲染的过程可以是:
在第一处理器接收到第一用户发出第一渲染请求之后,根据第一渲染请求调度第二处理器1中的图像渲染管线1从第一用户的视角出发对目标场景进行光栅化渲染,得到从第一用户的视角生成的该目标场景的渲染图像。其中,第一渲染请求指示所述第一视角以及所述目标场景的场景信息。
在第一处理器接收到第二用户发出第二渲染请求之后,根据第二渲染请求调度第二处理器2中的图像渲染管线2从第二用户的视角出发对目标场景进行光栅化渲染,得到从第二用户的视角生成的该目标场景的渲染图像。其中,第二渲染请求指示所述第二视角以及所述目标场景的场景信息。
为了完成上述光栅化渲染的过程,第一处理器需要将目标场景的场景信息要从第一处理器的内存同步到第二处理器1的内存以及第二处理器2的内存,并且每个第二处理器的内存都要存储相同的目标场景的场景信息,将使现有异构计算系统中内存的资源被大量浪费并且由于第一处理器和第二处理器之间的总线性能难以满足第一处理器和第二处理器之间传输目标场景的场景信息的要求,而导致图像渲染的实时性被大大降低。
为了解决上述问题,本申请提供了一种异构计算系统,处理器之间能够共享相同内存空间,提供集中化、数量级大的内存容量,而且,相同的数据可以被多个处理器访问,避免了需要存储的数据过大的问题。
参见图5,图5是本申请提供的一种异构计算系统的结构示意图。如图5所示,异构计算系统包括:多个处理器210、内存控制器230以及第一内存240。其中,处理器210通过高速总线220连接内存控制器230,内存控制器230通过接口连接第一内存240。
多个处理器210包括第一处理器以及第二处理器。其中,第一处理器和第二处理器构成了异构架构。
内存控制器230用于管理与规划从处理器210到第一内存240之间传输速度的总线电路控制器。内存控制器230决定了异构计算系统所能使用的最大内存容量、内存类型和速度、内存颗粒数据深度和数据宽度等等参数,也就是说决定了异构计算系统的内存性能,从而也对异构计算系统的整体性能产生较大影响。
第一内存240是临时存储程序以及数据的存储空间,即,第一内存240用于暂时存放CPU中的运算数据以及与硬盘等外部存储器交换的数据。第一内存240一般采用半导体存储单元,包括随机存储器(RAM)、只读存储器(ROM)以及高速缓存(cache)。第一内存是由内存芯片、电路板、内存颗粒等部分组成的。
在本申请具体的实施例中,处理器210以及内存控制器230通过高速总线220,例如,Gen-Z、CCIX或者自定义高速总线进行互联。
在本申请具体的实施例中,处理器210之间可以组成对称多处理(symmetric MulTI processing,SMP)系统,即,处理器210之间无主次或从属关系,处理器210之间共享相同的总线、内存以及I/O设备。因此,任何一个处理器210都可以作为主设备通过内存控制器230访问第一内存240。并且,任何一个处理器210访问第一内存240中的任何地址所需时间是相同的。
在本申请具体的实施例中,异构计算系统中还可以通过设置开关单元260来增加处理器210的数量。如图6所示,在异构计算系统中增加开关单元260,开关单元260之上可以设置多个处理器210。此时,处理器210通过开关单元260,再通过内存控制器230访问第一内存240。可以理解,图6中是以开关单元260之上增加的处理器210均为第二处理器为例进行说明的,但 是,在实际应用中,开关单元260之上增加的处理器还可以包括第一处理器,此处不作具体限定。
在本申请具体的实施例中,异构计算系统可以包括以下三种具体的实施方式:
在第一种方式中,所有处理器210共享所有第一内存240。此时,第一处理器211和第二处理器212均可以共享相同的第一内存。例如,如图7A所示,第一处理器1可以通过内存控制器1访问第一内存1,第一处理器1可以通过内存控制器2访问第一内存2,第一处理器1可以通过内存控制器3访问第一内存3。第二处理器1可以通过内存控制器1访问第一内存1,第二处理器1可以通过内存控制器2访问第一内存2,第二处理器1可以通过内存控制器3访问第一内存3。第二处理器2可以通过内存控制器1访问第一内存1,第二处理器2可以通过内存控制器2访问第一内存2,第二处理器2可以通过内存控制器3访问第一内存3。也就是说,第一处理器1、第二处理器1以及第二处理器2可以通过内存控制器1~3共享第一内存1~3。
在第二种方式中,部分第二处理器独占第二内存250。例如,如图7B所示,第一处理器可以通过内存控制器1访问第一内存1,第一处理器可以通过内存控制器2访问第一内存2。第二处理器1可以通过内存控制器1访问第一内存1,第二处理器1可以通过内存控制器2访问第一内存2。第二处理器2独占第二内存1,即,第二处理器2可以访问第二内存1,但是,第一处理器、第二处理器1,第二处理器3均不可以访问第二内存1,第二处理器3独占第二内存2,即,第二处理器3可以访问第二内存2,但是,第一处理器、第二处理器1,第二处理器2均不可以访问第二内存2。
在第三种方式中,部分第一处理器独占第二内存250。例如,如图7C所示,第一处理器1可以通过内存控制器1访问第一内存1,第一处理器1可以通过内存控制器2访问第一内存2。第二处理器可以通过内存控制器1访问第一内存1,第二处理器可以通过内存控制器2访问第一内存2。第一处理器2独占第二内存3,即,第一处理器2可以访问第二内存3,但是,第一处理器1、第二处理器,第一处理器3均不可以访问第二内存3,第一处理器3独占第二内存4,即,第一处理器3可以访问第二内存4,但是,第一处理器1、第二处理器,第一处理器2均不可以访问第二内存4。
在本申请具体的实施例中,第一处理器和第二处理器之间实现弱双向内存一致性。其中,弱双向内存一致性是指第一处理器对第一内存的修改可以被第二处理器获知,第二处理器对第一内存的修改也可以被第一处理器获知,但是,对于内存一致性的保护力度比较弱。具体的实施方式可以包括如下三种:
第一种方式中,如图8A所示,第一处理器将第一内存分成第一内存部分以及第二内存部分。即,第一内存可以包括第一内存部分和第二内存部分。其中,第一处理器和第二处理器都可以进行读写第一内存部分,因此,第一内存部分需要保证严格的内存一致性。第一处理器可以进行读写第二内存部分,第二处理器只能够进行读取第二内存部分,因此,第二内存部分不需要保证严格的内存一致性。
第二种方式中,如图8B所示,第一处理器将第一内存分成第一内存部分、第二内存部分以及第三内存部分。即,第一内存可以包括第一内存部分、第二内存部分和第三内存部分。其中,第一处理器和第二处理器都可以进行读写第一内存部分,因此,第一内存部分需要保证严格的内存一致性。第一处理器可以读写第二内存部分,第二处理器只能够进行读取第二内存部分,因此,第二内存部分不需要保证严格的内存一致性。第一处理器1和第一处理器2可以读写第三内存部分,第二处理器1和第二处理器2均不可以读写第三内存部分。
第三种方式中,如图8C所示,第一处理器将第一内存分成第一内存部分、第二内存部分、 第三内存部分以及第四内存部分。即,第一内存可以第一内存部分、第二内存部分、第三内存部分和第四内存部分。其中,第一处理器和第二处理器都可以进行读写第一内存部分,因此,第一内存部分需要保证严格的内存一致性。第一处理器可以读写第二内存部分,第二处理器只能够进行读取第二内存部分,因此,第二内存部分不需要保证严格的内存一致性。第一处理器1和第一处理器2可以读写第三内存部分,第二处理器1和第二处理器2均不可以读写第三内存部分。第二处理器1和第二处理器2可以读写第四内存部分,第一处理器1和第一处理器2均不可以读写第四内存部分。
应理解,上述具体的实施方式仅仅是作为举例,不应该构成具体限定。
在本申请具体的实施例中,所述第一处理器,用于对输入数据进行运算得到第一数据,并通过所述内存控制器将所述第一数据写入至所述第一内存部分;所述第一内存,用于在所述第一内存部分中存储所述第一数据;所述第二处理器,用于从所述第一内存中读取所述第一数据,并对所述第一数据进行运算,得到第二数据;所述第一内存,用于在所述第一内存部分中存储所述第二数据。以渲染场景为例,处理器包括第一处理器以及第二处理器,第一处理器包括CPU,第二处理器包括GPU、AI芯片以及编解码芯片。CPU将待渲染数据写到第一内存的第一内存部分的地址A,GPU从第一内存的第一内存部分的地址A读取待渲染数据进行渲染之后得到渲染数据,并将渲染数据写到第一内存的第一内存部分的地址B。然后,AI芯片从第一内存的第一内存部分的地址B中读取渲染数据,并进行超分辨处理得到超分辨数据,并写入第一内存的第一内存部分的地址C中。编解码芯片从第一内存的地址C中读取超分辨数据,完成H.264/H.265编码得到编码数据,并写入第一内存的第一内存部分的地址D中。
在本申请具体的实施例中,在启动异构计算系统之前,还必要对异构计算系统进行配置。例如,页表管理、交织的方法以及弱双向内存一致性等等。具体地,通过操作系统对页表访问权限以及数据加锁保护设置,计算单元内部硬件模块间快速同步机制,实现异构系统内部弱缓存一致性(非同构系统间MOESI强缓存一致性),在实现代价和性能间折中考虑。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、存储盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态存储盘Solid State Disk(SSD))等。

Claims (12)

  1. 一种异构计算系统,其特征在于,包括:第一处理器、第二处理器、内存控制器以及第一内存,其中,所述第一处理器通过高速总线连接所述内存控制器,所述第二处理器通过高速总线连接所述内存控制器,所述内存控制器通过接口连接所述第一内存,所述第一处理器和所述第二处理器是异构的,所述第一处理器可以对所述第一内存中的至少一部分进行读写操作,所述第二处理器也可以对所述第一内存中的至少一部分进行读写操作。
  2. 根据权利要求1所述的系统,其特征在于,所述第一处理器和所述第二处理器采用的指令集不相同。
  3. 根据权利要求1或2所述的系统,其特征在于,所述第一处理器和所述第二处理器采用的微结构不相同。
  4. 根据权利要求1-3任一权利要求所述的系统,其特征在于,所述第一处理器包括的运算器的数量比所述第二处理器包括的运算器的数量多。
  5. 根据权利要求1-4任一权利要求所述的系统,其特征在于,所述第一处理器为中央处理器CPU,所述第二处理器包括图形处理器GPU、人工智能AI芯片以及编码芯片中的一种或者多种。
  6. 根据权利要求1-5任一权利要求所述的系统,其特征在于,所述第一内存包括第一内存部分,
    所述第一处理器对所述第一内存部分具有读写权限,所述第二处理器对所述第一内存部分具有读写权限。
  7. 根据权利要求6任一权利要求所述的系统,其特征在于,
    所述第一处理器,用于对输入数据进行运算得到第一数据,并通过所述内存控制器将所述第一数据写入至所述第一内存部分;
    所述第一内存,用于在所述第一内存部分中存储所述第一数据;
    所述第二处理器,用于从所述第一内存中读取所述第一数据,并对所述第一数据进行运算,得到第二数据;
    所述第一内存,用于在所述第一内存部分中存储所述第二数据。
  8. 根据权利要求6或7所述的系统,其特征在于,所述第一内存还包括第二内存部分;所述第一处理器对所述第二内存部分具有读写权限,所述第二处理器对所述第二内存部分具有只读权限。
  9. 根据权利要求6至8任一权利要求所述的系统,其特征在于,所述第一内存还包括第三内存部分以及第四内存部分,
    所述第一处理器对所述第三内存部分具有读写权限,所述第二处理器对所述第三内存部分不具有读写权限;
    所述第一处理器对所述第四内存部分不具有读写权限,所述第二处理器对所述第四内存部分具有读写权限。
  10. 根据权利要求1至9任一权利要求所述的系统,其特征在于,所述系统还包括第二内存,所述第一内存连接所述第一处理器,所述第二内存可以被所述第一处理器访问,不可以被所述第二处理器访问。
  11. 根据权利要求1至10任一权利要求所述的系统,其特征在于,所述第一处理器、所述第二处理器组成对称多处理系统。
  12. 一种计算设备,其特征在于,包括如权利要求1至11任一权利要求所述的异构计算系统。
PCT/CN2021/135791 2020-12-31 2021-12-06 异构计算系统以及相关设备 WO2022143019A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202011641741.X 2020-12-31
CN202011641741 2020-12-31
CN202110559382.1A CN114691557A (zh) 2020-12-31 2021-05-21 异构计算系统以及相关设备
CN202110559382.1 2021-05-21

Publications (1)

Publication Number Publication Date
WO2022143019A1 true WO2022143019A1 (zh) 2022-07-07

Family

ID=82135922

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/135791 WO2022143019A1 (zh) 2020-12-31 2021-12-06 异构计算系统以及相关设备

Country Status (2)

Country Link
CN (1) CN114691557A (zh)
WO (1) WO2022143019A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117407181B (zh) * 2023-12-14 2024-03-22 沐曦集成电路(南京)有限公司 一种基于屏障指令的异构计算进程同步方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104424122A (zh) * 2013-09-09 2015-03-18 联想(北京)有限公司 一种电子设备及内存划分方法
CN105786400A (zh) * 2014-12-25 2016-07-20 研祥智能科技股份有限公司 一种异构混合内存组件、系统及存储方法
CN108089920A (zh) * 2016-11-23 2018-05-29 华为技术有限公司 一种数据处理的方法、装置和系统
US20180267722A1 (en) * 2017-03-17 2018-09-20 International Business Machines Corporation Partitioned memory with locally aggregated copy pools
CN109684085A (zh) * 2018-12-14 2019-04-26 北京中科寒武纪科技有限公司 内存访问方法及相关产品

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104424122A (zh) * 2013-09-09 2015-03-18 联想(北京)有限公司 一种电子设备及内存划分方法
CN105786400A (zh) * 2014-12-25 2016-07-20 研祥智能科技股份有限公司 一种异构混合内存组件、系统及存储方法
CN108089920A (zh) * 2016-11-23 2018-05-29 华为技术有限公司 一种数据处理的方法、装置和系统
US20180267722A1 (en) * 2017-03-17 2018-09-20 International Business Machines Corporation Partitioned memory with locally aggregated copy pools
CN109684085A (zh) * 2018-12-14 2019-04-26 北京中科寒武纪科技有限公司 内存访问方法及相关产品

Also Published As

Publication number Publication date
CN114691557A (zh) 2022-07-01

Similar Documents

Publication Publication Date Title
US11567780B2 (en) Apparatus, systems, and methods for providing computational imaging pipeline
US11367160B2 (en) Simultaneous compute and graphics scheduling
KR102218332B1 (ko) 확장 가능한 gpu에서 데이터 분배 패브릭
US8990833B2 (en) Indirect inter-thread communication using a shared pool of inboxes
TWI620128B (zh) 在中央處理單元與圖形處理單元間分享資源之裝置與系統
JP6006230B2 (ja) 組み合わせたcpu/gpuアーキテクチャシステムにおけるデバイスの発見およびトポロジーのレポーティング
US8773449B2 (en) Rendering of stereoscopic images with multithreaded rendering software pipeline
RU2597556C2 (ru) Структура компьютерного кластера для выполнения вычислительных задач и способ функционирования указанного кластера
CN106844048B (zh) 基于硬件特性的分布式共享内存方法及系统
US10275275B2 (en) Managing copy operations in complex processor topologies
EP4231242A1 (en) Graphics rendering method and related device thereof
US20230229524A1 (en) Efficient multi-device synchronization barriers using multicasting
EP3662376B1 (en) Reconfigurable cache architecture and methods for cache coherency
WO2022143019A1 (zh) 异构计算系统以及相关设备
US10180916B2 (en) Managing copy operations in complex processor topologies
Gantel et al. Dataflow programming model for reconfigurable computing
US8539516B1 (en) System and method for enabling interoperability between application programming interfaces
CN111274161A (zh) 用于加速串行化算法的具有可变等待时间的位置感知型存储器
GB2520603A (en) Atomic memory update unit and methods

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21913760

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21913760

Country of ref document: EP

Kind code of ref document: A1