WO2013097098A1 - Data processing method, graphics processing unit (gpu) and first node device - Google Patents

Data processing method, graphics processing unit (gpu) and first node device Download PDF

Info

Publication number
WO2013097098A1
WO2013097098A1 PCT/CN2011/084764 CN2011084764W WO2013097098A1 WO 2013097098 A1 WO2013097098 A1 WO 2013097098A1 CN 2011084764 W CN2011084764 W CN 2011084764W WO 2013097098 A1 WO2013097098 A1 WO 2013097098A1
Authority
WO
WIPO (PCT)
Prior art keywords
gpu
communication data
communication
node device
cpu
Prior art date
Application number
PCT/CN2011/084764
Other languages
French (fr)
Chinese (zh)
Inventor
蒋吴军
卢彦超
郑龙
过敏意
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2011/084764 priority Critical patent/WO2013097098A1/en
Priority to CN201180003244.XA priority patent/CN103282888B/en
Publication of WO2013097098A1 publication Critical patent/WO2013097098A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes

Definitions

  • the present invention relates to the field of communications technologies, and in particular, to a data processing method, an image processor GPU, and a first node device.
  • the data communication mechanism between node devices is the basis of distributed parallel computing.
  • distributed parallel computing there is a certain amount of shared data or data flow between processes belonging to the same task. These processes need to be peered at a specific location.
  • a GPU Graphic Processing Unit
  • a distributed GPU system is formed.
  • each process belonging to the same task is run by a GPU of a different node device, wherein the node device can be a commercial server; since there is a certain shared data between the processes, a communication mechanism between the nodes is needed to implement The flow of the shared data.
  • the node device can be a commercial server; since there is a certain shared data between the processes, a communication mechanism between the nodes is needed to implement The flow of the shared data.
  • the CPU of the second node device Central Processing Unit, central processing
  • the communication data is copied to the internal memory and transmitted to the GPU 1 by the CPU 1 of the first node device, so that the GPU 1 executes the processing process of the first process.
  • the inventors have found that the prior art has at least the following problems:
  • the first process of the GPU 1 needs to share the intermediate running data of the second process of the GPU 2
  • the first process also needs After the GPU 2 runs the complete second process, the intermediate running data of the second process can be obtained, and the running time of the first process is extended, thereby reducing the computing efficiency of the system.
  • an embodiment of the present invention provides a data processing method, an image processor GPU, and a first node device.
  • the technical solution is as follows:
  • a data processing method comprising: when a central processor CPU of a first node device starts a kernel program of a graphics processor GPU of the node device, the GPU runs the kernel program, and the kernel program includes at least Programming a preset GPU communication application P API ;
  • the GPU acquires first communication data
  • the GPU Determining, by the GPU, whether the communication operation corresponding to the preset GPU communication API is a communication operation for sending or a communication operation for receiving, and if the communication operation is for sending, the GPU performs the first communication
  • the data is stored in a preset buffer of the memory of the node device, so that the CPU copies the first communication data from the preset buffer to the memory of the node device; if it is a communication for receiving In operation, the GPU acquires second communication data from the preset buffer, where the second communication data is copied by the CPU into the preset buffer.
  • a graphics processor GPU comprising:
  • a running module configured to: when a central processor CPU of the first node device starts a kernel program of a graphics processor GPU of the node device, the kernel program includes at least one preset GPU communication application programming Interface API ;
  • An obtaining module configured to acquire first communication data when a kernel program of the GPU runs to the preset GPU communication API
  • a determining processing module configured to determine whether the communication operation corresponding to the preset GPU communication API is a communication operation for sending or a communication operation for receiving, and if it is a communication operation for sending, the GPU
  • the first communication data is stored in a preset buffer of the memory of the node device, so that the CPU copies the first communication data from the preset buffer to the memory of the node device;
  • the GPU acquires second communication data from the preset buffer, wherein the second communication data is copied by the CPU into the preset buffer.
  • a first node device comprising a central processing unit CPU and the above graphics processor GPU;
  • the technical solution provided by the embodiment of the present invention has the following beneficial effects: inserting a preset GPU communication API in a kernel program of the GPU of the first node device, where the intermediate running data needs to be shared, when the kernel program of the GPU runs to the
  • the preset GPU communication API acquires intermediate running data of the running part of the kernel program, that is, the first communication data; and the GPU determines whether the communication operation corresponding to the GPU communication API is for a communication operation for sending or for receiving
  • the communication operation is performed by the CPU of the GPU and the local node device according to the judgment result, and the communication operation of the GPU is completed, so that the CPU acquires the first communication data, and the GPU acquires the second communication data, compared with the existing communication data.
  • FIG. 5 is a schematic diagram of communication interaction between GPUs of different nodes according to Embodiment 3 of the present invention.
  • Embodiments of the present invention provide a data processing method, an image processor GPU, and a first node device.
  • the GPU determines whether the communication operation corresponding to the GPU communication API is a communication operation for sending or In the received communication operation, if it is a communication operation for sending, the GPU stores the first communication data to a preset buffer of the memory of the local node device, so that the CPU will use the first communication Data is copied from the preset buffer to the memory of the local device; if it is a communication operation for receiving, the GPU acquires second communication data from the preset buffer, where the second Communication data is copied by the CPU into the preset buffer.
  • a preset GPU communication API is inserted in a kernel program of the GPU of the first node device, where the intermediate running data needs to be shared, when the kernel program of the GPU runs to the preset GPU communication API, Obtaining intermediate running data of the running part of the kernel program, that is, the first communication data; the GPU determining whether the communication operation corresponding to the GPU communication API is a communication operation for sending or a communication operation for receiving, according to the judgment result
  • the GPU and the CPU of the local node device perform corresponding processing to complete the communication operation of the GPU, so that the CPU acquires the first communication data, and the GPU acquires the second communication data.
  • the embodiment is The intermediate running data (the first communication data and the second communication data) are acquired in time during the running of the kernel program of the GPU, so that the second node device does not need to wait for the entire kernel program of the first node device to run before acquiring the intermediate running data, which is shortened.
  • the running time of the process on the second node device improves the computational efficiency of the system.
  • FIG. 2 is a flowchart of an embodiment of a data processing method according to Embodiment 2 of the present invention.
  • the kernel (kernel) program of the GPU1 includes at least one preset GPU communication API.
  • the preset GPU communication API divides the kernel program of the GPU 1 into a plurality of sub-kernel programs, so the kernel program includes at least two sub-kernel programs, each of the sub-kernel programs There is no communication operation; the preset GPU communication API is a communication API supported by the GPU, which corresponds to different communication operations, wherein the communication operation includes a communication operation for transmission and a communication operation for reception.
  • the GPU 1 determines whether the communication operation corresponding to the preset GPU communication API is a communication operation for sending or a communication operation for receiving, if it is a communication operation for sending, performing S204; if it is for receiving For the communication operation, execute S205.
  • the GPU 1 stores the first communication data to a preset buffer of the memory of the local node device, so that the CPU copies the first communication data from the preset buffer to the local node device. In memory.
  • the communication operation corresponding to the preset GPU communication API is a communication operation for transmission, indicating that the GPU 1 wants to send the first communication data to the CPU 1 of the local node device, but due to the slave processor characteristics of the GPU, Therefore, the first communication data can only be acquired by the CPU 1 of the own node from the preset buffer.
  • the GPU 1 stores the first communication data in a preset buffer of the memory of the local node device,
  • the kernel program is switched to the CPU code, and the CPU 1 runs its own program.
  • the CPU 1 runs to the CPU communication API corresponding to the communication operation for reception, the CPU 1 copies the first communication data into the memory of the own node device.
  • the preset buffer is specified by the user.
  • the GPU 1 acquires second communication data from the preset buffer, where the second communication data is copied by the CPU1 into the preset buffer.
  • the communication operation corresponding to the preset GPU communication API is a communication operation for reception, it indicates that the CPU 1 wants to transmit the second communication data to the GPU 1.
  • the kernel program is switched to a CPU code, and the CPU 1 runs its own program.
  • the CPU 1 runs to the CPU communication API corresponding to the communication operation for transmitting, the CPU 1 copies the second communication data from the memory of the node device to the preset buffer of the memory of the node device.
  • the second communication data may be communication data of a program run by the CPU 1 itself; or may be second communication data generated by a kernel program of the GPU 2 on the second node device, specifically, the CPU 2 of the second node device
  • the second communication data is copied from the preset buffer on the second node device to the memory of the second node device, and the CPU 2 transmits the second communication data to the CPU1.
  • the subsequent part of the kernel program of the GPU is continuously executed, that is, the remaining sub-core programs of the kernel program of the GPU are sequentially executed.
  • the GPU When there are multiple GPU communication APIs in the kernel program of the GPU, the GPU cyclically executes the processes of the above S202-S205 until the end of the kernel program of the entire GPU.
  • the method further includes: the CPU1 of the first node device transmits the first communication data to the GPU2 of the second node device via the CPU2 of the second node device, so that the second node The GPU 2 of the device shares the first communication data; similarly, the GPU 2 on the second node device can also transmit its second communication data to the GPU 1 through the CPU 2 and the CPU 1 in sequence, thereby realizing the GPU running time on different node devices in the cluster.
  • Two-way communication The communication mechanism between the CPUs on the different node devices may be implemented by using a prior art such as a socket or a message passing interface (MPI), and is not described here.
  • MPI message passing interface
  • the kernel program of the GPU includes a preset GPU communication API, so that the GPU has the function of active communication.
  • the kernel program of the GPU executes to the preset GPU communication API, indicating that the GPU wants to send or receive communication data, correspondingly, the CPU on the node device fetches communication data from a preset buffer or The communication data is copied into the preset buffer, thereby indirectly implementing the communication operation of the GPU, thereby implementing two-way communication between the CPU and the GPU on the same node device during the running of the GPU kernel program.
  • the embodiment is The intermediate running data (the first communication data and the second communication data) are acquired in time during the running of the kernel program of the GPU, so that the second node device does not need to wait for the entire kernel program of the first node device to run before acquiring the intermediate running data, which is shortened.
  • the running time of the process on the second node device improves the computational efficiency of the system.
  • the two-way communication between the GPU and the CPU on the single-node device is implemented during the running of the kernel program of the GPU; and two-way communication between the GPU and the CPU on the single-node device is realized by running the kernel program of the GPU.
  • the two-way communication of the GPU running on different node devices in the cluster is realized.
  • the GPU 1 stores the first communication data to the first communication of the memory of the local node device.
  • the data buffer sets the state of the first indicator signal bit to a set state.
  • the GPU 1 continuously queries (ie, polls) the state of the first indicator signal bit, and when the state of the first indicator signal bit is set, the GPU 1 continues to query the first indicator signal bit.
  • the system adopts a method of task scheduling policy optimization, specifically, identifying a computing task that needs to perform a synchronization operation before distributing the calculation task, and distributing the calculation tasks to the system.
  • the global identification bit is set.
  • the computing tasks on all nodes that need to be synchronized are ready to run, the computing tasks are uniformly scheduled to run, thereby ensuring the user.
  • the exclusiveness of the GPU task determines that the number of tasks to be synchronized cannot exceed the number of concurrent tasks allowed by the system.
  • the tasks to be synchronized need to be in the running state at the same time. Otherwise, the system performance will be brought. damage.
  • a preset GPU communication API is inserted in a kernel program of the GPU of the first node device, where the intermediate running data needs to be shared, when the kernel program of the GPU runs to the preset GPU communication API, Obtaining intermediate running data of the running part of the kernel program, that is, the first communication data; the GPU determining whether the communication operation corresponding to the GPU communication API is a communication operation for sending or a communication operation for receiving, according to the judgment result
  • the GPU and the CPU of the local node device perform corresponding processing to complete the communication operation of the GPU, so that the CPU acquires the first communication data, and the GPU acquires the second communication data.
  • the obtaining module 502 is configured to acquire first communication data when the kernel program of the GPU runs to the preset GPU communication API.
  • the determining processing module 503 is configured to determine whether the communication operation corresponding to the preset GPU communication API is a communication operation for sending or a communication operation for receiving, and if it is a communication operation for sending, the GPU will
  • the first communication data is stored in a preset buffer of the memory of the local device, so that the CPU copies the first communication data from the preset buffer to the memory of the node device;
  • the GPU acquires second communication data from the preset buffer, where the second communication data is copied by the CPU into the preset buffer.
  • the obtaining module 502 includes: an obtaining unit 5021, as shown in FIG. 7, FIG. 7 is a second structural diagram of a GPU embodiment of a graphics processor according to Embodiment 4 of the present invention;
  • the obtaining unit 5021 is configured to acquire communication data of the sub-kernel program.
  • the preset buffer includes a flag signal bit and a communication data buffer; the flag signal bit includes a first flag signal bit and a second flag signal bit, and the communication data
  • the buffer includes a first communication data buffer and a second communication data buffer, wherein the first indication signal bit and the first communication data buffer correspond to the CPU receiving the indication signal bit of the GPU and the communication data buffer
  • the second indication signal bit and the second communication data buffer correspond to the GPU receiving the indication signal bit of the CPU and the communication data buffer.
  • the determination processing module 503 includes: the storage setting unit 5031, as shown in FIG. 8, FIG. 8 is a third structural diagram of a GPU embodiment of a graphics processor according to Embodiment 4 of the present invention;
  • the storage setting unit 5031 is configured to store the first communication data to a first communication data buffer of a memory of the local node device, and set a state of the first indication signal bit to a set state, so that the CPU In the query to the said After the state of the signal bit is set to the state, the first communication data in the first communication data buffer is copied into the memory of the node device.
  • the determining processing module 503 includes:
  • FIG. 9 is a fourth structural diagram of a GPU embodiment of a graphics processor according to Embodiment 4 of the present invention.
  • the CPU 40 is configured to start a kernel program of a graphics processor GPU of the node device; copy the first communication data from a preset buffer to a memory of the node device; and copy the second communication data to the preset In the buffer.
  • the CPU 40 is further configured to transmit the first communication data to a GPU of the second node device by using a CPU of the second node device, so that the GPU of the second node device shares the first communication data.
  • the CPU 40 is further configured to check whether the first communication data is valid, and if yes, the first identifier The state of the signal bit is set to the reset state; if not, the state of the flag signal bit is set to the reception error state.
  • a preset GPU communication API is inserted in a kernel program of the GPU of the first node device, where the intermediate running data needs to be shared, when the kernel program of the GPU runs to the preset GPU communication API, Obtaining intermediate running data of the running part of the kernel program, that is, the first communication data; the GPU determining whether the communication operation corresponding to the GPU communication API is a communication operation for sending or a communication operation for receiving, according to the judgment result
  • the GPU and the CPU of the local node device perform corresponding processing to complete the communication operation of the GPU, so that the CPU acquires the first communication data, and the GPU acquires the second communication data.
  • the two-way communication between the GPU and the CPU on the single-node device is implemented during the running of the kernel program of the GPU; and two-way communication between the GPU and the CPU on the single-node device is realized by running the kernel program of the GPU.
  • the two-way communication of the GPU running on different node devices in the cluster is realized.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multi Processors (AREA)
  • Computer And Data Communications (AREA)

Abstract

Provided are a data processing method, a graphics processing unit (GPU) and a first node device, which relate to the technical field of communications. The data processing method comprises: when a CPU starts up a kernel program of a GPU of a node device, the GPU runs the kernel program, the kernel program comprising at least one preset GPU communication API; when the kernel program of the GPU runs to the preset GPU communication API, the GPU acquires first communication data; and the GPU judges whether a communication operation corresponding to the preset GPU communication API is a communication operation for transmitting or a communication operation for receiving, and if it is a communication operation for transmitting, then the GPU stores the first communication data in a preset buffer of a video memory, and allows the CPU to copy the first communication data from the preset buffer to a memory of the node device; and if it is a communication operation for receiving, then the GPU acquires second communication data from the preset buffer. The computational efficiency of the system is improved by the present invention.

Description

数据处理方法、 图形处理器 GPU及第一节点设备 技术领域  Data processing method, graphics processor GPU and first node device
本发明涉及通信技术领域, 特别涉及一种数据处理方法、 图像处理器 GPU及第一节点 设备。  The present invention relates to the field of communications technologies, and in particular, to a data processing method, an image processor GPU, and a first node device.
 Say
背景技术 Background technique
在分布式环境下, 节点设备之间的数据通信机制是分布式并行计算的基础。 在典型的 分布式并行计算系统中, 同属一个任务的各个进程之间存在一定的共享数据或数据流动, 这些进程需要在特定位置进行同歩。 当在节点设备中书加入 GPU (Graphic Processing Unit, 图形处理器), 就组成了分布式 GPU系统。  In a distributed environment, the data communication mechanism between node devices is the basis of distributed parallel computing. In a typical distributed parallel computing system, there is a certain amount of shared data or data flow between processes belonging to the same task. These processes need to be peered at a specific location. When a GPU (Graphic Processing Unit) is added to a node device, a distributed GPU system is formed.
在分布式 GPU系统中, 同属一个任务的各个进程由不同节点设备的 GPU分别运行, 其 中节点设备可以为商用服务器; 由于各个进程之间存在一定的共享数据, 因此需要节点间 的通信机制来实现所述共享数据的流动。 例如当第一节点设备的 GPU1的第一进程需要共享 第二节点设备的 GPU2的第二进程的通信数据时, 由于 GPU的从处理器特性, 第二节点设备 的 CPU ( Central Processing Unit , 中央处理器) 2在所述 GPU2运行完第二进程后, 将所 述通信数据复制到自身内存后经第一节点设备的 CPU1传输至所述 GPU1, 使所述 GPU1执行 第一进程的处理过程。  In a distributed GPU system, each process belonging to the same task is run by a GPU of a different node device, wherein the node device can be a commercial server; since there is a certain shared data between the processes, a communication mechanism between the nodes is needed to implement The flow of the shared data. For example, when the first process of the GPU 1 of the first node device needs to share the communication data of the second process of the GPU 2 of the second node device, due to the slave processor characteristics of the GPU, the CPU of the second node device (Central Processing Unit, central processing) After the GPU 2 runs the second process, the communication data is copied to the internal memory and transmitted to the GPU 1 by the CPU 1 of the first node device, so that the GPU 1 executes the processing process of the first process.
在实现本发明的过程中, 发明人发现现有技术至少存在以下问题: 当所述 GPU1的第一 进程在运行时需要共享所述 GPU2的第二进程的中间运行数据时, 第一进程也需要等待所述 GPU2运行完整个第二进程后, 才能获取第二进程的中间运行数据, 延长了第一进程的运行 时间, 从而降低了系统的计算效率。 发明内容  In the process of implementing the present invention, the inventors have found that the prior art has at least the following problems: When the first process of the GPU 1 needs to share the intermediate running data of the second process of the GPU 2, the first process also needs After the GPU 2 runs the complete second process, the intermediate running data of the second process can be obtained, and the running time of the first process is extended, thereby reducing the computing efficiency of the system. Summary of the invention
为了提高系统的计算效率, 本发明实施例提供了一种数据处理方法、 图像处理器 GPU 及第一节点设备。 所述技术方案如下:  In order to improve the computational efficiency of the system, an embodiment of the present invention provides a data processing method, an image processor GPU, and a first node device. The technical solution is as follows:
一种数据处理方法, 所述方法包括- 当第一节点设备的中央处理器 CPU启动本节点设备的图形处理器 GPU的内核程序时, 所述 GPU运行所述内核程序, 所述内核程序包括至少一个预设的 GPU通信应用程序编程接 P API ; A data processing method, the method comprising: when a central processor CPU of a first node device starts a kernel program of a graphics processor GPU of the node device, the GPU runs the kernel program, and the kernel program includes at least Programming a preset GPU communication application P API ;
当所述 GPU的内核程序运行至所述预设的 GPU通信 API时, 所述 GPU获取第一通信数 据;  When the kernel program of the GPU runs to the preset GPU communication API, the GPU acquires first communication data;
所述 GPU判断所述预设的 GPU通信 API对应的通信操作是用于发送的通信操作还是用 于接收的通信操作, 如果是用于发送的通信操作时, 所述 GPU将所述第一通信数据存储至 本节点设备的显存的预设的缓冲区, 使所述 CPU将所述第一通信数据由所述预设的缓冲区 复制至本节点设备的内存中; 如果是用于接收的通信操作时, 所述 GPU从所述预设的缓冲 区获取第二通信数据, 其中所述第二通信数据由所述 CPU复制至所述预设的缓冲区中。  Determining, by the GPU, whether the communication operation corresponding to the preset GPU communication API is a communication operation for sending or a communication operation for receiving, and if the communication operation is for sending, the GPU performs the first communication The data is stored in a preset buffer of the memory of the node device, so that the CPU copies the first communication data from the preset buffer to the memory of the node device; if it is a communication for receiving In operation, the GPU acquires second communication data from the preset buffer, where the second communication data is copied by the CPU into the preset buffer.
一种图形处理器 GPU, 包括:  A graphics processor GPU, comprising:
运行模块, 用于当第一节点设备的中央处理器 CPU启动本节点设备的图形处理器 GPU 的内核程序时, 运行所述内核程序, 所述内核程序包括至少一个预设的 GPU通信应用程序 编程接口 API ;  a running module, configured to: when a central processor CPU of the first node device starts a kernel program of a graphics processor GPU of the node device, the kernel program includes at least one preset GPU communication application programming Interface API ;
获取模块, 用于当所述 GPU的内核程序运行至所述预设的 GPU通信 API时, 获取第一 通信数据;  An obtaining module, configured to acquire first communication data when a kernel program of the GPU runs to the preset GPU communication API;
判断处理模块, 用于判断所述预设的 GPU通信 API对应的通信操作是用于发送的通信 操作还是用于接收的通信操作, 如果是用于发送的通信操作时, 所述 GPU将所述第一通信 数据存储至本节点设备的显存的预设的缓冲区, 使所述 CPU将所述第一通信数据由所述预 设的缓冲区复制至本节点设备的内存中; 如果是用于接收的通信操作时, 所述 GPU从所述 预设的缓冲区获取第二通信数据, 其中所述第二通信数据由所述 CPU复制至所述预设的缓 冲区中。  a determining processing module, configured to determine whether the communication operation corresponding to the preset GPU communication API is a communication operation for sending or a communication operation for receiving, and if it is a communication operation for sending, the GPU The first communication data is stored in a preset buffer of the memory of the node device, so that the CPU copies the first communication data from the preset buffer to the memory of the node device; The GPU acquires second communication data from the preset buffer, wherein the second communication data is copied by the CPU into the preset buffer.
一种第一节点设备, 包括中央处理器 CPU和上述图形处理器 GPU;  a first node device, comprising a central processing unit CPU and the above graphics processor GPU;
所述 CPU, 用于启动本节点设备的图形处理器 GPU的内核程序; 将第一通信数据由预设 的缓冲区复制至本节点设备的内存中; 复制第二通信数据至所述预设的缓冲区中。  The CPU is configured to start a kernel program of a graphics processor GPU of the node device; copy the first communication data from a preset buffer to a memory of the node device; and copy the second communication data to the preset In the buffer.
本发明实施例提供的技术方案的有益效果是: 在第一节点设备的 GPU 的内核程序中需 要共享中间运行数据的地方插入预设的 GPU通信 API,当所述 GPU的内核程序运行至所述预 设的 GPU通信 API时, 获取运行完的部分内核程序的中间运行数据, 即第一通信数据; 所 述 GPU判断所述 GPU通信 API对应的通信操作是用于发送的通信操作还是用于接收的通信 操作, 根据判断结果由所述 GPU和本节点设备的 CPU执行相应处理, 完成 GPU的通信操作, 使所述 CPU获取第一通信数据, 所述 GPU获取第二通信数据, 相比较现有技术而言, 本实 施例在 GPU的内核程序运行过程中及时获取中间运行数据 (第一通信数据和第二通信数据), 使得第二节点设备无需等待第一节点设备的整个内核程序运行完之后再获取中间运行数 据, 缩短了第二节点设备上进程的运行时间, 提高了系统的计算效率。 附图说明 The technical solution provided by the embodiment of the present invention has the following beneficial effects: inserting a preset GPU communication API in a kernel program of the GPU of the first node device, where the intermediate running data needs to be shared, when the kernel program of the GPU runs to the The preset GPU communication API acquires intermediate running data of the running part of the kernel program, that is, the first communication data; and the GPU determines whether the communication operation corresponding to the GPU communication API is for a communication operation for sending or for receiving The communication operation is performed by the CPU of the GPU and the local node device according to the judgment result, and the communication operation of the GPU is completed, so that the CPU acquires the first communication data, and the GPU acquires the second communication data, compared with the existing communication data. Technically, in this embodiment, the intermediate running data (the first communication data and the second communication data) are acquired in time during the running of the kernel program of the GPU, so that the second node device does not need to wait for the entire kernel program of the first node device to run. Get the number of intermediate runs According to the method, the running time of the process on the second node device is shortened, and the calculation efficiency of the system is improved. DRAWINGS
为了更清楚地说明本发明实施例中的技术方案, 下面将对实施例描述中所需要使用的 附图作简单地介绍, 显而易见地, 下面描述中的附图仅仅是本发明的一些实施例, 对于本 领域普通技术人员来讲, 在不付出创造性劳动的前提下, 还可以根据这些附图获得其他的 附图。  In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly described. It is obvious that the drawings in the following description are only some embodiments of the present invention. Other drawings may also be obtained from those of ordinary skill in the art in view of the drawings.
图 1是本发明实施例 1提供的一种数据处理方法实施例的流程图;  1 is a flowchart of an embodiment of a data processing method according to Embodiment 1 of the present invention;
图 2是本发明实施例 2提供的一种数据处理方法实施例的流程图;  2 is a flowchart of an embodiment of a data processing method according to Embodiment 2 of the present invention;
图 3是本发明实施例 3提供的一种数据处理方法实施例的流程图;  3 is a flowchart of an embodiment of a data processing method according to Embodiment 3 of the present invention;
图 4为本发明实施例 3提供的预设的缓冲区的结构示意图;  4 is a schematic structural diagram of a preset buffer provided by Embodiment 3 of the present invention;
图 5是本发明实施例 3提供的不同节点间 GPU的通信交互示意图;  FIG. 5 is a schematic diagram of communication interaction between GPUs of different nodes according to Embodiment 3 of the present invention; FIG.
图 6是本发明实施例 4提供的一种图形处理器 GPU实施例的第一结构示意图; 图 7是本发明实施例 4提供的一种图形处理器 GPU实施例的第二结构示意图; 图 8是本发明实施例 4提供的一种图形处理器 GPU实施例的第三结构示意图; 图 9是本发明实施例 4提供的一种图形处理器 GPU实施例的第四结构示意图; 图 10是本发明实施例 5提供的一种第一节点设备实施例的结构示意图。 具体实施方式  FIG. 6 is a first schematic structural diagram of a GPU of a graphics processor according to Embodiment 4 of the present invention; FIG. 7 is a second schematic structural diagram of a GPU of a graphics processor according to Embodiment 4 of the present invention; FIG. 9 is a fourth schematic structural diagram of a GPU embodiment of a graphics processor according to Embodiment 4 of the present invention; FIG. 10 is a schematic diagram of a fourth structure of a GPU of a graphics processor according to Embodiment 4 of the present invention; A schematic structural diagram of an embodiment of a first node device according to Embodiment 5 of the present invention. detailed description
本发明实施例提供一种数据处理方法、 图像处理器 GPU及第一节点设备。  Embodiments of the present invention provide a data processing method, an image processor GPU, and a first node device.
为使本发明的目的、 技术方案和优点更加清楚, 下面将结合附图对本发明实施方式作 进一步地详细描述。  The embodiments of the present invention will be further described in detail below with reference to the accompanying drawings.
实施例 1  Example 1
参考图 1, 图 1是本发明实施例 1提供的一种数据处理方法实施例的流程图; 所述数据 处理方法包括:  1 is a flowchart of an embodiment of a data processing method according to Embodiment 1 of the present invention; the data processing method includes:
S 101 : 当第一节点设备的中央处理器 CPU启动本节点设备的图形处理器 GPU的内核程 序时, 所述 GPU运行所述内核程序, 所述内核程序包括至少一个预设的 GPU通信应用程序 编程接口 API。  S101: when a central processor CPU of the first node device starts a kernel program of a graphics processor GPU of the node device, the GPU runs the kernel program, and the kernel program includes at least one preset GPU communication application Programming interface API.
S 102 : 当所述 GPU的内核程序运行至所述预设的 GPU通信 API时, 所述 GPU获取第一 通信数据。  S102: when the kernel program of the GPU runs to the preset GPU communication API, the GPU acquires first communication data.
S 103 : 所述 GPU判断所述 GPU通信 API对应的通信操作是用于发送的通信操作还是用 于接收的通信操作, 如果是用于发送的通信操作时, 所述 GPU将所述第一通信数据存储至 本节点设备的显存的预设的缓冲区, 使所述 CPU将所述第一通信数据由所述预设的缓冲区 复制至本节点设备的内存中; 如果是用于接收的通信操作时, 所述 GPU从所述预设的缓冲 区获取第二通信数据, 其中所述第二通信数据由所述 CPU复制至所述预设的缓冲区中。 S103: The GPU determines whether the communication operation corresponding to the GPU communication API is a communication operation for sending or In the received communication operation, if it is a communication operation for sending, the GPU stores the first communication data to a preset buffer of the memory of the local node device, so that the CPU will use the first communication Data is copied from the preset buffer to the memory of the local device; if it is a communication operation for receiving, the GPU acquires second communication data from the preset buffer, where the second Communication data is copied by the CPU into the preset buffer.
本实施例中,所述 GPU的内核程序中包含了预设的 GPU通信 API , 使 GPU具有了主动通 信的功能。 当所述 GPU的内核程序执行到所述预设的 GPU通信 API时, 表示 GPU想要发送 或者接收通信数据, 相应的, 本节点设备上的 CPU就从预设的缓冲区中取通信数据或者将 通信数据复制到所述预设的缓冲区中, 从而间接的实现了 GPU的通信操作, 进而实现了当 GPU内核程序运行过程中同一节点设备上 CPU和 GPU之间的双向通信。  In this embodiment, the kernel program of the GPU includes a preset GPU communication API, so that the GPU has the function of active communication. When the kernel program of the GPU executes to the preset GPU communication API, indicating that the GPU wants to send or receive communication data, correspondingly, the CPU on the node device fetches communication data from a preset buffer or The communication data is copied into the preset buffer, thereby indirectly implementing the communication operation of the GPU, thereby implementing two-way communication between the CPU and the GPU on the same node device during the running of the GPU kernel program.
本实施例中, 在第一节点设备的 GPU 的内核程序中需要共享中间运行数据的地方插入 预设的 GPU通信 API, 当所述 GPU的内核程序运行至所述预设的 GPU通信 API时, 获取运行 完的部分内核程序的中间运行数据, 即第一通信数据; 所述 GPU判断所述 GPU通信 API对 应的通信操作是用于发送的通信操作还是用于接收的通信操作, 根据判断结果由所述 GPU 和本节点设备的 CPU执行相应处理, 完成 GPU的通信操作, 使所述 CPU获取第一通信数据, 所述 GPU获取第二通信数据, 相比较像有技术而言, 本实施例在 GPU的内核程序运行过程 中及时获取中间运行数据 (第一通信数据和第二通信数据), 使得第二节点设备无需等待第 一节点设备的整个内核程序运行完之后再获取中间运行数据, 缩短了第二节点设备上进程 的运行时间, 提高了系统的计算效率。 实施例 2  In this embodiment, a preset GPU communication API is inserted in a kernel program of the GPU of the first node device, where the intermediate running data needs to be shared, when the kernel program of the GPU runs to the preset GPU communication API, Obtaining intermediate running data of the running part of the kernel program, that is, the first communication data; the GPU determining whether the communication operation corresponding to the GPU communication API is a communication operation for sending or a communication operation for receiving, according to the judgment result The GPU and the CPU of the local node device perform corresponding processing to complete the communication operation of the GPU, so that the CPU acquires the first communication data, and the GPU acquires the second communication data. Compared with the technology, the embodiment is The intermediate running data (the first communication data and the second communication data) are acquired in time during the running of the kernel program of the GPU, so that the second node device does not need to wait for the entire kernel program of the first node device to run before acquiring the intermediate running data, which is shortened. The running time of the process on the second node device improves the computational efficiency of the system. Example 2
参考图 2, 图 2是本发明实施例 2提供的一种数据处理方法实施例的流程图; 所述数据 处理方法包括:  Referring to FIG. 2, FIG. 2 is a flowchart of an embodiment of a data processing method according to Embodiment 2 of the present invention;
S201 : 当第一节点设备的 CPU1启动本节点设备的 GPU1的内核程序时, 所述 GPU1运行 所述内核程序。  S201: When the CPU 1 of the first node device starts the kernel program of the GPU 1 of the node device, the GPU 1 runs the kernel program.
此步骤中, 所述 GPU1 的内核 (kernel ) 程序包括至少一个预设的 GPU 通信 API In this step, the kernel (kernel) program of the GPU1 includes at least one preset GPU communication API.
(Application Programming Interface,应用程序编程接口), 所述预设的 GPU通信 API将 所述 GPU1的内核程序分割成多个子内核程序,因此所述内核程序包括至少两个子内核程序, 每个子内核程序内不存在通信操作; 所述预设的 GPU通信 API为 GPU支持的通信 API ,其对 应着不同的通信操作, 其中所述通信操作包括用于发送的通信操作和用于接收的通信操作。 (Application Programming Interface), the preset GPU communication API divides the kernel program of the GPU 1 into a plurality of sub-kernel programs, so the kernel program includes at least two sub-kernel programs, each of the sub-kernel programs There is no communication operation; the preset GPU communication API is a communication API supported by the GPU, which corresponds to different communication operations, wherein the communication operation includes a communication operation for transmission and a communication operation for reception.
S202: 当所述 GPU1的内核程序运行至所述预设的 GPU通信 API时, 所述 GPU1获取第 一通信数据。 此步骤中, 当所述 GPUl运行至所述预设的 GPU通信 API时, 所述 GPU1结束当前部分 的子内核程序的运行, 获取第一通信数据, 其中所述第一通信数据为刚刚运行完的所述子 内核程序的通信数据。 S202: When the kernel program of the GPU 1 runs to the preset GPU communication API, the GPU 1 acquires first communication data. In this step, when the GPU 1 runs to the preset GPU communication API, the GPU 1 ends the operation of the current partial sub-kernel program, and acquires first communication data, where the first communication data is just finished running. The communication data of the sub-kernel program.
S203 : 所述 GPU1判断所述预设的 GPU通信 API对应的通信操作是用于发送的通信操作 还是用于接收的通信操作, 如果是用于发送的通信操作, 执行 S204; 如果是用于接收的通 信操作, 执行 S205。  S203: The GPU 1 determines whether the communication operation corresponding to the preset GPU communication API is a communication operation for sending or a communication operation for receiving, if it is a communication operation for sending, performing S204; if it is for receiving For the communication operation, execute S205.
S204: 所述 GPUl将所述第一通信数据存储至本节点设备的显存的预设的缓冲区, 使所 述 CPU将所述第一通信数据由所述预设的缓冲区复制至本节点设备的内存中。  S204: The GPU 1 stores the first communication data to a preset buffer of the memory of the local node device, so that the CPU copies the first communication data from the preset buffer to the local node device. In memory.
当所述预设的 GPU通信 API对应的通信操作是用于发送的通信操作时, 表示 GPU1想要 将所述第一通信数据发送至本节点设备的 CPU1 , 但是由于 GPU的从处理器特性, 因此只能 由本节点的 CPU1从预设的缓冲区获取所述第一通信数据。  When the communication operation corresponding to the preset GPU communication API is a communication operation for transmission, indicating that the GPU 1 wants to send the first communication data to the CPU 1 of the local node device, but due to the slave processor characteristics of the GPU, Therefore, the first communication data can only be acquired by the CPU 1 of the own node from the preset buffer.
具体为: 当所述预设的 GPU通信 API对应的通信操作是用于发送的通信操作时, 所述 GPU1将所述第一通信数据存储至本节点设备的显存的预设的缓冲区中, 将内核程序切换成 CPU代码, 由所述 CPU1运行自身的程序。 当所述 CPU1运行至用于接收的通信操作对应的 CPU通信 API时, 所述 CPU1将所述第一通信数据复制至本节点设备的内存中。 其中所述预 设的缓冲区由用户指定。  Specifically, when the communication operation corresponding to the preset GPU communication API is a communication operation for sending, the GPU 1 stores the first communication data in a preset buffer of the memory of the local node device, The kernel program is switched to the CPU code, and the CPU 1 runs its own program. When the CPU 1 runs to the CPU communication API corresponding to the communication operation for reception, the CPU 1 copies the first communication data into the memory of the own node device. The preset buffer is specified by the user.
S205 : 所述 GPU1 从所述预设的缓冲区获取第二通信数据, 其中所述第二通信数据由 所述 CPU1复制至所述预设的缓冲区中。  S205: The GPU 1 acquires second communication data from the preset buffer, where the second communication data is copied by the CPU1 into the preset buffer.
当所述预设的 GPU通信 API对应的通信操作是用于接收的通信操作时, 表示所述 CPU1 想要将第二通信数据发送至所述 GPU1。  When the communication operation corresponding to the preset GPU communication API is a communication operation for reception, it indicates that the CPU 1 wants to transmit the second communication data to the GPU 1.
具体为: 当所述预设的 GPU通信 API对应的通信操作是用于接收的通信操作时, 将内 核程序切换成 CPU代码, 由所述 CPU1运行自身的程序。 当所述 CPU1运行至用于发送的通 信操作对应的 CPU通信 API时, 所述 CPU1将所述第二通信数据从本节点设备的内存中复制 至本节点设备的显存的所述预设的缓冲区中。 其中所述第二通信数据可以是所述 CPU1自身 运行的程序的通信数据; 也可以是第二节点设备上 GPU2的内核程序生成的第二通信数据, 具体地, 第二节点设备的 CPU2将第二通信数据从第二节点设备上的预设的缓冲区中复制至 第二节点设备的内存上, 所述 CPU2再将所述第二通信数据传输至所述 CPU1。  Specifically, when the communication operation corresponding to the preset GPU communication API is a communication operation for receiving, the kernel program is switched to a CPU code, and the CPU 1 runs its own program. When the CPU 1 runs to the CPU communication API corresponding to the communication operation for transmitting, the CPU 1 copies the second communication data from the memory of the node device to the preset buffer of the memory of the node device. In the district. The second communication data may be communication data of a program run by the CPU 1 itself; or may be second communication data generated by a kernel program of the GPU 2 on the second node device, specifically, the CPU 2 of the second node device The second communication data is copied from the preset buffer on the second node device to the memory of the second node device, and the CPU 2 transmits the second communication data to the CPU1.
所述预设的 GPU通信 API执行完成后, 继续执行所述 GPU的内核程序的下续部分, 即 顺序执行所述 GPU的内核程序的其余子内核程序。  After the execution of the preset GPU communication API is completed, the subsequent part of the kernel program of the GPU is continuously executed, that is, the remaining sub-core programs of the kernel program of the GPU are sequentially executed.
对于所述 GPU的内核程序中存在多个 GPU通信 API时, GPU循环执行上述 S202-S205的 流程, 直至整个 GPU的内核程序的结束。 本实施例中, 所述方法还包括: 所述第一节点设备的 CPU1将所述第一通信数据经第二 节点设备的 CPU2传输至所述第二节点设备的 GPU2, 使所述第二节点设备的 GPU2共享所述 第一通信数据; 同理, 第二节点设备上的 GPU2也可以将其第二通信数据顺序经 CPU2、 CPU1 传输至 GPU1 , 从而实现了集群内部不同节点设备上 GPU运行时的双向通信。 其中不同节点 设备上 CPU间的通信机制, 可以采用 socket (套接字)或 MPI ( Message Passing Interface, 消息传递接口) 等现有技术来实现, 在此不再赘述。 When there are multiple GPU communication APIs in the kernel program of the GPU, the GPU cyclically executes the processes of the above S202-S205 until the end of the kernel program of the entire GPU. In this embodiment, the method further includes: the CPU1 of the first node device transmits the first communication data to the GPU2 of the second node device via the CPU2 of the second node device, so that the second node The GPU 2 of the device shares the first communication data; similarly, the GPU 2 on the second node device can also transmit its second communication data to the GPU 1 through the CPU 2 and the CPU 1 in sequence, thereby realizing the GPU running time on different node devices in the cluster. Two-way communication. The communication mechanism between the CPUs on the different node devices may be implemented by using a prior art such as a socket or a message passing interface (MPI), and is not described here.
本实施例中,所述 GPU的内核程序中包含了预设的 GPU通信 API , 使 GPU具有了主动通 信的功能。 当所述 GPU的内核程序执行到所述预设的 GPU通信 API时, 表示 GPU想要发送 或者接收通信数据, 相应的, 本节点设备上的 CPU就从预设的缓冲区中取通信数据或者将 通信数据复制到所述预设的缓冲区中, 从而间接的实现了 GPU的通信操作, 进而实现了当 GPU内核程序运行过程中同一节点设备上 CPU和 GPU之间的双向通信。  In this embodiment, the kernel program of the GPU includes a preset GPU communication API, so that the GPU has the function of active communication. When the kernel program of the GPU executes to the preset GPU communication API, indicating that the GPU wants to send or receive communication data, correspondingly, the CPU on the node device fetches communication data from a preset buffer or The communication data is copied into the preset buffer, thereby indirectly implementing the communication operation of the GPU, thereby implementing two-way communication between the CPU and the GPU on the same node device during the running of the GPU kernel program.
本实施例中, 在第一节点设备的 GPU 的内核程序中需要共享中间运行数据的地方插入 预设的 GPU通信 API , 当所述 GPU的内核程序运行至所述预设的 GPU通信 API时, 获取运行 完的部分内核程序的中间运行数据, 即第一通信数据; 所述 GPU判断所述 GPU通信 API对 应的通信操作是用于发送的通信操作还是用于接收的通信操作, 根据判断结果由所述 GPU 和本节点设备的 CPU执行相应处理, 完成 GPU的通信操作, 使所述 CPU获取第一通信数据, 所述 GPU获取第二通信数据, 相比较像有技术而言, 本实施例在 GPU的内核程序运行过程 中及时获取中间运行数据 (第一通信数据和第二通信数据), 使得第二节点设备无需等待第 一节点设备的整个内核程序运行完之后再获取中间运行数据, 缩短了第二节点设备上进程 的运行时间, 提高了系统的计算效率。  In this embodiment, a preset GPU communication API is inserted in a kernel program of the GPU of the first node device, where the intermediate running data needs to be shared, when the kernel program of the GPU runs to the preset GPU communication API, Obtaining intermediate running data of the running part of the kernel program, that is, the first communication data; the GPU determining whether the communication operation corresponding to the GPU communication API is a communication operation for sending or a communication operation for receiving, according to the judgment result The GPU and the CPU of the local node device perform corresponding processing to complete the communication operation of the GPU, so that the CPU acquires the first communication data, and the GPU acquires the second communication data. Compared with the technology, the embodiment is The intermediate running data (the first communication data and the second communication data) are acquired in time during the running of the kernel program of the GPU, so that the second node device does not need to wait for the entire kernel program of the first node device to run before acquiring the intermediate running data, which is shortened. The running time of the process on the second node device improves the computational efficiency of the system.
此外, 本实施例在 GPU的内核程序运行过程中实现了单节点设备上 GPU和 CPU之间的 双向通信; 且通过在 GPU的内核程序运行过程中实现单节点设备上 GPU和 CPU之间双向通 信的基础上, 结合现有集群内部不同节点设备之间的 CPU之间的通信机制, 实现了集群内 部不同节点设备上 GPU运行时的双向通信。 实施例 3  In addition, in this embodiment, the two-way communication between the GPU and the CPU on the single-node device is implemented during the running of the kernel program of the GPU; and two-way communication between the GPU and the CPU on the single-node device is realized by running the kernel program of the GPU. On the basis of the communication mechanism between the CPUs of different node devices in the existing cluster, the two-way communication of the GPU running on different node devices in the cluster is realized. Example 3
参考图 3, 图 3是本发明实施例 3提供的一种数据处理方法实施例的流程图; 本实施例 中将 CPU和 GPU之间的通信封装在 CPU设备和 GPU设备的上一层, 该层为分布式 GPU系统 提供基本的通信操作。 所述数据处理方法包括- S301 : 当第一节点设备的 CPU1启动本节点设备的 GPU1的内核程序时, 所述 GPU1运行 所述内核程序。 此步骤中, 所述 GPU1 的内核 (kernel ) 程序包括至少一个预设的 GPU 通信 API (Application Programming Interface,应用程序编程接口), 所述预设的 GPU通信 API对 应着不同的通信操作, 其中所述通信操作包括用于发送的通信操作和用于接收的通信操作。 Referring to FIG. 3, FIG. 3 is a flowchart of a data processing method according to Embodiment 3 of the present invention. In this embodiment, communication between a CPU and a GPU is encapsulated on a layer of a CPU device and a GPU device. Layers provide basic communication operations for distributed GPU systems. The data processing method includes -S301: when the CPU 1 of the first node device starts the kernel program of the GPU 1 of the node device, the GPU 1 runs the kernel program. In this step, the kernel program of the GPU 1 includes at least one preset GPU communication API (Application Programming Interface), and the preset GPU communication API corresponds to different communication operations, where The communication operations include communication operations for transmission and communication operations for reception.
GPU的内核 (kernel )程序中包含了预设的 GPU通信 API , 使 GPU具有了主动通信的功 能。  The GPU's kernel (kernel) program includes a preset GPU communication API that enables the GPU to communicate actively.
S302: 当所述 GPU1的内核程序运行至所述预设的 GPU通信 API时, 所述 GPU1获取第 一通信数据。  S302: When the kernel program of the GPU 1 runs to the preset GPU communication API, the GPU 1 acquires first communication data.
此步骤中, 当所述 GPU1运行至所述预设的 GPU通信 API时, 获取第一通信数据, 其中 所述第一通信数据为刚刚运行完的内核程序的通信数据。  In this step, when the GPU 1 runs to the preset GPU communication API, the first communication data is acquired, wherein the first communication data is communication data of the kernel program just run.
S303: 所述 GPU1判断所述预设的 GPU通信 API对应的通信操作是用于发送的通信操作 还是用于接收的通信操作, 如果是用于发送的通信操作, 执行 S304; 如果是用于接收的通 信操作, 执行 S305。  S303: The GPU 1 determines whether the communication operation corresponding to the preset GPU communication API is a communication operation for sending or a communication operation for receiving. If it is a communication operation for sending, perform S304; if it is for receiving The communication operation is performed in S305.
S304: 所述 GPU1将所述第一通信数据存储至本节点设备的显存的预设的缓冲区, 使所 述 CPU1将所述第一通信数据由所述预设的缓冲区复制至本节点设备的内存中。  S304: The GPU 1 stores the first communication data to a preset buffer of the memory of the local node device, so that the CPU 1 copies the first communication data from the preset buffer to the local node device. In memory.
本实施例中, 由于 CPU可以直接访问本节点设备的 GPU的显存, 于是在本节点设备的 显存中为 GPU的每个 SM (Stream Multiprocessors , 流多处理器)预设一个缓冲区, 其中 所述预设的缓冲区包括多个字段, 这些字段至少包括标示信号位和通信数据缓冲区, 如图 4 所示, 图 4为本发明实施例 3提供的预设的缓冲区的结构示意图。 其中, 所述通信数据缓 冲区还可以包括通信数据的长度, 即 CPU或 GPU程序需要通信的数据的大小。  In this embodiment, since the CPU can directly access the memory of the GPU of the node device, a buffer is preset for each SM (Stream Multiprocessors) of the GPU in the memory of the node device, where the buffer is The preset buffer includes a plurality of fields, and the field includes at least a flag signal bit and a communication data buffer. As shown in FIG. 4, FIG. 4 is a schematic structural diagram of a preset buffer provided by Embodiment 3 of the present invention. The communication data buffer may further include a length of communication data, that is, a size of data that the CPU or the GPU program needs to communicate.
所述标示信号位可以包括第一标示信号位和第二标示信号位, 所述通信数据缓冲区可 以包括第一通信数据缓冲区和第二通信数据缓冲区, 其中所述第一标示信号位和所述第一 通信数据缓冲区对应于所述用于发送的通信操作, 即所述第一标示信号位和所述第一通信 数据缓冲区分别为, 当所述 CPU接收所述 GPU的通信数据时对应的标示信号位和通信数据 缓冲区, 所述第二标示信号位和所述第二通信数据缓冲区对应于所述用户接收的通信操作, 即所述第二标示信号位和所述第而通信缓冲区分别为, 当所述 GPU接收所述 CPU的通信数 据时对应的的标示信号位和通信数据缓冲区。  The indication signal bit may include a first indication signal bit and a second indication signal bit, and the communication data buffer may include a first communication data buffer and a second communication data buffer, wherein the first indication signal bit and The first communication data buffer corresponds to the communication operation for sending, that is, the first indication signal bit and the first communication data buffer are respectively, when the CPU receives the communication data of the GPU Corresponding indication signal bit and communication data buffer, the second indication signal bit and the second communication data buffer correspond to a communication operation received by the user, that is, the second indication signal bit and the first The communication buffers are respectively corresponding indication signal bits and communication data buffers when the GPU receives the communication data of the CPU.
所述标示信号位的状态包括复位状态、 置位状态和接收错误状态, 其中所述复位状态 可以为 0, 所述置位状态可以为 1, 所述接收错误状态可以为除 0和 1外的其余数值。  The state of the flag signal bit includes a reset state, a set state, and a receive error state, wherein the reset state may be 0, the set state may be 1, and the receiving error state may be other than 0 and 1. The remaining values.
此步骤中, 具体地, 当所述预设的 GPU通信 API对应的通信操作是用于发送的通信操 作时, 所述 GPU1将所述第一通信数据存储至本节点设备的显存的第一通信数据缓冲区, 设 置所述第一标示信号位的状态为置位状态。 所述 GPU1不断査询 (即轮询)所述第一标示信号位的状态, 当所述第一标示信号位的 状态为置位状态时, 所述 GPU1继续査询所述第一标示信号位的状态; 当所述第一标示信号 位的状态为接收错误状态时, 所述 GPU1重新将第一通信数据复制至所述第一通信数据缓冲 区中, 并将所述第一标示信号位的状态设置为置位状态; 当所述第一标示信号位的状态为 复位状态时, 所述 GPU1査询所述第二标示信号位的状态是否为置位状态, 如果是, 执行类 似 S305中的相应流程, 如果否, 则继续查询所述第二标示信号位的状态是否为置位状态, 直至所述第二标示信号位的状态为置位状态为止。 In this step, specifically, when the communication operation corresponding to the preset GPU communication API is a communication operation for sending, the GPU 1 stores the first communication data to the first communication of the memory of the local node device. The data buffer sets the state of the first indicator signal bit to a set state. The GPU 1 continuously queries (ie, polls) the state of the first indicator signal bit, and when the state of the first indicator signal bit is set, the GPU 1 continues to query the first indicator signal bit. a state of the first indicator signal bit being a reception error state, the GPU 1 re-copying the first communication data into the first communication data buffer, and the first flag signal bit The state is set to a set state; when the state of the first indicator signal bit is a reset state, the GPU 1 queries whether the state of the second indicator signal bit is a set state, and if so, performs a process similar to that in S305. Corresponding flow, if no, continue to query whether the state of the second indication signal bit is a set state until the state of the second indication signal bit is a set state.
所述 CPU1上负责通信的线程也在一直不断的査询所述第一标示信号位的状态, 当所述 CPU1在査询到所述第一标示信号位的状态为置位状态时, 所述 CPU1将计数器清零, 将所述 第一通信数据缓冲区中的第一通信数据复制至本节点设备的内存中。  The thread responsible for communication on the CPU 1 is also constantly querying the state of the first indicator signal bit. When the CPU 1 queries the state of the first indicator signal bit to be set, the The CPU 1 clears the counter, and copies the first communication data in the first communication data buffer to the memory of the local node device.
所述 CPU1获取所述第一通信数据后, 对所述第一通信数据进行检验, 具体地可以通过 添加校验位来检验所述第一通信数据是否有效。  After acquiring the first communication data, the CPU 1 checks the first communication data, and specifically, whether the first communication data is valid by adding a check bit.
所述 CPU1检验所述第一通信数据是否有效, 如果是, 所述 CPU1将所述第一标示信号 位的状态设置为复位状态, 如果否, 所述 CPU1将所述第一标示信号位的状态设置为接收错 误状态。  The CPU 1 checks whether the first communication data is valid. If yes, the CPU 1 sets the state of the first indicator signal bit to a reset state. If not, the CPU 1 sets the state of the first indicator signal bit. Set to receive error status.
当所述 CPU1将所述第一表示信号位的状态设置为接收错误状态后, 所述 CPU1判断所 述计数器是否达到预设阈值, 如果是, 报告错误次数过多, 设备或许有异常, 程序终止; 如果否, 所述计数器加 1。 当所述计数器加 1后, 所述 CPU1再循环对获取到的新的第一通 信数据执行检验。  After the CPU 1 sets the state of the first representation signal bit to a receiving error state, the CPU 1 determines whether the counter reaches a preset threshold. If yes, the number of reported errors is excessive, the device may be abnormal, and the program terminates. If no, the counter is incremented by one. When the counter is incremented by 1, the CPU 1 recycles a check for the acquired new first communication data.
所述 CPU1将所述第一标示信号位的状态设置为复位状态后, 继续执行其自身的程序, 进行数据处理, 当所述 CPU1不需要继续与所述 GPU1进行通信时, 直接结束自身程序; 当 所述 CPU1需要继续与所述 GPU1进行通信时, 所述 CPU1将第二通信数据复制至所述第二通 信数据缓冲区中, 并将所述第二标示信号位的状态设置为置位状态。  After setting the state of the first indication signal bit to the reset state, the CPU 1 continues to execute its own program and performs data processing. When the CPU 1 does not need to continue to communicate with the GPU 1, the CPU 1 directly ends its own program; When the CPU 1 needs to continue to communicate with the GPU 1, the CPU 1 copies the second communication data into the second communication data buffer, and sets the state of the second flag signal bit to the set state. .
所述 CPU1不断査询所述第二标示信号位的状态, 当所述第二标示信号位的状态为置位 状态时, 所述 CPU1继续查询所述第二标示信号位的状态; 当所述第二标示信号位的状态为 接收错误状态时, 所述 CPU1重新将第二通信数据复制至所述第二通信数据缓冲区中, 并将 所述第二标示信号位的状态设置为置位状态; 当所述第二标示信号位的状态为复位状态时, 所述 CPU1判断是否需要接收 GPU待发送的第一通信数据, 如果是, 査询所述第一标示信号 位的状态是否是置位状态, 如果否, 继续运行其自身的程序。  The CPU 1 continuously queries the state of the second indicator signal bit, and when the state of the second indicator signal bit is the set state, the CPU 1 continues to query the state of the second indicator signal bit; When the state of the second flag signal bit is a receiving error state, the CPU 1 re-copyes the second communication data into the second communication data buffer, and sets the state of the second flag signal bit to the set state. When the state of the second indicator signal bit is a reset state, the CPU 1 determines whether it is necessary to receive the first communication data to be sent by the GPU, and if yes, whether the state of the first flag signal bit is set or not Status, if no, continue to run its own program.
S305 : 所述 GPU1从所述预设的缓冲区获取第二通信数据, 其中所述第二通信数据由所 述 CPU1复制至所述预设的缓冲区中。 此步骤中, 具体地, 当所述预设的 GPU通信 API对应的通信操作是用于接收的通信操 作时, 所述 GPU1不断的查询所述第二标示信号位的状态, 当所述第二标示信号位的状态为 置位状态时,表示所述 CPU1已经将第二通信数据复制至第二通信数据缓冲区中,且所述 CPU1 已经将所述第二标示信号位的状态设置为置位状态, 所述 GPU1将计数器清零, 从所述第二 通信数据缓冲区中获取第二通信数据。 S305: The GPU 1 acquires second communication data from the preset buffer, where the second communication data is copied by the CPU1 into the preset buffer. In this step, specifically, when the communication operation corresponding to the preset GPU communication API is a communication operation for receiving, the GPU 1 continuously queries the state of the second indicator signal bit, when the second When the state of the flag signal bit is set to state, it indicates that the CPU 1 has copied the second communication data into the second communication data buffer, and the CPU 1 has set the state of the second flag signal bit to be set. State, the GPU 1 clears the counter, and acquires second communication data from the second communication data buffer.
S306: 所述 GPU1检验所述第二通信数据是否有效, 如果是, 将所述第二标示信号位的 状态设置为复位状态; 如果否, 将所述第二标示信号位的状态设置为接收错误状态。  S306: The GPU1 checks whether the second communication data is valid, and if yes, sets a state of the second indication signal bit to a reset state; if not, sets a state of the second indication signal bit to a reception error. status.
此步骤中, 当所述第二标示信号位的状态为复位状态后, 所述 GPU1继续执行器内核程 序, 进行数据处理, 当再次遇到所述预设的 GPU通信 API时, 执行相应的处理。  In this step, after the state of the second indicator signal bit is the reset state, the GPU 1 continues the actuator kernel program to perform data processing, and when the preset GPU communication API is encountered again, performs corresponding processing. .
S307: 所述 GPU1判断所述计数器是否达到预设阈值, 如果是, 报告错误次数过多, 设 备或许有异常, 程序终止; 如果否, 所述计数器加 1。  S307: The GPU 1 determines whether the counter reaches a preset threshold. If yes, the number of reported errors is too large, the device may be abnormal, and the program terminates; if not, the counter is incremented by 1.
当所述计数器加 1后,所述 CPU1返回至 S306, 对获取到的新的第二通信数据执行检验 流程。  When the counter is incremented by 1, the CPU 1 returns to S306 to perform a verification flow on the acquired new second communication data.
本实施例中, 所述方法还包括: 所述第一节点设备的 CPU1将所述第一通信数据经第二 节点设备的 CPU2传输至所述第二节点设备的 GPU2, 使所述第二节点设备的 GPU2共享所述 第一通信数据; 同理, 第二节点设备上的 GPU2也可以将其第二通信数据顺序经 CPU2、 CPU1 传输至 GPU1, 从而实现了集群内部不同节点设备上 GPU运行时的双向通信; 如图 5所示, 图 5是本发明实施例 3提供的不同节点间 GPU的通信交互示意图; 图 5中实线表示第一通 信数据的通信路径, 虚线表示第二通信数据的通信路径。 其中, 不同节点设备上 CPU间的 通信机制, 可以采用 socket (套接字)或 MPI (Message Passing Interface, 消息传递接口) 等现有技术来实现, 在此不再赘述。  In this embodiment, the method further includes: the CPU1 of the first node device transmits the first communication data to the GPU2 of the second node device via the CPU2 of the second node device, so that the second node The GPU 2 of the device shares the first communication data; similarly, the GPU 2 on the second node device can also transmit the second communication data to the GPU 1 through the CPU 2 and the CPU 1 in sequence, thereby realizing the GPU running time on different node devices in the cluster. As shown in FIG. 5, FIG. 5 is a schematic diagram of communication interaction between GPUs of different nodes according to Embodiment 3 of the present invention; in FIG. 5, a solid line indicates a communication path of the first communication data, and a broken line indicates a second communication data. Communication path. The communication mechanism between the CPUs on different node devices may be implemented by using a prior art such as a socket (socket) or an MPI (Message Passing Interface), and details are not described herein.
本实施例中, 还可以将不同节点设备上 GPU之间的双向通信封装成云通信层 API放置 在云端供 GPU分布式系统的上层应用开发人员 (用户)使用, 其中所述云通信层 API中还 加入了对任务调度策略得优化 (对用户透明), 这样就有效的避免死锁和效率低下等问题, 保证了 GPU分布式系统的正确性和稳定性。 其中所述云通信层 API用于编写分布式 GPU计 算任务, 所述云通信层 API提供三个 API, 具体为: 发送操作 API、 接收操作 API和同步操 作 API。  In this embodiment, the two-way communication between the GPUs on different node devices may be encapsulated into a cloud communication layer API for use by an upper layer application developer (user) of the GPU distributed system, wherein the cloud communication layer API is used. It also adds optimization to the task scheduling strategy (transparent to the user), which effectively avoids deadlocks and inefficiencies, and ensures the correctness and stability of the GPU distributed system. The cloud communication layer API is used to write a distributed GPU computing task, and the cloud communication layer API provides three APIs, specifically: a sending operation API, a receiving operation API, and a synchronous operation API.
其 中 , 发送操作 API : CLOUD— Send (data— type, data— length, data— buff er, destination)  Among them, the send operation API: CLOUD_ Send (data_ type, data_length, data_ buff er, destination)
接收操作 API : CL0UD_Recv (datatype, data— length, data— buffer, source) 其中, date— type是待发送 /待接收的数据单位的类型, data— length是数据内容的大小 (即多少个单位的数据), data— buffer是待发送 /待接收到的数据缓冲, destination是发 送操作的目的地址, source是接收操作的数据来源地址。 CLOUD— Send ()和 CLOUD— Recv ()操 作失败将返回错误代码。 Receive operation API: CL0UD_Recv (datatype, data_length, data_buffer, source) where date_type is the type of data unit to be sent/to be received, and data_length is the size of the data content. (ie, how many units of data), data_buffer is the data buffer to be sent/to be received, destination is the destination address of the send operation, and source is the data source address of the receive operation. CLOUD—Send () and CLOUD—The Recv () operation fails with an error code.
当用户使用所述云通信层 API的发送操作 API/接收操作 API时,系统采用任务调度策略优 化的方法, 具体为, 本实施例中设置一个全局计算 task (任务) 分发队列, 即在分发计算 task之前, 识别出带有发送 /接收操作的计算 task, 按照带有发送操作的计算 task放置在所 有带有接收操作的计算 task之前的顺序设置全局计算 task分发队列; 当分发计算 task时, 按照所述全局计算 task队列进行分发操作, 从而保证了用户的分布式程序的任务中的发送 / 接收操作的时序的正确性, 从而解决由 GPU任务的独占性而带来的错误的发送和错误的接收 操作产生的死锁现象。  When the user uses the sending operation API/receiving operation API of the cloud communication layer API, the system adopts a method of task scheduling policy optimization, specifically, in this embodiment, a global computing task distribution queue is set, that is, in the distribution calculation. Before the task, the calculation task with the send/receive operation is recognized, and the global calculation task distribution queue is set in the order before the calculation task with the send operation is placed before all the calculation tasks with the receive operation; when the calculation task is distributed, according to The global calculation task queue performs a distribution operation, thereby ensuring the correctness of the timing of the transmission/reception operation in the task of the user's distributed program, thereby solving the erroneous transmission and error caused by the exclusiveness of the GPU task. The deadlock caused by the receiving operation.
同步操作 API: CLOUD— Sync ()  Synchronization API: CLOUD— Sync ()
在运行时, 当进行同步操作的任务数量过量时, 该方法将返回错误。  At run time, the method returns an error when the number of tasks performing the synchronization operation is excessive.
当用户使用所述云通信层 API的同步操作 API时, 系统采用任务调度策略优化的方法, 具 体为, 在分发计算 task前, 识别需要进行同步操作的计算任务, 将这些计算 task分发到系 统内的不同节点上 (即一个节点上不能有一个以上的计算任务), 设置全局标识位, 当所有 节点上的需要同步的计算任务都准备就绪运行时, 统一调度运行这些计算任务, 从而保证 了用户的分布式程序的任务中的同步操作的范围的正确性。 其中 GPU任务的独占性决定了, 进行同步操作的任务数量不能超过系统允许的同时运行的任务数量, 系统在调度的时候, 需要将进行同步的任务同时处于运行状态, 否则会对系统性能带来损害。  When the user uses the synchronization operation API of the cloud communication layer API, the system adopts a method of task scheduling policy optimization, specifically, identifying a computing task that needs to perform a synchronization operation before distributing the calculation task, and distributing the calculation tasks to the system. On different nodes (that is, there can be no more than one computing task on a node), the global identification bit is set. When the computing tasks on all nodes that need to be synchronized are ready to run, the computing tasks are uniformly scheduled to run, thereby ensuring the user. The correctness of the scope of the synchronization operation in the task of the distributed program. The exclusiveness of the GPU task determines that the number of tasks to be synchronized cannot exceed the number of concurrent tasks allowed by the system. When the system is scheduled, the tasks to be synchronized need to be in the running state at the same time. Otherwise, the system performance will be brought. damage.
本实施例中, 在第一节点设备的 GPU的内核程序中需要共享中间运行数据的地方插入预 设的 GPU通信 API , 当所述 GPU的内核程序运行至所述预设的 GPU通信 API时, 获取运行完的部 分内核程序的中间运行数据, 即第一通信数据; 所述 GPU判断所述 GPU通信 API对应的通信操 作是用于发送的通信操作还是用于接收的通信操作, 根据判断结果由所述 GPU和本节点设备 的 CPU执行相应处理, 完成 GPU的通信操作, 使所述 CPU获取第一通信数据, 所述 GPU获取第 二通信数据, 相比较像有技术而言, 本实施例在 GPU的内核程序运行过程中及时获取中间运 行数据 (第一通信数据和第二通信数据), 使得第二节点设备无需等待第一节点设备的整个 内核程序运行完之后再获取中间运行数据, 縮短了第二节点设备上进程的运行时间, 提高 了系统的计算效率。 此外, 本实施例在 GPU的内核程序运行过程中实现了单节点设备上 GPU和 CPU之间的 双向通信; 且通过在 GPU的内核程序运行过程中实现单节点设备上 GPU和 CPU之间双向通 信的基础上, 结合现有集群内部不同节点设备之间的 CPU之间的通信机制, 实现了集群内 部不同节点设备上 GPU运行时的双向通信。 实施例 4 In this embodiment, a preset GPU communication API is inserted in a kernel program of the GPU of the first node device, where the intermediate running data needs to be shared, when the kernel program of the GPU runs to the preset GPU communication API, Obtaining intermediate running data of the running part of the kernel program, that is, the first communication data; the GPU determining whether the communication operation corresponding to the GPU communication API is a communication operation for sending or a communication operation for receiving, according to the judgment result The GPU and the CPU of the local node device perform corresponding processing to complete the communication operation of the GPU, so that the CPU acquires the first communication data, and the GPU acquires the second communication data. Compared with the technology, the embodiment is The intermediate running data (the first communication data and the second communication data) are acquired in time during the running of the kernel program of the GPU, so that the second node device does not need to wait for the entire kernel program of the first node device to run before acquiring the intermediate running data, which is shortened. The running time of the process on the second node device improves the computational efficiency of the system. In addition, in this embodiment, the two-way communication between the GPU and the CPU on the single-node device is implemented during the running of the kernel program of the GPU; and the two-way communication between the GPU and the CPU on the single-node device is implemented during the running of the kernel program of the GPU. On the basis of the letter, combined with the communication mechanism between the CPUs of different node devices in the existing cluster, the two-way communication of the GPU running on different node devices in the cluster is realized. Example 4
参考图 6,图 6是本发明实施例 4提供的一种图形处理器 GPU实施例的第一结构示意图; 所述 GPU包括:  6 is a first schematic structural diagram of a GPU embodiment of a graphics processor according to Embodiment 4 of the present invention; the GPU includes:
运行模块 501, 用于当第一节点设备的中央处理器 CPU启动本节点设备的图形处理器 GPU的内核程序时, 运行所述内核程序,所述内核程序包括至少一个预设的 GPU通信应用程 序编程接口 API。  The running module 501 is configured to: when the central processor CPU of the first node device starts a kernel program of the graphics processor GPU of the node device, the kernel program includes at least one preset GPU communication application Programming interface API.
获取模块 502, 用于当所述 GPU的内核程序运行至所述预设的 GPU通信 API时, 获取第 一通信数据。  The obtaining module 502 is configured to acquire first communication data when the kernel program of the GPU runs to the preset GPU communication API.
判断处理模块 503,用于判断所述预设的 GPU通信 API对应的通信操作是用于发送的通 信操作还是用于接收的通信操作, 如果是用于发送的通信操作时, 所述 GPU将所述第一通 信数据存储至本节点设备的显存的预设的缓冲区, 使所述 CPU将所述第一通信数据由所述 预设的缓冲区复制至本节点设备的内存中; 如果是用于接收的通信操作时, 所述 GPU从所 述预设的缓冲区获取第二通信数据, 其中所述第二通信数据由所述 CPU复制至所述预设的 缓冲区中。  The determining processing module 503 is configured to determine whether the communication operation corresponding to the preset GPU communication API is a communication operation for sending or a communication operation for receiving, and if it is a communication operation for sending, the GPU will The first communication data is stored in a preset buffer of the memory of the local device, so that the CPU copies the first communication data from the preset buffer to the memory of the node device; During the received communication operation, the GPU acquires second communication data from the preset buffer, where the second communication data is copied by the CPU into the preset buffer.
其中所述内核程序包括至少两个子内核程序, 每两个子内核程序之间存在一个所述预 设的 GPU通信 API。  The kernel program includes at least two sub-kernels, and each of the two sub-kernels has one of the preset GPU communication APIs.
所述获取模块 502包括: 获取单元 5021, 如图 7所示, 图 7是本发明实施例 4提供的 一种图形处理器 GPU实施例的第二结构示意图;  The obtaining module 502 includes: an obtaining unit 5021, as shown in FIG. 7, FIG. 7 is a second structural diagram of a GPU embodiment of a graphics processor according to Embodiment 4 of the present invention;
所述获取单元 5021, 用于获取所述子内核程序的通信数据。  The obtaining unit 5021 is configured to acquire communication data of the sub-kernel program.
在本实施例的另一实施方式中, 所述预设的缓冲区包括标示信号位和通信数据缓冲区; 所述标示信号位包括第一标示信号位和第二标示信号位, 所述通信数据缓冲区包括第一通 信数据缓冲区和第二通信数据缓冲区, 其中所述第一标示信号位和所述第一通信数据缓冲 区对应所述 CPU接收所述 GPU的标示信号位和通信数据缓冲区, 所述第二标示信号位和所 述第二通信数据缓冲区对应所述 GPU接收所述 CPU的标示信号位和通信数据缓冲区。  In another embodiment of the present embodiment, the preset buffer includes a flag signal bit and a communication data buffer; the flag signal bit includes a first flag signal bit and a second flag signal bit, and the communication data The buffer includes a first communication data buffer and a second communication data buffer, wherein the first indication signal bit and the first communication data buffer correspond to the CPU receiving the indication signal bit of the GPU and the communication data buffer The second indication signal bit and the second communication data buffer correspond to the GPU receiving the indication signal bit of the CPU and the communication data buffer.
其中, 所述判断处理模块 503包括: 所述存储设置单元 5031, 如图 8所示, 图 8是本 发明实施例 4提供的一种图形处理器 GPU实施例的第三结构示意图;  The determination processing module 503 includes: the storage setting unit 5031, as shown in FIG. 8, FIG. 8 is a third structural diagram of a GPU embodiment of a graphics processor according to Embodiment 4 of the present invention;
所述存储设置单元 5031, 用于将所述第一通信数据存储至本节点设备的显存的第一通 信数据缓冲区, 设置所述第一标示信号位的状态为置位状态, 使所述 CPU在査询到所述第 一标示信号位的状态为置位状态后将所述第一通信数据缓冲区中的第一通信数据复制至本 节点设备的内存中。 The storage setting unit 5031 is configured to store the first communication data to a first communication data buffer of a memory of the local node device, and set a state of the first indication signal bit to a set state, so that the CPU In the query to the said After the state of the signal bit is set to the state, the first communication data in the first communication data buffer is copied into the memory of the node device.
或者, 所述判断处理模块 503包括:  Alternatively, the determining processing module 503 includes:
査询获取单元 5032, 用于当 GPU査询到所述第二标示信号位的状态为置位状态时, 从 所述第二通信数据缓冲区中获取第二通信数据, 其中所述第二通信数据由所述 CPU复制至 所述第二通信数据缓冲区中, 所述第二标示信号位由所述 CPU设置为置位状态。  The query obtaining unit 5032 is configured to acquire second communication data from the second communication data buffer when the GPU queries that the state of the second indication signal bit is a set state, where the second communication is Data is copied by the CPU into the second communication data buffer, and the second flag signal bit is set to a set state by the CPU.
进一步地, 所述 GPU还包括: 检验设置模块 504, 如图 9所示, 图 9是本发明实施例 4 提供的一种图形处理器 GPU实施例的第四结构示意图;  Further, the GPU further includes: a verification setting module 504, as shown in FIG. 9, FIG. 9 is a fourth structural diagram of a GPU embodiment of a graphics processor according to Embodiment 4 of the present invention;
所述检验设置模块 504,用于在所述从所述第二通信数据缓冲区中获取所述第二通信数 据之后, 检验所述第一通信数据是否有效, 如果是, 将所述第二标示信号位的状态设置为 复位状态; 如果否, 将所述第二标示信号位的状态设置为接收错误状态。  The verification setting module 504 is configured to check whether the first communication data is valid after acquiring the second communication data from the second communication data buffer, and if yes, the second indication The state of the signal bit is set to a reset state; if not, the state of the second flag signal bit is set to a reception error state.
本实施例中, 在第一节点设备的 GPU 的内核程序中需要共享中间运行数据的地方插入 预设的 GPU通信 API , 当所述 GPU的内核程序运行至所述预设的 GPU通信 API时, 获取运行 完的部分内核程序的中间运行数据, 即第一通信数据; 所述 GPU判断所述 GPU通信 API对 应的通信操作是用于发送的通信操作还是用于接收的通信操作, 根据判断结果由所述 GPU 和本节点设备的 CPU执行相应处理, 完成 GPU的通信操作, 使所述 CPU获取第一通信数据, 所述 GPU获取第二通信数据, 相比较像有技术而言, 本实施例在 GPU的内核程序运行过程 中及时获取中间运行数据 (第一通信数据和第二通信数据), 使得第二节点设备无需等待第 一节点设备的整个内核程序运行完之后再获取中间运行数据, 缩短了第二节点设备上进程 的运行时间, 提高了系统的计算效率。 实施例 5  In this embodiment, a preset GPU communication API is inserted in a kernel program of the GPU of the first node device, where the intermediate running data needs to be shared, when the kernel program of the GPU runs to the preset GPU communication API, Obtaining intermediate running data of the running part of the kernel program, that is, the first communication data; the GPU determining whether the communication operation corresponding to the GPU communication API is a communication operation for sending or a communication operation for receiving, according to the judgment result The GPU and the CPU of the local node device perform corresponding processing to complete the communication operation of the GPU, so that the CPU acquires the first communication data, and the GPU acquires the second communication data. Compared with the technology, the embodiment is The intermediate running data (the first communication data and the second communication data) are acquired in time during the running of the kernel program of the GPU, so that the second node device does not need to wait for the entire kernel program of the first node device to run before acquiring the intermediate running data, which is shortened. The running time of the process on the second node device improves the computational efficiency of the system. Example 5
参考图 10, 图 10是本发明实施例 5提供的一种第一节点设备实施例的结构示意图; 本 实施例中所述第一节点设备和第二节点设备均可以为商用服务器, 但是并不局限于此。  Referring to FIG. 10, FIG. 10 is a schematic structural diagram of a first node device according to Embodiment 5 of the present invention; in this embodiment, the first node device and the second node device may both be commercial servers, but Limited to this.
所述第一节点设备包括: CPU 40和 GPU 50; 其中所述 GPU 50的功能与实施例 4中 GPU 的功能类似, 具体可参见实施例 4的相关描述, 在此不再赘述。  The first node device includes: a CPU 40 and a GPU 50. The functions of the GPU 50 are similar to those of the GPU in the fourth embodiment. For details, refer to the related description of Embodiment 4, and details are not described herein.
所述 CPU 40, 用于启动本节点设备的图形处理器 GPU的内核程序; 将第一通信数据由 预设的缓冲区复制至本节点设备的内存中; 复制第二通信数据至所述预设的缓冲区中。  The CPU 40 is configured to start a kernel program of a graphics processor GPU of the node device; copy the first communication data from a preset buffer to a memory of the node device; and copy the second communication data to the preset In the buffer.
所述 CPU 40, 进一步用于将所述第一通信数据经第二节点设备的 CPU传输至所述第二 节点设备的 GPU, 使所述第二节点设备的 GPU共享所述第一通信数据。  The CPU 40 is further configured to transmit the first communication data to a GPU of the second node device by using a CPU of the second node device, so that the GPU of the second node device shares the first communication data.
所述 CPU 40, 进一步用于检验所述第一通信数据是否有效, 如果是, 将所述第一标示 信号位的状态设置为复位状态; 如果否, 将所述标示信号位的状态设置为接收错误状态。 本实施例中, 在第一节点设备的 GPU 的内核程序中需要共享中间运行数据的地方插入 预设的 GPU通信 API , 当所述 GPU的内核程序运行至所述预设的 GPU通信 API时, 获取运行 完的部分内核程序的中间运行数据, 即第一通信数据; 所述 GPU判断所述 GPU通信 API对 应的通信操作是用于发送的通信操作还是用于接收的通信操作, 根据判断结果由所述 GPU 和本节点设备的 CPU执行相应处理, 完成 GPU的通信操作, 使所述 CPU获取第一通信数据, 所述 GPU获取第二通信数据, 相比较像有技术而言, 本实施例在 GPU的内核程序运行过程 中及时获取中间运行数据 (第一通信数据和第二通信数据), 使得第二节点设备无需等待第 一节点设备的整个内核程序运行完之后再获取中间运行数据, 缩短了第二节点设备上进程 的运行时间, 提高了系统的计算效率。 The CPU 40 is further configured to check whether the first communication data is valid, and if yes, the first identifier The state of the signal bit is set to the reset state; if not, the state of the flag signal bit is set to the reception error state. In this embodiment, a preset GPU communication API is inserted in a kernel program of the GPU of the first node device, where the intermediate running data needs to be shared, when the kernel program of the GPU runs to the preset GPU communication API, Obtaining intermediate running data of the running part of the kernel program, that is, the first communication data; the GPU determining whether the communication operation corresponding to the GPU communication API is a communication operation for sending or a communication operation for receiving, according to the judgment result The GPU and the CPU of the local node device perform corresponding processing to complete the communication operation of the GPU, so that the CPU acquires the first communication data, and the GPU acquires the second communication data. Compared with the technology, the embodiment is The intermediate running data (the first communication data and the second communication data) are acquired in time during the running of the kernel program of the GPU, so that the second node device does not need to wait for the entire kernel program of the first node device to run before acquiring the intermediate running data, which is shortened. The running time of the process on the second node device improves the computational efficiency of the system.
此外, 本实施例在 GPU的内核程序运行过程中实现了单节点设备上 GPU和 CPU之间的 双向通信; 且通过在 GPU的内核程序运行过程中实现单节点设备上 GPU和 CPU之间双向通 信的基础上, 结合现有集群内部不同节点设备之间的 CPU之间的通信机制, 实现了集群内 部不同节点设备上 GPU运行时的双向通信。 需要说明的是, 本说明书中的各个实施例均采用递进的方式描述, 每个实施例重点说 明的都是与其他实施例的不同之处, 各个实施例之间相同相似的部分互相参见即可。 对于 装置类实施例而言, 由于其与方法实施例基本相似, 所以描述的比较简单, 相关之处参见 方法实施例的部分说明即可。  In addition, in this embodiment, the two-way communication between the GPU and the CPU on the single-node device is implemented during the running of the kernel program of the GPU; and two-way communication between the GPU and the CPU on the single-node device is realized by running the kernel program of the GPU. On the basis of the communication mechanism between the CPUs of different node devices in the existing cluster, the two-way communication of the GPU running on different node devices in the cluster is realized. It should be noted that each embodiment in the specification is described in a progressive manner, and each embodiment focuses on differences from other embodiments, and the same similar parts between the embodiments are referred to each other. can. For the device type embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
需要说明的是, 在本文中, 诸如第一和第二等之类的关系术语仅仅用来将一个实体或 者操作与另一个实体或操作区分开来, 而不一定要求或者暗示这些实体或操作之间存在任 何这种实际的关系或者顺序。 而且, 术语 "包括"、 "包含"或者其任何其他变体意在涵盖 非排他性的包含, 从而使得包括一系列要素的过程、 方法、 物品或者设备不仅包括那些要 素, 而且还包括没有明确列出的其他要素, 或者是还包括为这种过程、 方法、 物品或者设 备所固有的要素。 在没有更多限制的情况下, 由语句 "包括一个…… "限定的要素, 并不 排除在包括所述要素的过程、 方法、 物品或者设备中还存在另外的相同要素。  It should be noted that, in this context, relational terms such as first and second are used merely to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these entities or operations. There is any such actual relationship or order between them. Furthermore, the terms "including", "comprising" or "comprising" or "comprising" are intended to encompass a non-exclusive inclusion, such that a process, method, article, or device that includes a plurality of elements includes not only those elements but also Other elements, or elements that are inherent to such a process, method, item, or device. In the absence of further limitations, the elements defined by the phrase "comprising a ..." do not exclude the presence of additional equivalent elements in the process, method, item, or device that comprises the element.
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完 成, 也可以通过程序来指令相关的硬件完成, 所述的程序可以存储于一种计算机可读存储 介质中, 上述提到的存储介质可以是只读存储器, 磁盘或光盘等。 以上所述仅为本发明的较佳实施例, 并不用以限制本发明, 凡在本发明的精神和原则 之内, 所作的任何修改、 等同替换、 改进等, 均应包含在本发明的保护范围之内。 A person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium. The storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like. The above is only the preferred embodiment of the present invention, and is not intended to limit the present invention, and the spirit and principles of the present invention. Any modifications, equivalent substitutions, improvements, etc. made therein are intended to be included within the scope of the present invention.

Claims

权 利 要 求 书 Claim
1、 一种数据处理方法, 其特征在于, 所述方法包括:  A data processing method, the method comprising:
当第一节点设备的中央处理器 CPU启动本节点设备的图形处理器 GPU的内核程序时, 所 述 GPU运行所述内核程序,所述内核程序包括至少一个预设的 GPU通信应用程序编程接口 API ; 当所述 GPU的内核程序运行至所述预设的 GPU通信 API时,所述 GPU获取第一通信数据; 所述 GPU判断所述预设的 GPU通信 API对应的通信操作是用于发送的通信操作还是用于 接收的通信操作, 如果是用于发送的通信操作时, 所述 GPU将所述第一通信数据存储至本节 点设备的显存的预设的缓冲区, 使所述 CPU将所述第一通信数据由所述预设的缓冲区复制至 本节点设备的内存中; 如果是用于接收的通信操作时, 所述 GPU从所述预设的缓冲区获取第 二通信数据, 其中所述第二通信数据由所述 CPU复制至所述预设的缓冲区中。  When the central processor CPU of the first node device starts the kernel program of the graphics processor GPU of the node device, the GPU runs the kernel program, and the kernel program includes at least one preset GPU communication application programming interface API The GPU acquires first communication data when the kernel program of the GPU runs to the preset GPU communication API; the GPU determines that the communication operation corresponding to the preset GPU communication API is for sending The communication operation is also a communication operation for receiving. If it is a communication operation for sending, the GPU stores the first communication data to a preset buffer of the memory of the local node device, so that the CPU will The first communication data is copied from the preset buffer to the memory of the local device; if it is a communication operation for receiving, the GPU acquires second communication data from the preset buffer, where The second communication data is copied by the CPU into the preset buffer.
2、 根据权利要求 1所述的方法, 其特征在于, 所述内核程序包括至少两个子内核程序, 每两个子内核程序之间存在一个所述预设的 GPU通信 API。 2. The method according to claim 1, wherein the kernel program comprises at least two sub-kernel programs, and each of the two sub-kernel programs has one of the preset GPU communication APIs.
3、 根据权利要求 2所述的方法, 其特征在于, 所述 GPU获取第一通信数据包括- 所述 GPU获取所述子内核程序的通信数据。  3. The method according to claim 2, wherein the acquiring, by the GPU, the first communication data comprises: the GPU acquiring communication data of the sub-kernel program.
4、根据权利要求 1所述的方法, 其特征在于, 所述预设的缓冲区包括标示信号位和通信 数据缓冲区; 所述标示信号位包括第一标示信号位和第二标示信号位, 所述通信数据缓冲区 包括第一通信数据缓冲区和第二通信数据缓冲区, 其中所述第一标示信号位和所述第一通信 数据缓冲区对应于所述用于发送的通信操作, 所述第二标示信号位和所述第二通信数据缓冲 区对应于所述用于接收的通信操作。 The method according to claim 1, wherein the preset buffer includes a flag signal bit and a communication data buffer; the flag signal bit includes a first flag signal bit and a second flag signal bit. The communication data buffer includes a first communication data buffer and a second communication data buffer, wherein the first indication signal bit and the first communication data buffer correspond to the communication operation for transmitting, The second indicator signal bit and the second communication data buffer correspond to the communication operation for receiving.
5、根据权利要求 4所述的方法, 其特征在于, 所述 GPU将所述第一通信数据存储至本节 点设备的显存的预设的缓冲区, 使所述 CPU将所述第一通信数据由所述预设的缓冲区复制至 本节点设备的内存中包括- 所述 GPU将所述第一通信数据存储至本节点设备的显存的第一通信数据缓冲区, 设置所 述第一标示信号位的状态为置位状态, 使所述 CPU在查询到所述第一标示信号位的状态为置 位状态后将所述第一通信数据缓冲区中的第一通信数据复制至本节点设备的内存中。 The method according to claim 4, wherein the GPU stores the first communication data to a preset buffer of the memory of the local node device, so that the CPU sends the first communication data. Copying from the preset buffer to the memory of the local node device includes: the GPU storing the first communication data to a first communication data buffer of the memory of the local node device, and setting the first indication signal The state of the bit is set to enable the CPU to copy the first communication data in the first communication data buffer to the local node device after querying that the state of the first indicator signal bit is set to a state In memory.
6、根据权利要求 4所述的方法, 其特征在于, 所述 GPU从所述预设的缓冲区获取第二通 信数据, 其中所述第二通信数据由所述 CPU复制至所述预设的缓冲区中包括- 当 GPU査询到所述第二标示信号位的状态为置位状态时, 所述 GPU从所述第二通信数据 缓冲区中获取第二通信数据, 其中所述第二通信数据由所述 CPU复制至所述第二通信数据缓 冲区中, 所述第二标示信号位的状态由所述 CPU设置为置位状态。 The method according to claim 4, wherein the GPU acquires second communication data from the preset buffer, wherein the second communication data is copied by the CPU to the preset Included in the buffer - when the GPU queries the state of the second indication signal bit to a set state, the GPU acquires second communication data from the second communication data buffer, wherein the second communication Data is copied by the CPU into the second communication data buffer, and the state of the second flag signal bit is set to a set state by the CPU.
7、根据权利要求 6所述的方法, 其特征在于, 所述 GPU从所述第二通信数据缓冲区中获 取所述第二通信数据之后, 进一步包括- 所述 GPU检验所述第一通信数据是否有效, 如果是, 将所述第二标示信号位的状态设置 为复位状态; 如果否, 将所述第二标示信号位的状态设置为接收错误状态。 The method according to claim 6, wherein after the GPU acquires the second communication data from the second communication data buffer, the method further includes: the GPU checking the first communication data Whether it is valid, if yes, setting the state of the second indication signal bit to a reset state; if not, setting the state of the second indication signal bit to a reception error state.
8、 根据权利要求 1-7任一项所述的方法, 其特征在于, 进一步包括- 所述第一节点设备的 CPU将所述第一通信数据经第二节点设备的 CPU传输至所述第二节 点设备的 GPU, 使所述第二节点设备的 GPU共享所述第一通信数据。 The method according to any one of claims 1 to 7, further comprising: - the CPU of the first node device transmits the first communication data to the first via the CPU of the second node device The GPU of the two-node device causes the GPU of the second node device to share the first communication data.
9、 一种图形处理器 GPU, 其特征在于, 包括: 9. A graphics processor GPU, comprising:
运行模块, 用于当第一节点设备的中央处理器 CPU启动本节点设备的图形处理器 GPU的 内核程序时, 运行所述内核程序, 所述内核程序包括至少一个预设的 GPU通信应用程序编程 接口 API ;  a running module, configured to: when a central processor CPU of the first node device starts a kernel program of a graphics processor GPU of the node device, the kernel program includes at least one preset GPU communication application programming Interface API ;
获取模块, 用于当所述 GPU的内核程序运行至所述预设的 GPU通信 API时, 获取第一通 信数据;  An obtaining module, configured to acquire first communication data when a kernel program of the GPU runs to the preset GPU communication API;
判断处理模块, 用于判断所述预设的 GPU通信 API对应的通信操作是用于发送的通信操 作还是用于接收的通信操作, 如果是用于发送的通信操作时, 所述 GPU将所述第一通信数据 存储至本节点设备的显存的预设的缓冲区, 使所述 CPU将所述第一通信数据由所述预设的缓 冲区复制至本节点设备的内存中; 如果是用于接收的通信操作时, 所述 GPU从所述预设的缓 冲区获取第二通信数据, 其中所述第二通信数据由所述 CPU复制至所述预设的缓冲区中。  a determining processing module, configured to determine whether the communication operation corresponding to the preset GPU communication API is a communication operation for sending or a communication operation for receiving, and if it is a communication operation for sending, the GPU The first communication data is stored in a preset buffer of the memory of the node device, so that the CPU copies the first communication data from the preset buffer to the memory of the node device; The GPU acquires second communication data from the preset buffer, wherein the second communication data is copied by the CPU into the preset buffer.
10、 根据权利要求 9所述的 GPU, 其特征在于, 所述内核程序包括至少两个子内核程序, 每两个子内核程序之间存在一个所述预设的 GPU通信 API。 10. The GPU according to claim 9, wherein the kernel program comprises at least two sub-kernel programs, and each of the two sub-kernel programs has one of the preset GPU communication APIs.
11、 根据权利要求 10所述的 GPU, 其特征在于, 所述获取模块包括 : The GPU according to claim 10, wherein the acquiring module comprises :
获取单元, 用于获取所述子内核程序的通信数据。  An obtaining unit, configured to acquire communication data of the sub-kernel program.
12、 根据权利要求 9所述的 GPU, 其特征在于, 所述预设的缓冲区包括标示信号位和通 信数据缓冲区; 所述标示信号位包括第一标示信号位和第二标示信号位, 所述通信数据缓冲 区包括第一通信数据缓冲区和第二通信数据缓冲区, 其中所述第一标示信号位和所述第一通 信数据缓冲区对应于所述用于发送的通信操作, 所述第二标示信号位和所述第二通信数据缓 冲区对应于所述用于接收的通信操作。 The GPU according to claim 9, wherein the preset buffer includes a flag signal bit and a communication data buffer; and the flag signal bit includes a first flag signal bit and a second flag signal bit. The communication data buffer includes a first communication data buffer and a second communication data buffer, wherein the first indication signal bit and the first communication data buffer correspond to the communication operation for transmitting, The second indicator signal bit and the second communication data buffer correspond to the communication operation for receiving.
13、 根据权利要求 12所述的 GPU, 其特征在于, 所述判断处理模块包括: The GPU according to claim 12, wherein the determining processing module comprises:
存储设置单元, 用于将所述第一通信数据存储至本节点设备的显存的第一通信数据缓冲 区, 设置所述第一标示信号位的状态为置位状态, 使所述 CPU在查询到所述第一标示信号位 的状态为置位状态后将所述第一通信数据缓冲区中的第一通信数据复制至本节点设备的内存 中。  a storage setting unit, configured to store the first communication data to a first communication data buffer of a memory of the node device, and set a state of the first indicator signal bit to a set state, so that the CPU is inquiring After the state of the first indication signal bit is a set state, the first communication data in the first communication data buffer is copied into the memory of the local node device.
14、 根据权利要求 12所述的 GPU, 其特征在于, 所述判断处理模块包括- 査询获取单元, 用于当 GPU査询到所述第二标示信号位的状态为置位状态时, 从所述第 二通信数据缓冲区中获取第二通信数据, 其中所述第二通信数据由所述 CPU复制至所述第二 通信数据缓冲区中, 所述第二标示信号位由所述 CPU设置为置位状态。 The GPU according to claim 12, wherein the determining processing module includes: a query obtaining unit, configured to: when the GPU queries the state of the second indicator signal bit to be set, Obtaining second communication data in the second communication data buffer, wherein the second communication data is copied by the CPU to the second communication data buffer, and the second indication signal bit is set by the CPU Set to state.
15、 根据权利要求 14所述的 GPU, 其特征在于, 进一步包括: The GPU according to claim 14, further comprising:
检验设置模块, 用于在所述从所述第二通信数据缓冲区中获取所述第二通信数据之后, 检验所述第一通信数据是否有效, 如果是, 将所述第二标示信号位的状态设置为复位状态; 如果否, 将所述第二标示信号位的状态设置为接收错误状态。  a verification setting module, configured to check whether the first communication data is valid after the obtaining the second communication data from the second communication data buffer, and if yes, the second indication signal bit The state is set to a reset state; if not, the state of the second flag signal bit is set to a reception error state.
16、 一种第一节点设备, 其特征在于, 包括中央处理器 CPU和如权利要求 9-15任一项所 述的图形处理器 GPU; A first node device, comprising: a central processing unit CPU and a graphics processor GPU according to any one of claims 9-15;
所述 CPU, 用于启动本节点设备的图形处理器 GPU的内核程序; 将第一通信数据由预设 的缓冲区复制至本节点设备的内存中; 复制第二通信数据至所述预设的缓冲区中。 The CPU is configured to start a kernel program of a graphics processor GPU of the node device; copy the first communication data from a preset buffer to a memory of the node device; and copy the second communication data to the preset In the buffer.
17、 根据权利要求 16所述的第一节点设备, 其特征在于, 所述 CPU, 进一步用于将所述 第一通信数据经第二节点设备的 CPU传输至所述第二节点设备的 GPU, 使所述第二节点设备 的 GPU共享所述第一通信数据。 The first node device according to claim 16, wherein the CPU is further configured to transmit the first communication data to a GPU of the second node device via a CPU of the second node device, The GPU of the second node device is caused to share the first communication data.
18、 根据权利要求 16所述的第一节点设备, 其特征在于, 所述 CPU, 进一步用于检验所 述第一通信数据是否有效, 如果是, 将所述第一标示信号位的状态设置为复位状态; 如果否, 将所述标示信号位的状态设置为接收错误状态。 The first node device according to claim 16, wherein the CPU is further configured to check whether the first communication data is valid, and if yes, set a state of the first indicator signal bit to Reset state; if no, the state of the flag signal bit is set to the receive error state.
PCT/CN2011/084764 2011-12-27 2011-12-27 Data processing method, graphics processing unit (gpu) and first node device WO2013097098A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2011/084764 WO2013097098A1 (en) 2011-12-27 2011-12-27 Data processing method, graphics processing unit (gpu) and first node device
CN201180003244.XA CN103282888B (en) 2011-12-27 2011-12-27 Data processing method, image processor GPU and primary nodal point equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2011/084764 WO2013097098A1 (en) 2011-12-27 2011-12-27 Data processing method, graphics processing unit (gpu) and first node device

Publications (1)

Publication Number Publication Date
WO2013097098A1 true WO2013097098A1 (en) 2013-07-04

Family

ID=48696189

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/084764 WO2013097098A1 (en) 2011-12-27 2011-12-27 Data processing method, graphics processing unit (gpu) and first node device

Country Status (2)

Country Link
CN (1) CN103282888B (en)
WO (1) WO2013097098A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103716635A (en) * 2013-12-12 2014-04-09 浙江宇视科技有限公司 Method and device for improving intelligent analysis performance
CN107333136A (en) * 2017-06-26 2017-11-07 西安万像电子科技有限公司 Method for encoding images and device
CN111506420A (en) * 2020-03-27 2020-08-07 北京百度网讯科技有限公司 Memory synchronization method and device, electronic equipment and storage medium
TWI715613B (en) * 2015-09-25 2021-01-11 美商英特爾股份有限公司 Apparatus, system and method for performing gpu-cpu two-path memory copy

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110969565B (en) * 2018-09-28 2023-05-16 杭州海康威视数字技术股份有限公司 Image processing method and device
CN113986771B (en) * 2021-12-29 2022-04-08 北京壁仞科技开发有限公司 Method and device for debugging target program code and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1250567A (en) * 1997-03-13 2000-04-12 国际商业机器公司 Kiosk and server connected to computer network
CN101599009A (en) * 2009-04-30 2009-12-09 浪潮电子信息产业股份有限公司 A kind of method of executing tasks parallelly on heterogeneous multiprocessor
CN101802789A (en) * 2007-04-11 2010-08-11 苹果公司 Parallel runtime execution on multiple processors
CN102099788A (en) * 2008-06-06 2011-06-15 苹果公司 Application programming interfaces for data parallel computing on multiple processors

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5572572A (en) * 1988-05-05 1996-11-05 Transaction Technology, Inc. Computer and telephone apparatus with user friendly interface and enhanced integrity features

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1250567A (en) * 1997-03-13 2000-04-12 国际商业机器公司 Kiosk and server connected to computer network
CN101802789A (en) * 2007-04-11 2010-08-11 苹果公司 Parallel runtime execution on multiple processors
CN102099788A (en) * 2008-06-06 2011-06-15 苹果公司 Application programming interfaces for data parallel computing on multiple processors
CN101599009A (en) * 2009-04-30 2009-12-09 浪潮电子信息产业股份有限公司 A kind of method of executing tasks parallelly on heterogeneous multiprocessor

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103716635A (en) * 2013-12-12 2014-04-09 浙江宇视科技有限公司 Method and device for improving intelligent analysis performance
CN103716635B (en) * 2013-12-12 2017-04-19 浙江宇视科技有限公司 Method and device for improving intelligent analysis performance
TWI715613B (en) * 2015-09-25 2021-01-11 美商英特爾股份有限公司 Apparatus, system and method for performing gpu-cpu two-path memory copy
CN107333136A (en) * 2017-06-26 2017-11-07 西安万像电子科技有限公司 Method for encoding images and device
CN111506420A (en) * 2020-03-27 2020-08-07 北京百度网讯科技有限公司 Memory synchronization method and device, electronic equipment and storage medium
CN111506420B (en) * 2020-03-27 2023-09-22 北京百度网讯科技有限公司 Memory synchronization method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN103282888B (en) 2017-03-08
CN103282888A (en) 2013-09-04

Similar Documents

Publication Publication Date Title
TWI543073B (en) Method and system for work scheduling in a multi-chip system
US7490089B1 (en) Methods and apparatus facilitating access to shared storage among multiple computers
JP6475625B2 (en) Inter-core communication apparatus and method
US7668923B2 (en) Master-slave adapter
US8032892B2 (en) Message passing with a limited number of DMA byte counters
KR102011949B1 (en) System and method for providing and managing message queues for multinode applications in a middleware machine environment
JP6353086B2 (en) Multi-database log with multi-item transaction support
US7797588B2 (en) Mechanism to provide software guaranteed reliability for GSM operations
JP2018163671A (en) Scalable log-based transaction management
US20050081080A1 (en) Error recovery for data processing systems transferring message packets through communications adapters
TWI547870B (en) Method and system for ordering i/o access in a multi-node environment
WO2013097098A1 (en) Data processing method, graphics processing unit (gpu) and first node device
US20050078559A1 (en) Global recovery for time of day synchronization
TW201543218A (en) Chip device and method for multi-core network processor interconnect with multi-node connection
TWI541649B (en) System and method of inter-chip interconnect protocol for a multi-chip system
US8086766B2 (en) Support for non-locking parallel reception of packets belonging to a single memory reception FIFO
US10185681B2 (en) Hybrid message-based scheduling technique
US20050080869A1 (en) Transferring message packets from a first node to a plurality of nodes in broadcast fashion via direct memory to memory transfer
KR20110047753A (en) Method and system of data processing for deadlock free
US20050080920A1 (en) Interpartition control facility for processing commands that effectuate direct memory to memory information transfer
US20090199191A1 (en) Notification to Task of Completion of GSM Operations by Initiator Node
US9830263B1 (en) Cache consistency
EP2676203B1 (en) Broadcast protocol for a network of caches
CN105373563B (en) Database switching method and device
US20170039143A1 (en) System and method of a shared memory hash table with notifications and reduced memory utilization

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11879132

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11879132

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 11879132

Country of ref document: EP

Kind code of ref document: A1