WO2013097098A1

WO2013097098A1 - Data processing method, graphics processing unit (gpu) and first node device

Info

Publication number: WO2013097098A1
Application number: PCT/CN2011/084764
Authority: WO
Inventors: 蒋吴军; 卢彦超; 郑龙; 过敏意
Original assignee: 华为技术有限公司
Priority date: 2011-12-27
Filing date: 2011-12-27
Publication date: 2013-07-04
Also published as: CN103282888B; CN103282888A

Abstract

Provided are a data processing method, a graphics processing unit (GPU) and a first node device, which relate to the technical field of communications. The data processing method comprises: when a CPU starts up a kernel program of a GPU of a node device, the GPU runs the kernel program, the kernel program comprising at least one preset GPU communication API; when the kernel program of the GPU runs to the preset GPU communication API, the GPU acquires first communication data; and the GPU judges whether a communication operation corresponding to the preset GPU communication API is a communication operation for transmitting or a communication operation for receiving, and if it is a communication operation for transmitting, then the GPU stores the first communication data in a preset buffer of a video memory, and allows the CPU to copy the first communication data from the preset buffer to a memory of the node device; and if it is a communication operation for receiving, then the GPU acquires second communication data from the preset buffer. The computational efficiency of the system is improved by the present invention.

Description

Data processing method, graphics processor GPU and first node device

The present invention relates to the field of communications technologies, and in particular, to a data processing method, an image processor GPU, and a first node device.

Say

Background technique

In a distributed environment, the data communication mechanism between node devices is the basis of distributed parallel computing. In a typical distributed parallel computing system, there is a certain amount of shared data or data flow between processes belonging to the same task. These processes need to be peered at a specific location. When a GPU (Graphic Processing Unit) is added to a node device, a distributed GPU system is formed.

In a distributed GPU system, each process belonging to the same task is run by a GPU of a different node device, wherein the node device can be a commercial server; since there is a certain shared data between the processes, a communication mechanism between the nodes is needed to implement The flow of the shared data. For example, when the first process of the GPU 1 of the first node device needs to share the communication data of the second process of the GPU 2 of the second node device, due to the slave processor characteristics of the GPU, the CPU of the second node device (Central Processing Unit, central processing) After the GPU 2 runs the second process, the communication data is copied to the internal memory and transmitted to the GPU 1 by the CPU 1 of the first node device, so that the GPU 1 executes the processing process of the first process.

In the process of implementing the present invention, the inventors have found that the prior art has at least the following problems: When the first process of the GPU 1 needs to share the intermediate running data of the second process of the GPU 2, the first process also needs After the GPU 2 runs the complete second process, the intermediate running data of the second process can be obtained, and the running time of the first process is extended, thereby reducing the computing efficiency of the system. Summary of the invention

In order to improve the computational efficiency of the system, an embodiment of the present invention provides a data processing method, an image processor GPU, and a first node device. The technical solution is as follows:

A data processing method, the method comprising: when a central processor CPU of a first node device starts a kernel program of a graphics processor GPU of the node device, the GPU runs the kernel program, and the kernel program includes at least Programming a preset GPU communication application P API ;

When the kernel program of the GPU runs to the preset GPU communication API, the GPU acquires first communication data;

Determining, by the GPU, whether the communication operation corresponding to the preset GPU communication API is a communication operation for sending or a communication operation for receiving, and if the communication operation is for sending, the GPU performs the first communication The data is stored in a preset buffer of the memory of the node device, so that the CPU copies the first communication data from the preset buffer to the memory of the node device; if it is a communication for receiving In operation, the GPU acquires second communication data from the preset buffer, where the second communication data is copied by the CPU into the preset buffer.

A graphics processor GPU, comprising:

a running module, configured to: when a central processor CPU of the first node device starts a kernel program of a graphics processor GPU of the node device, the kernel program includes at least one preset GPU communication application programming Interface API ;

An obtaining module, configured to acquire first communication data when a kernel program of the GPU runs to the preset GPU communication API;

a determining processing module, configured to determine whether the communication operation corresponding to the preset GPU communication API is a communication operation for sending or a communication operation for receiving, and if it is a communication operation for sending, the GPU The first communication data is stored in a preset buffer of the memory of the node device, so that the CPU copies the first communication data from the preset buffer to the memory of the node device; The GPU acquires second communication data from the preset buffer, wherein the second communication data is copied by the CPU into the preset buffer.

a first node device, comprising a central processing unit CPU and the above graphics processor GPU;

The CPU is configured to start a kernel program of a graphics processor GPU of the node device; copy the first communication data from a preset buffer to a memory of the node device; and copy the second communication data to the preset In the buffer.

The technical solution provided by the embodiment of the present invention has the following beneficial effects: inserting a preset GPU communication API in a kernel program of the GPU of the first node device, where the intermediate running data needs to be shared, when the kernel program of the GPU runs to the The preset GPU communication API acquires intermediate running data of the running part of the kernel program, that is, the first communication data; and the GPU determines whether the communication operation corresponding to the GPU communication API is for a communication operation for sending or for receiving The communication operation is performed by the CPU of the GPU and the local node device according to the judgment result, and the communication operation of the GPU is completed, so that the CPU acquires the first communication data, and the GPU acquires the second communication data, compared with the existing communication data. Technically, in this embodiment, the intermediate running data (the first communication data and the second communication data) are acquired in time during the running of the kernel program of the GPU, so that the second node device does not need to wait for the entire kernel program of the first node device to run. Get the number of intermediate runs According to the method, the running time of the process on the second node device is shortened, and the calculation efficiency of the system is improved. DRAWINGS

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly described. It is obvious that the drawings in the following description are only some embodiments of the present invention. Other drawings may also be obtained from those of ordinary skill in the art in view of the drawings.

1 is a flowchart of an embodiment of a data processing method according to Embodiment 1 of the present invention;

2 is a flowchart of an embodiment of a data processing method according to Embodiment 2 of the present invention;

3 is a flowchart of an embodiment of a data processing method according to Embodiment 3 of the present invention;

4 is a schematic structural diagram of a preset buffer provided by Embodiment 3 of the present invention;

FIG. 5 is a schematic diagram of communication interaction between GPUs of different nodes according to Embodiment 3 of the present invention; FIG.

FIG. 6 is a first schematic structural diagram of a GPU of a graphics processor according to Embodiment 4 of the present invention; FIG. 7 is a second schematic structural diagram of a GPU of a graphics processor according to Embodiment 4 of the present invention; FIG. 9 is a fourth schematic structural diagram of a GPU embodiment of a graphics processor according to Embodiment 4 of the present invention; FIG. 10 is a schematic diagram of a fourth structure of a GPU of a graphics processor according to Embodiment 4 of the present invention; A schematic structural diagram of an embodiment of a first node device according to Embodiment 5 of the present invention. detailed description

Embodiments of the present invention provide a data processing method, an image processor GPU, and a first node device.

The embodiments of the present invention will be further described in detail below with reference to the accompanying drawings.

Example 1

1 is a flowchart of an embodiment of a data processing method according to Embodiment 1 of the present invention; the data processing method includes:

S101: when a central processor CPU of the first node device starts a kernel program of a graphics processor GPU of the node device, the GPU runs the kernel program, and the kernel program includes at least one preset GPU communication application Programming interface API.

S102: when the kernel program of the GPU runs to the preset GPU communication API, the GPU acquires first communication data.

S103: The GPU determines whether the communication operation corresponding to the GPU communication API is a communication operation for sending or In the received communication operation, if it is a communication operation for sending, the GPU stores the first communication data to a preset buffer of the memory of the local node device, so that the CPU will use the first communication Data is copied from the preset buffer to the memory of the local device; if it is a communication operation for receiving, the GPU acquires second communication data from the preset buffer, where the second Communication data is copied by the CPU into the preset buffer.

In this embodiment, the kernel program of the GPU includes a preset GPU communication API, so that the GPU has the function of active communication. When the kernel program of the GPU executes to the preset GPU communication API, indicating that the GPU wants to send or receive communication data, correspondingly, the CPU on the node device fetches communication data from a preset buffer or The communication data is copied into the preset buffer, thereby indirectly implementing the communication operation of the GPU, thereby implementing two-way communication between the CPU and the GPU on the same node device during the running of the GPU kernel program.

In this embodiment, a preset GPU communication API is inserted in a kernel program of the GPU of the first node device, where the intermediate running data needs to be shared, when the kernel program of the GPU runs to the preset GPU communication API, Obtaining intermediate running data of the running part of the kernel program, that is, the first communication data; the GPU determining whether the communication operation corresponding to the GPU communication API is a communication operation for sending or a communication operation for receiving, according to the judgment result The GPU and the CPU of the local node device perform corresponding processing to complete the communication operation of the GPU, so that the CPU acquires the first communication data, and the GPU acquires the second communication data. Compared with the technology, the embodiment is The intermediate running data (the first communication data and the second communication data) are acquired in time during the running of the kernel program of the GPU, so that the second node device does not need to wait for the entire kernel program of the first node device to run before acquiring the intermediate running data, which is shortened. The running time of the process on the second node device improves the computational efficiency of the system. Example 2

Referring to FIG. 2, FIG. 2 is a flowchart of an embodiment of a data processing method according to Embodiment 2 of the present invention;

S201: When the CPU 1 of the first node device starts the kernel program of the GPU 1 of the node device, the GPU 1 runs the kernel program.

In this step, the kernel (kernel) program of the GPU1 includes at least one preset GPU communication API.

(Application Programming Interface), the preset GPU communication API divides the kernel program of the GPU 1 into a plurality of sub-kernel programs, so the kernel program includes at least two sub-kernel programs, each of the sub-kernel programs There is no communication operation; the preset GPU communication API is a communication API supported by the GPU, which corresponds to different communication operations, wherein the communication operation includes a communication operation for transmission and a communication operation for reception.

S202: When the kernel program of the GPU 1 runs to the preset GPU communication API, the GPU 1 acquires first communication data. In this step, when the GPU 1 runs to the preset GPU communication API, the GPU 1 ends the operation of the current partial sub-kernel program, and acquires first communication data, where the first communication data is just finished running. The communication data of the sub-kernel program.

S203: The GPU 1 determines whether the communication operation corresponding to the preset GPU communication API is a communication operation for sending or a communication operation for receiving, if it is a communication operation for sending, performing S204; if it is for receiving For the communication operation, execute S205.

S204: The GPU 1 stores the first communication data to a preset buffer of the memory of the local node device, so that the CPU copies the first communication data from the preset buffer to the local node device. In memory.

When the communication operation corresponding to the preset GPU communication API is a communication operation for transmission, indicating that the GPU 1 wants to send the first communication data to the CPU 1 of the local node device, but due to the slave processor characteristics of the GPU, Therefore, the first communication data can only be acquired by the CPU 1 of the own node from the preset buffer.

Specifically, when the communication operation corresponding to the preset GPU communication API is a communication operation for sending, the GPU 1 stores the first communication data in a preset buffer of the memory of the local node device, The kernel program is switched to the CPU code, and the CPU 1 runs its own program. When the CPU 1 runs to the CPU communication API corresponding to the communication operation for reception, the CPU 1 copies the first communication data into the memory of the own node device. The preset buffer is specified by the user.

S205: The GPU 1 acquires second communication data from the preset buffer, where the second communication data is copied by the CPU1 into the preset buffer.

When the communication operation corresponding to the preset GPU communication API is a communication operation for reception, it indicates that the CPU 1 wants to transmit the second communication data to the GPU 1.

Specifically, when the communication operation corresponding to the preset GPU communication API is a communication operation for receiving, the kernel program is switched to a CPU code, and the CPU 1 runs its own program. When the CPU 1 runs to the CPU communication API corresponding to the communication operation for transmitting, the CPU 1 copies the second communication data from the memory of the node device to the preset buffer of the memory of the node device. In the district. The second communication data may be communication data of a program run by the CPU 1 itself; or may be second communication data generated by a kernel program of the GPU 2 on the second node device, specifically, the CPU 2 of the second node device The second communication data is copied from the preset buffer on the second node device to the memory of the second node device, and the CPU 2 transmits the second communication data to the CPU1.

After the execution of the preset GPU communication API is completed, the subsequent part of the kernel program of the GPU is continuously executed, that is, the remaining sub-core programs of the kernel program of the GPU are sequentially executed.

When there are multiple GPU communication APIs in the kernel program of the GPU, the GPU cyclically executes the processes of the above S202-S205 until the end of the kernel program of the entire GPU. In this embodiment, the method further includes: the CPU1 of the first node device transmits the first communication data to the GPU2 of the second node device via the CPU2 of the second node device, so that the second node The GPU 2 of the device shares the first communication data; similarly, the GPU 2 on the second node device can also transmit its second communication data to the GPU 1 through the CPU 2 and the CPU 1 in sequence, thereby realizing the GPU running time on different node devices in the cluster. Two-way communication. The communication mechanism between the CPUs on the different node devices may be implemented by using a prior art such as a socket or a message passing interface (MPI), and is not described here.

In this embodiment, a preset GPU communication API is inserted in a kernel program of the GPU of the first node device, where the intermediate running data needs to be shared, when the kernel program of the GPU runs to the preset GPU communication API, Obtaining intermediate running data of the running part of the kernel program, that is, the first communication data; the GPU determining whether the communication operation corresponding to the GPU communication API is a communication operation for sending or a communication operation for receiving, according to the judgment result The GPU and the CPU of the local node device perform corresponding processing to complete the communication operation of the GPU, so that the CPU acquires the first communication data, and the GPU acquires the second communication data. Compared with the technology, the embodiment is The intermediate running data (the first communication data and the second communication data) are acquired in time during the running of the kernel program of the GPU, so that the second node device does not need to wait for the entire kernel program of the first node device to run before acquiring the intermediate running data, which is shortened. The running time of the process on the second node device improves the computational efficiency of the system.

In addition, in this embodiment, the two-way communication between the GPU and the CPU on the single-node device is implemented during the running of the kernel program of the GPU; and two-way communication between the GPU and the CPU on the single-node device is realized by running the kernel program of the GPU. On the basis of the communication mechanism between the CPUs of different node devices in the existing cluster, the two-way communication of the GPU running on different node devices in the cluster is realized. Example 3

Referring to FIG. 3, FIG. 3 is a flowchart of a data processing method according to Embodiment 3 of the present invention. In this embodiment, communication between a CPU and a GPU is encapsulated on a layer of a CPU device and a GPU device. Layers provide basic communication operations for distributed GPU systems. The data processing method includes -S301: when the CPU 1 of the first node device starts the kernel program of the GPU 1 of the node device, the GPU 1 runs the kernel program. In this step, the kernel program of the GPU 1 includes at least one preset GPU communication API (Application Programming Interface), and the preset GPU communication API corresponds to different communication operations, where The communication operations include communication operations for transmission and communication operations for reception.

The GPU's kernel (kernel) program includes a preset GPU communication API that enables the GPU to communicate actively.

S302: When the kernel program of the GPU 1 runs to the preset GPU communication API, the GPU 1 acquires first communication data.

In this step, when the GPU 1 runs to the preset GPU communication API, the first communication data is acquired, wherein the first communication data is communication data of the kernel program just run.

S303: The GPU 1 determines whether the communication operation corresponding to the preset GPU communication API is a communication operation for sending or a communication operation for receiving. If it is a communication operation for sending, perform S304; if it is for receiving The communication operation is performed in S305.

S304: The GPU 1 stores the first communication data to a preset buffer of the memory of the local node device, so that the CPU 1 copies the first communication data from the preset buffer to the local node device. In memory.

In this embodiment, since the CPU can directly access the memory of the GPU of the node device, a buffer is preset for each SM (Stream Multiprocessors) of the GPU in the memory of the node device, where the buffer is The preset buffer includes a plurality of fields, and the field includes at least a flag signal bit and a communication data buffer. As shown in FIG. 4, FIG. 4 is a schematic structural diagram of a preset buffer provided by Embodiment 3 of the present invention. The communication data buffer may further include a length of communication data, that is, a size of data that the CPU or the GPU program needs to communicate.

The indication signal bit may include a first indication signal bit and a second indication signal bit, and the communication data buffer may include a first communication data buffer and a second communication data buffer, wherein the first indication signal bit and The first communication data buffer corresponds to the communication operation for sending, that is, the first indication signal bit and the first communication data buffer are respectively, when the CPU receives the communication data of the GPU Corresponding indication signal bit and communication data buffer, the second indication signal bit and the second communication data buffer correspond to a communication operation received by the user, that is, the second indication signal bit and the first The communication buffers are respectively corresponding indication signal bits and communication data buffers when the GPU receives the communication data of the CPU.

The state of the flag signal bit includes a reset state, a set state, and a receive error state, wherein the reset state may be 0, the set state may be 1, and the receiving error state may be other than 0 and 1. The remaining values.

In this step, specifically, when the communication operation corresponding to the preset GPU communication API is a communication operation for sending, the GPU 1 stores the first communication data to the first communication of the memory of the local node device. The data buffer sets the state of the first indicator signal bit to a set state. The GPU 1 continuously queries (ie, polls) the state of the first indicator signal bit, and when the state of the first indicator signal bit is set, the GPU 1 continues to query the first indicator signal bit. a state of the first indicator signal bit being a reception error state, the GPU 1 re-copying the first communication data into the first communication data buffer, and the first flag signal bit The state is set to a set state; when the state of the first indicator signal bit is a reset state, the GPU 1 queries whether the state of the second indicator signal bit is a set state, and if so, performs a process similar to that in S305. Corresponding flow, if no, continue to query whether the state of the second indication signal bit is a set state until the state of the second indication signal bit is a set state.

The thread responsible for communication on the CPU 1 is also constantly querying the state of the first indicator signal bit. When the CPU 1 queries the state of the first indicator signal bit to be set, the The CPU 1 clears the counter, and copies the first communication data in the first communication data buffer to the memory of the local node device.

After acquiring the first communication data, the CPU 1 checks the first communication data, and specifically, whether the first communication data is valid by adding a check bit.

The CPU 1 checks whether the first communication data is valid. If yes, the CPU 1 sets the state of the first indicator signal bit to a reset state. If not, the CPU 1 sets the state of the first indicator signal bit. Set to receive error status.

After the CPU 1 sets the state of the first representation signal bit to a receiving error state, the CPU 1 determines whether the counter reaches a preset threshold. If yes, the number of reported errors is excessive, the device may be abnormal, and the program terminates. If no, the counter is incremented by one. When the counter is incremented by 1, the CPU 1 recycles a check for the acquired new first communication data.

After setting the state of the first indication signal bit to the reset state, the CPU 1 continues to execute its own program and performs data processing. When the CPU 1 does not need to continue to communicate with the GPU 1, the CPU 1 directly ends its own program; When the CPU 1 needs to continue to communicate with the GPU 1, the CPU 1 copies the second communication data into the second communication data buffer, and sets the state of the second flag signal bit to the set state. .

The CPU 1 continuously queries the state of the second indicator signal bit, and when the state of the second indicator signal bit is the set state, the CPU 1 continues to query the state of the second indicator signal bit; When the state of the second flag signal bit is a receiving error state, the CPU 1 re-copyes the second communication data into the second communication data buffer, and sets the state of the second flag signal bit to the set state. When the state of the second indicator signal bit is a reset state, the CPU 1 determines whether it is necessary to receive the first communication data to be sent by the GPU, and if yes, whether the state of the first flag signal bit is set or not Status, if no, continue to run its own program.

S305: The GPU 1 acquires second communication data from the preset buffer, where the second communication data is copied by the CPU1 into the preset buffer. In this step, specifically, when the communication operation corresponding to the preset GPU communication API is a communication operation for receiving, the GPU 1 continuously queries the state of the second indicator signal bit, when the second When the state of the flag signal bit is set to state, it indicates that the CPU 1 has copied the second communication data into the second communication data buffer, and the CPU 1 has set the state of the second flag signal bit to be set. State, the GPU 1 clears the counter, and acquires second communication data from the second communication data buffer.

S306: The GPU1 checks whether the second communication data is valid, and if yes, sets a state of the second indication signal bit to a reset state; if not, sets a state of the second indication signal bit to a reception error. status.

In this step, after the state of the second indicator signal bit is the reset state, the GPU 1 continues the actuator kernel program to perform data processing, and when the preset GPU communication API is encountered again, performs corresponding processing. .

S307: The GPU 1 determines whether the counter reaches a preset threshold. If yes, the number of reported errors is too large, the device may be abnormal, and the program terminates; if not, the counter is incremented by 1.

When the counter is incremented by 1, the CPU 1 returns to S306 to perform a verification flow on the acquired new second communication data.

In this embodiment, the method further includes: the CPU1 of the first node device transmits the first communication data to the GPU2 of the second node device via the CPU2 of the second node device, so that the second node The GPU 2 of the device shares the first communication data; similarly, the GPU 2 on the second node device can also transmit the second communication data to the GPU 1 through the CPU 2 and the CPU 1 in sequence, thereby realizing the GPU running time on different node devices in the cluster. As shown in FIG. 5, FIG. 5 is a schematic diagram of communication interaction between GPUs of different nodes according to Embodiment 3 of the present invention; in FIG. 5, a solid line indicates a communication path of the first communication data, and a broken line indicates a second communication data. Communication path. The communication mechanism between the CPUs on different node devices may be implemented by using a prior art such as a socket (socket) or an MPI (Message Passing Interface), and details are not described herein.

In this embodiment, the two-way communication between the GPUs on different node devices may be encapsulated into a cloud communication layer API for use by an upper layer application developer (user) of the GPU distributed system, wherein the cloud communication layer API is used. It also adds optimization to the task scheduling strategy (transparent to the user), which effectively avoids deadlocks and inefficiencies, and ensures the correctness and stability of the GPU distributed system. The cloud communication layer API is used to write a distributed GPU computing task, and the cloud communication layer API provides three APIs, specifically: a sending operation API, a receiving operation API, and a synchronous operation API.

Among them, the send operation API: CLOUD_ Send (data_ type, data_length, data_ buff er, destination)

Receive operation API: CL0UD_Recv (datatype, data_length, data_buffer, source) where date_type is the type of data unit to be sent/to be received, and data_length is the size of the data content. (ie, how many units of data), data_buffer is the data buffer to be sent/to be received, destination is the destination address of the send operation, and source is the data source address of the receive operation. CLOUD—Send () and CLOUD—The Recv () operation fails with an error code.

When the user uses the sending operation API/receiving operation API of the cloud communication layer API, the system adopts a method of task scheduling policy optimization, specifically, in this embodiment, a global computing task distribution queue is set, that is, in the distribution calculation. Before the task, the calculation task with the send/receive operation is recognized, and the global calculation task distribution queue is set in the order before the calculation task with the send operation is placed before all the calculation tasks with the receive operation; when the calculation task is distributed, according to The global calculation task queue performs a distribution operation, thereby ensuring the correctness of the timing of the transmission/reception operation in the task of the user's distributed program, thereby solving the erroneous transmission and error caused by the exclusiveness of the GPU task. The deadlock caused by the receiving operation.

Synchronization API: CLOUD— Sync ()

At run time, the method returns an error when the number of tasks performing the synchronization operation is excessive.

When the user uses the synchronization operation API of the cloud communication layer API, the system adopts a method of task scheduling policy optimization, specifically, identifying a computing task that needs to perform a synchronization operation before distributing the calculation task, and distributing the calculation tasks to the system. On different nodes (that is, there can be no more than one computing task on a node), the global identification bit is set. When the computing tasks on all nodes that need to be synchronized are ready to run, the computing tasks are uniformly scheduled to run, thereby ensuring the user. The correctness of the scope of the synchronization operation in the task of the distributed program. The exclusiveness of the GPU task determines that the number of tasks to be synchronized cannot exceed the number of concurrent tasks allowed by the system. When the system is scheduled, the tasks to be synchronized need to be in the running state at the same time. Otherwise, the system performance will be brought. damage.

In this embodiment, a preset GPU communication API is inserted in a kernel program of the GPU of the first node device, where the intermediate running data needs to be shared, when the kernel program of the GPU runs to the preset GPU communication API, Obtaining intermediate running data of the running part of the kernel program, that is, the first communication data; the GPU determining whether the communication operation corresponding to the GPU communication API is a communication operation for sending or a communication operation for receiving, according to the judgment result The GPU and the CPU of the local node device perform corresponding processing to complete the communication operation of the GPU, so that the CPU acquires the first communication data, and the GPU acquires the second communication data. Compared with the technology, the embodiment is The intermediate running data (the first communication data and the second communication data) are acquired in time during the running of the kernel program of the GPU, so that the second node device does not need to wait for the entire kernel program of the first node device to run before acquiring the intermediate running data, which is shortened. The running time of the process on the second node device improves the computational efficiency of the system. In addition, in this embodiment, the two-way communication between the GPU and the CPU on the single-node device is implemented during the running of the kernel program of the GPU; and the two-way communication between the GPU and the CPU on the single-node device is implemented during the running of the kernel program of the GPU. On the basis of the letter, combined with the communication mechanism between the CPUs of different node devices in the existing cluster, the two-way communication of the GPU running on different node devices in the cluster is realized. Example 4

6 is a first schematic structural diagram of a GPU embodiment of a graphics processor according to Embodiment 4 of the present invention; the GPU includes:

The running module 501 is configured to: when the central processor CPU of the first node device starts a kernel program of the graphics processor GPU of the node device, the kernel program includes at least one preset GPU communication application Programming interface API.

The obtaining module 502 is configured to acquire first communication data when the kernel program of the GPU runs to the preset GPU communication API.

The determining processing module 503 is configured to determine whether the communication operation corresponding to the preset GPU communication API is a communication operation for sending or a communication operation for receiving, and if it is a communication operation for sending, the GPU will The first communication data is stored in a preset buffer of the memory of the local device, so that the CPU copies the first communication data from the preset buffer to the memory of the node device; During the received communication operation, the GPU acquires second communication data from the preset buffer, where the second communication data is copied by the CPU into the preset buffer.

The kernel program includes at least two sub-kernels, and each of the two sub-kernels has one of the preset GPU communication APIs.

The obtaining module 502 includes: an obtaining unit 5021, as shown in FIG. 7, FIG. 7 is a second structural diagram of a GPU embodiment of a graphics processor according to Embodiment 4 of the present invention;

The obtaining unit 5021 is configured to acquire communication data of the sub-kernel program.

In another embodiment of the present embodiment, the preset buffer includes a flag signal bit and a communication data buffer; the flag signal bit includes a first flag signal bit and a second flag signal bit, and the communication data The buffer includes a first communication data buffer and a second communication data buffer, wherein the first indication signal bit and the first communication data buffer correspond to the CPU receiving the indication signal bit of the GPU and the communication data buffer The second indication signal bit and the second communication data buffer correspond to the GPU receiving the indication signal bit of the CPU and the communication data buffer.

The determination processing module 503 includes: the storage setting unit 5031, as shown in FIG. 8, FIG. 8 is a third structural diagram of a GPU embodiment of a graphics processor according to Embodiment 4 of the present invention;

The storage setting unit 5031 is configured to store the first communication data to a first communication data buffer of a memory of the local node device, and set a state of the first indication signal bit to a set state, so that the CPU In the query to the said After the state of the signal bit is set to the state, the first communication data in the first communication data buffer is copied into the memory of the node device.

Alternatively, the determining processing module 503 includes:

The query obtaining unit 5032 is configured to acquire second communication data from the second communication data buffer when the GPU queries that the state of the second indication signal bit is a set state, where the second communication is Data is copied by the CPU into the second communication data buffer, and the second flag signal bit is set to a set state by the CPU.

Further, the GPU further includes: a verification setting module 504, as shown in FIG. 9, FIG. 9 is a fourth structural diagram of a GPU embodiment of a graphics processor according to Embodiment 4 of the present invention;

The verification setting module 504 is configured to check whether the first communication data is valid after acquiring the second communication data from the second communication data buffer, and if yes, the second indication The state of the signal bit is set to a reset state; if not, the state of the second flag signal bit is set to a reception error state.

In this embodiment, a preset GPU communication API is inserted in a kernel program of the GPU of the first node device, where the intermediate running data needs to be shared, when the kernel program of the GPU runs to the preset GPU communication API, Obtaining intermediate running data of the running part of the kernel program, that is, the first communication data; the GPU determining whether the communication operation corresponding to the GPU communication API is a communication operation for sending or a communication operation for receiving, according to the judgment result The GPU and the CPU of the local node device perform corresponding processing to complete the communication operation of the GPU, so that the CPU acquires the first communication data, and the GPU acquires the second communication data. Compared with the technology, the embodiment is The intermediate running data (the first communication data and the second communication data) are acquired in time during the running of the kernel program of the GPU, so that the second node device does not need to wait for the entire kernel program of the first node device to run before acquiring the intermediate running data, which is shortened. The running time of the process on the second node device improves the computational efficiency of the system. Example 5

Referring to FIG. 10, FIG. 10 is a schematic structural diagram of a first node device according to Embodiment 5 of the present invention; in this embodiment, the first node device and the second node device may both be commercial servers, but Limited to this.

The first node device includes: a CPU 40 and a GPU 50. The functions of the GPU 50 are similar to those of the GPU in the fourth embodiment. For details, refer to the related description of Embodiment 4, and details are not described herein.

The CPU 40 is configured to start a kernel program of a graphics processor GPU of the node device; copy the first communication data from a preset buffer to a memory of the node device; and copy the second communication data to the preset In the buffer.

The CPU 40 is further configured to transmit the first communication data to a GPU of the second node device by using a CPU of the second node device, so that the GPU of the second node device shares the first communication data.

The CPU 40 is further configured to check whether the first communication data is valid, and if yes, the first identifier The state of the signal bit is set to the reset state; if not, the state of the flag signal bit is set to the reception error state. In this embodiment, a preset GPU communication API is inserted in a kernel program of the GPU of the first node device, where the intermediate running data needs to be shared, when the kernel program of the GPU runs to the preset GPU communication API, Obtaining intermediate running data of the running part of the kernel program, that is, the first communication data; the GPU determining whether the communication operation corresponding to the GPU communication API is a communication operation for sending or a communication operation for receiving, according to the judgment result The GPU and the CPU of the local node device perform corresponding processing to complete the communication operation of the GPU, so that the CPU acquires the first communication data, and the GPU acquires the second communication data. Compared with the technology, the embodiment is The intermediate running data (the first communication data and the second communication data) are acquired in time during the running of the kernel program of the GPU, so that the second node device does not need to wait for the entire kernel program of the first node device to run before acquiring the intermediate running data, which is shortened. The running time of the process on the second node device improves the computational efficiency of the system.

In addition, in this embodiment, the two-way communication between the GPU and the CPU on the single-node device is implemented during the running of the kernel program of the GPU; and two-way communication between the GPU and the CPU on the single-node device is realized by running the kernel program of the GPU. On the basis of the communication mechanism between the CPUs of different node devices in the existing cluster, the two-way communication of the GPU running on different node devices in the cluster is realized. It should be noted that each embodiment in the specification is described in a progressive manner, and each embodiment focuses on differences from other embodiments, and the same similar parts between the embodiments are referred to each other. can. For the device type embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.

It should be noted that, in this context, relational terms such as first and second are used merely to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these entities or operations. There is any such actual relationship or order between them. Furthermore, the terms "including", "comprising" or "comprising" or "comprising" are intended to encompass a non-exclusive inclusion, such that a process, method, article, or device that includes a plurality of elements includes not only those elements but also Other elements, or elements that are inherent to such a process, method, item, or device. In the absence of further limitations, the elements defined by the phrase "comprising a ..." do not exclude the presence of additional equivalent elements in the process, method, item, or device that comprises the element.

A person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium. The storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like. The above is only the preferred embodiment of the present invention, and is not intended to limit the present invention, and the spirit and principles of the present invention. Any modifications, equivalent substitutions, improvements, etc. made therein are intended to be included within the scope of the present invention.

Claims

Claim

A data processing method, the method comprising:

When the central processor CPU of the first node device starts the kernel program of the graphics processor GPU of the node device, the GPU runs the kernel program, and the kernel program includes at least one preset GPU communication application programming interface API The GPU acquires first communication data when the kernel program of the GPU runs to the preset GPU communication API; the GPU determines that the communication operation corresponding to the preset GPU communication API is for sending The communication operation is also a communication operation for receiving. If it is a communication operation for sending, the GPU stores the first communication data to a preset buffer of the memory of the local node device, so that the CPU will The first communication data is copied from the preset buffer to the memory of the local device; if it is a communication operation for receiving, the GPU acquires second communication data from the preset buffer, where The second communication data is copied by the CPU into the preset buffer.

2. The method according to claim 1, wherein the kernel program comprises at least two sub-kernel programs, and each of the two sub-kernel programs has one of the preset GPU communication APIs.

3. The method according to claim 2, wherein the acquiring, by the GPU, the first communication data comprises: the GPU acquiring communication data of the sub-kernel program.

The method according to claim 1, wherein the preset buffer includes a flag signal bit and a communication data buffer; the flag signal bit includes a first flag signal bit and a second flag signal bit. The communication data buffer includes a first communication data buffer and a second communication data buffer, wherein the first indication signal bit and the first communication data buffer correspond to the communication operation for transmitting, The second indicator signal bit and the second communication data buffer correspond to the communication operation for receiving.

The method according to claim 4, wherein the GPU stores the first communication data to a preset buffer of the memory of the local node device, so that the CPU sends the first communication data. Copying from the preset buffer to the memory of the local node device includes: the GPU storing the first communication data to a first communication data buffer of the memory of the local node device, and setting the first indication signal The state of the bit is set to enable the CPU to copy the first communication data in the first communication data buffer to the local node device after querying that the state of the first indicator signal bit is set to a state In memory.

The method according to claim 4, wherein the GPU acquires second communication data from the preset buffer, wherein the second communication data is copied by the CPU to the preset Included in the buffer - when the GPU queries the state of the second indication signal bit to a set state, the GPU acquires second communication data from the second communication data buffer, wherein the second communication Data is copied by the CPU into the second communication data buffer, and the state of the second flag signal bit is set to a set state by the CPU.

The method according to claim 6, wherein after the GPU acquires the second communication data from the second communication data buffer, the method further includes: the GPU checking the first communication data Whether it is valid, if yes, setting the state of the second indication signal bit to a reset state; if not, setting the state of the second indication signal bit to a reception error state.

The method according to any one of claims 1 to 7, further comprising: - the CPU of the first node device transmits the first communication data to the first via the CPU of the second node device The GPU of the two-node device causes the GPU of the second node device to share the first communication data.

9. A graphics processor GPU, comprising:

10. The GPU according to claim 9, wherein the kernel program comprises at least two sub-kernel programs, and each of the two sub-kernel programs has one of the preset GPU communication APIs.

The GPU according to claim 10, wherein the acquiring module comprises _:

An obtaining unit, configured to acquire communication data of the sub-kernel program.

The GPU according to claim 9, wherein the preset buffer includes a flag signal bit and a communication data buffer; and the flag signal bit includes a first flag signal bit and a second flag signal bit. The communication data buffer includes a first communication data buffer and a second communication data buffer, wherein the first indication signal bit and the first communication data buffer correspond to the communication operation for transmitting, The second indicator signal bit and the second communication data buffer correspond to the communication operation for receiving.

The GPU according to claim 12, wherein the determining processing module comprises:

a storage setting unit, configured to store the first communication data to a first communication data buffer of a memory of the node device, and set a state of the first indicator signal bit to a set state, so that the CPU is inquiring After the state of the first indication signal bit is a set state, the first communication data in the first communication data buffer is copied into the memory of the local node device.

The GPU according to claim 12, wherein the determining processing module includes: a query obtaining unit, configured to: when the GPU queries the state of the second indicator signal bit to be set, Obtaining second communication data in the second communication data buffer, wherein the second communication data is copied by the CPU to the second communication data buffer, and the second indication signal bit is set by the CPU Set to state.

The GPU according to claim 14, further comprising:

a verification setting module, configured to check whether the first communication data is valid after the obtaining the second communication data from the second communication data buffer, and if yes, the second indication signal bit The state is set to a reset state; if not, the state of the second flag signal bit is set to a reception error state.

A first node device, comprising: a central processing unit CPU and a graphics processor GPU according to any one of claims 9-15;

The first node device according to claim 16, wherein the CPU is further configured to transmit the first communication data to a GPU of the second node device via a CPU of the second node device, The GPU of the second node device is caused to share the first communication data.

The first node device according to claim 16, wherein the CPU is further configured to check whether the first communication data is valid, and if yes, set a state of the first indicator signal bit to Reset state; if no, the state of the flag signal bit is set to the receive error state.