CN112231018B - Method, computing device, and computer-readable storage medium for offloading data - Google Patents

Method, computing device, and computer-readable storage medium for offloading data Download PDF

Info

Publication number
CN112231018B
CN112231018B CN202011470268.3A CN202011470268A CN112231018B CN 112231018 B CN112231018 B CN 112231018B CN 202011470268 A CN202011470268 A CN 202011470268A CN 112231018 B CN112231018 B CN 112231018B
Authority
CN
China
Prior art keywords
offload
engine
synchronization
command
synchronization point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011470268.3A
Other languages
Chinese (zh)
Other versions
CN112231018A (en
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Bilin Technology Development Co ltd
Shanghai Bi Ren Technology Co ltd
Original Assignee
Beijing Bilin Technology Development Co ltd
Shanghai Biren Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Bilin Technology Development Co ltd, Shanghai Biren Intelligent Technology Co Ltd filed Critical Beijing Bilin Technology Development Co ltd
Priority to CN202011470268.3A priority Critical patent/CN112231018B/en
Publication of CN112231018A publication Critical patent/CN112231018A/en
Application granted granted Critical
Publication of CN112231018B publication Critical patent/CN112231018B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44594Unloading
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining

Abstract

The present disclosure relates to a method, computing device, and computer-readable storage medium for offloading data. The method comprises the following steps: sending a command packet to a command queue, the command packet defining an offload operation to offload data from the graphics processor, the command packet indicating at least a synchronization point and an offload operation command packet associated with the synchronization point; notifying a synchronization engine in response to a synchronization instruction corresponding to the synchronization point being triggered; the synchronization engine informs the command engine that the synchronization point has been reached so that the command engine retrieves an offload operation command packet associated with the synchronization point from the command queue; and the command engine parses the offload operation command packet associated with the synchronization point to determine an offload engine for performing a corresponding offload operation of the offload operation command packet. The method and the device can effectively improve the overall performance and the resource utilization rate of the computing equipment.

Description

Method, computing device, and computer-readable storage medium for offloading data
Technical Field
Embodiments of the present disclosure relate generally to the field of information processing, and more particularly, to a method, computing device, and computer-readable storage medium for offloading data.
Background
In a conventional computing device, for example, a computing device configured with a General-purpose Graphics processor (GPGPU), a GPU (Graphics Processing Unit) for Processing Graphics tasks is generally used to calculate a General-purpose computing task originally processed by a Central Processing Unit (CPU). These general purpose computations involve a wide variety of types of operations, some of which have no relationship to graphics processing and are not the most appropriate for the GPU. The above-mentioned scheme for processing the operation originally suitable for being processed by other components (e.g. a central processing unit) by the GPU, such as pre-fetching (Preload) to a secondary cache and the like in the Memory of the GPU, is likely to consume a large amount of time (GPU cycle) and Memory bandwidth (Memory bandwidth), so as to reduce the execution efficiency of the GPU, which is not favorable for improving the overall performance and resource utilization of the computing device.
In summary, the conventional computing device is prone to reduce the execution efficiency of the GPU, and is not favorable for improving the overall performance and resource utilization of the computing device.
Disclosure of Invention
The present disclosure provides a method, a computing device and a computer-readable storage medium for offloading data, which can effectively improve the overall performance and resource utilization of the computing device.
According to a first aspect of the present disclosure, a method for offloading data is provided. The method comprises the following steps: sending a command packet to a command queue, the command packet defining an offload operation to offload data from the graphics processor, the command packet indicating at least a synchronization point and an offload operation command packet associated with the synchronization point; notifying a synchronization engine in response to a synchronization instruction corresponding to the synchronization point being triggered; the synchronization engine informs the command engine that the synchronization point has been reached so that the command engine retrieves an offload operation command packet associated with the synchronization point from the command queue; and the command engine parses the offload operation command packet associated with the synchronization point to determine an offload engine for performing a corresponding offload operation of the offload operation command packet.
According to a second aspect of the invention, there is also provided a method for offloading data. The method comprises the following steps: the command engine sends synchronization operation information, the synchronization operation information is used for defining unloading operation of unloading data from the graphics processor, and the synchronization operation information at least indicates a synchronization point, an unloading operation type associated with the synchronization point and target data of the unloading operation; notifying a synchronization engine in response to determining that a synchronization instruction corresponding to a synchronization point is triggered; the synchronization engine acquires the unloading operation type and the target data of the unloading operation associated with the synchronization point in the synchronization operation information based on the synchronization point; and the synchronization engine determines, based on the offload operation type, an offload engine for performing a corresponding offload operation associated with the synchronization point.
According to a third aspect of the invention, there is also provided a method for offloading data. The method comprises the following steps: acquiring equipment supplier identification and equipment identification of computing equipment, wherein the computing equipment at least comprises a graphic processor and a central processing unit; a method of a first aspect of the present disclosure for offloading data from a graphics processor if it is determined that a device vendor identification is a predetermined identification and the device identification belongs to a first predetermined set; and a method of a second aspect of the present disclosure for offloading data from a graphics processor if it is determined that the device vendor identification is the predetermined identification and the device identification belongs to the second predetermined set.
According to a fourth aspect of the present disclosure, a computing device is also provided. The computing device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the computing device to perform the method of the first aspect of the disclosure. According to a fifth aspect of the present disclosure, there is also provided a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a machine, performs the method of the first aspect of the disclosure.
In some embodiments, the determined offload engine to perform the corresponding offload operation is one of a component of a graphics processor and a central processor.
In some embodiments, the method for offloading data further comprises: determining, by the command engine, that the corresponding offload operation is performed via synchronization or the determined offload engine notification; the command engine informs the synchronous engine that the corresponding unloading operation is completed; and the synchronization engine notifies the program core of the end of the wait associated with the synchronization point so that the program core continues to execute code.
In some embodiments, before being triggered to notify the synchronization engine in response to the synchronization instruction corresponding to the synchronization point, the method further comprises: and the program core executes the codes to trigger synchronous instructions corresponding to the synchronous points, and the number and the sequence of the synchronous points configured by the codes correspond to the number and the sequence of the synchronous points indicated by the command packet.
In some embodiments, the command packet is generated by the driver based on information provided by the compiler, the offload operation command packet associated with the synchronization point indicating an offload operation type associated with the synchronization point and target data for the offload operation, the offload operation type comprising: an offload engine identification and an offload operation number identification.
In some embodiments, determining an offload engine for performing a corresponding offload operation of the offload operation command packet comprises: the command engine determines, among the plurality of offload engines, an offload engine for performing a corresponding offload operation of the offload operation command packet based on the offload engine identification and the offload operation number identification parsed from the offload operation command packet.
In some embodiments, the method for offloading data further comprises: determining, by the command engine, that the corresponding offload operation is performed via synchronization or the determined offload engine notification; and the synchronization engine notifies the program core of the end of the wait associated with the synchronization point so that the program core continues to execute code.
In some embodiments, the method for offloading data further comprises: the command engine sends code of the program core, the code configured with one or more synchronization points; and the program core executes the codes to trigger the synchronous instruction corresponding to the synchronous point. In some embodiments, the synchronization operation information is a table of synchronization operations generated by a compiler, and the offload operation type includes: an offload engine identification and an offload operation number identification.
In some embodiments, the synchronization engine determining, based on the offload operation type, an offload engine for performing a corresponding offload operation associated with the synchronization point comprises: the synchronization engine determines, based on the offload engine identification and the offload operation number identification, one offload engine among a plurality of offload engines for performing a corresponding offload operation associated with the synchronization point, the plurality of offload engines including at least two of a central processor and a component in the graphics processor.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are for a better understanding of the present solution and do not constitute a limitation of the present application.
Fig. 1 schematically illustrates a schematic diagram of a conventional computing device.
FIG. 2 shows a schematic diagram of a computing device for implementing a method for offloading data according to an embodiment of the disclosure.
Fig. 3 illustrates a schematic diagram of a method for offloading data, in accordance with some embodiments of the present disclosure.
Fig. 4 illustrates a flow diagram of a method for offloading data, in accordance with some embodiments of the present disclosure.
FIG. 5 shows a schematic diagram of a method for offloading data, in accordance with further embodiments of the present disclosure.
FIG. 6 illustrates a flow diagram of a method for offloading data, in accordance with further embodiments of the disclosure.
FIG. 7 illustrates a flow diagram of a method for offloading data, in accordance with further embodiments of the disclosure.
Like or corresponding reference characters designate like or corresponding parts throughout the several views.
Detailed Description
Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The term "include" and variations thereof as used herein is meant to be inclusive in an open-ended manner, i.e., "including but not limited to". Unless specifically stated otherwise, the term "or" means "and/or". The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment". The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like may refer to different or the same object.
As described above, in a conventional computing device, a Graphics Processing Unit (GPU) for Processing Graphics tasks is generally utilized to compute general-purpose computing tasks originally processed by a Central Processing Unit (CPU). It is easy to consume a large amount of time (GPU cycle) and Memory bandwidth (Memory bandwidth), and further reduce the execution efficiency of the GPU, which is not favorable for improving the overall performance and resource utilization of the computing device.
The reasons for the reduced overall performance and resource utilization of a computing device are described below in conjunction with fig. 1. Fig. 1 schematically illustrates a schematic diagram of a conventional computing device. As shown in fig. 1, computing device 100 includes: a graphics processor 110, a central processor 130, and a bus 140. Graphics Processor 110 includes, for example, a Command Processor (CP) 112, a plurality of Computing Units (CUs) 114, a DMA/SDMA controller 116, a level two cache 118, and a memory 120. The cpu 130 includes a memory 132, for example. The bus 140 is used for connecting the graphic processor 110 and the central processor 130, for example, to transfer data, addresses, control signals, and the like. The bus 140 is, for example and without limitation, a Peripheral Component Interconnect Express (PCI-E).
Generally, the central processor 130 and the graphics processor 110 are adapted to perform different tasks. To efficiently exploit the capabilities of the graphics processor 110, the CPU 130 may Offload a series of device code to the graphics processor 110, for example, in a process referred to as a First Offload (First Offload). The process of unloading device code from the central processor 130 to the graphics processor 110 includes, for example: the cpu 130 prepares data required for a Device Code (Device Code) in the memory 132 and then transmits a command to the command handler 112 in the graphic processor 110 to request data to be copied from the memory 132 of the cpu 130 to the memory 120 in the graphic processor 110. The command processor 112 performs data copying and storage from memory 132 to memory 120, for example, via DMA/SDMA controller 116. The central processor 130 sends the device code to be executed to the graphics processor 110 and then sends a command to the graphics processor 110 for initiating execution of the device code. The compute unit 114 in the graphics processor 110 performs tasks indicated by the device code, reads data from the memory 120 in the graphics processor 110, performs computations, and writes the computation results to the memory 120. When the computing unit 114 has completed executing the task indicated by the device code, the central processor 130 is notified by the command processor 112 that the device code has been completed. The cpu 130 then migrates the results of the computations in the memory 120 of the gpu 110 back to the memory 132 via the bus 140. The unloading process is referred to as First unloading (First Offload), for example.
The graphics processor 110 is configured to execute all device code from the computing unit 114, resulting in a large workload for the computing unit 114, while other components in the graphics processor 110 (e.g., the command processor 112, the level two cache 118, the DMA/SDMA controller 116, etc.) are waiting to be idle, thereby resulting in inefficient operation of the system 100. Furthermore, the tasks of some device code are not suitable for being performed by the computing unit 114, but are suitable for being performed by other components. For example, for some device codes: "Pre-load Data in memory of graphics processor to level two cache (i.e., Preload Data A to L2)" or "Flush to system level two cache (i.e., Flush L2)", or "All-Reduce operation for multiple pieces of Data stored in memory of graphics processor (i.e., All-Reduce Data A/B to C)". Taking the device code "Flush system level two cache (i.e., Flush L2)" for the computing unit 114 to execute, the computing unit 114 needs to perform a specific write operation to each address, which takes much time and blocks the execution of the subsequent code. For another example, taking the device code "perform All-Reduce operations (i.e., All-Reduce Data a/B to C) on multiple pieces of Data stored in the Memory of the graphics processor" executed by the computing unit 114, it is necessary to read the Memory Data into the computing unit 114 first, then perform the Reduce operation, and then write the operation result back into the Memory, which consumes a lot of time (GPU cycle) and Memory bandwidth (Memory bandwidth), thereby affecting the overall performance of the system 100.
To address, at least in part, one or more of the above issues and other potential issues, example embodiments of the present disclosure propose a method, computing device, and computer-readable storage medium for offloading data. In the disclosed method: by obtaining a command packet defining an offload operation to be performed from the graphics processor, and when a synchronization instruction is triggered, notifying a synchronization engine, such that the synchronization engine notifies the command engine to obtain an offload operation command packet associated with a synchronization point from a command queue; and determining an appropriate offload engine to perform a corresponding offload operation of the offload operation command packet based on the parsing of the offload operation command packet, the present disclosure may select an appropriate offload operation to perform the offload operation according to the parsing result of the offload operation command packet corresponding to the synchronization point, thereby achieving offload of target data at the graphics processor to internal components or external components at the graphics processor (such offload is referred to as "secondary offload", or as "reverse offload"), without having to execute all device code only by the computing unit 114, and thus facilitating improvement of overall performance and resource utilization of the computing device. Fig. 2 shows a schematic diagram of a computing device 200 for implementing a method for offloading data according to an embodiment of the disclosure. As shown in fig. 2, computing device 200 includes: graphics processor 210, central processing unit 230, bus 240, Application Specific Integrated Circuit (ASIC) 250, Field Programmable logic Array (FPGA) 260. in some embodiments, computing device 200 also includes other components, such as accelerators (not shown). Graphics processor 210 includes, for example, a command engine 212, a plurality of compute units 214, a DMA/SDMA controller 216, a level two cache 218, a memory 220, and a synchronization engine 222. The cpu 230 includes a memory 232, for example. Computing device 200 is configured with a device vendor identification and a device identification. The computing device 200 may implement the reverse offload of the foregoing. For example, computing device 200 determines an appropriate method of offloading data based on the device vendor identification and the device identification for offloading data from the graphics processor to an internal component in graphics processor 210 (e.g., to one of DMA/SDMA controller 216, secondary cache 218) or to an external component of graphics processor 210 (e.g., to one of central processor 230, application specific integrated circuit 250, field programmable logic array 260). The internal or external components of the offload engine graphics processor 210 that perform the corresponding offload operations are referred to hereinafter collectively as an offload engine.
To implement reverse offload to improve overall performance and resource utilization, the computing device 200 configures the sync engine 222 and the command engine 212 in the graphics processor 210, and defines an offload operation for offloading data from the graphics processor by configuring a command packet or sync data information in a command queue, so as to determine an offload engine for performing a corresponding offload operation (i.e., a reverse offload operation) among a plurality of offload engines.
With respect to the synchronization engine, in some embodiments, it is used to implement synchronization prior to reverse offload and to cause synchronization to wait for completion after reverse offload execution is complete for continued execution of code. Specifically, for example, if a synchronization instruction corresponding to the synchronization point is triggered, the synchronization engine is notified so that the synchronization engine notifies the command engine that the synchronization point has been reached, thereby causing the command engine to fetch an offload operation command packet associated with the synchronization point from the command queue. In addition, when the synchronization engine is notified that the corresponding offload operation has been completed, the kernel is notified that the wait associated with the synchronization point has ended so that the kernel continues to execute code. In other embodiments, the synchronization engine is further configured to identify an offload engine for performing a corresponding offload operation based on the synchronization operation information in addition to performing the above synchronization before and after the reverse offload. For example, the synchronization engine acquires the type of the uninstall operation associated with the synchronization point and target data of the uninstall operation in the synchronization operation information based on the synchronization point; and the synchronization engine determines, based on the offload operation type, an offload engine for performing a corresponding offload operation associated with the synchronization point.
With respect to the command engine, in some embodiments, it is used to determine an offload engine for performing a corresponding offload operation associated with the synchronization point. For example, the command engine acquires an offload operation command packet associated with the synchronization point from the command queue according to the notification of the synchronization engine; and the command engine parses the offload operation command packet associated with the synchronization point to determine an offload engine for performing a corresponding offload operation of the offload operation command packet. In other embodiments, the command engine is configured to send the synchronization operation information without participating in the determination by the offload engine, and the offload engine is configured to determine, by the synchronization engine, an offload engine for performing a corresponding offload operation associated with the synchronization point based on the type of offload operation in the synchronization operation information.
A method 300 for offloading data is described below in conjunction with fig. 3 and 4. Fig. 3 illustrates a flow diagram of a method 300 for offloading data, in accordance with some embodiments of the present disclosure. Fig. 4 illustrates a schematic diagram of a method for offloading data, in accordance with some embodiments of the present disclosure. It should be understood that the method 300 may be performed, for example, at the computing device 200 depicted in FIG. 2. It should be understood that method 300 may also include additional components, acts not shown, and/or may omit illustrated components, acts, as the scope of the present disclosure is not limited in this respect.
At step 302, a command packet is sent to a command queue, the command packet defining an offload operation to offload data from a graphics processor, the command packet indicating at least a synchronization point and an offload operation command packet associated with the synchronization point. For example, the software program generates a command packet and sends the generated command packet to the command queue 440 shown in FIG. 4.
With respect to command packets, it is used to define an offload operation for offloading data from the graphics processor (i.e., a reverse offload). The command packet is generated, for example, by a driver based on information provided by a compiler. The command packet indicates at least a synchronization point, an offload operation command associated with the synchronization point. The offload operation command packet may indicate, for example, an offload operation type associated with the synchronization point and target data for the offload operation. The types of offload operations include: an offload engine identification and an offload operation number identification. The information indicated by the offload command packet is described below in conjunction with table one.
Figure 71163DEST_PATH_IMAGE001
In Table one above, "Sync Counter # 1", "Sync Counter # 2", and "Sync Counter # 3" represent different synchronization points. As to the type of the unloading operation, it includes, for example: an offload engine identification and an offload operation number identification. For example, as shown in table one, the 32BIT offload operation type includes two parts, one part is an offload engine identifier (8 BIT) such as "0 x 0200" and the other part is an offload operation number identifier (24 BIT) such as "0002". The offload engine identification is used to define which offload engine the corresponding offload operation associated with the synchronization point is appropriate for delivery for execution by. The target data of the offload operation indicates, for example, data used by the corresponding offload operation.
The information of the command packet indicated by table one above defines, for example: for synchronization point "Sync Counter # 1", the corresponding offload operation performs its 2 nd operation at the second offload engine, and the target data of the offload operation is data A. For synchronization point "Sync Counter #2," the corresponding offload operation performs its 2 nd operation at the first offload engine, and the target data for the offload operation are data B and data C. For synchronization point "Sync Counter # 3", the corresponding offload operation performs its 4 th operation at the third offload engine, and the target data of the offload operation is data B.
The number and order of synchronization points indicated in the command packets included in command queue 440 correspond to the number and order of synchronization points associated with code 452 in program core 450. Thus, when the program core 450 executes a synchronization instruction corresponding to a synchronization point, an offload operation command packet corresponding to the synchronization point may be uniquely found in the command queue 440 for parsing by the command engine against the offload operation command packet in a subsequent step to determine an offload engine for performing the corresponding offload operation.
At step 304, the synchronization engine is notified in response to a synchronization instruction corresponding to the synchronization point being triggered.
In some embodiments, before notifying the synchronization engine in response to a synchronization instruction corresponding to a synchronization point being triggered, the program core 450 executes the code 452 to trigger a synchronization instruction corresponding to a synchronization point, the code configured with a number and order of synchronization points corresponding to the number and order of synchronization points indicated by the command packet. For example, the program core 450 in fig. 4 executes the code, and when the code is executed to the synchronization point "Sync Counter # 1", it triggers the synchronization instruction corresponding to the synchronization point "Sync Counter # 1". If it is determined that the synchronization instruction corresponding to the synchronization point "Sync Counter # 1" is triggered, the synchronization engine 460 is notified.
At step 306, the synchronization engine notifies the command engine that the synchronization point has been reached so that the command engine retrieves the offload operation command packet associated with the synchronization point from the command queue. For example, the command engine 462 shown in fig. 4 acquires the offload Operation command packet "Operation < OP1 >" associated with the synchronization point "Sync Counter # 1".
At step 308, the command engine parses the offload operation command packet associated with the synchronization point to determine an offload engine for performing a corresponding offload operation of the offload operation command packet. Wherein determining an offload engine for performing a corresponding offload operation of the offload operation command packet comprises: the command engine determines, among the plurality of offload engines, an offload engine for performing a corresponding offload operation of the offload operation command packet based on the offload engine identification and the offload operation number identification parsed from the offload operation command packet. The determined offload engine to perform the corresponding offload operation is one of a component of the graphics processor and a central processor.
For example, command engine 462 of FIG. 4 parses the offload Operation command packet "Operation < OP1 >" associated with synchronization point "Sync Counter # 1" to further parse out an offload engine identification and an offload Operation number identification, and command engine 462 may determine, among offload engines 464, that the corresponding offload Operation of the command packet associated with synchronization point "Sync Counter # 1" is to perform its 2 nd Operation at second offload engine 364-2 based on the parsed offload engine identification.
With respect to the offload engine, it is used to perform the corresponding offload operation of the offload operation command packet. Each offload engine has a corresponding offload engine identification. In some embodiments, the computing device may include multiple offload engines, which may be components at the graphics processor (e.g., or external to the graphics processor), such as a CPU. Different offload operations may be performed with different offload engines, i.e., different offload engines may be adapted to perform different types of offload operations. In some embodiments, the manner in which the offload engine performs the corresponding offload operation may be asynchronous.
In the scheme, when the synchronous instruction is triggered, the synchronous engine is informed so that the synchronous engine informs the command engine to acquire the unloading operation command packet associated with the synchronous point from the command queue; and determining an appropriate offload engine to perform the corresponding offload operation based on the result of the parsing for the offload operation command packet, thereby facilitating concurrent participation of internal or external components of the GPU in the offloading and execution of GPU code tasks, and thus significantly reducing execution time. Meanwhile, the CU which is the most important calculation unit of the GPU can process the most adept pure calculation tasks in a centralized mode, occupation of memory bandwidth is reduced, calculation time used by tasks which are not suitable for processing is shortened, execution efficiency is improved, and execution time is reduced. Therefore, the method and the device can be beneficial to improving the overall performance and the resource utilization rate of the computing device.
In the above scheme, the synchronization engine is used, for example, to perform synchronization before reverse offload execution, and to notify the program core (Kernel) to continue to execute subsequent code operations after the offload engine has executed the corresponding offload operations.
By adopting the above means, all modules inside the GPU or external modules can participate in the execution logic of the GPU code task concurrently, so that the concurrency is improved, and the execution time is reduced. Meanwhile, the CU which is the most important calculation unit of the GPU can centralize the pure calculation tasks which are most adept by the processor, so that the occupation of memory bandwidth can be obviously reduced, the execution efficiency is improved, and the execution time is reduced.
In addition, in the conventional GPU task unloading, since some tasks must be executed in the CPU, the GPU task needs to be divided into very small tasks, for example, after one GPU task is completed, the CPU needs to perform subsequent tasks, and therefore, repeated interaction is required between the CPU and the GPU, for example, the CPU starts the device code task of the GPU to be issued, and waits for the GPU to be executed, and then proceeds. Such frequent CPU starts GPU tasks and waits for GPU tasks to interact, which may result in reduced execution efficiency. By adopting the means, the repeated interaction between the CPU and the GPU is reduced. Logic that previously required multiple device code to be executed separately could be committed at one time, driven by the synchronization engine. In this way, repeated task submission and waiting between the CPU and the GPU are eliminated, and the support utilization rate can be significantly improved.
In some embodiments, the method 300 further comprises: determining, by the command engine, that the corresponding offload operation is performed via synchronization or the determined offload engine notification; the command engine informs the synchronous engine that the corresponding unloading operation is completed; and the synchronization engine notifies the program core of the end of the wait associated with the synchronization point so that the program core continues to execute code.
Fig. 4 shows a schematic diagram of a method 400 for offloading data, in accordance with an embodiment of the disclosure. It should be understood that method 400 may also include additional components, acts not shown, and/or may omit illustrated components, acts, as the scope of the present disclosure is not limited in this respect.
At step 402, the software program generates a command packet and sends the generated command packet to the command queue 440, which starts the program core 450 to execute code (e.g., execute code "kernel a"). The command packet indicates a synchronization point and an offload operation command packet associated with the synchronization point, the offload operation command packet defining, for example, an offload operation type and target data for the offload operation.
At step 404, the core 450 executes code 452 (e.g., code "kernel A") to trigger a synchronization instruction corresponding to a synchronization point (e.g., "Sync Counter # 1").
At step 406, if it is determined that a synchronization instruction corresponding to a synchronization point (e.g., "Sync Counter # 1") is triggered, the synchronization engine 460 is notified. While waiting at the code.
In step 408, synchronization engine 460 notifies command engine 462 that a synchronization point (e.g., "Sync Counter # 1") has been reached.
At step 410, command engine 462 notifies the end of the wait in command queue 440.
At step 412, command engine 462 retrieves an offload Operation command packet (e.g., "Operation < op1 >") associated with a synchronization point (e.g., "Sync Counter # 1") from command queue 440.
At step 414, the command engine 462 parses the offload operation command packet 444 associated with the synchronization point to determine an offload engine for performing the corresponding offload operation of the offload operation command packet. For example, the command engine 462 selects an offload engine (e.g., the second offload engine 464-2) among the plurality of offload engines 464 (e.g., the first to nth offload engines 464-1 to 464-N) for performing a corresponding offload Operation of the offload Operation command packet based on the offload Operation command packet "Operation < op1 >" parsed type and target data of the offload Operation.
At step 416, if the determined offload engine 464 completes the corresponding offload operation, the command engine 462 is notified. The command engine 462 may be notified by the offload engine 464 that the corresponding offload operation has been completed in an asynchronous manner (e.g., the dotted line in fig. 4 indicates that the 416 step may be asynchronous). In some embodiments, the command engine may know via synchronization that the corresponding offload operation has been performed by the offload engine 464, and may perform step 418 directly without the offload engine 464 of step 416 having to notify completion of the corresponding offload operation.
At step 418, the command engine 462 notifies the synchronization engine 460 that the corresponding offload operation has been completed.
At step 420, synchronization engine 460 notifies core 450 of the end of the wait associated with the synchronization point (e.g., "Sync Counter # 1") so that core 450 continues to execute code 452.
The method 400 described above mainly includes three stages: the method includes synchronizing before being used to offload tasks outward, performing a reverse offload operation, and notifying the program core to continue executing subsequent code when the offload operation is completed. In the stage of performing the reverse offload operation, it is necessary to determine that the offload operation is to be handed over to the appropriate offload engine for execution. In the method 400 described above. The synchronization engine is solely responsible for synchronization, with the command engine parsing the offload operation command packet to determine that the corresponding offload operation is to be handed to the appropriate offload engine for execution.
Further embodiments of a method 500 for offloading data are described below in conjunction with fig. 5 and 6. Fig. 5 illustrates a flow diagram of a method 500 for offloading data, in accordance with further embodiments of the present disclosure. It should be understood that the method 500 may be performed, for example, at the computing device 200 depicted in FIG. 2. It should be understood that method 500 may also include additional components, acts not shown, and/or may omit illustrated components, acts, as the scope of the present disclosure is not limited in this respect.
At step 502, the command engine sends synchronization operation information defining an offload operation to offload data from the graphics processor, the synchronization operation information indicating at least a synchronization point and an offload operation type associated with the synchronization point and target data for the offload operation. For example, as shown in FIG. 6, the command engine 662 sends the code 644 of the program core and the synchronization operation information 642, the code 644 configured with one or more synchronization points.
The synchronization operation information is, for example, a synchronization operation table generated by a compiler. The information indicated by the synchronization operation table is shown in the previous table one, for example. Here, the description is omitted.
With regard to the offload operation type, it is used, for example, to indicate by which offload engine the corresponding reverse offload operation was performed. The offload operation type includes, for example, an offload engine identification and an offload operation number identification. The offload engine identification is used, for example, to define which offload engine the corresponding offload operation associated with the synchronization point is appropriate for delivery for execution by. The target data of the offload operation indicates, for example, data used by the corresponding offload operation.
At step 504, a synchronization engine is notified in response to determining that a synchronization instruction corresponding to the synchronization point is triggered. For example, as shown in FIG. 6, the program core 640 executes code 644 to trigger a synchronization instruction corresponding to a synchronization point; it is determined that a synchronization instruction corresponding to the synchronization point is triggered, informing the synchronization engine 660.
At step 506, the synchronization engine obtains, based on the synchronization point, the offload operation type and target data for the offload operation associated with the synchronization point in the synchronization operation information. For example, in FIG. 6, the synchronization engine 660 searches the synchronization operation table for the offload operation type and the target data of the offload operation corresponding to the reverse offload operation of the synchronization point (e.g., "Sync Counter # 1") according to the synchronization point (e.g., "Sync Counter # 1").
At step 508, the synchronization engine determines, based on the offload operation type, an offload engine for performing a corresponding offload operation associated with the synchronization point. For example, the synchronization engine 660 determines, among the plurality of offload engines 664, one offload engine for performing a corresponding offload operation associated with the synchronization point based on an offload engine identification and an offload operation number identification, the plurality of offload engines including, for example, at least two of a component in the graphics processor (e.g., a DMA/SDMA controller, a level two cache, etc.) and an external component (e.g., a central processor, an application specific integrated circuit, a field programmable logic array, etc.).
By adopting the above means, the method and the device can enable all modules inside the GPU or external modules to participate in the execution logic of GPU code tasks concurrently, and enable the CU which is the most important computing unit of the GPU to centralize the pure computing tasks which are most adept by the processor, so that the concurrency is improved, the execution time is reduced, the occupation of memory bandwidth can be obviously reduced, the execution efficiency is improved, and the execution time is reduced.
In addition, in the above scheme, by enabling the synchronization engine to perform synchronization and determine which offload engine to hand over the reverse offload operation to, without requiring the determination of the offload engine's participation in the command engine, the present disclosure has higher execution efficiency, and can more conveniently and flexibly process more complicated offload operations.
In some embodiments, the method 500 further comprises: determining, by the command engine, that the corresponding offload operation is performed via synchronization or the determined offload engine notification; and the synchronization engine notifies the program core of the end of the wait associated with the synchronization point so that the program core continues to execute code.
Fig. 6 shows a schematic diagram of a method 500 for offloading data, in accordance with further embodiments of the present disclosure. It should be understood that method 600 may also include additional components, acts not shown, and/or may omit illustrated components, acts, as the scope of the present disclosure is not limited in this respect.
At step 602, command engine 662 reads and sends code 644 of the program core and a synchronization operation information table 642 from a compiler (not shown) to hardware 640 to initiate execution of code 644 of the program core, synchronization operation information table 642 indicating a synchronization point (e.g., "Sync Counter # 1") and the type of offload operation associated with the synchronization point and target data for the offload operation.
At step 604, the program core executes code 644 to trigger a synchronization instruction corresponding to the synchronization point, notifying the synchronization engine 660 while waiting at the code.
In step 606, the synchronization engine 660 searches the offload operation type and the target data of the offload operation associated with the synchronization point in the synchronization operation information table according to the synchronization point (e.g., "Sync Counter # 1").
In step 608, the sync engine 660 selects an offload engine (e.g., the first offload engine 664-1) for performing the corresponding offload operation from the plurality of offload engines 664 (e.g., the first offload engine 664-1, the second offload engine 664-2 through the nth offload engine 664-N) according to the type of offload operation associated with the sync point in the sync operation information table, so that the selected offload engine notifies the sync engine 660 after completing the corresponding offload operation (e.g., as indicated in step 610).
At step 612, the synchronization engine notifies the kernel of the end of the wait associated with the synchronization point (e.g., "Sync Counter # 1") so that the kernel continues to execute code 644. For example, the kernel continues to execute subsequent code until a synchronization instruction corresponding to the next synchronization point (e.g., "Sync Counter # 2") is triggered.
In the method 600, the synchronization engine performs synchronization and determines the offload engine performing the inverse offload without the determination of the participating offload engine of the command engine, which is more efficient to perform and can more conveniently and flexibly process more complex offload operations.
FIG. 7 shows a schematic diagram of a method 700 for offloading data, in accordance with further embodiments of the present disclosure. It should be understood that method 700 may also include additional components, acts not shown, and/or may omit illustrated components, acts, as the scope of the present disclosure is not limited in this respect.
At step 702, a device vendor identification and a device identification of a computing device are obtained, the computing device including a graphics processor and a central processor.
At step 704, if it is determined that the device vendor identification is the predetermined identification and the device identification belongs to the first predetermined set, the method of offloading data of method 300 is performed for offloading data from the image processor.
At step 706, if it is determined that the device vendor identification is the predetermined identification and the device identification belongs to the second predetermined set, the method of offloading data of method 500 is performed for offloading data from the image processor.
By adopting the technical means, the method for unloading the data matched with the hardware configuration can be determined according to the hardware configuration of the computing device, so that the execution efficiency of the computing device is improved more remarkably, and the power consumption is reduced.
The various processes and processes described above, such as methods 300 through 700, may be performed at a computing device. The computing device includes, for example: at least one processor (at least one graphics processor and at least one central processor); and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor. In some embodiments, 300 through 700, may be implemented as a computer software program tangibly embodied on a machine-readable medium. In some embodiments, part or all of the computer program may be loaded and/or installed onto the computing device via ROM and/or the communication unit. When the computer program is loaded into RAM and executed by the GPU and CPU, one or more of the actions of the methods 300 to 700 described above may be performed.
The present disclosure may be methods, apparatus, systems, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for carrying out various aspects of the present disclosure. The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a central processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the central processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modifications made within the spirit and principles of this application.

Claims (13)

1. A method for offloading data, comprising:
sending a command packet to a command queue, the command packet to define an offload operation to offload data from a graphics processor, the command packet to indicate at least a synchronization point and an offload operation command packet associated with the synchronization point;
notifying a synchronization engine in response to a synchronization instruction corresponding to the synchronization point being triggered;
the synchronization engine notifying a command engine that the synchronization point has been reached for the command engine to obtain an offload operation command packet associated with the synchronization point from the command queue; and
the command engine parses an offload operation command packet associated with the synchronization point to determine an offload engine for performing a corresponding offload operation of the offload operation command packet, wherein the determined offload engine for performing the corresponding offload operation is one of a component and a central processor of the graphics processor.
2. The method of claim 1, further comprising:
determining, by the command engine, that the corresponding offload operation is performed via synchronization or the determined offload engine notification;
the command engine notifies the synchronization engine that the corresponding offload operation has been completed; and
the synchronization engine notifies the program core of the end of the wait associated with the synchronization point so that the program core continues to execute code.
3. The method of claim 1, wherein before being triggered to notify a synchronization engine in response to a synchronization instruction corresponding to the synchronization point, the method further comprises:
and executing codes by the program core to trigger synchronous instructions corresponding to the synchronous points, wherein the number and the sequence of the synchronous points configured by the codes correspond to the number and the sequence of the synchronous points indicated by the command packet.
4. The method of claim 1, wherein the command packet is generated by a driver based on information provided by a compiler, the offload operation command packet associated with the synchronization point indicating an offload operation type associated with the synchronization point and target data for an offload operation, the offload operation type comprising: an offload engine identification and an offload operation number identification.
5. The method of claim 4, wherein determining an offload engine for performing a corresponding offload operation of the offload operation command packet comprises:
the command engine determines, among the plurality of offload engines, an offload engine for performing a corresponding offload operation of the offload operation command packet based on the offload engine identification and the offload operation number identification parsed from the offload operation command packet.
6. A method for offloading data, comprising:
the command engine sends synchronous operation information, the synchronous operation information is used for defining unloading operation for unloading data from the graphics processor, and the synchronous operation information at least indicates a synchronous point and an unloading operation type associated with the synchronous point and target data of the unloading operation;
notifying a synchronization engine in response to determining that a synchronization instruction corresponding to the synchronization point is triggered;
the synchronization engine acquires the unloading operation type and the target data of the unloading operation associated with the synchronization point in the synchronization operation information based on the synchronization point; and
the synchronization engine determines, based on the offload operation type, an offload engine for performing a corresponding offload operation associated with the synchronization point, wherein the determined offload engine for performing the corresponding offload operation is one of a component and a central processor of the graphics processor.
7. The method of claim 6, further comprising:
determining, by the command engine, that the corresponding offload operation is performed via synchronization or the determined offload engine notification; and
the synchronization engine notifies the program core of the end of the wait associated with the synchronization point so that the program core continues to execute code.
8. The method of claim 6, further comprising:
a command engine to send code of a program core, the code configured with one or more synchronization points; and
and the program core executes the code to trigger a synchronous instruction corresponding to the synchronous point.
9. The method of claim 6, wherein the synchronization operation information is a synchronization operation table generated by a compiler, the offload operation type comprising: an offload engine identification and an offload operation number identification.
10. The method of claim 9, wherein a synchronization engine determining, based on the offload operation type, an offload engine for performing a corresponding offload operation associated with the synchronization point comprises:
the synchronization engine determines, based on the offload engine identification and offload operation number identification, one offload engine among a plurality of offload engines to use for performing a corresponding offload operation associated with the synchronization point.
11. A method for offloading data, comprising:
acquiring equipment supplier identification and equipment identification of computing equipment, wherein the computing equipment at least comprises a graphic processor and a central processing unit;
performing the method of any of claims 1-5 for offloading data from the graphics processor if it is determined that the device vendor identification is a predetermined identification and the device identification belongs to a first predetermined set; and
performing the method of any of claims 6-10 for offloading data from the graphics processor if it is determined that a device vendor identification is a predetermined identification and the device identification belongs to a second predetermined set.
12. A computing device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-11.
13. A computer-readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-11.
CN202011470268.3A 2020-12-15 2020-12-15 Method, computing device, and computer-readable storage medium for offloading data Active CN112231018B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011470268.3A CN112231018B (en) 2020-12-15 2020-12-15 Method, computing device, and computer-readable storage medium for offloading data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011470268.3A CN112231018B (en) 2020-12-15 2020-12-15 Method, computing device, and computer-readable storage medium for offloading data

Publications (2)

Publication Number Publication Date
CN112231018A CN112231018A (en) 2021-01-15
CN112231018B true CN112231018B (en) 2021-03-16

Family

ID=74124197

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011470268.3A Active CN112231018B (en) 2020-12-15 2020-12-15 Method, computing device, and computer-readable storage medium for offloading data

Country Status (1)

Country Link
CN (1) CN112231018B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114168078A (en) * 2021-12-09 2022-03-11 中国建设银行股份有限公司 Data unloading method, device, system, equipment and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101025717A (en) * 2005-10-18 2007-08-29 威盛电子股份有限公司 Method and system for deferred command issuing in a computer system
CN108027804A (en) * 2015-09-23 2018-05-11 甲骨文国际公司 On piece atomic transaction engine

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10178619B1 (en) * 2017-09-29 2019-01-08 Intel Corporation Advanced graphics power state management

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101025717A (en) * 2005-10-18 2007-08-29 威盛电子股份有限公司 Method and system for deferred command issuing in a computer system
CN108027804A (en) * 2015-09-23 2018-05-11 甲骨文国际公司 On piece atomic transaction engine

Also Published As

Publication number Publication date
CN112231018A (en) 2021-01-15

Similar Documents

Publication Publication Date Title
US10877766B2 (en) Embedded scheduling of hardware resources for hardware acceleration
US10078879B2 (en) Process synchronization between engines using data in a memory location
CN104126179B (en) Method and apparatus for inter-core communication in multi-core processors
US10402223B1 (en) Scheduling hardware resources for offloading functions in a heterogeneous computing system
US10089263B2 (en) Synchronization of interrupt processing to reduce power consumption
US10481957B2 (en) Processor and task processing method therefor, and storage medium
CN104683472A (en) Data transmission method capable of supporting large data volume
CN112231018B (en) Method, computing device, and computer-readable storage medium for offloading data
CN110688160B (en) Instruction pipeline processing method, system, equipment and computer storage medium
CN110908716A (en) Method for implementing vector aggregation loading instruction
JP2018512661A (en) Shadow command ring for graphics processor virtualization
CN110119375B (en) Control method for linking multiple scalar cores into single-core vector processing array
CN115129480B (en) Scalar processing unit and access control method thereof
US10073810B2 (en) Parallel processing device and parallel processing method
CN116243983A (en) Processor, integrated circuit chip, instruction processing method, electronic device, and medium
CN111857830B (en) Method, system and storage medium for designing path for forwarding instruction data in advance
US9323575B1 (en) Systems and methods for improving data restore overhead in multi-tasking environments
CN116635829A (en) Compressed command packets for high throughput and low overhead kernel initiation
JP5630798B1 (en) Processor and method
JP2012203911A (en) Improvement of scheduling of task to be executed by asynchronous device
US20140201505A1 (en) Prediction-based thread selection in a multithreading processor
CN112486638A (en) Method, apparatus, device and storage medium for executing processing task
US10776139B2 (en) Simulation apparatus, simulation method, and computer readable medium
CN113703835B (en) High-speed data stream processing method and system based on multi-core processor
EP4195036A1 (en) Graph instruction processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: Room 0106-508, 1st floor, No.26, shangdixin Road, Haidian District, Beijing 100085

Patentee after: Beijing Bilin Technology Development Co.,Ltd.

Country or region after: China

Patentee after: Shanghai Bi Ren Technology Co.,Ltd.

Address before: Room 0106-508, 1st floor, No.26, shangdixin Road, Haidian District, Beijing 100085

Patentee before: Beijing Bilin Technology Development Co.,Ltd.

Country or region before: China

Patentee before: Shanghai Bilin Intelligent Technology Co.,Ltd.