CN111258574A

CN111258574A - Programming method and system for accelerator architecture

Info

Publication number: CN111258574A
Application number: CN202010038212.4A
Authority: CN
Inventors: 鄢贵海; 吴婧雅
Original assignee: Yusur Technology Co ltd
Current assignee: Yusur Technology Co ltd
Priority date: 2020-01-14
Filing date: 2020-01-14
Publication date: 2020-06-09
Anticipated expiration: 2040-01-14
Also published as: CN111258574B

Abstract

The application provides a programming method and a system of an accelerator architecture, wherein the method comprises the following steps: inputting a structured data object into an input data channel; splitting a structured data object in an input data channel into a plurality of input atomic data elements executable by a computational core, and forming a working group containing a first number of input atomic data elements based on the number of computational cores capable of executing computational tasks in an accelerator device and a data dependency relationship between the computational cores; executing operation on all input atomic data elements in the working group by a computing core capable of executing the computing task to obtain a group of output atomic data elements; and obtaining an output structured data object based on the output atomic data element, and outputting the output structured data object through an output data channel connected with the computing core. The embodiment of the invention can more efficiently utilize the hardware resource of the special accelerator and improve the execution efficiency of the operation task.

Description

Programming method and system for accelerator architecture

Technical Field

The present application relates to programming methods and computer architecture designs for computer architectures, and more particularly to programming methods and systems for accelerator architectures that support specialized computing cores.

Background

In the big data era, with the development of technologies such as internet of things, cloud computing and internet, a large amount of data generated in various application scenes is continuously increasing and accumulating in an amazing scale. According to the report of International Data Corporation (IDC), the total amount of Data will exceed 35ZB worldwide by 2020. The information contained behind the big data can greatly support the scientific research and the industrial development of human beings. Data-centric applications become a challenge for big data applications on how to quickly mine effective information from massive data.

As transistor technology advances, quantum tunneling effect becomes an inevitable limitation, and in recent years, the academy and industry have come to recognize that moore's law is about to fail. In order to solve the computing power problem facing big data computing, research on high-performance computing is gradually turning to a more efficient special parallel computing architecture, such as a semi-custom Field-Programmable Gate Array (FPGA); graphics Processing Units (GPUs) designed to perform complex mathematical and geometric calculations; a google Tensor Processor TPU (temporal Processing Unit, TPU) for accelerating the machine learning algorithm, a Neural Network Processor (NNU); IBM's BLU accelerator or other Application Specific Integrated Circuit (ASIC) to accelerate database operations. This kind of special computing architecture is combined with the CPU to form a heterogeneous system, and its basic architecture includes a CPU host and an accelerator device (such as GPU, FPGA, or ASIC).

The execution efficiency of an application on a heterogeneous system is not only dependent on these hardware resources supporting dedicated computing, but is also limited by the programming model of the heterogeneous system. An efficient and flexible programming model can effectively improve the utilization efficiency of hardware resources, so that the execution efficiency of computing tasks is improved. The heterogeneous system-oriented programming model mainly comprises the following categories: open Computing Language (OpenCL), unified Computing Device architecture (CUDA), and Open Accelerators (OpenACC), among others. OpenCL is the first open, free standard for parallel programming for the general purpose of heterogeneous systems, and is also a unified programming environment, and is widely applicable to CPUs, GPUs, FPGAs, digital signal processors, and other special accelerator devices. The CUDA is a general parallel computing platform and a programming model facing the GPU, and is expanded on the basis of C language. Open ACC proposes to make code available across general-purpose parallel processors, including heterogeneous CPU-GPU systems, while also supporting multi-core CPU systems.

The heterogeneous programming model provides an effective platform for the use of the heterogeneous accelerator, and rich hardware computing and storage resources can be effectively utilized during programming. However, these programming models still have problems in practical applications.

The CUDA exposes the memory hierarchy of the GPU and defines the hierarchical abstraction of the threads. However, in order to achieve the purpose of efficiently utilizing hardware resources, strong relevant knowledge of GPU memory structure and thread management is needed, so that optimization and application are difficult to achieve; and only supports a single accelerator device, namely the English WEIDA series of GPUs, and has a limited use range.

Open ACC is a simpler CPU-GPU heterogeneous system programming model, which hides communication between chips and low-level operation on computing resources and storage resources of a processor, and can realize cross-platform reuse without changing codes; however, Open ACC requires a complex compiler with strong versatility; furthermore, even with this complex compiler, the programming model of OpenACC only supports the english viada and AMD family of GPU accelerator devices, and cannot support the use of custom dedicated accelerators.

OpenCL supports task parallel and data parallel, and can be oriented to different types of accelerator hardware equipment; however, this programming model is relatively low-level, requires a programmer to have a certain knowledge of the internal structure of the target processor, and is expensive to change when optimizing and reusing the source code. Therefore, using OpenCL has great difficulty for expanding computing applications.

How to more efficiently utilize the hardware resources of the special accelerator and improve the execution efficiency of the operation task is a problem to be further solved.

Disclosure of Invention

Accordingly, the present invention is directed to a data-driven programming method and system that overcome one or more of the problems of the conventional programming modes described above.

In one aspect of the present invention, a method for programming an accelerator architecture is provided, the method comprising the steps of:

inputting a structured data object into an input data channel;

splitting a structured data object in an input data channel into a plurality of input atomic data elements which can be executed by a computation core, and forming a working group containing a first number of input atomic data elements based on the number of computation elements corresponding to the computation core which can execute a computation task in an accelerator device;

executing operation on all input atomic data elements in the working group by a computing core capable of executing the computing task to obtain a group of output atomic data elements;

and obtaining an output structured data object based on the output atomic data element, and outputting the output structured data object through an output data channel connected with the computing core.

Optionally, the structured data object is a primitive that can be directly manipulated by an accelerator; the input atomic data element is the smallest data unit of a computational task executed by a computational core.

Optionally, the method adopts pipeline operation, and different computing tasks correspond to different input atomic data elements in the working group and are executed in parallel by the computing core.

Optionally, in a case where there is no data dependency between the computing cores within the accelerator device that can execute the computing task, the first number is equal to the number of computing cores within the accelerator device that can execute the computing task.

Optionally, in the computing cores that can execute the computing task in the accelerator device, a plurality of computing cores that have data dependency with each other should have one computing element, and each computing core that does not have data dependency should have one computing element, where the first number is the same as the number of computing elements corresponding to the computing cores that can execute the computing task in the accelerator device.

Optionally, the first number is 1, and the computing cores that can execute the computing task in the accelerator device execute the computing task in series.

Optionally, the obtaining an output structured data object based on the output atomic data element, which is output via an output data channel connected to a computational core, includes: and caching the obtained output structured data object, packaging to obtain the output structured data object, and outputting the output structured data object through an output data channel connected with the computing core.

Optionally, the data dimensions of the input data channel and the output data channel are the same or different.

Another aspect of the present invention also provides a programming system comprising a processor and a memory, the memory having stored therein computer instructions, the processor being configured to execute the computer instructions stored in the memory, the system implementing the steps of the method as described above when the computer instructions are executed by the processor.

Yet another aspect of the invention provides a computer storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method as described above.

The programming method and the programming system hide the operation of a bottom hardware architecture, drive the programming of a heterogeneous system by using more intuitive computing tasks and computing data, greatly improve the programming efficiency and simplify the use mode of a programmer for a special accelerator. Due to the adoption of a data-driven execution mode, complex control cost is not required to be considered during program execution, and the programming method provided by the embodiment of the invention can more efficiently utilize the hardware resources of the special accelerator and improve the execution efficiency of the operation task.

It will be appreciated by those skilled in the art that the objects and advantages that can be achieved with the present invention are not limited to the specific details set forth above, and that these and other objects that can be achieved with the present invention will be more clearly understood from the detailed description that follows.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative effort.

Fig. 1 shows a program interface diagram corresponding to the programming method of the present invention.

Fig. 2 is a diagram showing a data structure supported by program execution in the embodiment of the present invention.

Fig. 3 is a schematic diagram illustrating an implementation process of the programming method according to the embodiment of the present invention.

FIG. 4 is a flow chart illustrating a programming method according to an embodiment of the present invention.

FIG. 5 illustrates an example of a task execution process that does not employ pipelining in one embodiment of the invention.

FIG. 6 shows an example of a task execution process that employs pipelining in one embodiment of the invention.

Fig. 7 is a schematic diagram illustrating data structures of input channels and output channels in a video decoding process according to an embodiment of the present invention.

Fig. 8 is a diagram illustrating data structures of input atomic data elements and output atomic data elements in a video decoding process according to an embodiment of the present invention.

Fig. 9 is a schematic diagram illustrating an encoding flow of a video decoding process according to an embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the structures and/or processing steps closely related to the scheme according to the present invention are shown in the drawings, and other details not so relevant to the present invention are omitted.

It should be emphasized that the term "comprises/comprising" when used herein, is taken to specify the presence of stated features, elements, steps or components, but does not preclude the presence or addition of one or more other features, elements, steps or components. The terms "first" and "second," when used to modify a feature, element, step, or component, are used solely to distinguish one from another without necessarily implying any order or order between such features, elements, steps, or components, unless otherwise indicated herein.

It should be noted that, in the present application, the embodiments and features of the embodiments may be combined with each other without conflict.

When the inventor researches the architecture design of the special accelerator, the problem of the existing programming model is caused by the fact that the design of an algorithm description language, a program interface design, a program execution mode and the like in the programming model is too low-level, or the problem is caused by the fact that the program interface design and a compiler are complex. While the more underlying programming methods provide experienced programmers with more flexible operation and wider optimization space, they pose certain difficulties for use by programmers with more applications and less architecture knowledge. In fact, optimization of the underlying architecture, such as management of program execution and storage, may be achieved through a compiler or other underlying interface, which need not be completely open to programmers, but the complex compiler interfaces complicate programming.

The inventor finds out through research on practical use processes of a special accelerator that the problems can be solved by the accelerator-oriented data-driven programming method designed by the invention. The data driving means that, for a calculation task, the calculation execution sequence is determined by the validity of input data. The data driver is a computing architecture that differs from the traditional von neumann architecture, without the need for program pointers and process scheduling mechanisms. The execution order of the data drivers depends on the interdependencies between the data and the validity of the input data (instruction operands), which can be calculated as soon as the data arrives. Embodiments of the present invention use this non-traditional computing structure to increase the processing power of the system. The programming method based on the data drive provided by the invention provides a simple program interface, and can greatly simplify the programming process. Starting from application and data calculation, the calculation task required to be completed by a program is directly considered, hardware resources of an accelerator can be effectively utilized, the execution efficiency of the calculation task is optimized, and the utilization efficiency of the hardware calculation resources can be greatly improved by supporting special accelerators with different architectures.

To implement the programming method of the present invention, the embodiment of the present invention provides a program interface as shown in fig. 1. The program interface includes: an input data channel (input channel) 101, an output data channel (output channel) 102 and a computing core (kernel)103, wherein the input data channel 101 and the output data channel 102 are used for inputting and outputting data which need to be computed and executed by an accelerator by a user during programming, and the input data channel 101 and the output data channel 102 are connected with the computing core 103. That is, the programmed data structure for inputting data is an input channel, after a user calls a specific computation core through programming, the input data is input to the currently called computation core, the computation core executes a specific operation, and the data structure for outputting a result is an output channel. The form of the program interface in the embodiment of the present invention may be oriented to C language, C + + language, python language, Java language, and the like, but is not limited thereto. The implementation forms are linked to the files through library functions oriented to different programming language implementations.

In the embodiment of the present invention, the Data structures in the input channel and the output channel in the provided program interface are Structured Data Objects (SDO), and the structured Data objects support the partitioning of a Data structure with a finer granularity: an input Atomic Data Element (ADE), which is a finer grained Data in the SDO, and an output Atomic Data Element (ADE), which is shown in shaded portion in fig. 2. SDO refers to the basic data unit that can be directly operated on by the accelerator. In the embodiment of the present invention, the SDO may be one column or several columns of the database or a part of a column, may also be one frame or several frames of the video stream, and may also be a certain data stream or certain data streams collected by a sensor in the internet of things. ADE is the smallest data unit that starts a compute core to execute a compute task, and every time an ADE is input into a compute core, the compute core will update the data output result by operation, and ADE is controlled by a memory management unit in the dedicated accelerator. In the embodiment of the present invention, the ADE may be one or several elements in a column of the database, may also be one or several image blocks (blocks) in a frame of the video stream, and may also be a section of data stream within a certain time window acquired by a sensor in the internet of things. The output atomic data element ADE is an output result of a computation core after a certain computation task is executed, and in the embodiment of the present invention, it may be several elements or records in one or several columns of a database that satisfy query conditions, one or several encoded or decoded image blocks in a frame of a video stream, a certain specific data set of a certain time window that is acquired by a sensor in the internet of things, or a specific numerical result after complex computation. Wherein the input atomic data elements and the output atomic data elements can have different dimensions. For example, the input atomic data element ADE may be a data block in the form of a matrix, while the output atomic data element ADE may be a numerical solution or a vector solution. Alternatively, the input atomic data element ADE may be a high-dimensional data vector and the output atomic data element ADE may be a low-dimensional vector solution or a numerical solution.

Based on the above program interfaces, the embodiment of the present invention provides an implementation schematic for performing programming based on the above programming interfaces and data structures, as shown in fig. 3, an input data channel 101 and an output data channel 102 are connected to a computing core 105, and are respectively used for inputting data to the computing core and outputting data from the computing core during programming; the data within the input data channel 101 comprises a structured data object that can be broken up into a plurality of input atomic data elements 103. The input atomic data element 103 is the smallest data building block that drives the computational core 105 in the accelerator device to perform an arithmetic operation, the computational core 105 in fig. 3 is the hardware computing resource of the dedicated accelerator, and the workgroup 106 includes one or more data structures that the current computational core processes when performing an arithmetic task, each data structure being an atomic data element ADE. Generally, in the absence of data dependencies between compute cores, one compute core processes one ADE in a workgroup and multiple compute cores may process multiple ADE in the workgroup in parallel. Of course, each compute core may also perform the compute tasks serially, in which case there may be only 1 ADE in the workgroup. In the case where there is a data dependency between some of the computing cores, the number of ADE within the workgroup may be determined based on the data dependency between the computing cores, which will be described later. The smallest data composition unit of the output result after the computation core 105 performs the operation is the output atomic data element 104, which is output through the output channel 102.

Based on the above program interface and schematic programming execution process, as shown in fig. 4, the programming method of the data-driven accelerator architecture of the embodiment of the present invention includes the following steps:

step S110, inputting the structured data object to the input data channel.

Step S120, splitting the structured data object in the input data channel into a plurality of input atomic data elements executable by the computation core, and forming a working group including a first number of input atomic data elements based on the number of computation elements corresponding to the computation core in the accelerator device, which can execute the computation task.

More specifically, in this step, the SDO in the data channel is divided into a plurality of ADE that can be executed by the compute core itself according to the size of the data that can be processed by the compute core that can execute the compute task in the accelerator, all or part of the ADE will form a work group, and all the executable compute cores will execute the computation of all the ADE in the current work group in parallel. All ADE may be performed once, i.e. finished, or it may be finished with several calculations.

In an embodiment of the invention, the number of ADEs within a workgroup may be determined by the number of compute cores within the accelerator device and the data dependencies between the compute cores. The number of the computing cores is determined in hardware design. For example, an accelerator may contain computing cores that can simultaneously perform 6 computing tasks, which may be of the same type or of different types, depending on the hardware design of the accelerator. In addition, there may be data dependency relationship between the computing cores, that is, the starting of the subsequent computing core needs to depend on the output of the previous computing core, or there may be no data dependency relationship. If there is data dependency, several interdependent computation cores may be regarded as 1 computation element, and if there is no data dependency, each computation core may be regarded as one computation element, and the number of ADE in the workgroup may be determined according to the number of actual computation elements.

The number of ADEs within a workgroup depends on the number of compute cores of the accelerator hardware, assuming no data dependencies between compute cores exist. If an accelerator device includes 6 computation cores capable of executing operations, after the SDO is split to obtain ADE, 6 ADE in the split ADE may form a work group, and input the work group to all computation core resources. That is, in the case where there is no data dependency between the computing cores within the accelerator device that can execute the computing task, the number of computing cores within the accelerator device that can execute the computing task is equal to the number of ADE in the workgroup, and each computing core without dependency can be regarded as one computing element. A plurality of computing tasks without computing conflict and resource conflict among each other are completed by one-time execution of the computing core batches. All of the executable compute cores will concurrently perform the computations of all of the ADEs within the current workgroup at once.

If there are dependencies between some computing cores, the size of the workgroup (the number of ADE within the workgroup) will depend on the requirements of the program itself, and it is understood that the number of ADE inputs required by the interdependent computing cores may be 1, whereas for a computing core that has no dependencies to execute in parallel, the presence of several such computing cores on the hardware requires several ADE inputs. Alternatively, a plurality of calculation cores having a dependency relationship with each other may be regarded as 1 calculation element, and each calculation core having no data dependency may be regarded as 1 calculation element, so that the calculation core is assigned to the calculation element, and the number of ADE in the work group is determined based on the number of calculation elements. All executable compute cores will in turn execute the computations of the ADE for the current workgroup in parallel until all ADE computations are completed. When the SDO is split into ADE, the splitting principle of ADE depends on the design of the computing core itself, i.e. ADE is the smallest data unit that starts one computing core to execute the computing task.

Step S130, a computation core capable of executing a computation task performs an operation on all input atomic data elements in the working group to obtain a group of output atomic data elements.

Each workgroup is the entire data that the accelerator slave can execute at once from the compute core within the device.

With different data sizes of SDOs, the number of operations that need to be executed by the computation core will be correspondingly different, that is, the number of work groups required for the overall computation will be different. That is, an SDO may be split into a plurality of ADE, where several ADE will constitute a work group. Then, an SDO may contain several workgroups, and each compute core may perform several arithmetic operations. When performing operations for multiple workgroups, each workgroup (consisting of different ADEs) that calculates the inputs will yield a different output result.

And step S140, obtaining an output structured data object based on the output atomic data element, and outputting the output structured data object through an output data channel connected with the computing core.

More specifically, the output result ADE obtained by the workgroup may be put into a buffer of the output data, and the result SDO obtained after packing is output to the output data channel, and returned to the CPU or temporarily stored in the accelerator device.

The data-driven programming method supporting the special computing core, which is provided by the embodiment, hides the operation of the bottom hardware architecture, uses more intuitive computing tasks and computing data to drive the programming of the heterogeneous system, greatly improves the programming efficiency, and simplifies the use mode of a programmer for the special accelerator. Due to the adoption of a data-driven execution mode, complex control cost is not required to be considered during program execution, and the programming method provided by the embodiment of the invention can more efficiently utilize the hardware resources of the special accelerator and improve the execution efficiency of the operation task.

In the embodiment of the invention, the execution of data in each workgroup is strictly synchronous, and if the workgroup which does not execute the current instruction exists, any other workgroup cannot start the execution of the next instruction. However, the computing cores may be asynchronous, i.e., execution between tasks may be directly in parallel, where different computing tasks may be executed by different computing cores at the same time. Therefore, the embodiment of the invention also provides a programming optimization mode according to pipeline operation, different computing tasks are used as different pipelines, the work tasks of different pipelines correspond to different ADEs in a work group, and the work tasks are executed in parallel by the computing cores.

Fig. 5 shows an example of a task execution process without pipelining according to an embodiment of the present invention. Fig. 6 shows an example of a task execution process using pipelining according to an embodiment of the present invention. The differences between the execution of the three tasks before and after the pipeline optimization is not used and before and after the pipeline optimization is used can be seen from fig. 5 and 6.

Referring to fig. 5, it is assumed that given 3 computation tasks, each computation task includes 4 different specific computation steps, that is, 4 hardware computation cores are required to execute, and there is a sequential dependency relationship between the 4 computation cores, that is, the computation core 2 must wait until the computation of the computation core 1 is finished before proceeding, and so on. Thus, when pipelining is not used, the execution of the three computing tasks is sequential, i.e., the second computing task can start executing only after the execution of the first computing task is finished. In this case, each computing core in each task performs computing operations on an ADE sequentially, i.e., for each task, the workgroup contains an ADE.

However, since the three computing tasks are independent, no data conflict or resource conflict exists, and therefore a pipeline operation mode can be adopted. That is, after the first computing core of the first computing task is executed, the first computing core of the second computing task may start, and so on. In this case, the individual computation cores may execute different tasks in parallel, for which the work group contains ADE corresponding to the number of computation cores. As shown in fig. 6. The execution efficiency of the computing task can be obviously improved by using a pipeline operation mode.

Example 1

The present embodiment provides one example of a data-driven programming method that supports dedicated computational cores, and is applicable to video decoding tasks. The computing tasks processed by the computing core include, but are not limited to: arithmetic decoding, motion compensation/prediction, inverse discrete cosine transform, inverse quantization, block splicing, etc., as in the substeps of step 2 in fig. 9. In this embodiment, the organization form of the input and output data structures is, according to the actual requirements of the application: the SDOs in the input data lane and the output data lane may be one or several frames of images, and fig. 7 may be a data structure in the input data lane or a data structure in the output data lane. The size of the structure of the SDO data within the input and output channels and the organization of the data is typically determined by the input and output of the algorithm itself.

One input ADE per processing may be one or several divided image blocks and correspondingly the output ADE or one or several image blocks, fig. 8 may be a data structure of the input ADE or the data structure of the output ADE, and the hatched part in fig. 8 may represent different ADE. It is noted here that the input and output data processed by each workgroup may be dimensionally the same or different. For example, in one example, an ADE input is a line in a currently input frame image, and an ADE output is a line that is operated to miss pixels, when the dimension of the ADE output is reduced compared to the dimension of the ADE input.

Referring to fig. 9, the execution of the video decoding task provided by the present embodiment includes the following steps:

step 1, calling a function statement of a special accelerator to transmit the prepared SDO to an input data channel.

Step 2, performing specific operations of video decoding, wherein the step further comprises several sub-steps: arithmetic decoding, motion compensation/prediction, inverse discrete cosine transform/inverse quantization and block splicing.

Firstly, splitting the input SDO based on computing resources, wherein the input SDO can be split according to the input requirements of the algorithm and the difference of data volume processed at one time.

In the process of executing the sub-steps, under the condition that a pipeline operation mode does not exist, an ADE is contained in one working group, the execution of data is strictly synchronous, and if the working group which does not execute the current instruction exists, any other working group cannot start the execution of the next instruction.

However, since there is no data dependency and conflict between these several sub-steps, the sequential execution computation core can be executed according to the pipeline operation manner, and the parallel computation of the sub-steps is realized by the scheduling of the computation core.

According to fig. 9, a video decoding computation task includes 4 computation sub-tasks, i.e., arithmetic decoding, motion compensation/prediction, inverse discrete cosine transform, inverse quantization and block splicing, where 4 computation cores may be required. The four subtasks are in a serial relationship, that is, after the subtask 1 is completed, the output result is transmitted as an input to the subtask 2, and so on. Each subtask may be performed by a computing core.

For the example of FIG. 9, without pipelining optimization, the input contains only one ADE at a time. Assuming 4 different computing cores are designed on the accelerator, 4 computing cores can be started at a time, and the 4 computing cores are in a serial relationship. That is, the 2 nd computing core may perform the computation by using the obtained output result as an input after the 1 st computing core completes the computation task. And so on. In this case, the number of work groups required for each compute core is 1.

If a pipeline optimization mode is adopted, 4 calculation tasks (except the last moments of task start and task end) can be executed at the same time, which is equivalent to simultaneously inputting 4 ADEs as a work group. The number of ADE input at one time is 4.

And 3, outputting a data channel, namely repacking the obtained output ADE data into SDO after the ADE is calculated by the working group of the calculation core, and sending the SDO to the output data channel.

The programming method shown in fig. 9 hides the operation of the underlying hardware architecture, drives the programming of the heterogeneous system with more intuitive computing tasks and computing data, greatly improves the programming efficiency, and simplifies the use of the dedicated accelerator by programmers. Due to the adoption of a data-driven execution mode, the complex control cost is not required to be considered during the program execution, and the programming method provided by the invention can more efficiently utilize the hardware resources of the special accelerator and improve the execution efficiency of the operation task.

Accordingly, the present invention also provides a programming system, which may be implemented in a CPU or a GPU, comprising a processor and a memory for storing computer instructions, the processor being adapted to execute the computer instructions stored in the memory, the apparatus realizing the respective method steps as described above when the computer instructions are executed by the processor.

The present disclosure also relates to storage media, which may be tangible storage media such as optical disks, U-disks, floppy disks, hard disks, etc., on which computer program code may be stored, which when executed may implement various embodiments of the method of the present invention.

It will be apparent to those skilled in the art that embodiments of the present application may provide a method, computer program product, and computer storage medium. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied in the medium.

Although the present application provides method steps as described in an embodiment or flowchart, additional or fewer steps may be included based on conventional or non-inventive efforts. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. When an actual apparatus or client product executes, it may execute sequentially or in parallel (e.g., in the context of parallel processors or multi-threaded processing) according to the embodiments or methods shown in the figures.

The apparatuses or modules illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. The functionality of the modules may be implemented in the same one or more software and/or hardware implementations of the present application. Of course, a module that implements a certain function may be implemented by a plurality of sub-modules or sub-units in combination.

The methods, apparatus or modules described herein may be implemented in computer readable program code to a controller implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, Application Specific Integrated Circuits (ASICs), programmable logic controllers and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, AtmelAT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may therefore be considered as a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method of programming an accelerator architecture, the method comprising the steps of:

inputting a structured data object into an input data channel;

splitting a structured data object in an input data channel into a plurality of input atomic data elements executable by a computational core, and forming a working group containing a first number of input atomic data elements based on the number of computational cores capable of executing computational tasks in an accelerator device and a data dependency relationship between the computational cores;

2. The method of claim 1,

the structured data object is a primitive that can be directly manipulated by an accelerator;

the input atomic data element is the smallest data unit of a computational task executed by a computational core.

3. The method of claim 1, wherein the method employs pipelining, and wherein different computing tasks correspond to different input atomic data elements within a workgroup and are performed in parallel by the computing cores.

4. The method of claim 3,

the first number is equal to a number of compute cores within an accelerator device that can execute a compute task without data dependencies between the compute cores within the accelerator device.

5. The method of claim 3,

in the computing cores capable of executing the computing tasks in the accelerator equipment, a plurality of computing cores which have data dependence with each other should have one computing element, each computing core which does not have data dependence should have one computing element, and the first number is the same as the number of the computing elements corresponding to the computing cores capable of executing the computing tasks in the accelerator equipment.

6. The method of claim 1, wherein the first number is 1, and wherein the computing tasks are performed serially by respective computing cores within the accelerator device that are capable of performing the computing tasks.

7. The method of claim 1, wherein obtaining an output structured data object based on the output atomic data elements, for output via an output data channel connected to a computational core, comprises:

and caching the obtained output structured data object, packaging to obtain the output structured data object, and outputting the output structured data object through an output data channel connected with the computing core.

8. The method of claim 1, wherein the input data channel and the output data channel have the same or different data dimensions.

9. A programmed system comprising a processor and a memory, wherein the memory has stored therein computer instructions for executing the computer instructions stored in the memory, the system implementing the steps of the method of any one of claims 1-8 when the computer instructions are executed by the processor.

10. A computer storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the steps of the method according to any one of claims 1 to 8.