CN103927150B

CN103927150B - Perform during parallel running on multiprocessor

Info

Publication number: CN103927150B
Application number: CN201410187203.6A
Authority: CN
Inventors: 阿夫泰伯·穆恩史; 杰里米·萨德梅尔
Original assignee: Apple Computer Inc
Current assignee: Apple Inc
Priority date: 2007-04-11
Filing date: 2008-04-09
Publication date: 2016-09-07
Anticipated expiration: 2028-04-09
Also published as: CN103927150A

Abstract

Perform during parallel running on multiprocessor.Multiple executable in scheduling queue are scheduling the method and apparatus to be executed concurrently in one or more physical computing devices of such as CPU or GPU etc.To a type of physical computing device different from the one or more physical computing device, from compiled online one or more executable in source with existing executable.Judge that the dependence between the scheduled corresponding element of executable selects executable to be executed concurrently by multiple threads in more than one physical computing device.If GPU is busy with graphics process thread, then the thread initialized for performing executable in the GPU of physical computing device that is initialised execution in another CPU in physical computing device.The existing executable of api function and source are stored in API library to perform the multiple executable of executable including existing executable and going out from source compiled online in multiple physical computing devices.

Description

Perform during parallel running on multiprocessor

In the application filing date April 9, Application No. 200880011684.8 in 2008 The divisional application of state's patent application when parallel running " perform " on multiprocessor.

Cross-Reference to Related Applications

Entitled " the DATA that the application submits on April 11st, 2007 with Aaftab Munshi etc. PARALLEL COMPUTING ON MULTIPLE PROCESSORS " (on multiprocessor Data parallel) U.S. Provisional Patent Application No.60/923,030 and Aaftab Munshi exist Entitled " the PARALLEL RUNTIME EXECUTION ON that on April 20th, 2007 submits to MULTIPLE PROCESSORS " U.S. of when parallel running (perform) on multiprocessor faces Time patent application No.60/925,620 are correlated with, and require both rights and interests, and both is by drawing With being incorporated herein.

Technical field

This invention relates generally to data parallel, more particularly it relates to across CPU Both data parallels of (CPU) and GPU (GPU) perform when running.

Background technology

Along with GPU continues to be evolved into high performance parallel computation device, increasing application be written into Data parallel is performed in the GPU similar with general-purpose calculating appts.Nowadays, these application are set Count into and run on the specific GPU using supplier's special interface.Therefore, they can not be in data Balance (leverage) CPU when processing system has GPU and CPU, can not answer such It is balanced with when just operating on the GPU of different suppliers.

But, along with increasing CPU includes that multiple core is to perform the meter of data parallel model Calculating, any one by available CPU and/or GPU can be supported more and more to process task.Pass On system, GPU with CPU is to be configured by mutually incompatible separate programmed environment.Great majority GPU needs the specific dedicated program of supplier.As a result, application is difficult in terms of processing resource balance CPU and GPU, such as, have the GPU of data parallel ability together with multi-core CPU.

Accordingly, it would be desirable to the data handling system in modern times overcomes problem above can hold to allow to apply Any available processes resource (such as CPU or one or more GPU) of row task performs this Business.

Summary of the invention

One embodiment of the present of invention includes the API in response to the application run in Main Processor Unit Request, loads the method for one or more executable of the data processing task for this application and sets Standby.In response to another API request from this application, one in the executable loaded selected Select to be held in being attached to another processing unit of such as CPU or GPU etc of this Main Processor Unit OK.

In an alternate embodiment, the application program run in Main Processor Unit generates API please Ask, to be used for loading the one or more executable for data processing task.Then, should by this With Program Generating the 2nd API, it is used for selecting one in loaded executable in attachment Perform in another processing unit of such as CPU or GPU etc of this Main Processor Unit.

In an alternate embodiment, for the source of object processing unit during runtime based on adding The executable being downloaded to processing unit is compiled.Processing unit and object processing unit can be centre Reason unit (CPU) or GPU (GPU).Between processing unit and object processing unit Difference detected to take out source from the executable loaded.

In an alternate embodiment, in response to carrying out the API request of self-application, utilize include multiple The new task of executable updates what the multiple processing units with such as CPU or GPU etc were associated Task queue.Judge performing the bar that is scheduling from the new task of queue in multiple processing units Part.Based on the condition judged, select and in the associated plurality of executable of new task For execution.

In an alternate embodiment, in response to carrying out the API request of self-application, load from this application For performing the source of data processing function, with at multiple target datas of such as CPU or GPU etc The one or more middle execution executable of reason unit.Automatically determine polytype target data to process Unit.Based on object processing unit one or more middle to be performed determined by type compile Executable.

In an alternate embodiment, source and for multiple processing units compile out one or more Corresponding executable is stored in API library and realizes api function.In response to from primary processor The application run in (host processor) request to API library, takes out this source from API library and is somebody's turn to do The one or more corresponding executable of api function.The additional place not included for these multiple unit Reason unit, goes out additional executables from the source compiled online taken out.According to api function, additional Processing unit is executed concurrently together with in one or more processing units additional executables and one Or multiple taken out executable.

In an alternate embodiment, receive API Calls on the host processor and perform application, should Application has multiple thread for performing.Primary processor coupling CPU and GPU.These multiple thread quilts Asynchronous schedule is for the executed in parallel on CPU and GPU.If GPU is busy with graphics process thread, The thread to perform on GPU that is then scheduled can be performed in CPU.

In an alternate embodiment, receive API Calls on the host processor and perform application, should Application has multiple thread for performing.Primary processor is coupled to CPU and GPU.These multiple threads By asynchronously initializing for the executed in parallel on CPU and GPU.If GPU is busy with graphics process line Journey, then the thread to perform on GPU that is initialised can be performed in CPU.

From accompanying drawing and described in detail below, other features of the present invention will be apparent to.

Accompanying drawing explanation

In the diagram of accompanying drawing by example the unrestricted present invention that illustrates, similar label represents phase As element, in accompanying drawing:

Fig. 1 is to illustrate to include that the device that calculates of CPU and/or GPU performs the number of application for configuration The block diagram of an embodiment according to the system of parallel computation；

Fig. 2 is to illustrate the computation processor having multiple parallel work-flow to be executed concurrently multiple thread The block diagram of example calculating device；

Fig. 3 is the multiple physics illustrating and being configured to logical calculated device via computing device identification symbol Calculate the block diagram of an embodiment of device；

Fig. 4 is to illustrate the ability need received by coupling from application to utilize computing device identification Symbol configures the flow chart of an embodiment of the process of multiple physical computing devices；

Fig. 5 is the enforcement illustrating and performing to calculate the process of executable in logical calculated device The flow chart of example；

Fig. 6 is the flow chart of the embodiment processed during the operation illustrating and loading executable, should Process and include that the one or more physical computing devices for being determined to perform this executable compile Source；

Fig. 7 be illustrate from perform queue selects to calculate kernel execution instance with and this perform reality The process performed in one or more physical computing devices that logical calculated device that example is associated is corresponding The flow chart of an embodiment；

Fig. 8 A is the stream of the embodiment illustrating the process setting up API (API) storehouse Cheng Tu, this process performs being used for the multiple of one or more API according to multiple physical computing devices Body and source are stored in storehouse；

Fig. 8 B is to illustrate application to perform in multiple executable and based on API request from API The flow chart of one embodiment of the process of the respective sources that storehouse is taken out；

Fig. 9 is to illustrate the meter calculating kernel executable to be performed in multiple physical computing device Calculate the sample source code of the example in kernel source；

Figure 10 be illustrate by call API be configured in multiple physical computing devices perform many The sample source code of the example of the logical calculated device of in individual executable；

Figure 11 illustrate that can be used in combination with embodiment described herein, have multiple CPU and One example of the typical computer system of GPU (GPU).

Detailed description of the invention

The method and apparatus of the data parallel being described herein on multiprocessor.It is described below In, elaborate that a large amount of specific detail is to provide the thorough explanation to the embodiment of the present invention.But, for It should be apparent to those skilled in the art that and can carry out the embodiment of the present invention and without these certain detail Joint.In other example, it is not illustrated in detail known assembly, structure and technology in order to avoid making originally retouching The understanding stated obscures.

In the description, mention that " embodiment " or " embodiment " mean to combine this embodiment The specific features, structure or the feature that describe can be included at least one embodiment of the present invention.? Each local phrase " in one embodiment " occurred in specification is not necessarily referring to same reality Execute example.

By including that hardware (such as, circuit, special logic etc.), software are (such as in general-purpose computations The software run in machine system or special purpose machinery) or below the process logic of combination of the two performs Process described in diagram.Although describe process below according to some order operation, but should Understand, some operation can being executed in different order in described operation.And it is possible to it is parallel Ground rather than be sequentially performed some operation.

GPU (GPU) can be carried out the efficient of such as 2D, 3D graphic operation etc Graphic operation and/or the dedicated graphics processors of digital video correlation function.GPU can include for holding Row such as position block transmission (blitter) operation, texture map, polygon render (rendering), as Special (programmable) of the graphic operation of element coloring (shading) and vertex coloring etc is hard Part.Known GPU obtains data and by admixed together for pixel by image background from frame buffer It is rendered in this frame buffer for display.GPU can also control this frame buffer and allow to be somebody's turn to do Frame buffer is used to refresh the display of such as CRT or LCD display etc, CRT or LCD shows Show device be the speed needing at least 20Hz refreshing (such as, every 1/30 second, utilize from frame buffer This display of Refresh Data) short retain display.Generally, GPU can be from coupling with GPU CPU obtains graphics processing tasks, exports raster graphics image by display controller to display device. " GPU " mentioned in this manual can be United States Patent (USP) No. such as Lindholdm etc. 7015913“Method and Apparatus for Multitheraded Processing of Data In a Programmable Graphics Processor " (data in programmable graphics processor are many The method and apparatus of thread process) and United States Patent (USP) No.6970206 " the Method for of Swan etc. Deinterlacing Interlaced Video by A Graphics Processor " (at by figure Reason device is to the method that deinterleaves of video after interweaving) described in image processor or able to programme Graphic process unit, the two patent is incorporated herein by reference.

In one embodiment, multiple different types of processors (such as CPU or GPU) can be also The data parallel processing task sending out the ground one or more application of execution increases in data handling system available Process the utilization ratio of resource.The process resource of data handling system can be based on multiple physical computing Device.Physical computing device can be CPU or GPU.In one embodiment, at data parallel Reason task can entrust to polytype processor, be such as able to carry out this task CPU or GPU.Data processing task can be from some particular procedure ability of processor requirement.Disposal ability is such as It can be dedicated texture (texturing) hardware supported, double-precision floating point computing, special locally stored Device, stream data cache or synchronization primitives (synchronization primitives).Different types of place But reason device can provide the disposal ability collection of different overlap.Such as, CPU and GPU can hold Row double-precision floating point calculates.In one embodiment, application can balance in available CPU or GPU Any one perform data parallel processing task.

In another embodiment, can the most automatically perform to appoint for parallel data processing The selection of the process resource of the number of different types of business and distribution.Application can pass through API (application journey Sequence interface) when the operation of data handling system, platform sends the energy included desired by data processing task The prompting of power list of requirements.Correspondingly, during operation platform may determine that multiple currently available, have CPU and/or GPU of the ability matched with received prompting is to entrust the data of this application to process Task.In one embodiment, this capability requirement list may rely on the data process times on basis Business.Capability requirement list can be suitable for and such as include the different editions that has from different suppliers The different processor set of GPU and multi-core CPU.Accordingly it is possible to prevent application provides with particular type CPU or GPU is the program of target.

Fig. 1 is to illustrate to include that the device that calculates of CPU and/or GPU performs the number of application for configuration The block diagram of an embodiment according to the system of parallel computation.System 100 can realize parallel computing architecture Structure.In one embodiment, system 100 can be to include the figure of one or more primary processor System, these primary processors are by data/address bus 113 and one or more central processing units 117 and such as Other processors one or more coupling of Media Processor 115 etc.Multiple primary processors can be Mandatory system (hosting system) 101 is connected to together.These multiple central processing units 117 are permissible Including the multi-core CPU from different suppliers.Media Processor can be to have dedicated texture to render firmly The GPU of part.Another Media Processor can be to support dedicated texture rendering hardware and double-precision floating point body The GPU of both architecture.Multiple GPU may be coupled to together for scalable connection interface (SLI) or CrossFire configuration.

In one embodiment, mandatory system 101 can support software stack, and software stack includes software stack Assembly, such as, apply 103, calculate podium level 111, calculating runtime layer 109, calculating compiler 107 With calculating application library 105.Application 103 can be passed through API (application programming interfaces) and call and other stack Assembly connects.Can be that the application 103 in mandatory system 101 runs one or more thread concomitantly. Calculate podium level 111 and can safeguard data structure or computer data structure, store each attachment The disposal ability of physical computing device.In one embodiment, application can be by calculating podium level 111 information taking out the available processes resource about mandatory system 101.Application can be flat by calculating Platform layer 111 selects and specifies the ability need for performing process task.Therefore, podium level is calculated For this process task, 111 can determine that the configuration of physical computing device is with from attached CPU117 And/or GPU115 distributes and initialization process resource.In one embodiment, podium level is calculated 111 should be able to be used for for corresponding with the physical computing device of the one or more reality configured Generate one or more logical calculated device.

Calculate runtime layer 109 can according to configured for application 103 process resource, such as one Individual or multiple logical calculated devices manage the execution of process task.In one embodiment, at execution Reason task can include that the calculating kernel objects creating representative process task and distribution such as preserve and can perform The storage resource of body, input/output data etc..Being loaded the executable for calculating kernel objects can To be calculating kernel objects.Calculate executable and can be included in the meter of such as CPU or GPU etc Calculate in calculating kernel objects to be performed in processor.Calculate runtime layer 109 can with distributed Physical unit interact to execution process task actual execution.In one embodiment, calculate Runtime layer 109 can according to configure for process task each processor (such as, CPU or GPU) run time behaviour is coordinated to perform the multiple process tasks from different application.Calculate and run Time layer 109 can select from the physical unit being configured to carry out process task based on run time behaviour One or more processors.Execution processes task and can include concomitantly in multiple physical processing apparatus Perform multiple threads of one or more executable.In one embodiment, runtime layer is calculated 109 can by monitor each processor operation time practice condition follow the tracks of performed each process The situation of task.

Runtime layer one or more can perform from application 103 loading is corresponding with process task Body.In one embodiment, runtime layer 109 is calculated automatically from calculating application library 105 load and execution Process the additional executables that required by task is wanted.Calculating runtime layer 109 can be from application 103 or calculating Application library 105 loads and calculates the executable of kernel objects and corresponding both source programs thereof.Calculate The source program of kernel objects can be to calculate kernel program.According to be configured to include polytype and/or The logical calculated device of the physical computing device of different editions, can load many based on single source program Individual executable.In one embodiment, calculate runtime layer 109 and can activate calculating compiler 107 Optimum is become to be used for being configured to carry out at the target of executable the source program compiled online loaded The executable of reason device (such as, CPU or GPU).

In addition to according to the existing executable of respective sources program, the executable that compiled online goes out is also Calling of future can be stored for.Additionally, calculate executable can be compiled offline and Via API Calls be loaded into calculating run time 109.Calculate application library 105 and/or application 103 can ring The executable being associated should be loaded in the storehouse API request carrying out self-application.Can be to calculate application library 105 or application 103 dynamically update the newly organized executable translated.In one embodiment, operation is calculated Time 109 can with by new upgraded version calculate device calculate compiler 107 compiled online go out new Executable replaces the existing calculating executable in application.When calculating operation, 109 are inserted into The new executable that line compiles updates calculating application library 105.In one embodiment, fortune is calculated During row, 109 can call calculating compiler 107 when the executable of loading processing task.At another In embodiment, calculate compiler 107 and can be called by off-line that set up can for calculate application library 105 Perform body.Calculate compiler 107 and can compile and link calculating kernel program to generate calculating kernel Executable.In one embodiment, calculate application library 105 and can include multiple for supporting such as Development kit and/or the function of image procossing.Each built-in function can correspond to for multiple physics meters Calculate the calculating source program and one or more executable stored in the calculating application library 105 of device.

Fig. 2 is the block diagram illustrating the example calculating device with multiple computation processor, and these are multiple Computation processor is operable to be executed concurrently multiple thread concurrently.Each computation processor is permissible Concurrently (or concomitantly) perform multiple thread.Thread can be properly termed as with the thread of executed in parallel Block.Calculate device and can have the multiple thread block that can be executed in parallel.Such as, it is shown that calculating dress Putting in 205, M thread performs as a thread block.Thread in multiple thread block, such as, meter Calculate thread 1 and the thread N of computation processor _ L203 of processor _ 1205, dress can be calculated at one Put on computation processor respectively or be performed in parallel on multiple calculating devices.At multiple calculating Multiple thread block on reason device can be performed in parallel calculating kernel executable.At more than one calculating Reason device can be one single chip based on such as ASIC (special IC) device.A reality Execute in example, can cross over multiple chips more than one computation processor on be executed concurrently from Multiple threads of application.

Calculate device and can include one or more computation processor, such as computation processor _ 1205 He Computation processor _ L203.Local storage can couple with computation processor.Can by with calculating Reason device coupling local storage support in computation processor run single thread block thread it Between shared memory.Cross over multiple threads of different thread block, such as thread 1213 and thread N 209 can share and calculate the stream stored in the stream memory 217 that device 201 couples.Stream can be Calculate the set of the element that kernel executable can operate on it, such as image stream or variable Stream.Nonsteady flow can be allocated for the global variable operated on it during storage process task. Image stream can be to can be used for image buffers, texture buffering or the buffer of frame buffering.

In one embodiment, the local storage of computation processor can be implemented as special locally stored This locality of device, such as processor _ 1 is shared this locality of memory 219 and processor _ L and is shared memory 211.In another embodiment, the local storage of computation processor can be implemented as calculating dress The stream read-write cache nor of the stream memory of the one or more computation processors 2 put, such as, be used for calculating dress Put the stream data cache 215 of computation processor 205203 in 201.In another embodiment, this locality is deposited Reservoir can be implemented in the thread in the thread block run in the computation processor coupled with local storage Between the special local storage shared, this locality such as couple with computation processor _ 1205 is shared Memory 219.Special local storage can not be spanned the thread of different threads block and share.As Really the local storage of computation processor (such as processor _ 1205m) is implemented as flowing read-write cache nor (such as, stream data cache 215), then in local storage, the variable of statement can be deposited from stream Reservoir 217 distributes and is stored in the stream read-write cache nor (example realizing local storage realized As, stream data cache 215) in.When such as flowing read-write cache nor and special local storage for phase When the calculating device answered is the most unavailable, the thread in thread block can be shared in stream memory 217 and be divided The local variable joined.In one embodiment, each thread is relevant to privately owned (private) memory Connection, privately owned memory is used for storing the thread private variable used by the function called in thread.Example As, privately owned memory 1211 can only be accessed by thread 1213.

Fig. 3 is to illustrate to be configured to multiple things of logical calculated device via computing device identification symbol Reason calculates the block diagram of an embodiment of device.In one embodiment, application 303 and podium level 305 Can run in host CPU 301.Application 303 can be in the application 103 of Fig. 1.Trustship system System 101 can include host CPU 301.Physical computing device Physical_Compute_Device-1 305...Physical_Compute_Device-N311 each in can be Fig. 1 CPU117 or In GPU115 one.In one embodiment, calculating podium level 111 can be in response to carrying out self-application The API request of 303 generates computing device identification symbol 307, for according to included in API request The list of ability need configures parallel data processing resource.Computing device identification symbol 307 can relate to The configuration carried out according to calculating podium level 111 selects the physical computing device of reality Physical_Compute_Device-1305...Physical_Compute_Device-N311.One In individual embodiment, logical calculated device 309 can represent selected by a group separated with host CPU 301 Actual physics calculates device.

Fig. 4 is to illustrate for by mating from applying the ability need received, utilizing calculating device Identifier configures the flow chart of the embodiment of the process of multiple physical computing device.Can be according to Fig. 1 System 100, in the data handling system by mandatory system 101 trustship execution process 400.Data Processing system can include hosted platform layer (the calculating podium level 111 of such as Fig. 1) primary processor and Multiple physical computing device (such as, CPU117 and GPU of Fig. 1 being attached to primary processor 115)。

In block 401, in one embodiment, process 400 and can set up representative and one or more phases The data structure of the ability associated plurality of physical computing device answered (or computer data knot Structure).Each physical computing device can be attached to execution and process the processing system of 400.Such as CPU Or whether the ability of the physical computing device of GPU etc or computing capability can include physical computing device Support processing feature, memory access mechanism or specify extension.Processing feature can be hard with dedicated texture Part support, double-precision floating point computing or synchronization support that (such as mutual exclusion) is relevant.Physical processing apparatus Memory access mechanism can be with the type of variable stream cache, the type of image stream caching or special this locality Memory is supported relevant.The system application of data handling system can be in response to by new physical computing dress Put and be attached to data handling system and carry out more new data structure.In one embodiment, it may be predetermined that The ability of physical computing device.In another embodiment, the system application of data handling system can be The physical processing apparatus of new attachment is found the run time of between.The application of this system can take out newfound thing Reason calculates the ability of device, updates the physical computing device attached by representative and their respective capabilities Data structure.

According to an embodiment, at block 403, processing 400 computing capabilitys that can receive self-application needs Ask.This application can send capability requirement by calling API to system application.This system is applied Can be corresponding with the podium level of the software stack in the mandatory system of this application.In one embodiment, Capability requirement can identify for asking process resource to perform the required ability of the task of this application List.In one embodiment, this application may require that requested resource in multiple threads also Send out ground and perform task.As response, at block 405, processing 400 can be from attached physical computing dress Put one group of physical computing device of middle selection.Can be based on capability requirement and institute in capabilities data structure Coupling between the computing capability of storage determines selection.In one embodiment, 400 are processed permissible Coupling is performed according to the prompting that handling capacity demand provides.

Processing 400 can be according to the calculating energy mated between physical computing device and capability requirement The number of power determines coupling scoring.In one embodiment, process 400 can select to have the highest Multiple physical computing devices of coupling scoring.In another embodiment, if each in ability need Ability is all matched, then process 400 and can select physical computing device.Processing 400 can be at block 405, determine many group coupling physical computing devices.In one embodiment, according to load balance ability Select often to organize coupling physical unit.In one embodiment, at block 407, processing 400 can be block Often group physical computing device selected by 405 generates computing device identification symbol.Process 400 can pass through Call API and return the one or more computing device identification symbol generated to application.Application can basis Computing device identification symbol selects which uses process resource to perform task.In one embodiment, Processing 400 at block 407 can be that received each ability need generates a most calculating device mark Know symbol.

In one embodiment, at block 409, processing 400 can accord with according to corresponding computing device identification Distribute the logical calculated device of one group of physical computing device selected by INIT block 405 Resource.Processing 400 can be according to the selection at block 405, in response to from having been received by one or many The API request of the application of individual computing device identification symbol performs the initialization to logical calculated device.Place Reason 400 can create context object on the logical calculated device of this application.An embodiment In, context object is associated with an application thread in the mandatory system run in this application.And The process task sending out in one logical calculated device of ground execution or crossing over different logical calculated devices is many Individual thread can be based on separate context object.

In one embodiment, process 400 can based on include cuCreateContext, Multiple API of cuRetainContext and cuReleaseContext.API cuCreateContext creates meter Count in hereafter.Calculate context to can correspond to calculate context object.API cuRetainContext Use and calculated the context input independent variable as cuRetainContext by the concrete of Context identifier It is incremented by the number of example.API cuCreateContext carries out implicit expression reservation.This for generally obtain by The third party library of the context that application passes to them is helpful.However, it is possible to this application can be deleted Storehouse is not notified except context.Multiple example is allowed to be attached to context and solve from context release The problem that context is no longer valid is calculated by what storehouse used.If the input of cuRetainContext is certainly Variable is not correspond to effectively calculating context object, then cuRetainContext returns CU_INVALID_CONTEXT.API cuReleaseContext is from effectively calculating release context Example.If the input independent variable of cuReleaseContext is the most relative with effective calculating context object Should, then cuReleaseContext returns CU_INVALID_CONTEXT.

Fig. 5 is the embodiment illustrating and performing to calculate the process of executable in logical calculated device Flow chart.In one embodiment, can be by runtime layer (such as, the figure in data handling system The calculating runtime layer 109 of 1) perform to process 500.At block 501, processing 500 can be logic Calculate calculating executable to be run on device and distribute one or more streams.Process task can be by right The calculating executable that stream carries out operating performs.In one embodiment, process task can include Inlet flow and output stream.Process 500 and the stream memory distributed can be mapped to application logically Location or be mapped to distributed stream memory from it.In one embodiment, processing 500 can be based on Carry out the API request of self-application to perform the operation of block 501.

At block 503, according to an embodiment, process 500 calculating that can create logical calculated device Kernel objects.Calculating kernel objects can be being correlated with for the respective handling task for performing function Stream and the executable of connection and the object that creates.Process 500 to build for calculating kernel objects at block 505 Vertical function argument.Function argument can be included as function input or the stream of output distribution, such as block The stream of distribution at 501.Process 500 will to calculate in kernel executable and/or calculating at block 507 Core source is loaded in calculating kernel objects.Calculating kernel executable can be according to logical calculated device It is executed for the executable of the respective handling task that execution is associated with kernel objects.At one In embodiment, calculate kernel executable can include such as with target physical calculate device type, The description data that version and/or compiling option are associated.Calculating kernel source can be to compile out calculating from it The source code of kernel executable.Processing 500 can be corresponding with calculating kernel source in block 507 loading Multiple calculating kernel executable.Processing 500 can be from application or by the calculating application library of such as Fig. 1 The calculating storehouse of 105 etc loads calculating kernel executable.Calculate kernel executable can utilize The corresponding kernel source that calculates loads.In one embodiment, processing 500 can be according to carrying out self-application API request performs the operation at block 503,505 and 507.

At block 511, process 500 and can update execution queue to utilize logical calculated device to perform computer Kernel objects.Process 500 can utilize when calculating operation (when such as, the calculating of Fig. 1 runs 109) Suitable independent variable, in response to come self-application or calculate application library (such as, Fig. 1 application 103 or Calculate application library 105) API Calls perform calculate kernel.In one embodiment, 500 are processed The calculating kernel execution instance performing to calculate kernel can be generated.To for performing to calculate the calculating of kernel During operation, the API Calls itself of (when the calculating of such as Fig. 1 runs 109) can essentially be asynchronous 's.Performing example can be by being returned by (when such as, the calculating of Fig. 1 runs 109) when calculating operation Calculating event object identify.Calculate kernel execution instance can be added in performing to calculate Examine the execution queue of example.In one embodiment, to for performing to calculate the execution team of kernel execution instance The API Calls of row can include the number of the thread of executed in parallel simultaneously on computation processor and to use The number of computation processor.Calculate kernel execution instance and can include indicating desired for performing The preferred value of the corresponding priority calculating kernel objects.Calculate kernel execution instance and can also include mark The event object performing example before knowledge and/or for performing the expected numbers purpose thread of this execution and pre- Issue purpose thread block.Can in API Calls the number of given thread block and the number of thread.? In one embodiment, event object may indicate that and includes the execution example of this event object and by event pair As another of mark performs the execution sequence relation between example.May require that and include holding of event object Row example performs to be performed after example completes to perform at another identified by this event object.Event pair As being properly termed as queue_after_event_object.In one embodiment, perform queue can include Multiple for performing to calculate accordingly the calculating kernel execution instance of kernel objects.In a calculating One or more calculating kernel execution instance of verification elephant can be scheduled for performing holding in queue OK.In one embodiment, process 500 to hold to update this in response to the API request carrying out self-application Row queue.The hosted data system that this execution queue can be run on by this application carrys out trustship.

At block 513, process 500 and calculating kernel can be selected from the execution queue for performing to perform reality Example.In one embodiment, process 500 to select more than one according to respective logic calculating device Individual calculating kernel execution instance to be executing concurrently.Process 500 to may determine whether based on calculating Kernel execution instance with perform in queue other perform priority that examples are associated and dependence and Calculating kernel execution instance is have selected from execution queue.Can be by according to being loaded in corresponding calculating The executable of verification elephant performs this calculating kernel objects, thus performs to calculate kernel execution instance.

At block 517, in one embodiment, process 500 can select to be loaded into and selected calculating Kernel execution instance corresponding calculate kernel objects multiple executable in an executable, with The physical computing device that the logical calculated device of this calculating kernel objects of Yu Yu is associated performs.Place Reason 500 can calculate kernel execution instance selection for one will be in more than one physical computing device The more than one executable of executed in parallel.This selection can based on and selected calculating in examine The current practice condition of the physical computing device that logical calculated device that example is associated is corresponding.Physics meter The practice condition calculating device can include the number of thread, local memory usage level and the place of operation Reason device utilizes level (such as, the peak number of the operation of time per unit) etc..An embodiment In, this selection can be to utilize level based on predetermined.In another embodiment, this selection can Be based on calculate the number of thread and the number of thread block that kernel execution instance is associated.Process 500 can take out practice condition from physical computing device.In one embodiment, process 500 can hold Row is for from performing to select queue to calculate the operation of kernel execution instance, with at block 513517 and torr The application run in guard system performs asynchronously.

At block 519, process in 500 calculating that can check the execution being scheduled in this execution queue Core performs the situation of example.Example can be performed to identify each by uniquely calculating event object.When Corresponding calculate kernel execution instance according to calculate run time the operation of such as, the Fig. 1 (time 109) arranged During team, event object can be returned to call the application of the API by performing this execution example or based on Calculate application library (such as, Fig. 5 application 103 or calculate application library 105).In one embodiment, Processing 500 can be in response to carrying out the API request of self-application to carry out practice condition inspection.Processing 500 can Determine in terms of execution by the situation calculating event object being identified this calculating kernel execution instance by inquiry Calculate completing of kernel execution instance.Process 500 and can wait until the execution calculating kernel execution instance Till being done, to return the API Calls of self-application.Process 500 to control based on event object Make and perform example reading from the process of various streams and/or write.

At block 521, according to an embodiment, process 500 and can take out execution calculating kernel execution instance Result.Subsequently, process 500 can be cleared up and is allocated for performing this calculating kernel execution instance Process resource.In one embodiment, process 500 to perform to calculate kernel executable by preserving The stream memory copy of result in local storage.Process 500 can be deleted and be divided at block 501 The nonsteady flow joined or image stream.Process 500 can delete for deleting when calculating kernel and performing and be done The kernel event object removed.If each being associated with specific calculation kernel objects calculates kernel and performs Example is completely performed, then process 500 and can delete specific calculation kernel objects.A reality Execute in example, process 500 operations that can perform block 521 based on the API request initiated by application.

Fig. 6 is the flow chart of the embodiment processed during the operation illustrating and loading executable, this operation Time process and include being used for being determined to perform one or more physical computing of this executable by source compiling Device.Process 600 to be performed as a part for the process 500 at the block 507 of Fig. 5.One In individual embodiment, processing 600 can be each physics being associated with logical calculated device at block 601 Calculating device selects one or more existing calculating kernels compatible with this physical computing device to perform Body.Calculate kernel executable to be performed in compatible physical computing device.This existing meter Calculate kernel executable from application or to be obtained by the calculating storehouse of the calculating application library 105 of such as Fig. 1 ?.Selected each calculating kernel executable calculated in kernel executable can be by least One physical computing device performs.In one embodiment, this selection can be based on existing calculating The description data that kernel executable is associated.

If there is selected existing calculating kernel objects, then processing 600 can judge at block 603 Whether any one calculated in kernel executable selected is optimum for physical computing device 's.This judgement can be such as version based on physical computing device.In one embodiment, as Fruit describes the version of version and physical computing device that the target physical in data calculates device and matches, Then process 600 and may determine that it is optimum for having calculating kernel executable for this physical computing device 's.

At block 605, in one embodiment, process 600 and can use compiled online device (such as Fig. 1 Calculating compiler 107) to set up for physical computing device from corresponding computer inner core source optimum New calculating kernel executable.If the calculating kernel selected by finding at block 603 can perform Not calculating kernel executable in body is optimum for physical computing device, then process 600 permissible Perform online foundation.In one embodiment, if finding that at block 601 existing calculating kernel can be held Do not calculate kernel executable in row body mutually compatible with physical computing device, then processing 600 can hold Row is online to be set up.Calculating kernel source can be from application or by calculating application library 105 of such as Fig. 1 etc Calculating storehouse obtain.

If the foundation at block 605 is successful, the most in one embodiment, processing 600 can be at block At 607, newly-established calculating kernel executable is loaded into corresponding calculating in kernel objects.No Then, process 600 and at block 609, selected calculating kernel executable can be loaded into interior verification As.In one embodiment, be not also loaded if calculating kernel executable, then processing 600 can So that calculating kernel executable is loaded into calculating kernel objects.In another embodiment, if calculated In the existing calculating kernel executable of kernel objects does not has calculating compatible with physical computing device Core executable, and it is not available to calculate kernel source accordingly, then and processing 600 can disappear with generation error Breath.

Fig. 7 be illustrate from perform queue selects to calculate kernel execution instance with and this perform reality The process performed in one or more physical computing devices that logical calculated device that example is associated is corresponding The flow chart of an embodiment.Processing 700 can be as one of the process 500 at the block 513 of Fig. 5 Divide and be performed.In one embodiment, process 700 is current in can identifying execution queue at block 701 The scheduled dependence condition calculated between kernel execution instance.Calculate the dependence bar of kernel execution instance Part is possible to prevent to calculate the execution of kernel execution instance, if this condition is not fully complete.A reality Execute in example, dependence can be based on the inlet flow fed by output stream between relation.One In individual embodiment, processing 700 can come according to the inlet flow of the respective function of execution example and output stream Detection performs the dependence between example.In another embodiment, there is holding of lower priority Row example can have dependence with another execution with high priority.

At block 703, in one embodiment, process 700 to hold from multiple calculating kernels being scheduled Row example selects do not have the calculating kernel execution instance of any dependence condition being not fully complete for performing. This selection can be based on the priority being assigned to perform example.In one embodiment, selected The calculating kernel execution instance selected can be relevant to the limit priority in multiple calculating kernel execution instance The dependence condition joined and be not not fully complete.At block 705, process 700 and can take out and selected calculating The current practice condition of the physical computing device that kernel execution instance is corresponding.In one embodiment, The practice condition of physical computing device can be to take out from predetermined storage position.Implement at another In example, process 700 and can receive practice condition report to physical computing device transmission status request. Process 700 and can assign in physical computing device based on the practice condition taken out at block 707 Individual or multiple perform selected calculating kernel execution instance.In one embodiment, physical computing Device can be assigned for according to the load balancing with other physical computing devices performing.Selected Physical computing device can with meet preassigned (such as, predetermined process device utilize level and/or Memory utilizes below horizontal) practice condition be associated.In one embodiment, preassigned can To depend on and the selected number calculating the thread that kernel execution instance is associated and the number of thread block Mesh.Process 700 can by be used for the identical separate calculating kernel performing example or multiple example can Perform body and be loaded into one or more assigned physical computing device, to hold parallel in multiple threads OK.

Fig. 8 A is the stream of the embodiment illustrating the process setting up API (API) storehouse Cheng Tu, this process performs being used for the multiple of one or more API according to multiple physical computing devices Body and source are stored in storehouse.Process 800A can at block 801 by off-line execution with by the source of api function Code is loaded in data handling system.Source code can be in one or more physical computing devices Calculating kernel source to be performed.In one embodiment, process 800A can at block 803 for Api function assigns multiple target physical to calculate device.Can according to type (such as, CPU or GPU), version or supplier assign target physical to calculate device.Processing 800A can be at block 805 Place calculates device for each target physical assigned and compiles source code into executable, such as, Calculate kernel executable.In one embodiment, processing 800A can be based on compiled online device (example Calculating compiler 107 such as Fig. 1) carry out compilation offline.At block 807, processing 800A can be by API The source code of function calculates, to for the target physical assigned, the corresponding executable that device is compiled out Store in API library.In one embodiment, each executable can be stored and describe data, Describe data and such as include that target physical calculates the type of device, version and supplier and/or compiling choosing ?.By run time between process (such as, the process 500 of Fig. 5) description data can be taken out.

Fig. 8 B is to illustrate application to perform in multiple executable and based on API request from API library The flow chart of one embodiment of the process of the respective sources taken out.In one embodiment, process 800B is (example in the data handling system including API library (such as, the calculating application library 105 of Fig. 1) As, in the mandatory system 101 of Fig. 1) run application program (such as, the application 103 of Fig. 1).At block At 811, process 800B and can take out source (such as, calculating kernel from API library based on API request Source) and one or more corresponding executable (such as, calculate kernel executable), such as Fig. 5 Block 507 at process 500.Each executable can calculate device with one or more target physical It is associated.In one embodiment, calculating kernel executable can be with the physical computing of miscellaneous editions Device backward compatibility.At block 813, process 800B and can perform base in multiple physical computing devices In the executable that API request is taken out one performs the api function being associated, such as Fig. 5 Block 517 at process 500.Processing 800B can be with execution api function at block 813 asynchronously at block Application is performed at 809.

Fig. 9 is to illustrate the meter calculating kernel executable to be performed in multiple physical computing device Calculate the sample source code of the example in kernel source.Example 900 can be to have to include variable 901 and stream 903 The api function of independent variable (arguments).Example 900 can be system 101 based on such as Fig. 1 Etc the programming language of parallel computation environment.In one embodiment, it is possible to use be designed to reality The additional extension of the one or more embodiments in existing said embodiment and restriction, according to ANSI (American National Standards Institute (ANSI)) C standard specifies parallel programming language.These extensions can include using Specify the function qualifier (qualifier) calculating calculating kernel function to be performed in device, example Such as qualifier 905.Calculate kernel function and can not be calculated kernel function call by other.A reality Execute in example, calculating kernel letter can be called by the principal function (host function) of parallel program language Number.Principal function can be conventional ANSI C function.Principal function can calculate kernel function with execution Calculate device be separated primary processor in be performed.In one embodiment, these extensions are permissible Including locally limiting symbol, to describe the calculating needing to be assigned to shared by all threads of thread block Variable in the local storage that device is associated.Kernel function declared inside this locality limit can calculated System symbol.The restriction of parallel programming language can be enhanced during compiler time or operation time with When these limit and are breached, generation error situation, such as, output error message or exit execution.

Figure 10 be illustrate by call API be configured in multiple physical computing devices calculate many The sample source code of the example of the logical calculated device of in individual executable.Example 1000 is permissible By being attached fortune in the host computer system (such as, the mandatory system 101 of Fig. 1) of multiple physical computing device That goes should be for performing.Example 1000 can specify the principal function of parallel programming language.In example 1000 The operation that processes can be by the process of the process 500 of such as Fig. 5 etc, be held as API Calls OK.Distribution stream 1001 and the operation that processes adding stream image 1003 can be by the places at the block 501 of Fig. 5 Reason 500 is performed.The process operation creating calculating kernel objects 1005 can be passed through at the block 503 of Fig. 5 Process 500 be performed.Processing operation 1007 can be by the calculating of the example 900 of such as Fig. 9 etc Core source is loaded into be created that calculating kernel objects.Processing operation 1009 can be from the calculating loaded Kernel source is explicitly set up and calculates kernel executable.In one embodiment, operation 1009 is processed The kernel executable that calculates set up can be loaded into created calculating kernel objects.Subsequently, Processing operation 1011 can explicitly select set up calculating kernel executable for performing to be created The calculating kernel objects built.

In one embodiment, process operation 1013 can using supplementary variable and stream as the calculating created The function argument of kernel objects.Processing operation 1013 can be by the process 500 at the frame 505 of Fig. 5 It is performed.Process operation 1015 and can perform created calculating kernel objects.An embodiment In, processing operation 1015 can be performed by the process 500 at the block 511 of Fig. 5.Process operation 1015 so that execution queue is utilized calculate kernel corresponding with the calculating kernel objects created and holds Row example and be updated.Process operation 1017 and can synchronously wait for created calculating kernel objects Perform completes.In one embodiment, processing operation 1019 can be from the execution calculating kernel objects Middle taking-up result.Subsequently, process operation 1021 can clear up distributed calculate interior verification for performing The resource of elephant, such as event object, the calculating kernel objects created and the memory distributed.? In one embodiment, process whether operation 1017 can be set based on kernel event object.Process behaviour Make 1017 to be performed by the process 500 at the block 519 of Fig. 5.

Figure 11 illustrates that of the computer system can being used together with one embodiment of the invention shows Example.First, system 1100 may be implemented as a part for the system shown in Fig. 1.Note, to the greatest extent Pipe Figure 11 illustrates the various assemblies of computer system, but it is not intended to represent these assemblies of interconnection Any concrete architecture or mode, because these details do not have close pass for the present invention System.It is further understood that, it is also possible to have less assembly or may the network computer of more multicompartment and its His data handling system (such as, handheld computer, personal digital assistant (PDA), honeycomb electricity Words, entertainment systems, consumer-elcetronics devices etc.) come together to realize one or more enforcements of the present invention Example.

As shown in Figure 11, the computer system 1101 as the data handling system of a kind of form is wrapped Include: be coupled to the bus of (one or more) microprocessor 1105 of such as CPU and/or GPU etc 1103, ROM (read-only storage) 1107, volatibility RAM1109 and nonvolatile memory 1111.Microprocessor 1103 can take out instruction from memory 1107,1109,1111 and perform this A little instructions perform aforesaid operations.These various assemblies are interconnected together by bus 1103, and also by this A little assemblies 1105,1107,1109 and 1111 and display controller and display device 1113 and peripheral equipment Putting interconnection, peripheral unit can be e.g. mouse, keyboard, modem, network interface, beat Input/output (I/O) device of print machine and other devices well known in the art.Commonly enter/export dress Put 915 and be coupled to this system by i/o controller 1117.Volatibility RAM (deposit by arbitrary access Reservoir) 1109 it is generally implemented as continuing to need electric power to refresh or to safeguard data in memory Dynamic ram (DRAM).The display controller coupled with display device 1108 can include alternatively One or more GPU process display data.It is alternatively possible to provide GPU memory 1111 to prop up Hold GPU included in display device 1108.

Even if high-capacity storage 1111 is typically to remain able to safeguard number after electric power is removed from system According to the magnetic hard disk drives of (such as, mass data) or magneto-optical drive or CD-ROM drive or DVD RAM or sudden strain of a muscle Deposit or other type of storage system.Generally, high-capacity storage 1111 also can be arbitrary access Memory, although this is not required to.Although Figure 11 illustrates that high-capacity storage 1111 is to be directly coupled to The local device of the remaining component in data handling system, it will be apparent, however, that the present invention can utilize Away from the nonvolatile memory of this system, such as by such as modem or Ethernet interface or The network interface of wireless networking interface etc is coupled to the network storage device of data handling system.Bus 1103 can include being connected with each other by various bridgers well known in the art, controller and/or adapter One or more buses.

The logic circuit of such as dedicated logic circuit etc can be utilized or utilize microcontroller or perform journey The process core of other forms of sequence code command realizes the part of foregoing.Therefore, it can profit Perform by instructed place discussed above with the program code of such as machine-executable instruction etc Reason, machine-executable instruction makes the machine performing these instructions perform some function.At this context In, " machine " can be intermediate form (or " abstract ") instruction to be converted into processor refer to Make (such as, such as " virtual machine " (such as, Java Virtual Machine), interpretive program, common language fortune The abstract execution ring of (Common Language Runtime), high-level language virtual machine etc. during row Border) and/or it is designed to perform the semiconductor chip of instruction (such as, by " the logic of transistor realization Circuit ") on the electronic circuit disposed, such as application specific processor and/or general processor.By with The process that upper discussion is instructed can also by be designed to perform these process (or these process one Point) electronic circuit (replace machine or be combined with machine) perform, and without carrying out program generation Code.

Manufacture can be used to store program code.The manufacture of storage program code can be implemented For, but be not limited to, be suitable to store one or more memories (such as, or many of e-command Individual flash memory, random access memory (static, dynamically or other)), CD, CD-ROM, DVD ROM, EPROM, EEPROM, magnetic or the card of light or other type of machine readable be situated between Matter.(such as via communication link (such as, network connects)) institute in propagation medium can also be passed through The data-signal realized to download to ask by program code from remote computer (such as, server) Computer (such as, client).

Algorithm and symbolic expression according to the operation to the data bit in computer storage illustrate elder generation Front detailed description.These arthmetic statements and statement be those skilled in the art in data processing field to The used instrument of substance that others skilled in the art pass on them to work most effectively.This In, algorithm is generally conceived that cause desired result is in harmony the sequence of operation certainly.These operations are to need Will be to those operations of the physical operations of physical quantity.Generally, but being not necessarily necessary, this tittle is adopted With the signal of telecommunication that can be stored, be forwarded, be combined, compared or otherwise operated or The form of magnetic signal.Sometimes, especially for the reason being used in conjunction with, have been found to believe these Number carry that to make bit, value, element, symbol, character, term, numeral etc. be convenient.

It is to be noted, however, that the whole terms in all these and similar term are with suitable Ground physical quantity be associated and be only be suitable for this tittle facilitate label.Unless specifically stated otherwise or with Other modes from described above obviously, it is understood that throughout the specification, utilize such as " process " or the discussion of term of " calculating " or " judgement " or " display " etc., relate to computer System or the action of similar computing electronics and process, computer system or similar electronics calculate The number that the physics (electronics) that device operation is represented as in the RS of computer system is measured According to and they be transformed into be similarly represented as computer system memory or register or other this The data of the physical quantity in the information-storing device of sample, transmission or display device.

The invention still further relates to the equipment for performing operation described herein.This equipment can be by special structure Build for the required purposes, or it can include that the computer program by being stored in computer has choosing The all-purpose computer being activated or reconfigured by with selecting.Such computer program can be stored in calculating In machine readable storage medium storing program for executing, computer-readable recording medium is such as but not limited to any kind of dish (include floppy disk, CD, CD-ROM and magneto-optic disk, read-only storage (ROM), RAM, EPROM, EEPROM, magnetic or optical card) or be suitable to store any kind of Jie of e-command Matter, and each of which is coupled to computer system bus.

Shown herein as process and display be not to have with any concrete computer or miscellaneous equipment inherently Close.Each general-purpose system can be used together with according to the program of teaching in this, or can confirm It is convenient for building more special equipment and performing described operation.From following description, for various The required structure of such system is obvious.Additionally, the present invention be not about any specifically Programming language describes.It is appreciated that various programming language may serve to realize institute the most here The teaching of the invention stated.

Some exemplary embodiment that merely depict the present invention described above.Those skilled in the art will Easily recognize from such discussion, drawings and claims can be carried out various amendment without departing from The spirit and scope of the present invention.

Claims

1. a computer implemented method, including:

Application program in the first processing unit run time between, in response to from described application program The second API request received, loads data processing task one or more of described application program Executable, wherein, the one or more executable with by described application program described second That specifies in API request calculates the second processing unit compatibility that device identifier is identified, described calculating Device identifier was previously referred to by the first API request by described application program during runtime with mating The processing unit of fixed one or more demands is associated；And

In response to the 3rd API request received from described application program between described run time, for Described second processing unit selects an executable in the one or more executable.

Computer implemented method the most according to claim 1, wherein said first processing unit It is CPU (CPU) or GPU (GPU) with described second processing unit.

Computer implemented method the most according to claim 1, wherein said one or more can The selected executable performed in body is associated with described 3rd API request.

Computer implemented method the most according to claim 1, wherein said one or more can Perform body and include the description data of at least one executable in the one or more executable, Described description data include version and the type of supported processing unit.

Computer implemented method the most according to claim 4, wherein said one or more can Execution body includes that source, described source are compiled and generates the one or more executable.

Computer implemented method the most according to claim 5, wherein said source is via described 2nd API loads from described application program.

Computer implemented method the most according to claim 5, wherein said source is from described The storehouse that one or more executable are associated loads.

Computer implemented method the most according to claim 5, wherein said loading includes:

Relatively described description data and the information of described second processing unit；And

For described second processing unit, from described source, compiled online goes out the one or more executable In one executable.

Computer implemented method the most according to claim 8, wherein said one or more can The one executable performed in body is associated with described second API request.

Computer implemented method the most according to claim 8, wherein said compiling based on: Described comparison indicates at least one executable in the one or more executable for described the Two processing units are not optimum.

11. computer implemented methods according to claim 8, wherein said compiling based on: Described at least one executable compared in the one or more executable of instruction is not supported described Second processing unit.

12. computer implemented methods according to claim 8, wherein said compiling includes:

Generate for the one executable in the one or more executable and include described second The version of processing unit description data after interior renewal；And

Store the one executable in the one or more executable, one or many One executable in individual executable includes the description data after described renewal.

13. computer implemented methods according to claim 12, wherein said one or more One executable in executable is stored replaces in the one or more executable At least one executable.

14. computer implemented methods according to claim 1, wherein said one or more can Perform body and include the description data of the one executable in the one or more executable, And wherein, described selection is based on described description data.

15. computer implemented methods according to claim 14, wherein said one or more An executable selected in executable based on described description data with at described second In the one or more executable of reason unit, up-to-date version is associated.

16. computer implemented methods according to claim 14, wherein said one or more The execution sequence relation that an executable selected in executable indicates based on described description data And be associated.

17. 1 kinds of computer implemented methods, including:

The first API request, institute is produced during runtime by the application program in the first processing unit State the first API request and specify one or more demands of the second processing unit；

The second API request is produced during runtime, to load described application journey by described application program One or more executable of the data processing task of sequence, wherein, the one or more can perform Body is identified with the calculating device identifier specified in described second API request by described application program Described second processing unit compatible, described calculating device identifier exists by described application program with mating The processing unit of the one or more demands previously specified in described first API request is associated；And

The 3rd API request is produced during runtime, with from one or many by described application program Individual executable select executable in order to perform in described second processing unit.

18. computer implemented methods according to claim 17, wherein said first processes list First and described second processing unit is CPU (CPU) or GPU (GPU)。

19. computer implemented methods according to claim 17, wherein said 2nd API please The source asked and therefrom compile out the one or more executable is associated.

20. computer implemented methods according to claim 19, performing selected in it Body is from described source compiled offline.

21. 1 kinds of data handling systems, including:

For the application program in the first processing unit run time between, in response to from described application The second API request that program receives, load one of data processing task of described application program or The device of multiple executable, wherein, the one or more executable with by described application program The second processing unit that the calculating device identifier specified in described second API request is identified is held concurrently Holding, described calculating device identifier is passed through first during runtime with mating by described application program The processing unit of one or more demands that API request had previously been specified is associated；And

For please in response to the 3rd API received from described application program between described run time Ask, select an executable in the one or more executable for described second processing unit Device.

22. 1 kinds of data handling systems, including:

For being produced the first API request during runtime by the application program in the first processing unit Device, described first API request specifies one or more demands of the second processing unit；

For by described application program described run time between produce the second API request, to load State the device of one or more executable of the data processing task of application program, wherein, described one Individual or multiple executable set with the calculating specified in described second API request by described application program Described second processing unit that standby identifier is identified is compatible, described calculating device identifier with mate by The process list of one or more demands that described application program had previously been specified in described first API request Unit is associated；And

The 3rd API request is produced during runtime, with from one or many by described application program Individual executable select executable in order to the device performed in described second processing unit.