CN103927150B - Perform during parallel running on multiprocessor - Google Patents
Perform during parallel running on multiprocessor Download PDFInfo
- Publication number
- CN103927150B CN103927150B CN201410187203.6A CN201410187203A CN103927150B CN 103927150 B CN103927150 B CN 103927150B CN 201410187203 A CN201410187203 A CN 201410187203A CN 103927150 B CN103927150 B CN 103927150B
- Authority
- CN
- China
- Prior art keywords
- executable
- processing unit
- application program
- calculating
- computer implemented
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Stored Programmes (AREA)
Abstract
Perform during parallel running on multiprocessor.Multiple executable in scheduling queue are scheduling the method and apparatus to be executed concurrently in one or more physical computing devices of such as CPU or GPU etc.To a type of physical computing device different from the one or more physical computing device, from compiled online one or more executable in source with existing executable.Judge that the dependence between the scheduled corresponding element of executable selects executable to be executed concurrently by multiple threads in more than one physical computing device.If GPU is busy with graphics process thread, then the thread initialized for performing executable in the GPU of physical computing device that is initialised execution in another CPU in physical computing device.The existing executable of api function and source are stored in API library to perform the multiple executable of executable including existing executable and going out from source compiled online in multiple physical computing devices.
Description
In the application filing date April 9, Application No. 200880011684.8 in 2008
The divisional application of state's patent application when parallel running " perform " on multiprocessor.
Cross-Reference to Related Applications
Entitled " the DATA that the application submits on April 11st, 2007 with Aaftab Munshi etc.
PARALLEL COMPUTING ON MULTIPLE PROCESSORS " (on multiprocessor
Data parallel) U.S. Provisional Patent Application No.60/923,030 and Aaftab Munshi exist
Entitled " the PARALLEL RUNTIME EXECUTION ON that on April 20th, 2007 submits to
MULTIPLE PROCESSORS " U.S. of when parallel running (perform) on multiprocessor faces
Time patent application No.60/925,620 are correlated with, and require both rights and interests, and both is by drawing
With being incorporated herein.
Technical field
This invention relates generally to data parallel, more particularly it relates to across CPU
Both data parallels of (CPU) and GPU (GPU) perform when running.
Background technology
Along with GPU continues to be evolved into high performance parallel computation device, increasing application be written into
Data parallel is performed in the GPU similar with general-purpose calculating appts.Nowadays, these application are set
Count into and run on the specific GPU using supplier's special interface.Therefore, they can not be in data
Balance (leverage) CPU when processing system has GPU and CPU, can not answer such
It is balanced with when just operating on the GPU of different suppliers.
But, along with increasing CPU includes that multiple core is to perform the meter of data parallel model
Calculating, any one by available CPU and/or GPU can be supported more and more to process task.Pass
On system, GPU with CPU is to be configured by mutually incompatible separate programmed environment.Great majority
GPU needs the specific dedicated program of supplier.As a result, application is difficult in terms of processing resource balance
CPU and GPU, such as, have the GPU of data parallel ability together with multi-core CPU.
Accordingly, it would be desirable to the data handling system in modern times overcomes problem above can hold to allow to apply
Any available processes resource (such as CPU or one or more GPU) of row task performs this
Business.
Summary of the invention
One embodiment of the present of invention includes the API in response to the application run in Main Processor Unit
Request, loads the method for one or more executable of the data processing task for this application and sets
Standby.In response to another API request from this application, one in the executable loaded selected
Select to be held in being attached to another processing unit of such as CPU or GPU etc of this Main Processor Unit
OK.
In an alternate embodiment, the application program run in Main Processor Unit generates API please
Ask, to be used for loading the one or more executable for data processing task.Then, should by this
With Program Generating the 2nd API, it is used for selecting one in loaded executable in attachment
Perform in another processing unit of such as CPU or GPU etc of this Main Processor Unit.
In an alternate embodiment, for the source of object processing unit during runtime based on adding
The executable being downloaded to processing unit is compiled.Processing unit and object processing unit can be centre
Reason unit (CPU) or GPU (GPU).Between processing unit and object processing unit
Difference detected to take out source from the executable loaded.
In an alternate embodiment, in response to carrying out the API request of self-application, utilize include multiple
The new task of executable updates what the multiple processing units with such as CPU or GPU etc were associated
Task queue.Judge performing the bar that is scheduling from the new task of queue in multiple processing units
Part.Based on the condition judged, select and in the associated plurality of executable of new task
For execution.
In an alternate embodiment, in response to carrying out the API request of self-application, load from this application
For performing the source of data processing function, with at multiple target datas of such as CPU or GPU etc
The one or more middle execution executable of reason unit.Automatically determine polytype target data to process
Unit.Based on object processing unit one or more middle to be performed determined by type compile
Executable.
In an alternate embodiment, source and for multiple processing units compile out one or more
Corresponding executable is stored in API library and realizes api function.In response to from primary processor
The application run in (host processor) request to API library, takes out this source from API library and is somebody's turn to do
The one or more corresponding executable of api function.The additional place not included for these multiple unit
Reason unit, goes out additional executables from the source compiled online taken out.According to api function, additional
Processing unit is executed concurrently together with in one or more processing units additional executables and one
Or multiple taken out executable.
In an alternate embodiment, receive API Calls on the host processor and perform application, should
Application has multiple thread for performing.Primary processor coupling CPU and GPU.These multiple thread quilts
Asynchronous schedule is for the executed in parallel on CPU and GPU.If GPU is busy with graphics process thread,
The thread to perform on GPU that is then scheduled can be performed in CPU.
In an alternate embodiment, receive API Calls on the host processor and perform application, should
Application has multiple thread for performing.Primary processor is coupled to CPU and GPU.These multiple threads
By asynchronously initializing for the executed in parallel on CPU and GPU.If GPU is busy with graphics process line
Journey, then the thread to perform on GPU that is initialised can be performed in CPU.
From accompanying drawing and described in detail below, other features of the present invention will be apparent to.
Accompanying drawing explanation
In the diagram of accompanying drawing by example the unrestricted present invention that illustrates, similar label represents phase
As element, in accompanying drawing:
Fig. 1 is to illustrate to include that the device that calculates of CPU and/or GPU performs the number of application for configuration
The block diagram of an embodiment according to the system of parallel computation;
Fig. 2 is to illustrate the computation processor having multiple parallel work-flow to be executed concurrently multiple thread
The block diagram of example calculating device;
Fig. 3 is the multiple physics illustrating and being configured to logical calculated device via computing device identification symbol
Calculate the block diagram of an embodiment of device;
Fig. 4 is to illustrate the ability need received by coupling from application to utilize computing device identification
Symbol configures the flow chart of an embodiment of the process of multiple physical computing devices;
Fig. 5 is the enforcement illustrating and performing to calculate the process of executable in logical calculated device
The flow chart of example;
Fig. 6 is the flow chart of the embodiment processed during the operation illustrating and loading executable, should
Process and include that the one or more physical computing devices for being determined to perform this executable compile
Source;
Fig. 7 be illustrate from perform queue selects to calculate kernel execution instance with and this perform reality
The process performed in one or more physical computing devices that logical calculated device that example is associated is corresponding
The flow chart of an embodiment;
Fig. 8 A is the stream of the embodiment illustrating the process setting up API (API) storehouse
Cheng Tu, this process performs being used for the multiple of one or more API according to multiple physical computing devices
Body and source are stored in storehouse;
Fig. 8 B is to illustrate application to perform in multiple executable and based on API request from API
The flow chart of one embodiment of the process of the respective sources that storehouse is taken out;
Fig. 9 is to illustrate the meter calculating kernel executable to be performed in multiple physical computing device
Calculate the sample source code of the example in kernel source;
Figure 10 be illustrate by call API be configured in multiple physical computing devices perform many
The sample source code of the example of the logical calculated device of in individual executable;
Figure 11 illustrate that can be used in combination with embodiment described herein, have multiple CPU and
One example of the typical computer system of GPU (GPU).
Detailed description of the invention
The method and apparatus of the data parallel being described herein on multiprocessor.It is described below
In, elaborate that a large amount of specific detail is to provide the thorough explanation to the embodiment of the present invention.But, for
It should be apparent to those skilled in the art that and can carry out the embodiment of the present invention and without these certain detail
Joint.In other example, it is not illustrated in detail known assembly, structure and technology in order to avoid making originally retouching
The understanding stated obscures.
In the description, mention that " embodiment " or " embodiment " mean to combine this embodiment
The specific features, structure or the feature that describe can be included at least one embodiment of the present invention.?
Each local phrase " in one embodiment " occurred in specification is not necessarily referring to same reality
Execute example.
By including that hardware (such as, circuit, special logic etc.), software are (such as in general-purpose computations
The software run in machine system or special purpose machinery) or below the process logic of combination of the two performs
Process described in diagram.Although describe process below according to some order operation, but should
Understand, some operation can being executed in different order in described operation.And it is possible to it is parallel
Ground rather than be sequentially performed some operation.
GPU (GPU) can be carried out the efficient of such as 2D, 3D graphic operation etc
Graphic operation and/or the dedicated graphics processors of digital video correlation function.GPU can include for holding
Row such as position block transmission (blitter) operation, texture map, polygon render (rendering), as
Special (programmable) of the graphic operation of element coloring (shading) and vertex coloring etc is hard
Part.Known GPU obtains data and by admixed together for pixel by image background from frame buffer
It is rendered in this frame buffer for display.GPU can also control this frame buffer and allow to be somebody's turn to do
Frame buffer is used to refresh the display of such as CRT or LCD display etc, CRT or LCD shows
Show device be the speed needing at least 20Hz refreshing (such as, every 1/30 second, utilize from frame buffer
This display of Refresh Data) short retain display.Generally, GPU can be from coupling with GPU
CPU obtains graphics processing tasks, exports raster graphics image by display controller to display device.
" GPU " mentioned in this manual can be United States Patent (USP) No. such as Lindholdm etc.
7015913“Method and Apparatus for Multitheraded Processing of Data In a
Programmable Graphics Processor " (data in programmable graphics processor are many
The method and apparatus of thread process) and United States Patent (USP) No.6970206 " the Method for of Swan etc.
Deinterlacing Interlaced Video by A Graphics Processor " (at by figure
Reason device is to the method that deinterleaves of video after interweaving) described in image processor or able to programme
Graphic process unit, the two patent is incorporated herein by reference.
In one embodiment, multiple different types of processors (such as CPU or GPU) can be also
The data parallel processing task sending out the ground one or more application of execution increases in data handling system available
Process the utilization ratio of resource.The process resource of data handling system can be based on multiple physical computing
Device.Physical computing device can be CPU or GPU.In one embodiment, at data parallel
Reason task can entrust to polytype processor, be such as able to carry out this task CPU or
GPU.Data processing task can be from some particular procedure ability of processor requirement.Disposal ability is such as
It can be dedicated texture (texturing) hardware supported, double-precision floating point computing, special locally stored
Device, stream data cache or synchronization primitives (synchronization primitives).Different types of place
But reason device can provide the disposal ability collection of different overlap.Such as, CPU and GPU can hold
Row double-precision floating point calculates.In one embodiment, application can balance in available CPU or GPU
Any one perform data parallel processing task.
In another embodiment, can the most automatically perform to appoint for parallel data processing
The selection of the process resource of the number of different types of business and distribution.Application can pass through API (application journey
Sequence interface) when the operation of data handling system, platform sends the energy included desired by data processing task
The prompting of power list of requirements.Correspondingly, during operation platform may determine that multiple currently available, have
CPU and/or GPU of the ability matched with received prompting is to entrust the data of this application to process
Task.In one embodiment, this capability requirement list may rely on the data process times on basis
Business.Capability requirement list can be suitable for and such as include the different editions that has from different suppliers
The different processor set of GPU and multi-core CPU.Accordingly it is possible to prevent application provides with particular type
CPU or GPU is the program of target.
Fig. 1 is to illustrate to include that the device that calculates of CPU and/or GPU performs the number of application for configuration
The block diagram of an embodiment according to the system of parallel computation.System 100 can realize parallel computing architecture
Structure.In one embodiment, system 100 can be to include the figure of one or more primary processor
System, these primary processors are by data/address bus 113 and one or more central processing units 117 and such as
Other processors one or more coupling of Media Processor 115 etc.Multiple primary processors can be
Mandatory system (hosting system) 101 is connected to together.These multiple central processing units 117 are permissible
Including the multi-core CPU from different suppliers.Media Processor can be to have dedicated texture to render firmly
The GPU of part.Another Media Processor can be to support dedicated texture rendering hardware and double-precision floating point body
The GPU of both architecture.Multiple GPU may be coupled to together for scalable connection interface
(SLI) or CrossFire configuration.
In one embodiment, mandatory system 101 can support software stack, and software stack includes software stack
Assembly, such as, apply 103, calculate podium level 111, calculating runtime layer 109, calculating compiler 107
With calculating application library 105.Application 103 can be passed through API (application programming interfaces) and call and other stack
Assembly connects.Can be that the application 103 in mandatory system 101 runs one or more thread concomitantly.
Calculate podium level 111 and can safeguard data structure or computer data structure, store each attachment
The disposal ability of physical computing device.In one embodiment, application can be by calculating podium level
111 information taking out the available processes resource about mandatory system 101.Application can be flat by calculating
Platform layer 111 selects and specifies the ability need for performing process task.Therefore, podium level is calculated
For this process task, 111 can determine that the configuration of physical computing device is with from attached CPU117
And/or GPU115 distributes and initialization process resource.In one embodiment, podium level is calculated
111 should be able to be used for for corresponding with the physical computing device of the one or more reality configured
Generate one or more logical calculated device.
Calculate runtime layer 109 can according to configured for application 103 process resource, such as one
Individual or multiple logical calculated devices manage the execution of process task.In one embodiment, at execution
Reason task can include that the calculating kernel objects creating representative process task and distribution such as preserve and can perform
The storage resource of body, input/output data etc..Being loaded the executable for calculating kernel objects can
To be calculating kernel objects.Calculate executable and can be included in the meter of such as CPU or GPU etc
Calculate in calculating kernel objects to be performed in processor.Calculate runtime layer 109 can with distributed
Physical unit interact to execution process task actual execution.In one embodiment, calculate
Runtime layer 109 can according to configure for process task each processor (such as, CPU or
GPU) run time behaviour is coordinated to perform the multiple process tasks from different application.Calculate and run
Time layer 109 can select from the physical unit being configured to carry out process task based on run time behaviour
One or more processors.Execution processes task and can include concomitantly in multiple physical processing apparatus
Perform multiple threads of one or more executable.In one embodiment, runtime layer is calculated
109 can by monitor each processor operation time practice condition follow the tracks of performed each process
The situation of task.
Runtime layer one or more can perform from application 103 loading is corresponding with process task
Body.In one embodiment, runtime layer 109 is calculated automatically from calculating application library 105 load and execution
Process the additional executables that required by task is wanted.Calculating runtime layer 109 can be from application 103 or calculating
Application library 105 loads and calculates the executable of kernel objects and corresponding both source programs thereof.Calculate
The source program of kernel objects can be to calculate kernel program.According to be configured to include polytype and/or
The logical calculated device of the physical computing device of different editions, can load many based on single source program
Individual executable.In one embodiment, calculate runtime layer 109 and can activate calculating compiler 107
Optimum is become to be used for being configured to carry out at the target of executable the source program compiled online loaded
The executable of reason device (such as, CPU or GPU).
In addition to according to the existing executable of respective sources program, the executable that compiled online goes out is also
Calling of future can be stored for.Additionally, calculate executable can be compiled offline and
Via API Calls be loaded into calculating run time 109.Calculate application library 105 and/or application 103 can ring
The executable being associated should be loaded in the storehouse API request carrying out self-application.Can be to calculate application library
105 or application 103 dynamically update the newly organized executable translated.In one embodiment, operation is calculated
Time 109 can with by new upgraded version calculate device calculate compiler 107 compiled online go out new
Executable replaces the existing calculating executable in application.When calculating operation, 109 are inserted into
The new executable that line compiles updates calculating application library 105.In one embodiment, fortune is calculated
During row, 109 can call calculating compiler 107 when the executable of loading processing task.At another
In embodiment, calculate compiler 107 and can be called by off-line that set up can for calculate application library 105
Perform body.Calculate compiler 107 and can compile and link calculating kernel program to generate calculating kernel
Executable.In one embodiment, calculate application library 105 and can include multiple for supporting such as
Development kit and/or the function of image procossing.Each built-in function can correspond to for multiple physics meters
Calculate the calculating source program and one or more executable stored in the calculating application library 105 of device.
Fig. 2 is the block diagram illustrating the example calculating device with multiple computation processor, and these are multiple
Computation processor is operable to be executed concurrently multiple thread concurrently.Each computation processor is permissible
Concurrently (or concomitantly) perform multiple thread.Thread can be properly termed as with the thread of executed in parallel
Block.Calculate device and can have the multiple thread block that can be executed in parallel.Such as, it is shown that calculating dress
Putting in 205, M thread performs as a thread block.Thread in multiple thread block, such as, meter
Calculate thread 1 and the thread N of computation processor _ L203 of processor _ 1205, dress can be calculated at one
Put on computation processor respectively or be performed in parallel on multiple calculating devices.At multiple calculating
Multiple thread block on reason device can be performed in parallel calculating kernel executable.At more than one calculating
Reason device can be one single chip based on such as ASIC (special IC) device.A reality
Execute in example, can cross over multiple chips more than one computation processor on be executed concurrently from
Multiple threads of application.
Calculate device and can include one or more computation processor, such as computation processor _ 1205 He
Computation processor _ L203.Local storage can couple with computation processor.Can by with calculating
Reason device coupling local storage support in computation processor run single thread block thread it
Between shared memory.Cross over multiple threads of different thread block, such as thread 1213 and thread N
209 can share and calculate the stream stored in the stream memory 217 that device 201 couples.Stream can be
Calculate the set of the element that kernel executable can operate on it, such as image stream or variable
Stream.Nonsteady flow can be allocated for the global variable operated on it during storage process task.
Image stream can be to can be used for image buffers, texture buffering or the buffer of frame buffering.
In one embodiment, the local storage of computation processor can be implemented as special locally stored
This locality of device, such as processor _ 1 is shared this locality of memory 219 and processor _ L and is shared memory
211.In another embodiment, the local storage of computation processor can be implemented as calculating dress
The stream read-write cache nor of the stream memory of the one or more computation processors 2 put, such as, be used for calculating dress
Put the stream data cache 215 of computation processor 205203 in 201.In another embodiment, this locality is deposited
Reservoir can be implemented in the thread in the thread block run in the computation processor coupled with local storage
Between the special local storage shared, this locality such as couple with computation processor _ 1205 is shared
Memory 219.Special local storage can not be spanned the thread of different threads block and share.As
Really the local storage of computation processor (such as processor _ 1205m) is implemented as flowing read-write cache nor
(such as, stream data cache 215), then in local storage, the variable of statement can be deposited from stream
Reservoir 217 distributes and is stored in the stream read-write cache nor (example realizing local storage realized
As, stream data cache 215) in.When such as flowing read-write cache nor and special local storage for phase
When the calculating device answered is the most unavailable, the thread in thread block can be shared in stream memory 217 and be divided
The local variable joined.In one embodiment, each thread is relevant to privately owned (private) memory
Connection, privately owned memory is used for storing the thread private variable used by the function called in thread.Example
As, privately owned memory 1211 can only be accessed by thread 1213.
Fig. 3 is to illustrate to be configured to multiple things of logical calculated device via computing device identification symbol
Reason calculates the block diagram of an embodiment of device.In one embodiment, application 303 and podium level 305
Can run in host CPU 301.Application 303 can be in the application 103 of Fig. 1.Trustship system
System 101 can include host CPU 301.Physical computing device Physical_Compute_Device-1
305...Physical_Compute_Device-N311 each in can be Fig. 1 CPU117 or
In GPU115 one.In one embodiment, calculating podium level 111 can be in response to carrying out self-application
The API request of 303 generates computing device identification symbol 307, for according to included in API request
The list of ability need configures parallel data processing resource.Computing device identification symbol 307 can relate to
The configuration carried out according to calculating podium level 111 selects the physical computing device of reality
Physical_Compute_Device-1305...Physical_Compute_Device-N311.One
In individual embodiment, logical calculated device 309 can represent selected by a group separated with host CPU 301
Actual physics calculates device.
Fig. 4 is to illustrate for by mating from applying the ability need received, utilizing calculating device
Identifier configures the flow chart of the embodiment of the process of multiple physical computing device.Can be according to Fig. 1
System 100, in the data handling system by mandatory system 101 trustship execution process 400.Data
Processing system can include hosted platform layer (the calculating podium level 111 of such as Fig. 1) primary processor and
Multiple physical computing device (such as, CPU117 and GPU of Fig. 1 being attached to primary processor
115)。
In block 401, in one embodiment, process 400 and can set up representative and one or more phases
The data structure of the ability associated plurality of physical computing device answered (or computer data knot
Structure).Each physical computing device can be attached to execution and process the processing system of 400.Such as CPU
Or whether the ability of the physical computing device of GPU etc or computing capability can include physical computing device
Support processing feature, memory access mechanism or specify extension.Processing feature can be hard with dedicated texture
Part support, double-precision floating point computing or synchronization support that (such as mutual exclusion) is relevant.Physical processing apparatus
Memory access mechanism can be with the type of variable stream cache, the type of image stream caching or special this locality
Memory is supported relevant.The system application of data handling system can be in response to by new physical computing dress
Put and be attached to data handling system and carry out more new data structure.In one embodiment, it may be predetermined that
The ability of physical computing device.In another embodiment, the system application of data handling system can be
The physical processing apparatus of new attachment is found the run time of between.The application of this system can take out newfound thing
Reason calculates the ability of device, updates the physical computing device attached by representative and their respective capabilities
Data structure.
According to an embodiment, at block 403, processing 400 computing capabilitys that can receive self-application needs
Ask.This application can send capability requirement by calling API to system application.This system is applied
Can be corresponding with the podium level of the software stack in the mandatory system of this application.In one embodiment,
Capability requirement can identify for asking process resource to perform the required ability of the task of this application
List.In one embodiment, this application may require that requested resource in multiple threads also
Send out ground and perform task.As response, at block 405, processing 400 can be from attached physical computing dress
Put one group of physical computing device of middle selection.Can be based on capability requirement and institute in capabilities data structure
Coupling between the computing capability of storage determines selection.In one embodiment, 400 are processed permissible
Coupling is performed according to the prompting that handling capacity demand provides.
Processing 400 can be according to the calculating energy mated between physical computing device and capability requirement
The number of power determines coupling scoring.In one embodiment, process 400 can select to have the highest
Multiple physical computing devices of coupling scoring.In another embodiment, if each in ability need
Ability is all matched, then process 400 and can select physical computing device.Processing 400 can be at block
405, determine many group coupling physical computing devices.In one embodiment, according to load balance ability
Select often to organize coupling physical unit.In one embodiment, at block 407, processing 400 can be block
Often group physical computing device selected by 405 generates computing device identification symbol.Process 400 can pass through
Call API and return the one or more computing device identification symbol generated to application.Application can basis
Computing device identification symbol selects which uses process resource to perform task.In one embodiment,
Processing 400 at block 407 can be that received each ability need generates a most calculating device mark
Know symbol.
In one embodiment, at block 409, processing 400 can accord with according to corresponding computing device identification
Distribute the logical calculated device of one group of physical computing device selected by INIT block 405
Resource.Processing 400 can be according to the selection at block 405, in response to from having been received by one or many
The API request of the application of individual computing device identification symbol performs the initialization to logical calculated device.Place
Reason 400 can create context object on the logical calculated device of this application.An embodiment
In, context object is associated with an application thread in the mandatory system run in this application.And
The process task sending out in one logical calculated device of ground execution or crossing over different logical calculated devices is many
Individual thread can be based on separate context object.
In one embodiment, process 400 can based on include cuCreateContext,
Multiple API of cuRetainContext and cuReleaseContext.API cuCreateContext creates meter
Count in hereafter.Calculate context to can correspond to calculate context object.API cuRetainContext
Use and calculated the context input independent variable as cuRetainContext by the concrete of Context identifier
It is incremented by the number of example.API cuCreateContext carries out implicit expression reservation.This for generally obtain by
The third party library of the context that application passes to them is helpful.However, it is possible to this application can be deleted
Storehouse is not notified except context.Multiple example is allowed to be attached to context and solve from context release
The problem that context is no longer valid is calculated by what storehouse used.If the input of cuRetainContext is certainly
Variable is not correspond to effectively calculating context object, then cuRetainContext returns
CU_INVALID_CONTEXT.API cuReleaseContext is from effectively calculating release context
Example.If the input independent variable of cuReleaseContext is the most relative with effective calculating context object
Should, then cuReleaseContext returns CU_INVALID_CONTEXT.
Fig. 5 is the embodiment illustrating and performing to calculate the process of executable in logical calculated device
Flow chart.In one embodiment, can be by runtime layer (such as, the figure in data handling system
The calculating runtime layer 109 of 1) perform to process 500.At block 501, processing 500 can be logic
Calculate calculating executable to be run on device and distribute one or more streams.Process task can be by right
The calculating executable that stream carries out operating performs.In one embodiment, process task can include
Inlet flow and output stream.Process 500 and the stream memory distributed can be mapped to application logically
Location or be mapped to distributed stream memory from it.In one embodiment, processing 500 can be based on
Carry out the API request of self-application to perform the operation of block 501.
At block 503, according to an embodiment, process 500 calculating that can create logical calculated device
Kernel objects.Calculating kernel objects can be being correlated with for the respective handling task for performing function
Stream and the executable of connection and the object that creates.Process 500 to build for calculating kernel objects at block 505
Vertical function argument.Function argument can be included as function input or the stream of output distribution, such as block
The stream of distribution at 501.Process 500 will to calculate in kernel executable and/or calculating at block 507
Core source is loaded in calculating kernel objects.Calculating kernel executable can be according to logical calculated device
It is executed for the executable of the respective handling task that execution is associated with kernel objects.At one
In embodiment, calculate kernel executable can include such as with target physical calculate device type,
The description data that version and/or compiling option are associated.Calculating kernel source can be to compile out calculating from it
The source code of kernel executable.Processing 500 can be corresponding with calculating kernel source in block 507 loading
Multiple calculating kernel executable.Processing 500 can be from application or by the calculating application library of such as Fig. 1
The calculating storehouse of 105 etc loads calculating kernel executable.Calculate kernel executable can utilize
The corresponding kernel source that calculates loads.In one embodiment, processing 500 can be according to carrying out self-application
API request performs the operation at block 503,505 and 507.
At block 511, process 500 and can update execution queue to utilize logical calculated device to perform computer
Kernel objects.Process 500 can utilize when calculating operation (when such as, the calculating of Fig. 1 runs 109)
Suitable independent variable, in response to come self-application or calculate application library (such as, Fig. 1 application 103 or
Calculate application library 105) API Calls perform calculate kernel.In one embodiment, 500 are processed
The calculating kernel execution instance performing to calculate kernel can be generated.To for performing to calculate the calculating of kernel
During operation, the API Calls itself of (when the calculating of such as Fig. 1 runs 109) can essentially be asynchronous
's.Performing example can be by being returned by (when such as, the calculating of Fig. 1 runs 109) when calculating operation
Calculating event object identify.Calculate kernel execution instance can be added in performing to calculate
Examine the execution queue of example.In one embodiment, to for performing to calculate the execution team of kernel execution instance
The API Calls of row can include the number of the thread of executed in parallel simultaneously on computation processor and to use
The number of computation processor.Calculate kernel execution instance and can include indicating desired for performing
The preferred value of the corresponding priority calculating kernel objects.Calculate kernel execution instance and can also include mark
The event object performing example before knowledge and/or for performing the expected numbers purpose thread of this execution and pre-
Issue purpose thread block.Can in API Calls the number of given thread block and the number of thread.?
In one embodiment, event object may indicate that and includes the execution example of this event object and by event pair
As another of mark performs the execution sequence relation between example.May require that and include holding of event object
Row example performs to be performed after example completes to perform at another identified by this event object.Event pair
As being properly termed as queue_after_event_object.In one embodiment, perform queue can include
Multiple for performing to calculate accordingly the calculating kernel execution instance of kernel objects.In a calculating
One or more calculating kernel execution instance of verification elephant can be scheduled for performing holding in queue
OK.In one embodiment, process 500 to hold to update this in response to the API request carrying out self-application
Row queue.The hosted data system that this execution queue can be run on by this application carrys out trustship.
At block 513, process 500 and calculating kernel can be selected from the execution queue for performing to perform reality
Example.In one embodiment, process 500 to select more than one according to respective logic calculating device
Individual calculating kernel execution instance to be executing concurrently.Process 500 to may determine whether based on calculating
Kernel execution instance with perform in queue other perform priority that examples are associated and dependence and
Calculating kernel execution instance is have selected from execution queue.Can be by according to being loaded in corresponding calculating
The executable of verification elephant performs this calculating kernel objects, thus performs to calculate kernel execution instance.
At block 517, in one embodiment, process 500 can select to be loaded into and selected calculating
Kernel execution instance corresponding calculate kernel objects multiple executable in an executable, with
The physical computing device that the logical calculated device of this calculating kernel objects of Yu Yu is associated performs.Place
Reason 500 can calculate kernel execution instance selection for one will be in more than one physical computing device
The more than one executable of executed in parallel.This selection can based on and selected calculating in examine
The current practice condition of the physical computing device that logical calculated device that example is associated is corresponding.Physics meter
The practice condition calculating device can include the number of thread, local memory usage level and the place of operation
Reason device utilizes level (such as, the peak number of the operation of time per unit) etc..An embodiment
In, this selection can be to utilize level based on predetermined.In another embodiment, this selection can
Be based on calculate the number of thread and the number of thread block that kernel execution instance is associated.Process
500 can take out practice condition from physical computing device.In one embodiment, process 500 can hold
Row is for from performing to select queue to calculate the operation of kernel execution instance, with at block 513517 and torr
The application run in guard system performs asynchronously.
At block 519, process in 500 calculating that can check the execution being scheduled in this execution queue
Core performs the situation of example.Example can be performed to identify each by uniquely calculating event object.When
Corresponding calculate kernel execution instance according to calculate run time the operation of such as, the Fig. 1 (time 109) arranged
During team, event object can be returned to call the application of the API by performing this execution example or based on
Calculate application library (such as, Fig. 5 application 103 or calculate application library 105).In one embodiment,
Processing 500 can be in response to carrying out the API request of self-application to carry out practice condition inspection.Processing 500 can
Determine in terms of execution by the situation calculating event object being identified this calculating kernel execution instance by inquiry
Calculate completing of kernel execution instance.Process 500 and can wait until the execution calculating kernel execution instance
Till being done, to return the API Calls of self-application.Process 500 to control based on event object
Make and perform example reading from the process of various streams and/or write.
At block 521, according to an embodiment, process 500 and can take out execution calculating kernel execution instance
Result.Subsequently, process 500 can be cleared up and is allocated for performing this calculating kernel execution instance
Process resource.In one embodiment, process 500 to perform to calculate kernel executable by preserving
The stream memory copy of result in local storage.Process 500 can be deleted and be divided at block 501
The nonsteady flow joined or image stream.Process 500 can delete for deleting when calculating kernel and performing and be done
The kernel event object removed.If each being associated with specific calculation kernel objects calculates kernel and performs
Example is completely performed, then process 500 and can delete specific calculation kernel objects.A reality
Execute in example, process 500 operations that can perform block 521 based on the API request initiated by application.
Fig. 6 is the flow chart of the embodiment processed during the operation illustrating and loading executable, this operation
Time process and include being used for being determined to perform one or more physical computing of this executable by source compiling
Device.Process 600 to be performed as a part for the process 500 at the block 507 of Fig. 5.One
In individual embodiment, processing 600 can be each physics being associated with logical calculated device at block 601
Calculating device selects one or more existing calculating kernels compatible with this physical computing device to perform
Body.Calculate kernel executable to be performed in compatible physical computing device.This existing meter
Calculate kernel executable from application or to be obtained by the calculating storehouse of the calculating application library 105 of such as Fig. 1
?.Selected each calculating kernel executable calculated in kernel executable can be by least
One physical computing device performs.In one embodiment, this selection can be based on existing calculating
The description data that kernel executable is associated.
If there is selected existing calculating kernel objects, then processing 600 can judge at block 603
Whether any one calculated in kernel executable selected is optimum for physical computing device
's.This judgement can be such as version based on physical computing device.In one embodiment, as
Fruit describes the version of version and physical computing device that the target physical in data calculates device and matches,
Then process 600 and may determine that it is optimum for having calculating kernel executable for this physical computing device
's.
At block 605, in one embodiment, process 600 and can use compiled online device (such as Fig. 1
Calculating compiler 107) to set up for physical computing device from corresponding computer inner core source optimum
New calculating kernel executable.If the calculating kernel selected by finding at block 603 can perform
Not calculating kernel executable in body is optimum for physical computing device, then process 600 permissible
Perform online foundation.In one embodiment, if finding that at block 601 existing calculating kernel can be held
Do not calculate kernel executable in row body mutually compatible with physical computing device, then processing 600 can hold
Row is online to be set up.Calculating kernel source can be from application or by calculating application library 105 of such as Fig. 1 etc
Calculating storehouse obtain.
If the foundation at block 605 is successful, the most in one embodiment, processing 600 can be at block
At 607, newly-established calculating kernel executable is loaded into corresponding calculating in kernel objects.No
Then, process 600 and at block 609, selected calculating kernel executable can be loaded into interior verification
As.In one embodiment, be not also loaded if calculating kernel executable, then processing 600 can
So that calculating kernel executable is loaded into calculating kernel objects.In another embodiment, if calculated
In the existing calculating kernel executable of kernel objects does not has calculating compatible with physical computing device
Core executable, and it is not available to calculate kernel source accordingly, then and processing 600 can disappear with generation error
Breath.
Fig. 7 be illustrate from perform queue selects to calculate kernel execution instance with and this perform reality
The process performed in one or more physical computing devices that logical calculated device that example is associated is corresponding
The flow chart of an embodiment.Processing 700 can be as one of the process 500 at the block 513 of Fig. 5
Divide and be performed.In one embodiment, process 700 is current in can identifying execution queue at block 701
The scheduled dependence condition calculated between kernel execution instance.Calculate the dependence bar of kernel execution instance
Part is possible to prevent to calculate the execution of kernel execution instance, if this condition is not fully complete.A reality
Execute in example, dependence can be based on the inlet flow fed by output stream between relation.One
In individual embodiment, processing 700 can come according to the inlet flow of the respective function of execution example and output stream
Detection performs the dependence between example.In another embodiment, there is holding of lower priority
Row example can have dependence with another execution with high priority.
At block 703, in one embodiment, process 700 to hold from multiple calculating kernels being scheduled
Row example selects do not have the calculating kernel execution instance of any dependence condition being not fully complete for performing.
This selection can be based on the priority being assigned to perform example.In one embodiment, selected
The calculating kernel execution instance selected can be relevant to the limit priority in multiple calculating kernel execution instance
The dependence condition joined and be not not fully complete.At block 705, process 700 and can take out and selected calculating
The current practice condition of the physical computing device that kernel execution instance is corresponding.In one embodiment,
The practice condition of physical computing device can be to take out from predetermined storage position.Implement at another
In example, process 700 and can receive practice condition report to physical computing device transmission status request.
Process 700 and can assign in physical computing device based on the practice condition taken out at block 707
Individual or multiple perform selected calculating kernel execution instance.In one embodiment, physical computing
Device can be assigned for according to the load balancing with other physical computing devices performing.Selected
Physical computing device can with meet preassigned (such as, predetermined process device utilize level and/or
Memory utilizes below horizontal) practice condition be associated.In one embodiment, preassigned can
To depend on and the selected number calculating the thread that kernel execution instance is associated and the number of thread block
Mesh.Process 700 can by be used for the identical separate calculating kernel performing example or multiple example can
Perform body and be loaded into one or more assigned physical computing device, to hold parallel in multiple threads
OK.
Fig. 8 A is the stream of the embodiment illustrating the process setting up API (API) storehouse
Cheng Tu, this process performs being used for the multiple of one or more API according to multiple physical computing devices
Body and source are stored in storehouse.Process 800A can at block 801 by off-line execution with by the source of api function
Code is loaded in data handling system.Source code can be in one or more physical computing devices
Calculating kernel source to be performed.In one embodiment, process 800A can at block 803 for
Api function assigns multiple target physical to calculate device.Can according to type (such as, CPU or
GPU), version or supplier assign target physical to calculate device.Processing 800A can be at block 805
Place calculates device for each target physical assigned and compiles source code into executable, such as,
Calculate kernel executable.In one embodiment, processing 800A can be based on compiled online device (example
Calculating compiler 107 such as Fig. 1) carry out compilation offline.At block 807, processing 800A can be by API
The source code of function calculates, to for the target physical assigned, the corresponding executable that device is compiled out
Store in API library.In one embodiment, each executable can be stored and describe data,
Describe data and such as include that target physical calculates the type of device, version and supplier and/or compiling choosing
?.By run time between process (such as, the process 500 of Fig. 5) description data can be taken out.
Fig. 8 B is to illustrate application to perform in multiple executable and based on API request from API library
The flow chart of one embodiment of the process of the respective sources taken out.In one embodiment, process
800B is (example in the data handling system including API library (such as, the calculating application library 105 of Fig. 1)
As, in the mandatory system 101 of Fig. 1) run application program (such as, the application 103 of Fig. 1).At block
At 811, process 800B and can take out source (such as, calculating kernel from API library based on API request
Source) and one or more corresponding executable (such as, calculate kernel executable), such as Fig. 5
Block 507 at process 500.Each executable can calculate device with one or more target physical
It is associated.In one embodiment, calculating kernel executable can be with the physical computing of miscellaneous editions
Device backward compatibility.At block 813, process 800B and can perform base in multiple physical computing devices
In the executable that API request is taken out one performs the api function being associated, such as Fig. 5
Block 517 at process 500.Processing 800B can be with execution api function at block 813 asynchronously at block
Application is performed at 809.
Fig. 9 is to illustrate the meter calculating kernel executable to be performed in multiple physical computing device
Calculate the sample source code of the example in kernel source.Example 900 can be to have to include variable 901 and stream 903
The api function of independent variable (arguments).Example 900 can be system 101 based on such as Fig. 1
Etc the programming language of parallel computation environment.In one embodiment, it is possible to use be designed to reality
The additional extension of the one or more embodiments in existing said embodiment and restriction, according to ANSI
(American National Standards Institute (ANSI)) C standard specifies parallel programming language.These extensions can include using
Specify the function qualifier (qualifier) calculating calculating kernel function to be performed in device, example
Such as qualifier 905.Calculate kernel function and can not be calculated kernel function call by other.A reality
Execute in example, calculating kernel letter can be called by the principal function (host function) of parallel program language
Number.Principal function can be conventional ANSI C function.Principal function can calculate kernel function with execution
Calculate device be separated primary processor in be performed.In one embodiment, these extensions are permissible
Including locally limiting symbol, to describe the calculating needing to be assigned to shared by all threads of thread block
Variable in the local storage that device is associated.Kernel function declared inside this locality limit can calculated
System symbol.The restriction of parallel programming language can be enhanced during compiler time or operation time with
When these limit and are breached, generation error situation, such as, output error message or exit execution.
Figure 10 be illustrate by call API be configured in multiple physical computing devices calculate many
The sample source code of the example of the logical calculated device of in individual executable.Example 1000 is permissible
By being attached fortune in the host computer system (such as, the mandatory system 101 of Fig. 1) of multiple physical computing device
That goes should be for performing.Example 1000 can specify the principal function of parallel programming language.In example 1000
The operation that processes can be by the process of the process 500 of such as Fig. 5 etc, be held as API Calls
OK.Distribution stream 1001 and the operation that processes adding stream image 1003 can be by the places at the block 501 of Fig. 5
Reason 500 is performed.The process operation creating calculating kernel objects 1005 can be passed through at the block 503 of Fig. 5
Process 500 be performed.Processing operation 1007 can be by the calculating of the example 900 of such as Fig. 9 etc
Core source is loaded into be created that calculating kernel objects.Processing operation 1009 can be from the calculating loaded
Kernel source is explicitly set up and calculates kernel executable.In one embodiment, operation 1009 is processed
The kernel executable that calculates set up can be loaded into created calculating kernel objects.Subsequently,
Processing operation 1011 can explicitly select set up calculating kernel executable for performing to be created
The calculating kernel objects built.
In one embodiment, process operation 1013 can using supplementary variable and stream as the calculating created
The function argument of kernel objects.Processing operation 1013 can be by the process 500 at the frame 505 of Fig. 5
It is performed.Process operation 1015 and can perform created calculating kernel objects.An embodiment
In, processing operation 1015 can be performed by the process 500 at the block 511 of Fig. 5.Process operation
1015 so that execution queue is utilized calculate kernel corresponding with the calculating kernel objects created and holds
Row example and be updated.Process operation 1017 and can synchronously wait for created calculating kernel objects
Perform completes.In one embodiment, processing operation 1019 can be from the execution calculating kernel objects
Middle taking-up result.Subsequently, process operation 1021 can clear up distributed calculate interior verification for performing
The resource of elephant, such as event object, the calculating kernel objects created and the memory distributed.?
In one embodiment, process whether operation 1017 can be set based on kernel event object.Process behaviour
Make 1017 to be performed by the process 500 at the block 519 of Fig. 5.
Figure 11 illustrates that of the computer system can being used together with one embodiment of the invention shows
Example.First, system 1100 may be implemented as a part for the system shown in Fig. 1.Note, to the greatest extent
Pipe Figure 11 illustrates the various assemblies of computer system, but it is not intended to represent these assemblies of interconnection
Any concrete architecture or mode, because these details do not have close pass for the present invention
System.It is further understood that, it is also possible to have less assembly or may the network computer of more multicompartment and its
His data handling system (such as, handheld computer, personal digital assistant (PDA), honeycomb electricity
Words, entertainment systems, consumer-elcetronics devices etc.) come together to realize one or more enforcements of the present invention
Example.
As shown in Figure 11, the computer system 1101 as the data handling system of a kind of form is wrapped
Include: be coupled to the bus of (one or more) microprocessor 1105 of such as CPU and/or GPU etc
1103, ROM (read-only storage) 1107, volatibility RAM1109 and nonvolatile memory
1111.Microprocessor 1103 can take out instruction from memory 1107,1109,1111 and perform this
A little instructions perform aforesaid operations.These various assemblies are interconnected together by bus 1103, and also by this
A little assemblies 1105,1107,1109 and 1111 and display controller and display device 1113 and peripheral equipment
Putting interconnection, peripheral unit can be e.g. mouse, keyboard, modem, network interface, beat
Input/output (I/O) device of print machine and other devices well known in the art.Commonly enter/export dress
Put 915 and be coupled to this system by i/o controller 1117.Volatibility RAM (deposit by arbitrary access
Reservoir) 1109 it is generally implemented as continuing to need electric power to refresh or to safeguard data in memory
Dynamic ram (DRAM).The display controller coupled with display device 1108 can include alternatively
One or more GPU process display data.It is alternatively possible to provide GPU memory 1111 to prop up
Hold GPU included in display device 1108.
Even if high-capacity storage 1111 is typically to remain able to safeguard number after electric power is removed from system
According to the magnetic hard disk drives of (such as, mass data) or magneto-optical drive or CD-ROM drive or DVD RAM or sudden strain of a muscle
Deposit or other type of storage system.Generally, high-capacity storage 1111 also can be arbitrary access
Memory, although this is not required to.Although Figure 11 illustrates that high-capacity storage 1111 is to be directly coupled to
The local device of the remaining component in data handling system, it will be apparent, however, that the present invention can utilize
Away from the nonvolatile memory of this system, such as by such as modem or Ethernet interface or
The network interface of wireless networking interface etc is coupled to the network storage device of data handling system.Bus
1103 can include being connected with each other by various bridgers well known in the art, controller and/or adapter
One or more buses.
The logic circuit of such as dedicated logic circuit etc can be utilized or utilize microcontroller or perform journey
The process core of other forms of sequence code command realizes the part of foregoing.Therefore, it can profit
Perform by instructed place discussed above with the program code of such as machine-executable instruction etc
Reason, machine-executable instruction makes the machine performing these instructions perform some function.At this context
In, " machine " can be intermediate form (or " abstract ") instruction to be converted into processor refer to
Make (such as, such as " virtual machine " (such as, Java Virtual Machine), interpretive program, common language fortune
The abstract execution ring of (Common Language Runtime), high-level language virtual machine etc. during row
Border) and/or it is designed to perform the semiconductor chip of instruction (such as, by " the logic of transistor realization
Circuit ") on the electronic circuit disposed, such as application specific processor and/or general processor.By with
The process that upper discussion is instructed can also by be designed to perform these process (or these process one
Point) electronic circuit (replace machine or be combined with machine) perform, and without carrying out program generation
Code.
Manufacture can be used to store program code.The manufacture of storage program code can be implemented
For, but be not limited to, be suitable to store one or more memories (such as, or many of e-command
Individual flash memory, random access memory (static, dynamically or other)), CD, CD-ROM,
DVD ROM, EPROM, EEPROM, magnetic or the card of light or other type of machine readable be situated between
Matter.(such as via communication link (such as, network connects)) institute in propagation medium can also be passed through
The data-signal realized to download to ask by program code from remote computer (such as, server)
Computer (such as, client).
Algorithm and symbolic expression according to the operation to the data bit in computer storage illustrate elder generation
Front detailed description.These arthmetic statements and statement be those skilled in the art in data processing field to
The used instrument of substance that others skilled in the art pass on them to work most effectively.This
In, algorithm is generally conceived that cause desired result is in harmony the sequence of operation certainly.These operations are to need
Will be to those operations of the physical operations of physical quantity.Generally, but being not necessarily necessary, this tittle is adopted
With the signal of telecommunication that can be stored, be forwarded, be combined, compared or otherwise operated or
The form of magnetic signal.Sometimes, especially for the reason being used in conjunction with, have been found to believe these
Number carry that to make bit, value, element, symbol, character, term, numeral etc. be convenient.
It is to be noted, however, that the whole terms in all these and similar term are with suitable
Ground physical quantity be associated and be only be suitable for this tittle facilitate label.Unless specifically stated otherwise or with
Other modes from described above obviously, it is understood that throughout the specification, utilize such as
" process " or the discussion of term of " calculating " or " judgement " or " display " etc., relate to computer
System or the action of similar computing electronics and process, computer system or similar electronics calculate
The number that the physics (electronics) that device operation is represented as in the RS of computer system is measured
According to and they be transformed into be similarly represented as computer system memory or register or other this
The data of the physical quantity in the information-storing device of sample, transmission or display device.
The invention still further relates to the equipment for performing operation described herein.This equipment can be by special structure
Build for the required purposes, or it can include that the computer program by being stored in computer has choosing
The all-purpose computer being activated or reconfigured by with selecting.Such computer program can be stored in calculating
In machine readable storage medium storing program for executing, computer-readable recording medium is such as but not limited to any kind of dish
(include floppy disk, CD, CD-ROM and magneto-optic disk, read-only storage (ROM), RAM,
EPROM, EEPROM, magnetic or optical card) or be suitable to store any kind of Jie of e-command
Matter, and each of which is coupled to computer system bus.
Shown herein as process and display be not to have with any concrete computer or miscellaneous equipment inherently
Close.Each general-purpose system can be used together with according to the program of teaching in this, or can confirm
It is convenient for building more special equipment and performing described operation.From following description, for various
The required structure of such system is obvious.Additionally, the present invention be not about any specifically
Programming language describes.It is appreciated that various programming language may serve to realize institute the most here
The teaching of the invention stated.
Some exemplary embodiment that merely depict the present invention described above.Those skilled in the art will
Easily recognize from such discussion, drawings and claims can be carried out various amendment without departing from
The spirit and scope of the present invention.
Claims (22)
1. a computer implemented method, including:
Application program in the first processing unit run time between, in response to from described application program
The second API request received, loads data processing task one or more of described application program
Executable, wherein, the one or more executable with by described application program described second
That specifies in API request calculates the second processing unit compatibility that device identifier is identified, described calculating
Device identifier was previously referred to by the first API request by described application program during runtime with mating
The processing unit of fixed one or more demands is associated;And
In response to the 3rd API request received from described application program between described run time, for
Described second processing unit selects an executable in the one or more executable.
Computer implemented method the most according to claim 1, wherein said first processing unit
It is CPU (CPU) or GPU (GPU) with described second processing unit.
Computer implemented method the most according to claim 1, wherein said one or more can
The selected executable performed in body is associated with described 3rd API request.
Computer implemented method the most according to claim 1, wherein said one or more can
Perform body and include the description data of at least one executable in the one or more executable,
Described description data include version and the type of supported processing unit.
Computer implemented method the most according to claim 4, wherein said one or more can
Execution body includes that source, described source are compiled and generates the one or more executable.
Computer implemented method the most according to claim 5, wherein said source is via described
2nd API loads from described application program.
Computer implemented method the most according to claim 5, wherein said source is from described
The storehouse that one or more executable are associated loads.
Computer implemented method the most according to claim 5, wherein said loading includes:
Relatively described description data and the information of described second processing unit;And
For described second processing unit, from described source, compiled online goes out the one or more executable
In one executable.
Computer implemented method the most according to claim 8, wherein said one or more can
The one executable performed in body is associated with described second API request.
Computer implemented method the most according to claim 8, wherein said compiling based on:
Described comparison indicates at least one executable in the one or more executable for described the
Two processing units are not optimum.
11. computer implemented methods according to claim 8, wherein said compiling based on:
Described at least one executable compared in the one or more executable of instruction is not supported described
Second processing unit.
12. computer implemented methods according to claim 8, wherein said compiling includes:
Generate for the one executable in the one or more executable and include described second
The version of processing unit description data after interior renewal;And
Store the one executable in the one or more executable, one or many
One executable in individual executable includes the description data after described renewal.
13. computer implemented methods according to claim 12, wherein said one or more
One executable in executable is stored replaces in the one or more executable
At least one executable.
14. computer implemented methods according to claim 1, wherein said one or more can
Perform body and include the description data of the one executable in the one or more executable,
And wherein, described selection is based on described description data.
15. computer implemented methods according to claim 14, wherein said one or more
An executable selected in executable based on described description data with at described second
In the one or more executable of reason unit, up-to-date version is associated.
16. computer implemented methods according to claim 14, wherein said one or more
The execution sequence relation that an executable selected in executable indicates based on described description data
And be associated.
17. 1 kinds of computer implemented methods, including:
The first API request, institute is produced during runtime by the application program in the first processing unit
State the first API request and specify one or more demands of the second processing unit;
The second API request is produced during runtime, to load described application journey by described application program
One or more executable of the data processing task of sequence, wherein, the one or more can perform
Body is identified with the calculating device identifier specified in described second API request by described application program
Described second processing unit compatible, described calculating device identifier exists by described application program with mating
The processing unit of the one or more demands previously specified in described first API request is associated;And
The 3rd API request is produced during runtime, with from one or many by described application program
Individual executable select executable in order to perform in described second processing unit.
18. computer implemented methods according to claim 17, wherein said first processes list
First and described second processing unit is CPU (CPU) or GPU
(GPU)。
19. computer implemented methods according to claim 17, wherein said 2nd API please
The source asked and therefrom compile out the one or more executable is associated.
20. computer implemented methods according to claim 19, performing selected in it
Body is from described source compiled offline.
21. 1 kinds of data handling systems, including:
For the application program in the first processing unit run time between, in response to from described application
The second API request that program receives, load one of data processing task of described application program or
The device of multiple executable, wherein, the one or more executable with by described application program
The second processing unit that the calculating device identifier specified in described second API request is identified is held concurrently
Holding, described calculating device identifier is passed through first during runtime with mating by described application program
The processing unit of one or more demands that API request had previously been specified is associated;And
For please in response to the 3rd API received from described application program between described run time
Ask, select an executable in the one or more executable for described second processing unit
Device.
22. 1 kinds of data handling systems, including:
For being produced the first API request during runtime by the application program in the first processing unit
Device, described first API request specifies one or more demands of the second processing unit;
For by described application program described run time between produce the second API request, to load
State the device of one or more executable of the data processing task of application program, wherein, described one
Individual or multiple executable set with the calculating specified in described second API request by described application program
Described second processing unit that standby identifier is identified is compatible, described calculating device identifier with mate by
The process list of one or more demands that described application program had previously been specified in described first API request
Unit is associated;And
The 3rd API request is produced during runtime, with from one or many by described application program
Individual executable select executable in order to the device performed in described second processing unit.
Applications Claiming Priority (7)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US92303007P | 2007-04-11 | 2007-04-11 | |
US60/923,030 | 2007-04-11 | ||
US92562007P | 2007-04-20 | 2007-04-20 | |
US60/925,620 | 2007-04-20 | ||
US11/800,319 US8286196B2 (en) | 2007-05-03 | 2007-05-03 | Parallel runtime execution on multiple processors |
US11/800,319 | 2007-05-03 | ||
CN200880011684.8A CN101802789B (en) | 2007-04-11 | 2008-04-09 | Parallel runtime execution on multiple processors |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN200880011684.8A Division CN101802789B (en) | 2007-04-11 | 2008-04-09 | Parallel runtime execution on multiple processors |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103927150A CN103927150A (en) | 2014-07-16 |
CN103927150B true CN103927150B (en) | 2016-09-07 |
Family
ID=51145382
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410187203.6A Active CN103927150B (en) | 2007-04-11 | 2008-04-09 | Perform during parallel running on multiprocessor |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103927150B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113542382B (en) * | 2018-03-30 | 2024-04-26 | 北京忆芯科技有限公司 | KV storage device in cloud computing and fog computing system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5301324A (en) * | 1992-11-19 | 1994-04-05 | International Business Machines Corp. | Method and apparatus for dynamic work reassignment among asymmetric, coupled processors |
CN1877490A (en) * | 2006-07-04 | 2006-12-13 | 浙江大学 | Method for saving energy by optimizing running frequency through combination of static compiler and dynamic frequency modulation techniques |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6513057B1 (en) * | 1996-10-28 | 2003-01-28 | Unisys Corporation | Heterogeneous symmetric multi-processing system |
US20070208956A1 (en) * | 2004-11-19 | 2007-09-06 | Motorola, Inc. | Energy efficient inter-processor management method and system |
JP4367337B2 (en) * | 2004-12-28 | 2009-11-18 | セイコーエプソン株式会社 | Multimedia processing system and multimedia processing method |
-
2008
- 2008-04-09 CN CN201410187203.6A patent/CN103927150B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5301324A (en) * | 1992-11-19 | 1994-04-05 | International Business Machines Corp. | Method and apparatus for dynamic work reassignment among asymmetric, coupled processors |
CN1877490A (en) * | 2006-07-04 | 2006-12-13 | 浙江大学 | Method for saving energy by optimizing running frequency through combination of static compiler and dynamic frequency modulation techniques |
Also Published As
Publication number | Publication date |
---|---|
CN103927150A (en) | 2014-07-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11106504B2 (en) | Application interface on multiple processors | |
CN101802789B (en) | Parallel runtime execution on multiple processors | |
US20200250005A1 (en) | Data parallel computing on multiple processors | |
CN102099788B (en) | Application programming interfaces for data parallel computing on multiple processors | |
US20180203737A1 (en) | Data parallel computing on multiple processors | |
US8108633B2 (en) | Shared stream memory on multiple processors | |
CN103927150B (en) | Perform during parallel running on multiprocessor | |
AU2018226440B2 (en) | Data parallel computing on multiple processors | |
AU2014100505A4 (en) | Parallel runtime execution on multiple processors |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |