CN106164881A - Work in heterogeneous computing system is stolen - Google Patents
Work in heterogeneous computing system is stolen Download PDFInfo
- Publication number
- CN106164881A CN106164881A CN201380073056.3A CN201380073056A CN106164881A CN 106164881 A CN106164881 A CN 106164881A CN 201380073056 A CN201380073056 A CN 201380073056A CN 106164881 A CN106164881 A CN 106164881A
- Authority
- CN
- China
- Prior art keywords
- work
- hardware
- queue
- computational unit
- unit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5083—Techniques for rebalancing the load in a distributed system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/38—Information transfer, e.g. on bus
- G06F13/42—Bus transfer protocol, e.g. handshake; Synchronisation
- G06F13/4204—Bus transfer protocol, e.g. handshake; Synchronisation on a parallel bus
- G06F13/4234—Bus transfer protocol, e.g. handshake; Synchronisation on a parallel bus being a memory bus
- G06F13/4239—Bus transfer protocol, e.g. handshake; Synchronisation on a parallel bus being a memory bus with asynchronous protocol
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/505—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Advance Control (AREA)
- Hardware Redundancy (AREA)
- Multi Processors (AREA)
Abstract
Work is stolen device equipment and is comprised and determine module.Determine that module determines that the first hardware computational unit from the first kind steals work for being different from the second hardware computational unit of the Second Type of the first kind.This is operated in the first work queue queuing, and the first work queue is corresponding to the first hardware computational unit and will be stored in the shared memorizer shared by the first and second hardware computational unit.Synchronous working steals device module by the access of the synchronous memories of the first work queue is stolen work.Synchronous memories access is by relative to the memory access synchronization to the first work queue from the first hardware computational unit.
Description
Copyright notice
The material being affected by copyright protection comprised herein.When it occurs in patent and trademark office patent file or record,
It is open that copyright owner does not oppose that anyone replicates this patent, in any case but retain the proprietary rights to copyright in other side
Profit.
Technical field
Embodiment described herein and relate generally to heterogeneous computing system.Specifically, embodiment described herein and typically relate to
And the work in heterogeneous computing system steals.
Background technology
Computer also often has one or more in addition to being deployed as the general processor of CPU (CPU)
Hardware accelerator device.One widely used example of this type of hardware accelerator device is Graphics Processing Unit (GPU).GPU
Already function as a part for the main graphics subsystem driving one or more display traditionally.GPU helps to unload work from CPU
Make, thus allow cpu resource for other task.CPU is typically implemented as general processor, and typically will be adapted for carrying out each
Plant different types of algorithm (process of such as general aspects).By contrast, GPU is frequently embodied as application specific processor or the most more
Specialized processor, its specialized algorithm (such as graphics-related algorithms) being typically more suitable for performing limited quantity.GPU warp
Often having the computing hardware of highly-parallel, it tends to allow them quickly to process graph data.
Recently, it has been recognized that GPU can be used for the application of other non-graphic.Such as, the general-purpose computations on GPU (GPGPU)
As represent the universal calculating of type utilizing GPU to perform to remain for the most in advance CPU.Often can use the height of GPU
Parallel computation hardware significantly speeds up this type of and calculates.When utilized in such a manner, at least when correctly realizing, GPU can help
Help and substantially speed up computation-intensive application.
But, the challenge (such as in the case of GPGPU) using GPU to accelerate non-graphic application relates to work and adjusts
Degree.In order to effectively utilize CPU and GPU, it is important that dispatch in an efficient way between CPU and GPU and be distributed work.One
In individual possible method, CPU can be only to GPU unloading or assignment particular task.CPU may then wait for or perform other task,
And GPU completes the task of assignment and provides result.But, in this approach, process identical live load time CPU with
Actual cooperation is not had between GPU.It is to say, the different piece to identical live load works during the possible difference of CPU with GPU.
It actually not mixes execution.
According to other method, CPU and GPU may participate in mixing and perform, and wherein they are to static scheduling on CPU and GPU
The different piece of concurrent working load works collaboratively.This type of mixing execution may be provided in and utilizes on identical live load simultaneously
The potential advantage of CPU and GPU.But, it is intended to have challenge is that between CPU and GPU, traffic control loads not effectively
Same part.Such as, CPU and GPU can not simultaneously operate by clock rate degree, and/or can have different memory level, and/or can have
There is different Performance Characteristicses based on live load.The code that can efficiently perform on one of CPU and GPU is likely not to have separately
Perform the most effective on one.The live load of other competition and/or the existence of power constraint can be further intended to make elder generation
The ability testing traffic control load effectively complicates.Shortcoming is, if live load scheduling not yet in effect, then one of CPU and GPU
Overtax can be become, and another is possibly also with deficiency.
Accompanying drawing explanation
By with reference to for illustrating that the following description and drawings of the embodiment of the present invention can be best understood the present invention.At accompanying drawing
In:
Fig. 1 is the block diagram of the embodiment of Heterogeneous Computing or computer system.
Fig. 2 is the block diagram that synchronous working steals the embodiment of device system.
Fig. 3 is the FB(flow block) of the embodiment of the method stealing work in heterogeneous computer system.
Fig. 4 is the frame of the embodiment that the bi-directional synchronization work being shown between the embodiment of CPU and the embodiment of GPU is stolen
Figure.
Fig. 5 is the block diagram of the environment that can use the embodiment stealing device system that works asynchronously.
Fig. 6 is the block diagram of the example embodiment of Heterogeneous Computing or computer system.
Fig. 7 is the block diagram of the example embodiment of Heterogeneous Computing or computer system.
Detailed description of the invention
In the following description, a large amount of specific detail (such as particular processor unit, heterogeneous computer system, operation are elaborated
Sequence, logical division/integrated details, the type of system component and mutual relation etc.).It is to be understood, however, that the enforcement of the present invention
Example can be carried out in the case of not having these specific detail.In other example, it is thus well known that circuit, structure and technology are not
It is shown specifically, in order to avoid the fuzzy understanding that this is described.
Fig. 1 is the block diagram of the embodiment of Heterogeneous Computing or computer system 100.In various embodiments, Heterogeneous Computing system
System can represent desk computer, laptop computer, notebook, netbook computer, work station, personal digital assistant
(PDA), smart phone, cell phone, mobile computing device, server, Internet appliance or known in the art various other
Types of computer systems or other calculating system.
Heterogeneous computing system comprises the hardware computational unit of at least two isomery (i.e. different) type.Hardware computational unit exists
It is also referred to as computing unit herein.In the illustrated embodiment, heterogeneous computer system comprises the first hard of the first kind
Part computing unit 101-1 and the second different types of second hardware computational unit 101-2.First and second computing units such as may be used
It is coupled by interconnection 111.Other embodiments can comprise three or more Heterogeneous Computing unit.Showing of suitable computing unit
Example including but not limited to processor, core, hardware thread, thread slot, be able to maintain that the hardware etc. of the state of independently executing.
In certain embodiments, the first computing unit 101-1 can be universal computing unit (or at least than second calculate single
101-2 is the most general in unit), and the second computing unit 101-2 can not be universal computing unit and/or can be dedicated computing
Unit (or at least more general than the first computing unit 101-1).In certain embodiments, the first computing unit 101-1 can
Being in CPU, general processor and general purpose core, and the second computing unit can not be.Such as, the second computing unit
101-2 can be that graphic process unit (such as GPU, graphics coprocessor, graphics core etc.), hardware accelerator device are (the most special
Accelerator, fixing function accelerator etc.), cipher processor, communication processor, network processing unit, application specific processor, specific core,
In highly-parallel specialized hardware, digital signal processor (DSP), field programmable gate array (FPGA) etc. one.
In certain embodiments, the first computing unit 101-1 can be CPU, and the second computing unit can be at figure
Reason device (such as GPU, graphics coprocessor, graphics core etc.), although the scope of the present invention is not so limited.In this type of embodiment
In, heterogeneous computing system can represent GPGPU system, CPU-GPU cooperative system etc..In this detailed description of the invention, frequent CPU
With the example that GPU is used as the first and second Heterogeneous Computing unit, it will be appreciated that, in an alternative embodiment, alternately
(more general processor is together with from application specific processor, specific core, hard such as relatively to use other various combination of Heterogeneous Computing unit
Part accelerator installation, the relative more application specific processor selected between DSP, FPGA etc.).
Refer again to Fig. 1, when being deployed in heterogeneous computing system, the first computing unit 101-1 and the second computing unit
101-2 is each operable such as to be coupled with shared memorizer 102 by interconnection 108,109.Share memorizer to first and second
Computing unit can access and shared by them.In certain embodiments, shared memorizer can represent shared virtual memory.Share
Memorizer or shared virtual memory can represent the realization one or more storage arrangements at one or more type of memory
In the part of physical storage.In certain embodiments, shared memorizer may be implemented in dynamic random access memory
(DRAM) in, although the scope of the present invention is not so limited.
In certain embodiments, corresponding to the first work queue 104-1 of the first computing unit 101-1 with corresponding to second
The second work queue 104-2 of computing unit 101-2 is storable in sharing in memorizer 102.First work queue 104-1 is permissible
Operable with to for the operational reception of the first computing unit 101-1 and queue up.Second work queue 104-2 can operable with
To for the operational reception of the second computing unit 101-2 and queue up.In the example shown to put it more simply, only illustrate two work queues,
But in some embodiments, it may be possible to there are the multiple work for the first computing unit (be such as used in multiple cores is each)
Queue and/or for the second computing unit (such as optionally for multiple multiprocessors or other group core in each) multiple
Work queue.
Referring again to Fig. 1, in certain embodiments, the first computing unit 101-1 is (such as CPU, general processor, general
Core etc.) the first job scheduler module 103-1 can be comprised.First job scheduler module 103-1 can be operable to dispatch
Work on one computing unit 101-1.In certain embodiments, the second computing unit 101-2 (such as GPU, graphics core, hardware
Accelerator, application specific processor, specific core etc.) the second job scheduler module 103-2 can be comprised.Second job scheduler module
103-2 can be operable to dispatch the work on the second computing unit 101-2.In certain embodiments, the first and second work
Scheduler module can be operable to dispatch the work on the first and second computing units, to realize the public affairs on Heterogeneous Computing unit
Mixing in the different piece (the such as different piece of data parallel live load) of live load performs and/or cooperation/conjunction altogether
Calculate.Such as, in certain embodiments, CPU with GPU can together be operated in GPGPU.
As example, live load may be logically divided into chunk, working cell or other parts.These chunks, working cell or
Part can be dispatched between the first and second computing units, and queues up in the first and second corresponding work queues.At some
In embodiment, this Class Queue such as can realize by encoding the data parallel operations of a series of minimum and maximum index range,
The individual unit wherein worked is represented by tuple (such as min1, max1).The general index scope of data parallel operations can be divided
Become the chunk of work.In some cases, the size of chunk is selected such that it corresponds to computing unit (such as alternatively
GPU) the hardware thread of multiple described quantity, SIMD road or core.Furtherly, it can be more than alternatively hardware thread,
The quantity of the barrier that the sum of SIMD road or core is supported divided by computing unit simultaneously.Although it is not required, this can help ensure that work
The chunk made effectively utilizes calculating resource, and initial schedule is effective by hardware thread scheduler, if there is one
If individual.Chunk the most such as can be uniformly distributed between work queue, or based on various algorithms/criterion distribution.Running
During time, chunk or working cell can be retrieved and be processed to each computing unit from its corresponding work queue.
This can be continued until that live load is complete.As example, complete to be that sky indicates by all queues.One
In the case of Xie, termination token can be used alternatively.Such as, terminate token to be expressed as max-int terminating token TERM
EMPTY.All evaluation work persons by determining that order is stolen and do not have in the embodiment that worker generates more work, return
Instruction is typically terminated by value EMPTY.In the embodiment allowing to steal at random operation, after its own is finished work, it is sufficient to
Allow at least one worker (such as CPU line journey) by determining that order passes all queues.If it find that all queue empties, then it
EMPTY can be write terminate token to all queues or at least all working queue to another computing unit (such as GPU).This can help
Help termination data parallel live load.
In certain embodiments, such as can perform work to steal, in order to help to improve load balance, improve performance, reduce
Power consumption etc..Term " work is stolen " is the term used in this area.Term as used herein " work is stolen " broadly by
In referring to again to assign work from a computing unit or task distributes to another computing unit again, by work from a calculating
The queue of unit moves to the queue of another computing unit, it is allowed to computing unit advocate or be responsible for before by another computing unit master
The work etc. opened or be responsible for by it.
Referring again to Fig. 1, in certain embodiments, the first computing unit 101-1 is (such as CPU, general processor, general
Core etc.) the first synchronous working can be comprised alternatively steal the embodiment of device system 105-1, although what this was not required.First is same
Device system 105-1 is stolen in step work can be operable to steal from the second computing unit 101-2 execution synchronous working, in order to first
Computing unit 101-1 does this work, although the scope of the present invention is not so limited.In certain embodiments, the first synchronous working
Steal device system 105-1 and can access the first work queue 104-1 and the second work queue 104-2.First synchronous working steals device
System can use any combination of software, firmware and hardware to realize.
In certain embodiments, the second computing unit 101-2 (such as GPU, graphics core, hardware accelerator, dedicated processes
Device, specific core etc.) the second synchronous working can be comprised steal the embodiment of device system 105-2.In certain embodiments, second synchronize
Device system 105-2 is stolen in work can be operable to steal from the first computing unit 101-1 execution synchronous working, in order to the second meter
Calculate unit 101-2 to work.The work stolen can obtain from the first work queue 104-1, and adds the second work queue to
104-2.In certain embodiments, the second synchronous working is stolen device system 105-2 and can be accessed the first work queue 104-1 and second
Work queue 104-2.Second synchronous working is stolen device system and any combination of software, firmware and hardware can be used to realize.
In certain embodiments, work is stolen and can based on the first work queue 104-1 and the second work queue 104-2 be worked as
Front full.Such as, in certain embodiments, if the second work queue 104-2 is emptying, it is filled in below threshold levels, or quilt
Second synchronous working is stolen device system 105-2 and is perceived as the most fully being full of, then the second synchronous working steal device system 105-2 can be from
First work queue 104-1 steals work, and the work stolen is put in the second work queue 104-2.As another example,
In certain embodiments, if it is full that the second synchronous working steal device system 103-2 to perceive the first work queue 104-1,
Be filled in more than threshold levels, or the fullest, then the second synchronous working steal device system 103-2 can be from the excessively filled
One work queue 104-1 steals work.In other embodiments, when the first work queue 104-1 lack of fill and/or work as
When second work queue 104-2 fills excessive, the first synchronous working steals device system 103-1 can perform similar or mutual type
Work is stolen.Also contemplate other reason that work is stolen.
In certain embodiments, work is stolen can be two-way between first and second computing unit in either direction
's.Such as, the first synchronous working is stolen device system 105-1 and can be stolen work for the first calculating from the second computing unit 101-2
Unit 101-1 does (such as from the second work queue 104-2, work being moved to the first work queue 104-1), and second
Synchronous working is stolen device system 105-2 and can be stolen work from the first computing unit 101-1 and do for the second computing unit 101-2
(such as work being moved to the second work queue 104-2 from the first work queue 104-1).In certain embodiments, can perform
The two-way work being mutually in step is stolen.In certain embodiments, can use alternatively random the most substantially simultaneously two-way mutually with
The work of step is stolen.In other embodiments, the second computing unit 101-2 can steal from the first computing unit 101-1 alternatively
Work, it is not necessary to the first computing unit steals work from the second computing unit.This may not be provided as huge advantage, but if it is desire to
If, can help to allow more simple realization.In such cases, the first computing unit can omit synchronous working alternatively and steal device
System 105-1.
Refer again to Fig. 1, in certain embodiments, synchronous working can be performed alternatively and steal operation 106,107.At some
In embodiment, queue 104-1,104-2 can be stolen by synchronous access work and/or synchronous access is shared memorizer 103 and held
Row synchronous working steals operation.Such as, the second computing unit 101-2 can send help and realizes synchronous working and steal and (such as synchronize
Work steal 106) synchronization (such as atom accesses/operate 110).In certain embodiments, synchronous working is stolen to operate and be can use
Steal identical work in helping prevent two entities and/or perform identical work.This also assists in and prevents for realizing work team
The destruction of the shared data structure of row, and can help to allow them to be converted to another effective status from an effective status.?
In some embodiments, synchronous working is stolen operation and can be sufficiently complete, because work may not be stolen from queue, but not by stealing meter
Calculation unit performs.
In certain embodiments, can be held by one or more memory access synchronization primitives and/or instruction and/or operation
Row synchronous working steals operation.In certain embodiments, dequeue and steal operation available calculated by first and second single
Storage stack access sequence primitive/instructions/operations that unit supports.In certain embodiments, memory access synchronization primitives/refer to
Order/operation can support that this type of memory access any desired between first and second computing unit (such as CPU and GPU) is same
Realize on the hardware of step primitive/instructions/operations.The example of this type of suitable primitive/instructions/operations is including but not limited to memorizer
Fence and/or barrier macro-instruction, atomic memory access macro, OpenCL atomic operation, CUDA memory access synchronization behaviour
Other method made or be known in the art.
As example, in atomic memory access operation, processor can substantially simultaneously (such as circulate at same bus
In) read memory location and write to memory location.This type of atomic operation can help prevent other computing unit or external device (ED)
Write or read memory location until atomic operation completes.It is said that in general, atomic memory accessing operation will perform completely or
Person does not perform.The example of this type of atomic memory access primitive/instructions/operations is including but not limited to read-modify-write, ratio
Relatively and exchange, relatively and exchange, test and arrange, relatively and arrange, load link/have ready conditions storage instruction etc. and they
Various combinations.These can pass through macro-instruction, OpenCL atomic operation, CUDA memory access simultaneously operating or pass through in this area
Other method known is carried out.
In certain embodiments, these can comprise the instruction of one or more memory access fence.Memory access fence
Instruction is also sometimes referred to as memory access barrier instructions in the art (such as in some frameworks).This type of memory access
The example of fence and/or barrier instructions is including but not limited to loading fence/barrier (such as LFENCE instruction), storage fence/screen
Barrier (such as SFENCE instruction) and load and store fence/barrier (such as MFENCE instruction) etc. and their various groups
Close.This type of fence or barrier operating can pass through macro-instruction, OpenCL operation, CUDA operation or by be known in the art other
Method realizes.
Advantageously, the more effective scheduling that can help to allow between first and second computing unit is stolen in this type of work
And distribution, this may result in and better profits from resource and improve performance.Because allowing work to steal and particularly two-way when allowing
When work is stolen, it is not required that can be with height effective means static scheduling live load between Heterogeneous Computing unit.As at the back of the body
Discussed in scape technology segment, this type of static work load dispatch priori is tended to owing to some reasons are (such as, due to potential
Different clocks speed, different memory level, different code perform effectiveness, the existence of other live load, power limit etc.)
It is difficult.But, if owing to the work of certain reason is initially dispatched with ineffective way, and it causes a computing unit to become
Obtain overtax (or under-utilized), then can perform work and steal together with load balance, in order to help rebalancing live load,
And thereby aid in that to alleviate the utilization of various computing unit excessive and/or under-utilized.In other embodiments, can hold alternatively
Row work is stolen rather than load balance is to help to realize other purpose, the most such as, maximize processing speed, reduce power consumption etc..
Fig. 2 is the block diagram that synchronous working steals the embodiment of device system 205.It is also shown that in shared memorizer 202
First work queue 204-1 and the second work queue 204-2.In certain embodiments, the synchronous working of Fig. 2 steal device system can
It is included in the second hardware computational unit 101-2 and/or the heterogeneous computing system 100 of Fig. 1.Alternatively, the synchronous working of Fig. 2 is stolen
Take device system to can be included in similar or different computing unit or heterogeneous system.And, second hardware computational unit of Fig. 1
Similar from Fig. 2 or different synchronous working can be comprised with heterogeneous system and steal device system.
Synchronous working is stolen device system 205 and is comprised and determine module 220.Determine that module is operable to determine whether from the first kind
First hardware computational unit (the such as first computing unit 101-1) of type steals work 222 for second different types of second
Hardware computational unit (the such as second computing unit 101-2).In certain embodiments, Second Type is more special than the first kind
With.Work 222 can be queued up in corresponding to the first work queue 204-1 of the first hardware computational unit.First work queue exists
Sharing in memorizer 202, it is shared by the first and second hardware computational unit.As indicated, in certain embodiments, module is determined
Can couple with the first work queue 204-1 and/or the second work queue 204-2, or otherwise communicate.Implement at some
In example, determine module to can be determined whether based on full or one or more work queues and steal work.This can as herein other
Carrying out described by place.
Synchronous working is stolen device system 205 and is also comprised synchronous working and steal device module 221.Synchronous working steals device module can
Operate to steal work 222 from the first hardware computational unit and/or first queue 204-1, and using it as the work 223 stolen
There is provided for the second hardware computational unit, and/or add it to the second work queue 204-2.In certain embodiments, synchronize
It is operable with by the synchronous memories of the first work queue 204-1 access 210 is stolen work that device module is stolen in work.One
In a little embodiments, synchronous memories access 210 can be relative to depositing the first work queue from the first hardware computational unit
Access to store synchronizes.
Fig. 3 is the FB(flow block) of the embodiment of the method 335 stealing work in heterogeneous computer system.Implement at some
In example, the operation of Fig. 3 and/or method can be stolen the synchronous working of system 105-2 and/or Fig. 2 stolen by the synchronous working of Fig. 1 and be
System 205 execution, and/or within it perform.The component described herein of these systems, feature or specific optional details are alternatively also
Being applicable in an embodiment can be by these systems and/or the operation performed in these systems and/or method.Alternatively, Fig. 3
Operation and/or method can by be similar to or different synchronous workings steal system perform and/or within it perform.And, Fig. 1 and/
Or the synchronous working of Fig. 2 steals system and can perform the similar or different operation from Fig. 3 and/or method.
The method is included in block 336 and determines that the first hardware computational unit from the first kind steals work for than the first kind
Second hardware computational unit of the Second Type that type is more special.In certain embodiments, work can be corresponding to the first hardware meter
Calculate in the first work queue of unit and queue up.In certain embodiments, the first work queue is storable in by first and second hard
In the shared memorizer that part computing unit is shared.
The method is also included in block 337 and steals work.In certain embodiments, steal work and can comprise execution to the first work
Make the synchronous memories access of queue.In certain embodiments, synchronous memories access can be relative to based on the first hardware
The memory access of the first work queue calculating unit synchronizes.
Fig. 4 is that the bi-directional synchronization work being shown between the embodiment of CPU 401-1 and the embodiment of GPU 401-2 is stolen
406, the block diagram of the embodiment of 407.The illustrated example of CPU has four cores, i.e. CPU core 0, CPU core 1, CPU core 2 and CPU
Core 3.Other CPU can have less or more core.In the illustrated embodiment, in these cores each have corresponding to
Different operating queue in first group of work queue 404-1 of CPU.Specifically, in illustrated example, work queue
Ws_q [0] corresponds to CPU core 0, and work queue ws_q [1] corresponds to CPU core 1, and work queue ws_q [2] corresponds to CPU core 2,
And work queue ws_q [3] corresponds to CPU core 3.In other embodiments, the many-one between core and work queue can be used
Or the corresponding relation of one-to-many.
The illustrated example of CPU has four working groups, i.e. working group 0, working group 1, working group 2 and working group 3.This
A little working groups may be additionally referred to as spread unit or multiprocessor.In the example shown, each working group has 16 SIMD roads, although its
Its embodiment can have less or more SMID road.And, other GPU can have less or more working group.Such as, one
A little GPU have about dozens of multiprocessor or working group.Each in these working groups comprises one or more core, or
Comprise much core potentially.Such as, each multiprocessor of some GPU or working group have about dozens of core.Illustrated
Each shared the second work queue 404-2 corresponding to GPU in embodiment, in these working groups.In other embodiments, many
Individual work queue can respectively be used for one or more working group.Working group 404-1,404-2 can be in shared memorizeies.
Show that the example of operation 406,407 is stolen in bi-directional synchronization work in the example shown.Although CPU core 0, CPU core 1 and
CPU core 3 respectively from the corresponding work queue ws_q [0] of themselves, ws_q [l] and ws_q [3] work, but CPU core 2 from
GPU queue 404-2 performs synchronous working and steals operation 407.Similarly, working group 0 (also known as spread unit 0) is from corresponding to
The work queue ws_q [3] of CPU core 3 performs synchronous working and steals operation 406.This is an illustrated examples.Work is stolen
Operation 406,407 can be similar to other work described herein and steals operation, or same.
Fig. 5 is the block diagram of the environment that can use the embodiment stealing device system 505 that works asynchronously.As indicated, it is real at some
Executing in example, synchronous working is stolen device system and be may be implemented on the conventional operation time, the most such as when conventional OpenCL runs
Between on 542.In certain embodiments, synchronous working steal device system can be by the most conventional for live load OpenCL work
Load kernel 540 is packaged into additional work and steals scheduler code.In certain embodiments, OpenCL runs time or other fortune
The row time (such as CUDA runs the time) may need substantially to be modified.
Fig. 6 is the block diagram of the example embodiment of Heterogeneous Computing or computer system 600.System comprises CPU 601-1 and tool
There is the card 652 of GPU 601-2.CPU comprises the core 650 of any desired quantity.GPU comprises the core 651 of any desired quantity.Card is also
There is the memorizer 653 coupled with GPU.In certain embodiments, memorizer can comprise DRAM, although what this was not required.CPU
It is coupled with GPU, and is coupled with memorizer 653 by interconnection 608.Any known interconnection is applicable, the most such as
Periphery component interconnection and its derivative or extension.
Memorizer 654 comprises shared memorizer 602.Shared memorizer comprises CPU work queue 604-1 and GPU work team
Row 604-2.CPU has the first synchronous working and steals device system 605-1.GPU has the second synchronous working and steals device system 605-
2.In certain embodiments, these synchronous workings steal in device system any one or the two can be similar in institute elsewhere herein
Describe, or same.Such as, the second synchronous working is stolen device system 605-2 and be can be similar to second synchronous working of Fig. 1
The synchronous working stealing device system 105-2 and/or Fig. 2 steals device system 205, or same.
Fig. 7 is the block diagram of the example embodiment of Heterogeneous Computing or computer system 700.System comprises by interconnection and storage
The chip of device 754 coupling and/or tube core 761.In certain embodiments, memorizer 754 can comprise DRAM, although this is not required
's.Any known interconnection is applicable, the most such as periphery component interconnection and its derivative or extension.Chip/die comprises CPU
701-1 and integrated graphics 701-2.CPU comprises the core 750 of any desired quantity.Integrated graphics comprises the core of any desired quantity
751.CPU and integrated graphics couple with interconnection 760 on sheet.The upper interconnection mechanism of any be known in the art is applicable.Collection
Memory Controller 762 is become also to couple with interconnection on sheet.Memory Controller is by interconnecting 708 by chip/die and memorizer
754 couplings.
Memorizer 754 comprises shared memorizer 702.Shared memorizer comprises CPU work queue 704-1 and integrated graphics work
Make queue 704-2.CPU has the first synchronous working and steals device system 705-1.Integrated graphics has the second synchronous working and steals device
System 705-2.In certain embodiments, these synchronous workings steal in device system any one or the two can be similar to herein not
Place described by, or same.Such as, the second synchronous working is stolen device system 705-2 and be can be similar to the second of Fig. 1
Synchronous working is stolen the synchronous working of device system 105-2 and/or Fig. 2 and is stolen device system 205, or same.
The synchronous working that following code description is suitable for steals the use of the example embodiment of device system:
/ * annotates:
1. overall situation work is stolen queue data structure ws_q and is preserved all working of CPU core and GPU core and steal queue.These data
In the shared memorizer that structure is dispensed between CPU and GPU.
2. queue ws_q [0] is stolen in work ..., ws_q [num_cpu_threads-1] represents the queue of CPU core.
Ws_q [num_cpu_threads] represents that queue is stolen in GPU work.Num_cpu_threads represents the CPU core in heterogeneous system
Sum.All these work are stolen queue and are all dispensed in the shared memorizer between CPU and GPU.Each work is stolen
Queue is described in following " WSQueue " data structure.
The most each GPU calculates kernel and is defined by subroutine " actual_kernel ", and subroutine " actual_kernel " will
The index in the iteration space that it operates thereon is as argument.The details of this subroutine is that user applies specifically.
Subroutine the most described below " kernel_wrapper " is wrapper based on software, and it is by compiling GPU kernel
Compiler or by application itself or by OpenCL run the time generate.
Subroutine the most described below " steal_wrapper " steals queue discovery work across the work of CPU and GPU core.
If it find that any work, then it is stolen in correspondence work and performs in queue to steal operation.If in all queues whatever
Do not find, then return EMPTY.
6. subroutine " steal " is stolen queue from given work and is performed actual stealing.When it is stolen, it first ensures that
In queue, at least there is a job, and update the Top Of of queue the most atomically to guarantee correctly to perform to steal simultaneously
Take.
7. perform to compare and arrange operation on the subroutine " cas " shared memorizer between CPU with GPU.
*/
/ * each calculating kernel, such as " actual_kernel " all uses this subroutine to pack, to perform CPU Yu GPU work surreptitiously
The work taken between queue is stolen;This wrapper code runs time execution by user/compiler or by OpenCL.*/
__kernel void kernel_wrapper (
Global WSQueue * ws_q ,/* steal queue for the work of CPU and GPU of distribution in shared memorizer
Data structure */
Quantity * of int num_cpu_threads/* CPU core/
) {
int local_id = get_local_id ( 0 );The local_id of this work item in // working group
__local int work_ idx;Index in // total iteration space;Work item in working group shares this
barrier (CLK_LOCAL_MEM_FENCE);
// local storage barrier;This guarantees that work_idx is visible to all working item in working group;
if (work_idx == EMPTY) return;// if we are not finding work locally or through stealing, return
/ * call actual opencl kernel */
actual_kernel (work_idx + local_id) ;
}
}
This subroutine of/* sequentially traverses through all working and steals queue and attempt job search.If in all queues what
Also do not find, then returning EMPTY, instruction terminates.*/
int steal_wrapper ( global WSQueue *ws_q, int
num_cpu_threads ) {
/ * ws_q [num_cpu_threads] represents that queue is stolen in GPU work, and therefore we travel through in reverse order, with first
From GPU queue and then cpu queue discovery work.*/
for (int i=num_cpu_threads ; i>=0; i--) {
__global WSQueue each_ws_q = ws_q[i];
work_idx = steal (each_ws_q);// steal queue from work and perform actual to steal operation
if (work_idx >= 0) return work_idx;
// if it find that work, then index is returned to work.
}
return EMPTY;If // in all queues, do not find work, then return EMPTY.
}
This subroutine of/* perform for from given queue steal work actual steal operation */
int steal ( global WSQueue *a_ws_q) {
int top = a_ws_q->top;The top entry in queue is stolen in // sensing work
int bottom = a_ws_q->bottom;The sill strip mesh in queue is stolen in // sensing work
__global CircularArray *array = a_ws_q->activeArray ;
// find to store this array to (min, max)
int size = bottom - top;// find that the number of entries in queue is stolen in work
If (size≤0) // without work, return EMPTY
return EMPTY;
}
int o = (array->segment [top % (l<<array->log_size) ] ).min;
The index of // discovery the first chunk atop
Since/* we have found work and is ready to steal, we just use on Top Of relatively and arrange (cas)
Execution atom is stolen.*/
if ( !Cas (a_ws_q, top, top+1)) // on Top Of relatively and arrange
return ABORT;If // our unsuccessful execution atomic operation, ABORT
}
return o;// return minimum index from the tuple (min, max) of instruction real work index
}
This subroutine of/* show on shared memorizer between CPU and GPU relatively and the realization that (cas) operates is set.*/
bool cas ( __global WSQueue *a_ws_q, int oldVal, int newVal) {
int result;
result = atomic_cmpxchg ( (volatile global int * ) & (a_ws_q->top) ,
oldVal, newVal);
// perform atomic ratio relatively also swap operation on the shared memory
return (result == oldVal);
};
/ * for work steal queue realize list of data structures */
/ * is each work independently steal queue data structure */typedef struct WS_Q_s
volatile int bottom;The bottom of queue is stolen in // work
volatile int top;The top of queue is stolen in // work
CircularArray *activeArray ;// for preserving the basic array of all working chunk
} WSQueue;
typedef struct CircularArray_s {
int log_size;// for realizing the log size of the circulation array of work chunk
pair_t *segment;// for the array of work chunk
} CircularArray ;
typedef struct pair_s {
int min;// the minimum index started working from that, if be stolen
int max;// work until largest index, if be stolen
}pair_t ;
enum WSQ_Tag{
EMPTY=-2, // instruction queue is EMPTY, and therefore terminates
ABORT=-1, operation unsuccessful continuation due to competition is stolen in // instruction
};
For assembly, feature and details that in Fig. 1 and Fig. 4-7, any one is described can be also used in Fig. 2-3 alternatively arbitrary
In individual.And, the assembly, feature and the details that describe for any equipment herein can be also used in any side described herein alternatively
In method, described method in an embodiment can thus kind equipment and/or with this kind equipment perform.
Example embodiment
Following example relates to additional embodiment.In one or more embodiments the most all can use in example thin
Joint.
Example 1 is that device equipment is stolen in work.Work is stolen device and is comprised: determine module, for determining the from the first kind
One hardware computational unit steals work for being different from the second hardware computational unit of the Second Type of the first kind.This is operated in
Queuing up in first work queue, the first work queue is corresponding to the first hardware computational unit and is stored in by first and second
In the shared memorizer that hardware computational unit is shared.Work is stolen device equipment and is also comprised: synchronous working steals device module, and it passes through
Accessing the synchronous memories of the first work queue and steal described work, the access of described synchronous memories is relative to from the
The memory access of the first work queue of one hardware computational unit synchronizes.
Example 2 comprises the theme of example 1, and alternatively, and wherein synchronous working is stolen device module work is added to the
Two work queues.Second work queue corresponds to the second hardware computational unit, and is stored in shared memorizer.Equations of The Second Kind
Type is more special than the first kind alternatively.
Example 3 comprises the theme of example 1, and alternatively, it is right by comprising that wherein said synchronous working steals device module
Described work is stolen in the described synchronous memories access of the atomic operation that the first work queue performs.
Example 4 comprises the theme of example 3, and alternatively, wherein said atomic operation includes read-modify-write operation, ratio
Relatively and exchange operation, compare and swap operation, test and arrange operation, relatively and operation is set and loads link/have ready conditions
One of storage operation.
Example 2 comprises the theme of example 1, and alternatively, wherein atomic operation includes open computational language (OpenCL)
Atomic operation.
Example 6 comprises the theme of example 3, and alternatively, and wherein synchronous working is stolen device module and synchronized to deposit by CUDA
Work is stolen in access to store.
Example 7 comprises the theme of example 1, and alternatively, farther includes: second determines module, is used for determining from
Two hardware computational unit steal the second work for the first hardware computational unit.Second is operated in the second work queue queuing,
Second work queue is corresponding to the second hardware computational unit and is stored in shared memorizer.
Example 8 comprises the theme of example 7, and alternatively, farther includes: the second synchronous working steals device module, its
By the second work is stolen in the second synchronous memories access of the second work queue from the first hardware computational unit.The
Two synchronous memories accesses are relative to the memory access synchronization to the second work queue from the second hardware computational unit.
Example 9 comprises the theme of any one in example 1-8, and alternatively, wherein and fills out for sky when the second work queue
During in filling below threshold levels one, described determining that module determines and steal described work, the second work queue is corresponding to the
Two hardware computational unit and being stored in described shared memorizer.
Example 10 comprises the theme of any one in example 1-8, and alternatively, wherein the first hardware computational unit include from
One selected in general processor and CPU (CPU).Second hardware computational unit includes from graphic process unit, hard
Part accelerator installation, cipher processor, communication processor, network processing unit, application specific processor, specific core, highly-parallel are special
One selected in hardware, digital signal processor (DSP) and field programmable gate array (FPGA).
Example 11 comprises the theme of example 10, and alternatively, wherein the first hardware computational unit includes described CPU, and
And wherein the second hardware computational unit includes the described graphics process selected from Graphics Processing Unit (GPU) and integrated graphics core
Device.
Example 12 comprises the theme of example 11, and alternatively, wherein graphic process unit includes integrated graphics core, and its
Middle integrated graphics core and CPU share identical afterbody cache.
Example 13 is a kind of method in heterogeneous computing system.The method comprises: determine from the first of the first kind hard
Part computing unit steals work for second different types of second hardware computational unit more special than the first kind.This work
Queuing up in the first work queue, the first work queue is corresponding to the first hardware computational unit and is stored in by first and the
In the shared memorizer that two hardware computational unit are shared.The method also comprises: steal described saddlebag containing performing being stored in
Stating the synchronous memories access of the first work queue in shared memorizer, the access of described synchronous memories is relative to from the
The memory access of the first work queue of one hardware computational unit synchronizes.
Example 14 comprises the theme of example 13, and alternatively, farther includes: add described work to second work
Queue, it corresponds to the second hardware computational unit, and it also is stored in described shared memorizer.
Example 15 comprises the theme of example 13, and alternatively, wherein performs synchronous memories access and include performing atom
Operation.
Example 16 comprises the theme of example 15, and alternatively, wherein perform atomic operation include performing from read-modify-
Write operation, relatively and exchange operation, compare and swap operation, test and arrange operation, relatively and arrange operation and load chain
The atomic operation selected in road/storage operation of having ready conditions.
Example 17 comprises the theme of example 15, and alternatively, wherein performs atomic operation and include performing open calculating language
Speech (OpenCL) atomic operation.
Example 18 comprises the theme of example 15, and alternatively, wherein steals and include by performing CUDA synchronous memories
Work is stolen in access.
Example 19 comprises the theme of example 13, and alternatively, farther includes: determines and steals the second work, and from
Two hardware computational unit steal the second work for the first hardware computational unit.Second is operated in the second work queue queuing,
Second work queue is corresponding to the second hardware computational unit and is stored in shared memorizer.
Example 20 comprises the theme of example 19, and alternatively, wherein steals the second work and include performing from first
The synchronous memories access of the second work queue of hardware computational unit, described synchronous memories accesses relative to from second
The memory access of the second work queue of hardware computational unit synchronizes.
Example 21 comprises the theme of example 13, and alternatively, wherein determines and include that in response to the second work queue be sky
Be filled in below threshold levels and determine and steal described work, the second work queue corresponding to second hardware calculate
Unit and being stored in described shared memorizer.
Example 22 comprises the theme of example 13, and alternatively, wherein the first hardware computational unit be from general processor,
CPU (CPU) and select during there is the SOC(system on a chip) of multiple general purpose core one.Further, wherein the second hardware meter
Calculating unit is from graphic process unit, hardware accelerator device, cipher processor, communication processor, network processing unit, dedicated processes
Specific core, highly-parallel specialized hardware, digital signal processor (DSP) and field-programmable in device, specific core, SOC(system on a chip)
One selected in gate array (FPGA).
Example 23 comprises the theme of example 22, and alternatively, wherein the first hardware computational unit includes described CPU, and
And wherein the second hardware computational unit includes the described graphics process selected from Graphics Processing Unit (GPU) and integrated graphics core
Device.
Example 24 comprises the theme of example 22, and alternatively, wherein graphic process unit includes integrated graphics core, and its
The core of middle integrated graphics core and CPU shares identical afterbody cache.
Example 25 is heterogeneous computer system.Heterogeneous computer system comprises interconnection.Heterogeneous computer system also comprise with
First hardware computational unit of the first kind of interconnection coupling.Heterogeneous computer system also comprises different from the second of interconnection coupling
Second hardware computational unit of type.Second Type is more special than the first kind.Heterogeneous computer system also comprises and interconnection coupling
The dynamic random access memory (DRAM) closed, DRAM comprises the shared storage shared by the first and second hardware computational unit
Device.Shared memorizer comprises the first work queue of the queue work by the first hardware computational unit and by single for the second hardware calculating
Second work queue of the queue work of unit.Heterogeneous computer system also comprises: device equipment is stolen in work, and it determines steals and lead to
Cross the access of the synchronous memories to the first work queue steal work from first queue and add it to the second queue.Synchronization is deposited
Access to store is relative to the memory access synchronization to the first work queue from the first hardware computational unit.
Example 26 comprises the theme of example 25, and alternatively, wherein works and steal device equipment by comprising the first work
Work is stolen in the synchronous memories access of the atomic operation in queue.
Example 27 comprises the theme of any one in example 25-26, and alternatively, farther includes: the second work is stolen
Device equipment, it determines steals and by the second synchronous memories access of the second work queue is stolen the from the second work queue
Two work and add it to the first work queue, and the second synchronous memories accesses relative to from the second hardware computational unit
The second work queue memory access synchronize.
Example 28 is one or more computer-readable recording mediums of storage instruction, if described instruction is performed by machine
Then will make machine performing operations.Operate to comprise and determine that the first hardware computational unit from the first kind steals work for ratio first
The operation of the second different types of second hardware computational unit that type is more special.This is operated in the first work queue queuing,
First work queue corresponding to the first hardware computational unit and be stored in by the first and second hardware computational unit share
Share in memorizer.Operation comprises the synchronous memories access by performing the first work queue and steals the operation of work.
Synchronous memories access is relative to the memory access synchronization to the first work queue from the first hardware computational unit.
Example 29 comprises the theme of example 28, and alternatively, wherein machinable medium further provides for instruction,
If described instruction is performed by machine, comprise the operation that work is added to the second work queue, the second work by making machine perform
Make queue corresponding to the second hardware computational unit and to be stored in shared memorizer.
Example 30 comprises the theme of any one in example 28-29, and alternatively, wherein machinable medium enters one
Step provides instruction, if described instruction is performed by machine, comprises making machine perform by comprising the atom performing first queue
The operation of work is stolen in the synchronous memories access of operation.
Example 31 is the machinable medium of storage instruction, if described instruction is performed by machine, makes described machine
Perform the method for any one in example 13-24.
Example 32 is to include for performing the equipment of the parts of the method for any one in example 13-24.
Example 33 is performed in example 13-24 the equipment of the method for any one.
Example 34 is the equipment for essentially performing approach described herein.
Example 35 is to include the equipment of the parts for essentially performing approach described herein.
In the specification and in the claims, term can be used for " to couple " and " connection ", together with their derivation
Word.It should be understood that these terms are not intended as synonym each other.But, in a particular embodiment, " connection " can be used for referring to
Show that two or more elements are in direct physical contact with each other or make electrical contact with." couple " and may imply that two or more element direct physical
Contact or electrical contact.But, " coupling " also mean that, two or more elements are not directly contacted with each other, but still work in coordination with each other
Operation or mutual.Such as, processor can be by one or more intermediate modules (the most one or more interconnection and/or chipset)
Couple with device.In the accompanying drawings, use arrow that connection and coupling are shown.
In the specification and in the claims, may have used term " logic ".Logic used herein can be wrapped
Module containing such as hardware, firmware, software or combinations thereof.Example logic comprises integrated circuit, special IC, simulation
Circuit, digital circuit, the logic device of programming, comprise the storage arrangement etc. of instruction.
Term "and/or" can be used for.Term as used herein "and/or" mean one or the other or two
Person (such as A and/or B means A or B, or A and B).
In the above description, for illustrative purposes, a large amount of specific detail has been elaborated to provide the embodiment of the present invention
Comprehensive understanding.But, those skilled in the art will be apparent that, some in not having these specific detail are thin
In the case of joint, it is possible to put into practice one or more other embodiments.Described specific embodiment is not to provide for limiting this
Invention, but for it being described by example embodiment.The scope of the present invention is not determined by particular example and only by claim
Book determines.In other example, the most in form of a block diagram or do not have details show well-known circuit, structure, device and
Operation, in order to avoid the understanding of vague description.
In the case of being considered as suitably, it is repeated the end section of reference or reference to refer to
Showing correspondence or like, they can have similar or identical characteristics alternatively, unless specified otherwise herein or the most apparent.
In some cases, it has been described that multiple assemblies, they can be merged into single component.Having been described with single component
In the case of other, it can be divided into multiple assembly.
Have been described with various operation and method.Some of them side is described in flow charts with the most basic form
Method, but operation can be added to method and/or remove operation from method alternatively.Although additionally, flow chart shows according to example
The concrete order of the operation of embodiment, but concrete order is exemplary.Alternative can perform alternatively in different order
Operation, combines some operation, some operation overlapping etc..
Some embodiments comprise goods (such as computer program), and it comprises machine readable media.Medium can comprise
The mechanism of the information such as storing machine-readable form is provided.Machine readable media can provide or on it, storage has one or more
Instruction, if instructing and/or be operable such that when executed by a machine that machine performs and/or causes machine execution disclosed herein
One or more operations, method or technology.
In certain embodiments, machine readable media can comprise tangible and/or non-transitory machinable medium.Example
As, tangible and/or non-transitory machinable medium can comprise floppy disk, optical storage media, CD, optical data storage dress
Put, CD-ROM, disk, magneto-optic disk, read only memory (ROM), programming ROM (PROM), erasable and programming ROM
(EPROM), electric erasable and programming ROM (EEPROM), random access memory (RAM), static RAM (SRAM), dynamically
RAM (DRAM), flash memory, phase transition storage, phase change data storage material, nonvolatile memory, non-volatile data storage dress
Put, non-transitory memory, non-transitory data storage device etc..Non-transitory machinable medium does not include temporary
Transmitting signal.In another embodiment, machine readable media can comprise temporary machine readable communication medium, such as electricity, light, sound
Or the transmitting signal of other form, such as carrier wave, infrared signal, digital signal etc..
Suitably the example of machine is including but not limited to desk computer, laptop computer, notebook, flat board meter
Calculation machine, notebook, smart phone, cell phone, server, network equipment (such as router and switch), mobile Internet
Device (MID), media player, intelligent television, device for logging on network, Set Top Box and PlayStation 3 videogame console/PS3 and other department of computer science
Unite, calculate device or there is the electronic installation of one or more processor.
It is also acknowledged that this description mentions " embodiment ", " embodiment " or " one or more enforcements in the whole text
Example " such as mean in the practice that specific features can be included in the present invention.Similarly, it should be appreciated that the most various spies
Levy and combine sometimes along in single embodiment, accompanying drawing or its description to make the disclosure smooth and helping to understand each invented party
Face.But, this method disclosed is not construed as reflecting that the present invention needs than the spy being expressly recited in each claim
Levy the intention of more feature.But, as following claims reflects, inventive aspect may be in disclosed real less than single
Execute all features of example.Thus, it then follows claims of detailed description of the invention are thus clearly merged into this specific embodiment party
In formula, the independent individual embodiment as the present invention of the most each claim.
Claims (25)
1. a device equipment is stolen in work, including:
Determine module, for determining that the first hardware computational unit from the first kind steals work for being different from the described first kind
Second hardware computational unit of the Second Type of type, wherein said is operated in the first work queue queuing, described first work
Queue corresponding to described first hardware computational unit and be stored in by described first and second hardware computational unit share
Share in memorizer;And
Synchronous working steals device module, and it is by stealing described work to the access of the synchronous memories of described first work queue
Making, described synchronous memories accesses relative to the storage to described first work queue from described first hardware computational unit
Device access synchronizes.
2. equipment as claimed in claim 1, wherein said synchronous working is stolen device module and is added described work to second work
Making queue, described second work queue is corresponding to described second hardware computational unit and is stored in described shared memorizer
In, and wherein said Second Type is more special than the described first kind.
3. equipment as claimed in claim 1, wherein said synchronous working steals device module by comprising described first work
Described work is stolen in the described synchronous memories access of the atomic operation that queue performs.
4. equipment as claimed in claim 3, wherein said atomic operation includes read-modify-write operation, relatively and exchange behaviour
Make, compare and swap operation, test and arrange operation, relatively and operation is set and loads link/have ready conditions and store operation
One.
5. equipment as claimed in claim 3, wherein said atomic operation includes open computational language (OpenCL) atomic operation.
6. equipment as claimed in claim 3, wherein said synchronous working is stolen device module and is accessed by CUDA synchronous memories
Steal described work.
7. equipment as claimed in claim 1, farther includes: second determines module, by determining based on from described second hardware
Calculating unit and steal the second work for described first hardware computational unit, wherein said second is operated in the second work queue row
Team, described second work queue is corresponding to the second hardware computational unit and is stored in described shared memorizer.
8. equipment as claimed in claim 7, farther includes: the second synchronous working steals device module, and it is by from institute
Described second work, institute are stolen in second synchronous memories access of described second work queue stating the first hardware computational unit
State the second synchronous memories to access relative to the storage to described second work queue from described second hardware computational unit
Device access synchronizes.
9. the equipment as according to any one of claim 1-8, wherein for sky and is filled in threshold levels when the second work queue
During in below one, described determining that module determines and steal described work, described second work queue corresponds to the second hardware meter
Calculate unit and be stored in described shared memorizer.
10. the equipment as according to any one of claim 1-8, wherein said first hardware computational unit includes from general procedure
One selected in device and CPU (CPU), and wherein said second hardware computational unit includes from graphics process
Device, hardware accelerator device, cipher processor, communication processor, job processor, application specific processor, specific core, highly-parallel
One selected in specialized hardware, digital signal processor (DSP) and field programmable gate array (FPGA).
11. equipment as claimed in claim 10, wherein said first hardware computational unit includes described CPU, and wherein institute
State the described graphic process unit that the second hardware computational unit includes selecting from Graphics Processing Unit (GPU) and integrated graphics core.
12. 1 kinds of methods in heterogeneous computing system, including:
Determine that the first hardware computational unit from the first kind steals work for more special than the described first kind second not
With the second hardware computational unit of type, wherein said it is operated in the first work queue queuing, described first work queue pair
First hardware computational unit described in Ying Yu and being stored in shared is deposited by what described first and second hardware computational unit were shared
In reservoir;And
Steal described saddlebag and store containing performing the synchronization to described first work queue being stored in described shared memorizer
Device accesses, and described synchronous memories accesses relative to described first work queue from described first hardware computational unit
Memory access synchronizes.
13. methods as claimed in claim 12, farther include: described work adds to the second work queue, and described
Two work queues are corresponding to described second hardware computational unit and also are stored in described shared memorizer.
14. methods as claimed in claim 12, wherein perform the access of described synchronous memories and include performing atomic operation.
15. methods as claimed in claim 14, wherein perform described atomic operation and include performing from read-modify-write operation, ratio
Relatively and exchange operation, compare and swap operation, test and arrange operation, relatively and operation is set and loads link/have ready conditions
The atomic operation selected in storage operation.
16. methods as claimed in claim 14, wherein perform described atomic operation and include performing open computational language
(OpenCL) atomic operation.
17. methods as claimed in claim 12, farther include: determine and steal the second work and from described second hardware meter
Calculation unit is stolen described second and is worked for described first hardware computational unit, and wherein said second is operated in the second work queue
Middle queuing, described second work queue is corresponding to described second hardware computational unit and is stored in described shared memorizer
In.
18. methods as claimed in claim 13, wherein determine and include for sky and being filled in response to described second work queue
In below threshold levels one and determine and steal described work, described second work queue calculates corresponding to described second hardware
Unit and being stored in described shared memorizer.
19. methods as claimed in claim 13, wherein said first hardware computational unit is to process from general processor, central authorities
Unit (CPU) and select during there is the SOC(system on a chip) of multiple general purpose core one, and wherein said second hardware calculates single
Unit be from graphic process unit, hardware accelerator device, cipher processor, communication processor, network processing unit, application specific processor,
Specific core, highly-parallel specialized hardware, digital signal processor (DSP) and field programmable gate in specific core, SOC(system on a chip)
One selected in array (FPGA).
20. methods as claimed in claim 19, wherein said first hardware computational unit includes described CPU, and wherein institute
State the described graphic process unit that the second hardware computational unit includes selecting from Graphics Processing Unit (GPU) and integrated graphics core.
21. 1 kinds of heterogeneous computer systems, including:
Interconnection;
The first hardware computational unit with the first kind that described interconnection couples;
The second different types of second hardware computational unit coupled with described interconnection, described Second Type is than the described first kind
More special;
The dynamic random access memory (DRAM) coupled with described interconnection, described DRAM comprises by described first and second hardware
The shared memorizer that computing unit is shared, described shared memorizer comprises the queue work by described first hardware computational unit
First work queue and the second work queue of the queue work by described second hardware computational unit;And
Device equipment is stolen in work, steals and by accessing the synchronous memories of described first work queue from described for determining
First queue is stolen described work and adds it to described second queue, and described synchronous memories accesses relative to from institute
The memory access of described first work queue stating the first hardware computational unit synchronizes.
22. systems as claimed in claim 21, wherein said work steals device equipment by comprising described first work queue
On atomic operation described synchronous memories access steal described work.
23. systems as according to any one of claim 21-22, farther include: device equipment is stolen in the second work, and it determines
Steal and by the second synchronous memories access of described second work queue is stolen the second work from described second work queue
Making and add it to described first work queue, described second synchronous memories accesses relative to from described second hardware
The memory access of described second work queue of computing unit synchronizes.
24. 1 kinds of machinable mediums storing instruction, if described instruction is performed by machine, make described machine perform
Method as according to any one of claim 12-20.
25. 1 kinds of equipment, including the parts for performing the method as according to any one of claim 12-20.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710028738.2A CN107092573B (en) | 2013-03-15 | 2013-03-15 | Method and apparatus for work stealing in heterogeneous computing systems |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US2013/032707 WO2014143067A1 (en) | 2013-03-15 | 2013-03-15 | Work stealing in heterogeneous computing systems |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710028738.2A Division CN107092573B (en) | 2013-03-15 | 2013-03-15 | Method and apparatus for work stealing in heterogeneous computing systems |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106164881A true CN106164881A (en) | 2016-11-23 |
CN106164881B CN106164881B (en) | 2022-01-14 |
Family
ID=51537399
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710028738.2A Active CN107092573B (en) | 2013-03-15 | 2013-03-15 | Method and apparatus for work stealing in heterogeneous computing systems |
CN201380073056.3A Active CN106164881B (en) | 2013-03-15 | 2013-03-15 | Work stealing in heterogeneous computing systems |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710028738.2A Active CN107092573B (en) | 2013-03-15 | 2013-03-15 | Method and apparatus for work stealing in heterogeneous computing systems |
Country Status (4)
Country | Link |
---|---|
US (3) | US20160154677A1 (en) |
EP (2) | EP3242210B1 (en) |
CN (2) | CN107092573B (en) |
WO (1) | WO2014143067A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106777710A (en) * | 2016-12-22 | 2017-05-31 | 中国兵器装备集团自动化研究所 | A kind of method of the CUDA kernels realized on FPGA |
CN108319510A (en) * | 2017-12-28 | 2018-07-24 | 大唐软件技术股份有限公司 | A kind of isomery processing method and processing device |
CN112463709A (en) * | 2019-09-09 | 2021-03-09 | 上海登临科技有限公司 | Configurable heterogeneous artificial intelligence processor |
Families Citing this family (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3242210B1 (en) | 2013-03-15 | 2022-08-03 | INTEL Corporation | Work stealing in heterogeneous computing systems |
US9811467B2 (en) * | 2014-02-03 | 2017-11-07 | Cavium, Inc. | Method and an apparatus for pre-fetching and processing work for procesor cores in a network processor |
US9753785B2 (en) * | 2014-07-24 | 2017-09-05 | Home Box Office, Inc. | Asynchronous dispatcher for application framework |
US9804883B2 (en) * | 2014-11-14 | 2017-10-31 | Advanced Micro Devices, Inc. | Remote scoped synchronization for work stealing and sharing |
US9678806B2 (en) * | 2015-06-26 | 2017-06-13 | Advanced Micro Devices, Inc. | Method and apparatus for distributing processing core workloads among processing cores |
US10089155B2 (en) * | 2015-09-22 | 2018-10-02 | Advanced Micro Devices, Inc. | Power aware work stealing |
US20170083365A1 (en) * | 2015-09-23 | 2017-03-23 | Qualcomm Incorporated | Adaptive Chunk Size Tuning for Data Parallel Processing on Multi-core Architecture |
US9772878B2 (en) | 2015-11-09 | 2017-09-26 | Unity IPR ApS | Determining a job group status based on a relationship between a generation counter value and a ticket value for scheduling the job group for execution |
US9892544B2 (en) * | 2015-12-22 | 2018-02-13 | Intel Corporation | Method and apparatus for load balancing in a ray tracing architecture |
US10114681B2 (en) * | 2016-03-30 | 2018-10-30 | Qualcomm Incorporated | Identifying enhanced synchronization operation outcomes to improve runtime operations |
JP6645348B2 (en) * | 2016-05-06 | 2020-02-14 | 富士通株式会社 | Information processing apparatus, information processing program, and information processing method |
US10437616B2 (en) * | 2016-12-31 | 2019-10-08 | Intel Corporation | Method, apparatus, system for optimized work submission to an accelerator work queue |
US10444817B2 (en) * | 2017-04-17 | 2019-10-15 | Intel Corporation | System, apparatus and method for increasing performance in a processor during a voltage ramp |
US11237872B2 (en) * | 2017-05-23 | 2022-02-01 | Kla-Tencor Corporation | Semiconductor inspection and metrology systems for distributing job among the CPUs or GPUs based on logical image processing boundaries |
US10686728B2 (en) * | 2017-07-06 | 2020-06-16 | Huawei Technologies Co., Ltd. | Systems and methods for allocating computing resources in distributed computing |
KR102482896B1 (en) | 2017-12-28 | 2022-12-30 | 삼성전자주식회사 | Memory device including heterogeneous volatile memory chips and electronic device including the same |
US10430246B2 (en) | 2018-01-18 | 2019-10-01 | International Business Machines Corporation | Virtualized and synchronous access to hardware accelerators |
US10705849B2 (en) * | 2018-02-05 | 2020-07-07 | The Regents Of The University Of Michigan | Mode-selectable processor for execution of a single thread in a first mode and plural borrowed threads in a second mode |
CN108920260B (en) * | 2018-05-16 | 2021-11-26 | 成都淞幸科技有限责任公司 | Interaction method and device for heterogeneous system |
US10957095B2 (en) * | 2018-08-06 | 2021-03-23 | Intel Corporation | Programmable ray tracing with hardware acceleration on a graphics processor |
US10963300B2 (en) * | 2018-12-06 | 2021-03-30 | Raytheon Company | Accelerating dataflow signal processing applications across heterogeneous CPU/GPU systems |
WO2020257976A1 (en) * | 2019-06-24 | 2020-12-30 | Intel Corporation | Apparatus and method for scheduling graphics processing resources |
US11449339B2 (en) * | 2019-09-27 | 2022-09-20 | Red Hat, Inc. | Memory barrier elision for multi-threaded workloads |
US11327793B2 (en) * | 2020-02-18 | 2022-05-10 | International Business Machines Corporation | Garbage collection work stealing mechanism |
US11340942B2 (en) * | 2020-03-19 | 2022-05-24 | Raytheon Company | Cooperative work-stealing scheduler |
CN113886057B (en) * | 2020-07-01 | 2024-06-28 | 西南科技大学 | Dynamic resource scheduling method based on analysis technology and data stream information on heterogeneous many-core |
US11698816B2 (en) * | 2020-08-31 | 2023-07-11 | Hewlett Packard Enterprise Development Lp | Lock-free work-stealing thread scheduler |
CN115705194B (en) * | 2021-08-13 | 2024-09-24 | 华为技术有限公司 | Code processing method and corresponding device under hardware memory sequence architecture |
US11875200B2 (en) * | 2021-09-23 | 2024-01-16 | International Business Machines Corporation | Adjunct processor extended message limit facility |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101091175A (en) * | 2004-09-16 | 2007-12-19 | 辉达公司 | Load balancing |
US20110055838A1 (en) * | 2009-08-28 | 2011-03-03 | Moyes William A | Optimized thread scheduling via hardware performance monitoring |
CN102053870A (en) * | 2009-10-28 | 2011-05-11 | 国际商业机器公司 | Systems and methods for affinity driven distributed scheduling of parallel computations |
CN102360313A (en) * | 2011-09-29 | 2012-02-22 | 中国科学技术大学苏州研究院 | Performance acceleration method of heterogeneous multi-core computing platform on chip |
US20120054771A1 (en) * | 2010-08-31 | 2012-03-01 | International Business Machines Corporation | Rescheduling workload in a hybrid computing environment |
WO2012082557A2 (en) * | 2010-12-15 | 2012-06-21 | Advanced Micro Devices, Inc. | Dynamic work partitioning on heterogeneous processing devices |
Family Cites Families (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6748593B1 (en) * | 2000-02-17 | 2004-06-08 | International Business Machines Corporation | Apparatus and method for starvation load balancing using a global run queue in a multiple run queue system |
US6826583B1 (en) * | 2000-05-15 | 2004-11-30 | Sun Microsystems, Inc. | Local allocation buffers for parallel garbage collection |
US7117502B1 (en) | 2000-11-10 | 2006-10-03 | Sun Microsystems, Inc. | Linked-list implementation of a data structure with concurrent non-blocking insert and remove operations |
US7167916B2 (en) * | 2002-08-30 | 2007-01-23 | Unisys Corporation | Computer OS dispatcher operation with virtual switching queue and IP queues |
US7516456B2 (en) * | 2003-09-25 | 2009-04-07 | International Business Machines Corporation | Asymmetric heterogeneous multi-threaded operating system |
US20050210472A1 (en) * | 2004-03-18 | 2005-09-22 | International Business Machines Corporation | Method and data processing system for per-chip thread queuing in a multi-processor system |
ITMI20040600A1 (en) * | 2004-03-26 | 2004-06-26 | Atmel Corp | DSP SYSTEM ON DOUBLE PROCESSOR WITH MOBILE COMB IN THE COMPLEX DOMAIN |
GB2427045B (en) * | 2005-06-06 | 2007-11-21 | Transitive Ltd | Method and apparatus for converting program code with access coordination for a shared resource |
US8539493B1 (en) * | 2006-03-23 | 2013-09-17 | Emc Corporation | Configurable prioritization and aging of queued tasks |
US8209493B2 (en) * | 2008-03-26 | 2012-06-26 | Intel Corporation | Systems and methods for scheduling memory requests during memory throttling |
US8607237B2 (en) * | 2008-06-02 | 2013-12-10 | Microsoft Corporation | Collection with local lists for a multi-processor system |
US8266394B2 (en) * | 2008-07-14 | 2012-09-11 | International Business Machines Corporation | Methods for single-owner multi-consumer work queues for repeatable tasks |
US8813091B2 (en) * | 2008-08-04 | 2014-08-19 | Oracle America, Inc. | Distribution data structures for locality-guided work stealing |
US8321558B1 (en) * | 2009-03-31 | 2012-11-27 | Amazon Technologies, Inc. | Dynamically monitoring and modifying distributed execution of programs |
US8572622B2 (en) | 2009-12-30 | 2013-10-29 | International Business Machines Corporation | Reducing queue synchronization of multiple work items in a system with high memory latency between processing nodes |
US8464026B2 (en) | 2010-02-17 | 2013-06-11 | International Business Machines Corporation | Method and apparatus for computing massive spatio-temporal correlations using a hybrid CPU-GPU approach |
US8806497B2 (en) * | 2010-03-05 | 2014-08-12 | Microsoft Corporation | System and method for altering the work definitions in an iterative parallel opportunistic work stealing scheduler |
WO2011148553A1 (en) * | 2010-05-24 | 2011-12-01 | 株式会社ソニー・コンピュータエンタテインメント | Information processing device and information processing method |
US9152468B2 (en) | 2010-10-25 | 2015-10-06 | Samsung Electronics Co., Ltd. | NUMA aware system task management |
JP5576305B2 (en) * | 2011-01-20 | 2014-08-20 | インターナショナル・ビジネス・マシーンズ・コーポレーション | Computer operation control method, program, and system |
KR101895453B1 (en) * | 2011-11-09 | 2018-10-25 | 삼성전자주식회사 | Apparatus and method for guarantee security in heterogeneous computing environment |
US8842122B2 (en) * | 2011-12-15 | 2014-09-23 | Qualcomm Incorporated | Graphics processing unit with command processor |
US9116739B2 (en) * | 2013-03-14 | 2015-08-25 | Intel Corporation | Fast and scalable concurrent queuing system |
EP3242210B1 (en) | 2013-03-15 | 2022-08-03 | INTEL Corporation | Work stealing in heterogeneous computing systems |
-
2013
- 2013-03-15 EP EP17177561.2A patent/EP3242210B1/en active Active
- 2013-03-15 US US13/976,579 patent/US20160154677A1/en not_active Abandoned
- 2013-03-15 CN CN201710028738.2A patent/CN107092573B/en active Active
- 2013-03-15 CN CN201380073056.3A patent/CN106164881B/en active Active
- 2013-03-15 WO PCT/US2013/032707 patent/WO2014143067A1/en active Application Filing
- 2013-03-15 EP EP13877641.4A patent/EP2972907B1/en active Active
-
2016
- 2016-12-27 US US15/391,549 patent/US11138048B2/en active Active
-
2021
- 2021-10-04 US US17/493,419 patent/US20220027210A1/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101091175A (en) * | 2004-09-16 | 2007-12-19 | 辉达公司 | Load balancing |
US20110055838A1 (en) * | 2009-08-28 | 2011-03-03 | Moyes William A | Optimized thread scheduling via hardware performance monitoring |
CN102053870A (en) * | 2009-10-28 | 2011-05-11 | 国际商业机器公司 | Systems and methods for affinity driven distributed scheduling of parallel computations |
US20120054771A1 (en) * | 2010-08-31 | 2012-03-01 | International Business Machines Corporation | Rescheduling workload in a hybrid computing environment |
WO2012082557A2 (en) * | 2010-12-15 | 2012-06-21 | Advanced Micro Devices, Inc. | Dynamic work partitioning on heterogeneous processing devices |
CN102360313A (en) * | 2011-09-29 | 2012-02-22 | 中国科学技术大学苏州研究院 | Performance acceleration method of heterogeneous multi-core computing platform on chip |
Non-Patent Citations (1)
Title |
---|
VINICIUS GARCIA PINTO等: "Scheduling by Work-Stealing in Hybrid Parallel Architectures", 《HTTPS://WWW.INF.UFRGS.BR/GPPD/WSPPD/2012/PAPERS/WSPPD2012_SUBMISSION_23.PDF.MOD.PDF》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106777710A (en) * | 2016-12-22 | 2017-05-31 | 中国兵器装备集团自动化研究所 | A kind of method of the CUDA kernels realized on FPGA |
CN108319510A (en) * | 2017-12-28 | 2018-07-24 | 大唐软件技术股份有限公司 | A kind of isomery processing method and processing device |
CN112463709A (en) * | 2019-09-09 | 2021-03-09 | 上海登临科技有限公司 | Configurable heterogeneous artificial intelligence processor |
Also Published As
Publication number | Publication date |
---|---|
US20220027210A1 (en) | 2022-01-27 |
US20170109213A1 (en) | 2017-04-20 |
CN107092573B (en) | 2023-04-18 |
WO2014143067A1 (en) | 2014-09-18 |
EP2972907B1 (en) | 2019-09-25 |
CN106164881B (en) | 2022-01-14 |
EP3242210A1 (en) | 2017-11-08 |
CN107092573A (en) | 2017-08-25 |
EP3242210B1 (en) | 2022-08-03 |
US11138048B2 (en) | 2021-10-05 |
EP2972907A4 (en) | 2016-11-30 |
EP2972907A1 (en) | 2016-01-20 |
US20160154677A1 (en) | 2016-06-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106164881A (en) | Work in heterogeneous computing system is stolen | |
CN103765376B (en) | Graphic process unit with clog-free parallel architecture | |
Chen et al. | Dynamic load balancing on single-and multi-GPU systems | |
Hermann et al. | Multi-GPU and multi-CPU parallelization for interactive physics simulations | |
CN103365631B (en) | For the dynamic base pattern addressing of memory access | |
US11934826B2 (en) | Vector reductions using shared scratchpad memory | |
CN103294536B (en) | Control to be distributed for the work of process task | |
CN103559014A (en) | Method and system for processing nested stream events | |
CN109997115A (en) | Low-power and low latency GPU coprocessor for persistently calculating | |
CN104572568A (en) | Read lock operation method, write lock operation method and system | |
CN101978350A (en) | Vector instructions to enable efficient synchronization and parallel reduction operations | |
US20100031267A1 (en) | Distribution Data Structures for Locality-Guided Work Stealing | |
US11340942B2 (en) | Cooperative work-stealing scheduler | |
CN103154892A (en) | Method, system and apparatus for multi-level processing | |
CN103885903A (en) | Technique For Performing Memory Access Operations Via Texture Hardware | |
US8413151B1 (en) | Selective thread spawning within a multi-threaded processing system | |
Chavarria-Miranda et al. | Global Futures: A multithreaded execution model for Global Arrays-based applications | |
WO2021154732A1 (en) | Shared scratchpad memory with parallel load-store | |
Vinkler et al. | Massively parallel hierarchical scene processing with applications in rendering | |
CN112346879B (en) | Process management method, device, computer equipment and storage medium | |
CN116775265A (en) | Collaborative group array | |
CN116774914A (en) | Distributed shared memory | |
Gaster | HSA memory model. | |
Silva et al. | Terrame hpa: parallel simulation of multi-agent systems over smps | |
Masko et al. | Task scheduling for SoC-based dynamic SMP clusters with communication on the fly |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |