CN106164881A - Work in heterogeneous computing system is stolen - Google Patents

Work in heterogeneous computing system is stolen Download PDF

Info

Publication number
CN106164881A
CN106164881A CN201380073056.3A CN201380073056A CN106164881A CN 106164881 A CN106164881 A CN 106164881A CN 201380073056 A CN201380073056 A CN 201380073056A CN 106164881 A CN106164881 A CN 106164881A
Authority
CN
China
Prior art keywords
work
hardware
queue
computational unit
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201380073056.3A
Other languages
Chinese (zh)
Other versions
CN106164881B (en
Inventor
R·白里克
S·A·赫哈特
J·斯里拉姆
T·施佩斯曼
R·L·哈德森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Intel Corp
Original Assignee
Intel Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intel Corp filed Critical Intel Corp
Priority to CN201710028738.2A priority Critical patent/CN107092573B/en
Publication of CN106164881A publication Critical patent/CN106164881A/en
Application granted granted Critical
Publication of CN106164881B publication Critical patent/CN106164881B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/42Bus transfer protocol, e.g. handshake; Synchronisation
    • G06F13/4204Bus transfer protocol, e.g. handshake; Synchronisation on a parallel bus
    • G06F13/4234Bus transfer protocol, e.g. handshake; Synchronisation on a parallel bus being a memory bus
    • G06F13/4239Bus transfer protocol, e.g. handshake; Synchronisation on a parallel bus being a memory bus with asynchronous protocol
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)
  • Hardware Redundancy (AREA)
  • Multi Processors (AREA)

Abstract

Work is stolen device equipment and is comprised and determine module.Determine that module determines that the first hardware computational unit from the first kind steals work for being different from the second hardware computational unit of the Second Type of the first kind.This is operated in the first work queue queuing, and the first work queue is corresponding to the first hardware computational unit and will be stored in the shared memorizer shared by the first and second hardware computational unit.Synchronous working steals device module by the access of the synchronous memories of the first work queue is stolen work.Synchronous memories access is by relative to the memory access synchronization to the first work queue from the first hardware computational unit.

Description

Work in heterogeneous computing system is stolen
Copyright notice
The material being affected by copyright protection comprised herein.When it occurs in patent and trademark office patent file or record, It is open that copyright owner does not oppose that anyone replicates this patent, in any case but retain the proprietary rights to copyright in other side Profit.
Technical field
Embodiment described herein and relate generally to heterogeneous computing system.Specifically, embodiment described herein and typically relate to And the work in heterogeneous computing system steals.
Background technology
Computer also often has one or more in addition to being deployed as the general processor of CPU (CPU) Hardware accelerator device.One widely used example of this type of hardware accelerator device is Graphics Processing Unit (GPU).GPU Already function as a part for the main graphics subsystem driving one or more display traditionally.GPU helps to unload work from CPU Make, thus allow cpu resource for other task.CPU is typically implemented as general processor, and typically will be adapted for carrying out each Plant different types of algorithm (process of such as general aspects).By contrast, GPU is frequently embodied as application specific processor or the most more Specialized processor, its specialized algorithm (such as graphics-related algorithms) being typically more suitable for performing limited quantity.GPU warp Often having the computing hardware of highly-parallel, it tends to allow them quickly to process graph data.
Recently, it has been recognized that GPU can be used for the application of other non-graphic.Such as, the general-purpose computations on GPU (GPGPU) As represent the universal calculating of type utilizing GPU to perform to remain for the most in advance CPU.Often can use the height of GPU Parallel computation hardware significantly speeds up this type of and calculates.When utilized in such a manner, at least when correctly realizing, GPU can help Help and substantially speed up computation-intensive application.
But, the challenge (such as in the case of GPGPU) using GPU to accelerate non-graphic application relates to work and adjusts Degree.In order to effectively utilize CPU and GPU, it is important that dispatch in an efficient way between CPU and GPU and be distributed work.One In individual possible method, CPU can be only to GPU unloading or assignment particular task.CPU may then wait for or perform other task, And GPU completes the task of assignment and provides result.But, in this approach, process identical live load time CPU with Actual cooperation is not had between GPU.It is to say, the different piece to identical live load works during the possible difference of CPU with GPU. It actually not mixes execution.
According to other method, CPU and GPU may participate in mixing and perform, and wherein they are to static scheduling on CPU and GPU The different piece of concurrent working load works collaboratively.This type of mixing execution may be provided in and utilizes on identical live load simultaneously The potential advantage of CPU and GPU.But, it is intended to have challenge is that between CPU and GPU, traffic control loads not effectively Same part.Such as, CPU and GPU can not simultaneously operate by clock rate degree, and/or can have different memory level, and/or can have There is different Performance Characteristicses based on live load.The code that can efficiently perform on one of CPU and GPU is likely not to have separately Perform the most effective on one.The live load of other competition and/or the existence of power constraint can be further intended to make elder generation The ability testing traffic control load effectively complicates.Shortcoming is, if live load scheduling not yet in effect, then one of CPU and GPU Overtax can be become, and another is possibly also with deficiency.
Accompanying drawing explanation
By with reference to for illustrating that the following description and drawings of the embodiment of the present invention can be best understood the present invention.At accompanying drawing In:
Fig. 1 is the block diagram of the embodiment of Heterogeneous Computing or computer system.
Fig. 2 is the block diagram that synchronous working steals the embodiment of device system.
Fig. 3 is the FB(flow block) of the embodiment of the method stealing work in heterogeneous computer system.
Fig. 4 is the frame of the embodiment that the bi-directional synchronization work being shown between the embodiment of CPU and the embodiment of GPU is stolen Figure.
Fig. 5 is the block diagram of the environment that can use the embodiment stealing device system that works asynchronously.
Fig. 6 is the block diagram of the example embodiment of Heterogeneous Computing or computer system.
Fig. 7 is the block diagram of the example embodiment of Heterogeneous Computing or computer system.
Detailed description of the invention
In the following description, a large amount of specific detail (such as particular processor unit, heterogeneous computer system, operation are elaborated Sequence, logical division/integrated details, the type of system component and mutual relation etc.).It is to be understood, however, that the enforcement of the present invention Example can be carried out in the case of not having these specific detail.In other example, it is thus well known that circuit, structure and technology are not It is shown specifically, in order to avoid the fuzzy understanding that this is described.
Fig. 1 is the block diagram of the embodiment of Heterogeneous Computing or computer system 100.In various embodiments, Heterogeneous Computing system System can represent desk computer, laptop computer, notebook, netbook computer, work station, personal digital assistant (PDA), smart phone, cell phone, mobile computing device, server, Internet appliance or known in the art various other Types of computer systems or other calculating system.
Heterogeneous computing system comprises the hardware computational unit of at least two isomery (i.e. different) type.Hardware computational unit exists It is also referred to as computing unit herein.In the illustrated embodiment, heterogeneous computer system comprises the first hard of the first kind Part computing unit 101-1 and the second different types of second hardware computational unit 101-2.First and second computing units such as may be used It is coupled by interconnection 111.Other embodiments can comprise three or more Heterogeneous Computing unit.Showing of suitable computing unit Example including but not limited to processor, core, hardware thread, thread slot, be able to maintain that the hardware etc. of the state of independently executing.
In certain embodiments, the first computing unit 101-1 can be universal computing unit (or at least than second calculate single 101-2 is the most general in unit), and the second computing unit 101-2 can not be universal computing unit and/or can be dedicated computing Unit (or at least more general than the first computing unit 101-1).In certain embodiments, the first computing unit 101-1 can Being in CPU, general processor and general purpose core, and the second computing unit can not be.Such as, the second computing unit 101-2 can be that graphic process unit (such as GPU, graphics coprocessor, graphics core etc.), hardware accelerator device are (the most special Accelerator, fixing function accelerator etc.), cipher processor, communication processor, network processing unit, application specific processor, specific core, In highly-parallel specialized hardware, digital signal processor (DSP), field programmable gate array (FPGA) etc. one.
In certain embodiments, the first computing unit 101-1 can be CPU, and the second computing unit can be at figure Reason device (such as GPU, graphics coprocessor, graphics core etc.), although the scope of the present invention is not so limited.In this type of embodiment In, heterogeneous computing system can represent GPGPU system, CPU-GPU cooperative system etc..In this detailed description of the invention, frequent CPU With the example that GPU is used as the first and second Heterogeneous Computing unit, it will be appreciated that, in an alternative embodiment, alternately (more general processor is together with from application specific processor, specific core, hard such as relatively to use other various combination of Heterogeneous Computing unit Part accelerator installation, the relative more application specific processor selected between DSP, FPGA etc.).
Refer again to Fig. 1, when being deployed in heterogeneous computing system, the first computing unit 101-1 and the second computing unit 101-2 is each operable such as to be coupled with shared memorizer 102 by interconnection 108,109.Share memorizer to first and second Computing unit can access and shared by them.In certain embodiments, shared memorizer can represent shared virtual memory.Share Memorizer or shared virtual memory can represent the realization one or more storage arrangements at one or more type of memory In the part of physical storage.In certain embodiments, shared memorizer may be implemented in dynamic random access memory (DRAM) in, although the scope of the present invention is not so limited.
In certain embodiments, corresponding to the first work queue 104-1 of the first computing unit 101-1 with corresponding to second The second work queue 104-2 of computing unit 101-2 is storable in sharing in memorizer 102.First work queue 104-1 is permissible Operable with to for the operational reception of the first computing unit 101-1 and queue up.Second work queue 104-2 can operable with To for the operational reception of the second computing unit 101-2 and queue up.In the example shown to put it more simply, only illustrate two work queues, But in some embodiments, it may be possible to there are the multiple work for the first computing unit (be such as used in multiple cores is each) Queue and/or for the second computing unit (such as optionally for multiple multiprocessors or other group core in each) multiple Work queue.
Referring again to Fig. 1, in certain embodiments, the first computing unit 101-1 is (such as CPU, general processor, general Core etc.) the first job scheduler module 103-1 can be comprised.First job scheduler module 103-1 can be operable to dispatch Work on one computing unit 101-1.In certain embodiments, the second computing unit 101-2 (such as GPU, graphics core, hardware Accelerator, application specific processor, specific core etc.) the second job scheduler module 103-2 can be comprised.Second job scheduler module 103-2 can be operable to dispatch the work on the second computing unit 101-2.In certain embodiments, the first and second work Scheduler module can be operable to dispatch the work on the first and second computing units, to realize the public affairs on Heterogeneous Computing unit Mixing in the different piece (the such as different piece of data parallel live load) of live load performs and/or cooperation/conjunction altogether Calculate.Such as, in certain embodiments, CPU with GPU can together be operated in GPGPU.
As example, live load may be logically divided into chunk, working cell or other parts.These chunks, working cell or Part can be dispatched between the first and second computing units, and queues up in the first and second corresponding work queues.At some In embodiment, this Class Queue such as can realize by encoding the data parallel operations of a series of minimum and maximum index range, The individual unit wherein worked is represented by tuple (such as min1, max1).The general index scope of data parallel operations can be divided Become the chunk of work.In some cases, the size of chunk is selected such that it corresponds to computing unit (such as alternatively GPU) the hardware thread of multiple described quantity, SIMD road or core.Furtherly, it can be more than alternatively hardware thread, The quantity of the barrier that the sum of SIMD road or core is supported divided by computing unit simultaneously.Although it is not required, this can help ensure that work The chunk made effectively utilizes calculating resource, and initial schedule is effective by hardware thread scheduler, if there is one If individual.Chunk the most such as can be uniformly distributed between work queue, or based on various algorithms/criterion distribution.Running During time, chunk or working cell can be retrieved and be processed to each computing unit from its corresponding work queue.
This can be continued until that live load is complete.As example, complete to be that sky indicates by all queues.One In the case of Xie, termination token can be used alternatively.Such as, terminate token to be expressed as max-int terminating token TERM EMPTY.All evaluation work persons by determining that order is stolen and do not have in the embodiment that worker generates more work, return Instruction is typically terminated by value EMPTY.In the embodiment allowing to steal at random operation, after its own is finished work, it is sufficient to Allow at least one worker (such as CPU line journey) by determining that order passes all queues.If it find that all queue empties, then it EMPTY can be write terminate token to all queues or at least all working queue to another computing unit (such as GPU).This can help Help termination data parallel live load.
In certain embodiments, such as can perform work to steal, in order to help to improve load balance, improve performance, reduce Power consumption etc..Term " work is stolen " is the term used in this area.Term as used herein " work is stolen " broadly by In referring to again to assign work from a computing unit or task distributes to another computing unit again, by work from a calculating The queue of unit moves to the queue of another computing unit, it is allowed to computing unit advocate or be responsible for before by another computing unit master The work etc. opened or be responsible for by it.
Referring again to Fig. 1, in certain embodiments, the first computing unit 101-1 is (such as CPU, general processor, general Core etc.) the first synchronous working can be comprised alternatively steal the embodiment of device system 105-1, although what this was not required.First is same Device system 105-1 is stolen in step work can be operable to steal from the second computing unit 101-2 execution synchronous working, in order to first Computing unit 101-1 does this work, although the scope of the present invention is not so limited.In certain embodiments, the first synchronous working Steal device system 105-1 and can access the first work queue 104-1 and the second work queue 104-2.First synchronous working steals device System can use any combination of software, firmware and hardware to realize.
In certain embodiments, the second computing unit 101-2 (such as GPU, graphics core, hardware accelerator, dedicated processes Device, specific core etc.) the second synchronous working can be comprised steal the embodiment of device system 105-2.In certain embodiments, second synchronize Device system 105-2 is stolen in work can be operable to steal from the first computing unit 101-1 execution synchronous working, in order to the second meter Calculate unit 101-2 to work.The work stolen can obtain from the first work queue 104-1, and adds the second work queue to 104-2.In certain embodiments, the second synchronous working is stolen device system 105-2 and can be accessed the first work queue 104-1 and second Work queue 104-2.Second synchronous working is stolen device system and any combination of software, firmware and hardware can be used to realize.
In certain embodiments, work is stolen and can based on the first work queue 104-1 and the second work queue 104-2 be worked as Front full.Such as, in certain embodiments, if the second work queue 104-2 is emptying, it is filled in below threshold levels, or quilt Second synchronous working is stolen device system 105-2 and is perceived as the most fully being full of, then the second synchronous working steal device system 105-2 can be from First work queue 104-1 steals work, and the work stolen is put in the second work queue 104-2.As another example, In certain embodiments, if it is full that the second synchronous working steal device system 103-2 to perceive the first work queue 104-1, Be filled in more than threshold levels, or the fullest, then the second synchronous working steal device system 103-2 can be from the excessively filled One work queue 104-1 steals work.In other embodiments, when the first work queue 104-1 lack of fill and/or work as When second work queue 104-2 fills excessive, the first synchronous working steals device system 103-1 can perform similar or mutual type Work is stolen.Also contemplate other reason that work is stolen.
In certain embodiments, work is stolen can be two-way between first and second computing unit in either direction 's.Such as, the first synchronous working is stolen device system 105-1 and can be stolen work for the first calculating from the second computing unit 101-2 Unit 101-1 does (such as from the second work queue 104-2, work being moved to the first work queue 104-1), and second Synchronous working is stolen device system 105-2 and can be stolen work from the first computing unit 101-1 and do for the second computing unit 101-2 (such as work being moved to the second work queue 104-2 from the first work queue 104-1).In certain embodiments, can perform The two-way work being mutually in step is stolen.In certain embodiments, can use alternatively random the most substantially simultaneously two-way mutually with The work of step is stolen.In other embodiments, the second computing unit 101-2 can steal from the first computing unit 101-1 alternatively Work, it is not necessary to the first computing unit steals work from the second computing unit.This may not be provided as huge advantage, but if it is desire to If, can help to allow more simple realization.In such cases, the first computing unit can omit synchronous working alternatively and steal device System 105-1.
Refer again to Fig. 1, in certain embodiments, synchronous working can be performed alternatively and steal operation 106,107.At some In embodiment, queue 104-1,104-2 can be stolen by synchronous access work and/or synchronous access is shared memorizer 103 and held Row synchronous working steals operation.Such as, the second computing unit 101-2 can send help and realizes synchronous working and steal and (such as synchronize Work steal 106) synchronization (such as atom accesses/operate 110).In certain embodiments, synchronous working is stolen to operate and be can use Steal identical work in helping prevent two entities and/or perform identical work.This also assists in and prevents for realizing work team The destruction of the shared data structure of row, and can help to allow them to be converted to another effective status from an effective status.? In some embodiments, synchronous working is stolen operation and can be sufficiently complete, because work may not be stolen from queue, but not by stealing meter Calculation unit performs.
In certain embodiments, can be held by one or more memory access synchronization primitives and/or instruction and/or operation Row synchronous working steals operation.In certain embodiments, dequeue and steal operation available calculated by first and second single Storage stack access sequence primitive/instructions/operations that unit supports.In certain embodiments, memory access synchronization primitives/refer to Order/operation can support that this type of memory access any desired between first and second computing unit (such as CPU and GPU) is same Realize on the hardware of step primitive/instructions/operations.The example of this type of suitable primitive/instructions/operations is including but not limited to memorizer Fence and/or barrier macro-instruction, atomic memory access macro, OpenCL atomic operation, CUDA memory access synchronization behaviour Other method made or be known in the art.
As example, in atomic memory access operation, processor can substantially simultaneously (such as circulate at same bus In) read memory location and write to memory location.This type of atomic operation can help prevent other computing unit or external device (ED) Write or read memory location until atomic operation completes.It is said that in general, atomic memory accessing operation will perform completely or Person does not perform.The example of this type of atomic memory access primitive/instructions/operations is including but not limited to read-modify-write, ratio Relatively and exchange, relatively and exchange, test and arrange, relatively and arrange, load link/have ready conditions storage instruction etc. and they Various combinations.These can pass through macro-instruction, OpenCL atomic operation, CUDA memory access simultaneously operating or pass through in this area Other method known is carried out.
In certain embodiments, these can comprise the instruction of one or more memory access fence.Memory access fence Instruction is also sometimes referred to as memory access barrier instructions in the art (such as in some frameworks).This type of memory access The example of fence and/or barrier instructions is including but not limited to loading fence/barrier (such as LFENCE instruction), storage fence/screen Barrier (such as SFENCE instruction) and load and store fence/barrier (such as MFENCE instruction) etc. and their various groups Close.This type of fence or barrier operating can pass through macro-instruction, OpenCL operation, CUDA operation or by be known in the art other Method realizes.
Advantageously, the more effective scheduling that can help to allow between first and second computing unit is stolen in this type of work And distribution, this may result in and better profits from resource and improve performance.Because allowing work to steal and particularly two-way when allowing When work is stolen, it is not required that can be with height effective means static scheduling live load between Heterogeneous Computing unit.As at the back of the body Discussed in scape technology segment, this type of static work load dispatch priori is tended to owing to some reasons are (such as, due to potential Different clocks speed, different memory level, different code perform effectiveness, the existence of other live load, power limit etc.) It is difficult.But, if owing to the work of certain reason is initially dispatched with ineffective way, and it causes a computing unit to become Obtain overtax (or under-utilized), then can perform work and steal together with load balance, in order to help rebalancing live load, And thereby aid in that to alleviate the utilization of various computing unit excessive and/or under-utilized.In other embodiments, can hold alternatively Row work is stolen rather than load balance is to help to realize other purpose, the most such as, maximize processing speed, reduce power consumption etc..
Fig. 2 is the block diagram that synchronous working steals the embodiment of device system 205.It is also shown that in shared memorizer 202 First work queue 204-1 and the second work queue 204-2.In certain embodiments, the synchronous working of Fig. 2 steal device system can It is included in the second hardware computational unit 101-2 and/or the heterogeneous computing system 100 of Fig. 1.Alternatively, the synchronous working of Fig. 2 is stolen Take device system to can be included in similar or different computing unit or heterogeneous system.And, second hardware computational unit of Fig. 1 Similar from Fig. 2 or different synchronous working can be comprised with heterogeneous system and steal device system.
Synchronous working is stolen device system 205 and is comprised and determine module 220.Determine that module is operable to determine whether from the first kind First hardware computational unit (the such as first computing unit 101-1) of type steals work 222 for second different types of second Hardware computational unit (the such as second computing unit 101-2).In certain embodiments, Second Type is more special than the first kind With.Work 222 can be queued up in corresponding to the first work queue 204-1 of the first hardware computational unit.First work queue exists Sharing in memorizer 202, it is shared by the first and second hardware computational unit.As indicated, in certain embodiments, module is determined Can couple with the first work queue 204-1 and/or the second work queue 204-2, or otherwise communicate.Implement at some In example, determine module to can be determined whether based on full or one or more work queues and steal work.This can as herein other Carrying out described by place.
Synchronous working is stolen device system 205 and is also comprised synchronous working and steal device module 221.Synchronous working steals device module can Operate to steal work 222 from the first hardware computational unit and/or first queue 204-1, and using it as the work 223 stolen There is provided for the second hardware computational unit, and/or add it to the second work queue 204-2.In certain embodiments, synchronize It is operable with by the synchronous memories of the first work queue 204-1 access 210 is stolen work that device module is stolen in work.One In a little embodiments, synchronous memories access 210 can be relative to depositing the first work queue from the first hardware computational unit Access to store synchronizes.
Fig. 3 is the FB(flow block) of the embodiment of the method 335 stealing work in heterogeneous computer system.Implement at some In example, the operation of Fig. 3 and/or method can be stolen the synchronous working of system 105-2 and/or Fig. 2 stolen by the synchronous working of Fig. 1 and be System 205 execution, and/or within it perform.The component described herein of these systems, feature or specific optional details are alternatively also Being applicable in an embodiment can be by these systems and/or the operation performed in these systems and/or method.Alternatively, Fig. 3 Operation and/or method can by be similar to or different synchronous workings steal system perform and/or within it perform.And, Fig. 1 and/ Or the synchronous working of Fig. 2 steals system and can perform the similar or different operation from Fig. 3 and/or method.
The method is included in block 336 and determines that the first hardware computational unit from the first kind steals work for than the first kind Second hardware computational unit of the Second Type that type is more special.In certain embodiments, work can be corresponding to the first hardware meter Calculate in the first work queue of unit and queue up.In certain embodiments, the first work queue is storable in by first and second hard In the shared memorizer that part computing unit is shared.
The method is also included in block 337 and steals work.In certain embodiments, steal work and can comprise execution to the first work Make the synchronous memories access of queue.In certain embodiments, synchronous memories access can be relative to based on the first hardware The memory access of the first work queue calculating unit synchronizes.
Fig. 4 is that the bi-directional synchronization work being shown between the embodiment of CPU 401-1 and the embodiment of GPU 401-2 is stolen 406, the block diagram of the embodiment of 407.The illustrated example of CPU has four cores, i.e. CPU core 0, CPU core 1, CPU core 2 and CPU Core 3.Other CPU can have less or more core.In the illustrated embodiment, in these cores each have corresponding to Different operating queue in first group of work queue 404-1 of CPU.Specifically, in illustrated example, work queue Ws_q [0] corresponds to CPU core 0, and work queue ws_q [1] corresponds to CPU core 1, and work queue ws_q [2] corresponds to CPU core 2, And work queue ws_q [3] corresponds to CPU core 3.In other embodiments, the many-one between core and work queue can be used Or the corresponding relation of one-to-many.
The illustrated example of CPU has four working groups, i.e. working group 0, working group 1, working group 2 and working group 3.This A little working groups may be additionally referred to as spread unit or multiprocessor.In the example shown, each working group has 16 SIMD roads, although its Its embodiment can have less or more SMID road.And, other GPU can have less or more working group.Such as, one A little GPU have about dozens of multiprocessor or working group.Each in these working groups comprises one or more core, or Comprise much core potentially.Such as, each multiprocessor of some GPU or working group have about dozens of core.Illustrated Each shared the second work queue 404-2 corresponding to GPU in embodiment, in these working groups.In other embodiments, many Individual work queue can respectively be used for one or more working group.Working group 404-1,404-2 can be in shared memorizeies.
Show that the example of operation 406,407 is stolen in bi-directional synchronization work in the example shown.Although CPU core 0, CPU core 1 and CPU core 3 respectively from the corresponding work queue ws_q [0] of themselves, ws_q [l] and ws_q [3] work, but CPU core 2 from GPU queue 404-2 performs synchronous working and steals operation 407.Similarly, working group 0 (also known as spread unit 0) is from corresponding to The work queue ws_q [3] of CPU core 3 performs synchronous working and steals operation 406.This is an illustrated examples.Work is stolen Operation 406,407 can be similar to other work described herein and steals operation, or same.
Fig. 5 is the block diagram of the environment that can use the embodiment stealing device system 505 that works asynchronously.As indicated, it is real at some Executing in example, synchronous working is stolen device system and be may be implemented on the conventional operation time, the most such as when conventional OpenCL runs Between on 542.In certain embodiments, synchronous working steal device system can be by the most conventional for live load OpenCL work Load kernel 540 is packaged into additional work and steals scheduler code.In certain embodiments, OpenCL runs time or other fortune The row time (such as CUDA runs the time) may need substantially to be modified.
Fig. 6 is the block diagram of the example embodiment of Heterogeneous Computing or computer system 600.System comprises CPU 601-1 and tool There is the card 652 of GPU 601-2.CPU comprises the core 650 of any desired quantity.GPU comprises the core 651 of any desired quantity.Card is also There is the memorizer 653 coupled with GPU.In certain embodiments, memorizer can comprise DRAM, although what this was not required.CPU It is coupled with GPU, and is coupled with memorizer 653 by interconnection 608.Any known interconnection is applicable, the most such as Periphery component interconnection and its derivative or extension.
Memorizer 654 comprises shared memorizer 602.Shared memorizer comprises CPU work queue 604-1 and GPU work team Row 604-2.CPU has the first synchronous working and steals device system 605-1.GPU has the second synchronous working and steals device system 605- 2.In certain embodiments, these synchronous workings steal in device system any one or the two can be similar in institute elsewhere herein Describe, or same.Such as, the second synchronous working is stolen device system 605-2 and be can be similar to second synchronous working of Fig. 1 The synchronous working stealing device system 105-2 and/or Fig. 2 steals device system 205, or same.
Fig. 7 is the block diagram of the example embodiment of Heterogeneous Computing or computer system 700.System comprises by interconnection and storage The chip of device 754 coupling and/or tube core 761.In certain embodiments, memorizer 754 can comprise DRAM, although this is not required 's.Any known interconnection is applicable, the most such as periphery component interconnection and its derivative or extension.Chip/die comprises CPU 701-1 and integrated graphics 701-2.CPU comprises the core 750 of any desired quantity.Integrated graphics comprises the core of any desired quantity 751.CPU and integrated graphics couple with interconnection 760 on sheet.The upper interconnection mechanism of any be known in the art is applicable.Collection Memory Controller 762 is become also to couple with interconnection on sheet.Memory Controller is by interconnecting 708 by chip/die and memorizer 754 couplings.
Memorizer 754 comprises shared memorizer 702.Shared memorizer comprises CPU work queue 704-1 and integrated graphics work Make queue 704-2.CPU has the first synchronous working and steals device system 705-1.Integrated graphics has the second synchronous working and steals device System 705-2.In certain embodiments, these synchronous workings steal in device system any one or the two can be similar to herein not Place described by, or same.Such as, the second synchronous working is stolen device system 705-2 and be can be similar to the second of Fig. 1 Synchronous working is stolen the synchronous working of device system 105-2 and/or Fig. 2 and is stolen device system 205, or same.
The synchronous working that following code description is suitable for steals the use of the example embodiment of device system:
/ * annotates:
1. overall situation work is stolen queue data structure ws_q and is preserved all working of CPU core and GPU core and steal queue.These data In the shared memorizer that structure is dispensed between CPU and GPU.
2. queue ws_q [0] is stolen in work ..., ws_q [num_cpu_threads-1] represents the queue of CPU core. Ws_q [num_cpu_threads] represents that queue is stolen in GPU work.Num_cpu_threads represents the CPU core in heterogeneous system Sum.All these work are stolen queue and are all dispensed in the shared memorizer between CPU and GPU.Each work is stolen Queue is described in following " WSQueue " data structure.
The most each GPU calculates kernel and is defined by subroutine " actual_kernel ", and subroutine " actual_kernel " will The index in the iteration space that it operates thereon is as argument.The details of this subroutine is that user applies specifically.
Subroutine the most described below " kernel_wrapper " is wrapper based on software, and it is by compiling GPU kernel Compiler or by application itself or by OpenCL run the time generate.
Subroutine the most described below " steal_wrapper " steals queue discovery work across the work of CPU and GPU core. If it find that any work, then it is stolen in correspondence work and performs in queue to steal operation.If in all queues whatever Do not find, then return EMPTY.
6. subroutine " steal " is stolen queue from given work and is performed actual stealing.When it is stolen, it first ensures that In queue, at least there is a job, and update the Top Of of queue the most atomically to guarantee correctly to perform to steal simultaneously Take.
7. perform to compare and arrange operation on the subroutine " cas " shared memorizer between CPU with GPU.
*/
/ * each calculating kernel, such as " actual_kernel " all uses this subroutine to pack, to perform CPU Yu GPU work surreptitiously The work taken between queue is stolen;This wrapper code runs time execution by user/compiler or by OpenCL.*/
__kernel void kernel_wrapper (
Global WSQueue * ws_q ,/* steal queue for the work of CPU and GPU of distribution in shared memorizer Data structure */
Quantity * of int num_cpu_threads/* CPU core/
) {
int local_id = get_local_id ( 0 );The local_id of this work item in // working group
__local int work_ idx;Index in // total iteration space;Work item in working group shares this
barrier (CLK_LOCAL_MEM_FENCE);
// local storage barrier;This guarantees that work_idx is visible to all working item in working group;
if (work_idx == EMPTY) return;// if we are not finding work locally or through stealing, return
/ * call actual opencl kernel */
actual_kernel (work_idx + local_id) ;
}
}
This subroutine of/* sequentially traverses through all working and steals queue and attempt job search.If in all queues what Also do not find, then returning EMPTY, instruction terminates.*/
int steal_wrapper ( global WSQueue *ws_q, int
num_cpu_threads ) {
/ * ws_q [num_cpu_threads] represents that queue is stolen in GPU work, and therefore we travel through in reverse order, with first From GPU queue and then cpu queue discovery work.*/
for (int i=num_cpu_threads ; i>=0; i--) {
__global WSQueue each_ws_q = ws_q[i];
work_idx = steal (each_ws_q);// steal queue from work and perform actual to steal operation
if (work_idx >= 0) return work_idx;
// if it find that work, then index is returned to work.
}
return EMPTY;If // in all queues, do not find work, then return EMPTY.
}
This subroutine of/* perform for from given queue steal work actual steal operation */
int steal ( global WSQueue *a_ws_q) {
int top = a_ws_q->top;The top entry in queue is stolen in // sensing work
int bottom = a_ws_q->bottom;The sill strip mesh in queue is stolen in // sensing work
__global CircularArray *array = a_ws_q->activeArray ;
// find to store this array to (min, max)
int size = bottom - top;// find that the number of entries in queue is stolen in work
If (size≤0) // without work, return EMPTY
return EMPTY;
}
int o = (array->segment [top % (l<<array->log_size) ] ).min;
The index of // discovery the first chunk atop
Since/* we have found work and is ready to steal, we just use on Top Of relatively and arrange (cas) Execution atom is stolen.*/
if ( !Cas (a_ws_q, top, top+1)) // on Top Of relatively and arrange
return ABORT;If // our unsuccessful execution atomic operation, ABORT
}
return o;// return minimum index from the tuple (min, max) of instruction real work index
}
This subroutine of/* show on shared memorizer between CPU and GPU relatively and the realization that (cas) operates is set.*/
bool cas ( __global WSQueue *a_ws_q, int oldVal, int newVal) {
int result;
result = atomic_cmpxchg ( (volatile global int * ) & (a_ws_q->top) , oldVal, newVal);
// perform atomic ratio relatively also swap operation on the shared memory
return (result == oldVal);
};
/ * for work steal queue realize list of data structures */
/ * is each work independently steal queue data structure */typedef struct WS_Q_s
volatile int bottom;The bottom of queue is stolen in // work
volatile int top;The top of queue is stolen in // work
CircularArray *activeArray ;// for preserving the basic array of all working chunk
} WSQueue;
typedef struct CircularArray_s {
int log_size;// for realizing the log size of the circulation array of work chunk
pair_t *segment;// for the array of work chunk
} CircularArray ;
typedef struct pair_s {
int min;// the minimum index started working from that, if be stolen
int max;// work until largest index, if be stolen
}pair_t ;
enum WSQ_Tag{
EMPTY=-2, // instruction queue is EMPTY, and therefore terminates
ABORT=-1, operation unsuccessful continuation due to competition is stolen in // instruction
};
For assembly, feature and details that in Fig. 1 and Fig. 4-7, any one is described can be also used in Fig. 2-3 alternatively arbitrary In individual.And, the assembly, feature and the details that describe for any equipment herein can be also used in any side described herein alternatively In method, described method in an embodiment can thus kind equipment and/or with this kind equipment perform.
Example embodiment
Following example relates to additional embodiment.In one or more embodiments the most all can use in example thin Joint.
Example 1 is that device equipment is stolen in work.Work is stolen device and is comprised: determine module, for determining the from the first kind One hardware computational unit steals work for being different from the second hardware computational unit of the Second Type of the first kind.This is operated in Queuing up in first work queue, the first work queue is corresponding to the first hardware computational unit and is stored in by first and second In the shared memorizer that hardware computational unit is shared.Work is stolen device equipment and is also comprised: synchronous working steals device module, and it passes through Accessing the synchronous memories of the first work queue and steal described work, the access of described synchronous memories is relative to from the The memory access of the first work queue of one hardware computational unit synchronizes.
Example 2 comprises the theme of example 1, and alternatively, and wherein synchronous working is stolen device module work is added to the Two work queues.Second work queue corresponds to the second hardware computational unit, and is stored in shared memorizer.Equations of The Second Kind Type is more special than the first kind alternatively.
Example 3 comprises the theme of example 1, and alternatively, it is right by comprising that wherein said synchronous working steals device module Described work is stolen in the described synchronous memories access of the atomic operation that the first work queue performs.
Example 4 comprises the theme of example 3, and alternatively, wherein said atomic operation includes read-modify-write operation, ratio Relatively and exchange operation, compare and swap operation, test and arrange operation, relatively and operation is set and loads link/have ready conditions One of storage operation.
Example 2 comprises the theme of example 1, and alternatively, wherein atomic operation includes open computational language (OpenCL) Atomic operation.
Example 6 comprises the theme of example 3, and alternatively, and wherein synchronous working is stolen device module and synchronized to deposit by CUDA Work is stolen in access to store.
Example 7 comprises the theme of example 1, and alternatively, farther includes: second determines module, is used for determining from Two hardware computational unit steal the second work for the first hardware computational unit.Second is operated in the second work queue queuing, Second work queue is corresponding to the second hardware computational unit and is stored in shared memorizer.
Example 8 comprises the theme of example 7, and alternatively, farther includes: the second synchronous working steals device module, its By the second work is stolen in the second synchronous memories access of the second work queue from the first hardware computational unit.The Two synchronous memories accesses are relative to the memory access synchronization to the second work queue from the second hardware computational unit.
Example 9 comprises the theme of any one in example 1-8, and alternatively, wherein and fills out for sky when the second work queue During in filling below threshold levels one, described determining that module determines and steal described work, the second work queue is corresponding to the Two hardware computational unit and being stored in described shared memorizer.
Example 10 comprises the theme of any one in example 1-8, and alternatively, wherein the first hardware computational unit include from One selected in general processor and CPU (CPU).Second hardware computational unit includes from graphic process unit, hard Part accelerator installation, cipher processor, communication processor, network processing unit, application specific processor, specific core, highly-parallel are special One selected in hardware, digital signal processor (DSP) and field programmable gate array (FPGA).
Example 11 comprises the theme of example 10, and alternatively, wherein the first hardware computational unit includes described CPU, and And wherein the second hardware computational unit includes the described graphics process selected from Graphics Processing Unit (GPU) and integrated graphics core Device.
Example 12 comprises the theme of example 11, and alternatively, wherein graphic process unit includes integrated graphics core, and its Middle integrated graphics core and CPU share identical afterbody cache.
Example 13 is a kind of method in heterogeneous computing system.The method comprises: determine from the first of the first kind hard Part computing unit steals work for second different types of second hardware computational unit more special than the first kind.This work Queuing up in the first work queue, the first work queue is corresponding to the first hardware computational unit and is stored in by first and the In the shared memorizer that two hardware computational unit are shared.The method also comprises: steal described saddlebag containing performing being stored in Stating the synchronous memories access of the first work queue in shared memorizer, the access of described synchronous memories is relative to from the The memory access of the first work queue of one hardware computational unit synchronizes.
Example 14 comprises the theme of example 13, and alternatively, farther includes: add described work to second work Queue, it corresponds to the second hardware computational unit, and it also is stored in described shared memorizer.
Example 15 comprises the theme of example 13, and alternatively, wherein performs synchronous memories access and include performing atom Operation.
Example 16 comprises the theme of example 15, and alternatively, wherein perform atomic operation include performing from read-modify- Write operation, relatively and exchange operation, compare and swap operation, test and arrange operation, relatively and arrange operation and load chain The atomic operation selected in road/storage operation of having ready conditions.
Example 17 comprises the theme of example 15, and alternatively, wherein performs atomic operation and include performing open calculating language Speech (OpenCL) atomic operation.
Example 18 comprises the theme of example 15, and alternatively, wherein steals and include by performing CUDA synchronous memories Work is stolen in access.
Example 19 comprises the theme of example 13, and alternatively, farther includes: determines and steals the second work, and from Two hardware computational unit steal the second work for the first hardware computational unit.Second is operated in the second work queue queuing, Second work queue is corresponding to the second hardware computational unit and is stored in shared memorizer.
Example 20 comprises the theme of example 19, and alternatively, wherein steals the second work and include performing from first The synchronous memories access of the second work queue of hardware computational unit, described synchronous memories accesses relative to from second The memory access of the second work queue of hardware computational unit synchronizes.
Example 21 comprises the theme of example 13, and alternatively, wherein determines and include that in response to the second work queue be sky Be filled in below threshold levels and determine and steal described work, the second work queue corresponding to second hardware calculate Unit and being stored in described shared memorizer.
Example 22 comprises the theme of example 13, and alternatively, wherein the first hardware computational unit be from general processor, CPU (CPU) and select during there is the SOC(system on a chip) of multiple general purpose core one.Further, wherein the second hardware meter Calculating unit is from graphic process unit, hardware accelerator device, cipher processor, communication processor, network processing unit, dedicated processes Specific core, highly-parallel specialized hardware, digital signal processor (DSP) and field-programmable in device, specific core, SOC(system on a chip) One selected in gate array (FPGA).
Example 23 comprises the theme of example 22, and alternatively, wherein the first hardware computational unit includes described CPU, and And wherein the second hardware computational unit includes the described graphics process selected from Graphics Processing Unit (GPU) and integrated graphics core Device.
Example 24 comprises the theme of example 22, and alternatively, wherein graphic process unit includes integrated graphics core, and its The core of middle integrated graphics core and CPU shares identical afterbody cache.
Example 25 is heterogeneous computer system.Heterogeneous computer system comprises interconnection.Heterogeneous computer system also comprise with First hardware computational unit of the first kind of interconnection coupling.Heterogeneous computer system also comprises different from the second of interconnection coupling Second hardware computational unit of type.Second Type is more special than the first kind.Heterogeneous computer system also comprises and interconnection coupling The dynamic random access memory (DRAM) closed, DRAM comprises the shared storage shared by the first and second hardware computational unit Device.Shared memorizer comprises the first work queue of the queue work by the first hardware computational unit and by single for the second hardware calculating Second work queue of the queue work of unit.Heterogeneous computer system also comprises: device equipment is stolen in work, and it determines steals and lead to Cross the access of the synchronous memories to the first work queue steal work from first queue and add it to the second queue.Synchronization is deposited Access to store is relative to the memory access synchronization to the first work queue from the first hardware computational unit.
Example 26 comprises the theme of example 25, and alternatively, wherein works and steal device equipment by comprising the first work Work is stolen in the synchronous memories access of the atomic operation in queue.
Example 27 comprises the theme of any one in example 25-26, and alternatively, farther includes: the second work is stolen Device equipment, it determines steals and by the second synchronous memories access of the second work queue is stolen the from the second work queue Two work and add it to the first work queue, and the second synchronous memories accesses relative to from the second hardware computational unit The second work queue memory access synchronize.
Example 28 is one or more computer-readable recording mediums of storage instruction, if described instruction is performed by machine Then will make machine performing operations.Operate to comprise and determine that the first hardware computational unit from the first kind steals work for ratio first The operation of the second different types of second hardware computational unit that type is more special.This is operated in the first work queue queuing, First work queue corresponding to the first hardware computational unit and be stored in by the first and second hardware computational unit share Share in memorizer.Operation comprises the synchronous memories access by performing the first work queue and steals the operation of work. Synchronous memories access is relative to the memory access synchronization to the first work queue from the first hardware computational unit.
Example 29 comprises the theme of example 28, and alternatively, wherein machinable medium further provides for instruction, If described instruction is performed by machine, comprise the operation that work is added to the second work queue, the second work by making machine perform Make queue corresponding to the second hardware computational unit and to be stored in shared memorizer.
Example 30 comprises the theme of any one in example 28-29, and alternatively, wherein machinable medium enters one Step provides instruction, if described instruction is performed by machine, comprises making machine perform by comprising the atom performing first queue The operation of work is stolen in the synchronous memories access of operation.
Example 31 is the machinable medium of storage instruction, if described instruction is performed by machine, makes described machine Perform the method for any one in example 13-24.
Example 32 is to include for performing the equipment of the parts of the method for any one in example 13-24.
Example 33 is performed in example 13-24 the equipment of the method for any one.
Example 34 is the equipment for essentially performing approach described herein.
Example 35 is to include the equipment of the parts for essentially performing approach described herein.
In the specification and in the claims, term can be used for " to couple " and " connection ", together with their derivation Word.It should be understood that these terms are not intended as synonym each other.But, in a particular embodiment, " connection " can be used for referring to Show that two or more elements are in direct physical contact with each other or make electrical contact with." couple " and may imply that two or more element direct physical Contact or electrical contact.But, " coupling " also mean that, two or more elements are not directly contacted with each other, but still work in coordination with each other Operation or mutual.Such as, processor can be by one or more intermediate modules (the most one or more interconnection and/or chipset) Couple with device.In the accompanying drawings, use arrow that connection and coupling are shown.
In the specification and in the claims, may have used term " logic ".Logic used herein can be wrapped Module containing such as hardware, firmware, software or combinations thereof.Example logic comprises integrated circuit, special IC, simulation Circuit, digital circuit, the logic device of programming, comprise the storage arrangement etc. of instruction.
Term "and/or" can be used for.Term as used herein "and/or" mean one or the other or two Person (such as A and/or B means A or B, or A and B).
In the above description, for illustrative purposes, a large amount of specific detail has been elaborated to provide the embodiment of the present invention Comprehensive understanding.But, those skilled in the art will be apparent that, some in not having these specific detail are thin In the case of joint, it is possible to put into practice one or more other embodiments.Described specific embodiment is not to provide for limiting this Invention, but for it being described by example embodiment.The scope of the present invention is not determined by particular example and only by claim Book determines.In other example, the most in form of a block diagram or do not have details show well-known circuit, structure, device and Operation, in order to avoid the understanding of vague description.
In the case of being considered as suitably, it is repeated the end section of reference or reference to refer to Showing correspondence or like, they can have similar or identical characteristics alternatively, unless specified otherwise herein or the most apparent. In some cases, it has been described that multiple assemblies, they can be merged into single component.Having been described with single component In the case of other, it can be divided into multiple assembly.
Have been described with various operation and method.Some of them side is described in flow charts with the most basic form Method, but operation can be added to method and/or remove operation from method alternatively.Although additionally, flow chart shows according to example The concrete order of the operation of embodiment, but concrete order is exemplary.Alternative can perform alternatively in different order Operation, combines some operation, some operation overlapping etc..
Some embodiments comprise goods (such as computer program), and it comprises machine readable media.Medium can comprise The mechanism of the information such as storing machine-readable form is provided.Machine readable media can provide or on it, storage has one or more Instruction, if instructing and/or be operable such that when executed by a machine that machine performs and/or causes machine execution disclosed herein One or more operations, method or technology.
In certain embodiments, machine readable media can comprise tangible and/or non-transitory machinable medium.Example As, tangible and/or non-transitory machinable medium can comprise floppy disk, optical storage media, CD, optical data storage dress Put, CD-ROM, disk, magneto-optic disk, read only memory (ROM), programming ROM (PROM), erasable and programming ROM (EPROM), electric erasable and programming ROM (EEPROM), random access memory (RAM), static RAM (SRAM), dynamically RAM (DRAM), flash memory, phase transition storage, phase change data storage material, nonvolatile memory, non-volatile data storage dress Put, non-transitory memory, non-transitory data storage device etc..Non-transitory machinable medium does not include temporary Transmitting signal.In another embodiment, machine readable media can comprise temporary machine readable communication medium, such as electricity, light, sound Or the transmitting signal of other form, such as carrier wave, infrared signal, digital signal etc..
Suitably the example of machine is including but not limited to desk computer, laptop computer, notebook, flat board meter Calculation machine, notebook, smart phone, cell phone, server, network equipment (such as router and switch), mobile Internet Device (MID), media player, intelligent television, device for logging on network, Set Top Box and PlayStation 3 videogame console/PS3 and other department of computer science Unite, calculate device or there is the electronic installation of one or more processor.
It is also acknowledged that this description mentions " embodiment ", " embodiment " or " one or more enforcements in the whole text Example " such as mean in the practice that specific features can be included in the present invention.Similarly, it should be appreciated that the most various spies Levy and combine sometimes along in single embodiment, accompanying drawing or its description to make the disclosure smooth and helping to understand each invented party Face.But, this method disclosed is not construed as reflecting that the present invention needs than the spy being expressly recited in each claim Levy the intention of more feature.But, as following claims reflects, inventive aspect may be in disclosed real less than single Execute all features of example.Thus, it then follows claims of detailed description of the invention are thus clearly merged into this specific embodiment party In formula, the independent individual embodiment as the present invention of the most each claim.

Claims (25)

1. a device equipment is stolen in work, including:
Determine module, for determining that the first hardware computational unit from the first kind steals work for being different from the described first kind Second hardware computational unit of the Second Type of type, wherein said is operated in the first work queue queuing, described first work Queue corresponding to described first hardware computational unit and be stored in by described first and second hardware computational unit share Share in memorizer;And
Synchronous working steals device module, and it is by stealing described work to the access of the synchronous memories of described first work queue Making, described synchronous memories accesses relative to the storage to described first work queue from described first hardware computational unit Device access synchronizes.
2. equipment as claimed in claim 1, wherein said synchronous working is stolen device module and is added described work to second work Making queue, described second work queue is corresponding to described second hardware computational unit and is stored in described shared memorizer In, and wherein said Second Type is more special than the described first kind.
3. equipment as claimed in claim 1, wherein said synchronous working steals device module by comprising described first work Described work is stolen in the described synchronous memories access of the atomic operation that queue performs.
4. equipment as claimed in claim 3, wherein said atomic operation includes read-modify-write operation, relatively and exchange behaviour Make, compare and swap operation, test and arrange operation, relatively and operation is set and loads link/have ready conditions and store operation One.
5. equipment as claimed in claim 3, wherein said atomic operation includes open computational language (OpenCL) atomic operation.
6. equipment as claimed in claim 3, wherein said synchronous working is stolen device module and is accessed by CUDA synchronous memories Steal described work.
7. equipment as claimed in claim 1, farther includes: second determines module, by determining based on from described second hardware Calculating unit and steal the second work for described first hardware computational unit, wherein said second is operated in the second work queue row Team, described second work queue is corresponding to the second hardware computational unit and is stored in described shared memorizer.
8. equipment as claimed in claim 7, farther includes: the second synchronous working steals device module, and it is by from institute Described second work, institute are stolen in second synchronous memories access of described second work queue stating the first hardware computational unit State the second synchronous memories to access relative to the storage to described second work queue from described second hardware computational unit Device access synchronizes.
9. the equipment as according to any one of claim 1-8, wherein for sky and is filled in threshold levels when the second work queue During in below one, described determining that module determines and steal described work, described second work queue corresponds to the second hardware meter Calculate unit and be stored in described shared memorizer.
10. the equipment as according to any one of claim 1-8, wherein said first hardware computational unit includes from general procedure One selected in device and CPU (CPU), and wherein said second hardware computational unit includes from graphics process Device, hardware accelerator device, cipher processor, communication processor, job processor, application specific processor, specific core, highly-parallel One selected in specialized hardware, digital signal processor (DSP) and field programmable gate array (FPGA).
11. equipment as claimed in claim 10, wherein said first hardware computational unit includes described CPU, and wherein institute State the described graphic process unit that the second hardware computational unit includes selecting from Graphics Processing Unit (GPU) and integrated graphics core.
12. 1 kinds of methods in heterogeneous computing system, including:
Determine that the first hardware computational unit from the first kind steals work for more special than the described first kind second not With the second hardware computational unit of type, wherein said it is operated in the first work queue queuing, described first work queue pair First hardware computational unit described in Ying Yu and being stored in shared is deposited by what described first and second hardware computational unit were shared In reservoir;And
Steal described saddlebag and store containing performing the synchronization to described first work queue being stored in described shared memorizer Device accesses, and described synchronous memories accesses relative to described first work queue from described first hardware computational unit Memory access synchronizes.
13. methods as claimed in claim 12, farther include: described work adds to the second work queue, and described Two work queues are corresponding to described second hardware computational unit and also are stored in described shared memorizer.
14. methods as claimed in claim 12, wherein perform the access of described synchronous memories and include performing atomic operation.
15. methods as claimed in claim 14, wherein perform described atomic operation and include performing from read-modify-write operation, ratio Relatively and exchange operation, compare and swap operation, test and arrange operation, relatively and operation is set and loads link/have ready conditions The atomic operation selected in storage operation.
16. methods as claimed in claim 14, wherein perform described atomic operation and include performing open computational language (OpenCL) atomic operation.
17. methods as claimed in claim 12, farther include: determine and steal the second work and from described second hardware meter Calculation unit is stolen described second and is worked for described first hardware computational unit, and wherein said second is operated in the second work queue Middle queuing, described second work queue is corresponding to described second hardware computational unit and is stored in described shared memorizer In.
18. methods as claimed in claim 13, wherein determine and include for sky and being filled in response to described second work queue In below threshold levels one and determine and steal described work, described second work queue calculates corresponding to described second hardware Unit and being stored in described shared memorizer.
19. methods as claimed in claim 13, wherein said first hardware computational unit is to process from general processor, central authorities Unit (CPU) and select during there is the SOC(system on a chip) of multiple general purpose core one, and wherein said second hardware calculates single Unit be from graphic process unit, hardware accelerator device, cipher processor, communication processor, network processing unit, application specific processor, Specific core, highly-parallel specialized hardware, digital signal processor (DSP) and field programmable gate in specific core, SOC(system on a chip) One selected in array (FPGA).
20. methods as claimed in claim 19, wherein said first hardware computational unit includes described CPU, and wherein institute State the described graphic process unit that the second hardware computational unit includes selecting from Graphics Processing Unit (GPU) and integrated graphics core.
21. 1 kinds of heterogeneous computer systems, including:
Interconnection;
The first hardware computational unit with the first kind that described interconnection couples;
The second different types of second hardware computational unit coupled with described interconnection, described Second Type is than the described first kind More special;
The dynamic random access memory (DRAM) coupled with described interconnection, described DRAM comprises by described first and second hardware The shared memorizer that computing unit is shared, described shared memorizer comprises the queue work by described first hardware computational unit First work queue and the second work queue of the queue work by described second hardware computational unit;And
Device equipment is stolen in work, steals and by accessing the synchronous memories of described first work queue from described for determining First queue is stolen described work and adds it to described second queue, and described synchronous memories accesses relative to from institute The memory access of described first work queue stating the first hardware computational unit synchronizes.
22. systems as claimed in claim 21, wherein said work steals device equipment by comprising described first work queue On atomic operation described synchronous memories access steal described work.
23. systems as according to any one of claim 21-22, farther include: device equipment is stolen in the second work, and it determines Steal and by the second synchronous memories access of described second work queue is stolen the second work from described second work queue Making and add it to described first work queue, described second synchronous memories accesses relative to from described second hardware The memory access of described second work queue of computing unit synchronizes.
24. 1 kinds of machinable mediums storing instruction, if described instruction is performed by machine, make described machine perform Method as according to any one of claim 12-20.
25. 1 kinds of equipment, including the parts for performing the method as according to any one of claim 12-20.
CN201380073056.3A 2013-03-15 2013-03-15 Work stealing in heterogeneous computing systems Active CN106164881B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710028738.2A CN107092573B (en) 2013-03-15 2013-03-15 Method and apparatus for work stealing in heterogeneous computing systems

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2013/032707 WO2014143067A1 (en) 2013-03-15 2013-03-15 Work stealing in heterogeneous computing systems

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN201710028738.2A Division CN107092573B (en) 2013-03-15 2013-03-15 Method and apparatus for work stealing in heterogeneous computing systems

Publications (2)

Publication Number Publication Date
CN106164881A true CN106164881A (en) 2016-11-23
CN106164881B CN106164881B (en) 2022-01-14

Family

ID=51537399

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201710028738.2A Active CN107092573B (en) 2013-03-15 2013-03-15 Method and apparatus for work stealing in heterogeneous computing systems
CN201380073056.3A Active CN106164881B (en) 2013-03-15 2013-03-15 Work stealing in heterogeneous computing systems

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN201710028738.2A Active CN107092573B (en) 2013-03-15 2013-03-15 Method and apparatus for work stealing in heterogeneous computing systems

Country Status (4)

Country Link
US (3) US20160154677A1 (en)
EP (2) EP3242210B1 (en)
CN (2) CN107092573B (en)
WO (1) WO2014143067A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106777710A (en) * 2016-12-22 2017-05-31 中国兵器装备集团自动化研究所 A kind of method of the CUDA kernels realized on FPGA
CN108319510A (en) * 2017-12-28 2018-07-24 大唐软件技术股份有限公司 A kind of isomery processing method and processing device
CN112463709A (en) * 2019-09-09 2021-03-09 上海登临科技有限公司 Configurable heterogeneous artificial intelligence processor

Families Citing this family (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3242210B1 (en) 2013-03-15 2022-08-03 INTEL Corporation Work stealing in heterogeneous computing systems
US9811467B2 (en) * 2014-02-03 2017-11-07 Cavium, Inc. Method and an apparatus for pre-fetching and processing work for procesor cores in a network processor
US9753785B2 (en) * 2014-07-24 2017-09-05 Home Box Office, Inc. Asynchronous dispatcher for application framework
US9804883B2 (en) * 2014-11-14 2017-10-31 Advanced Micro Devices, Inc. Remote scoped synchronization for work stealing and sharing
US9678806B2 (en) * 2015-06-26 2017-06-13 Advanced Micro Devices, Inc. Method and apparatus for distributing processing core workloads among processing cores
US10089155B2 (en) * 2015-09-22 2018-10-02 Advanced Micro Devices, Inc. Power aware work stealing
US20170083365A1 (en) * 2015-09-23 2017-03-23 Qualcomm Incorporated Adaptive Chunk Size Tuning for Data Parallel Processing on Multi-core Architecture
US9772878B2 (en) 2015-11-09 2017-09-26 Unity IPR ApS Determining a job group status based on a relationship between a generation counter value and a ticket value for scheduling the job group for execution
US9892544B2 (en) * 2015-12-22 2018-02-13 Intel Corporation Method and apparatus for load balancing in a ray tracing architecture
US10114681B2 (en) * 2016-03-30 2018-10-30 Qualcomm Incorporated Identifying enhanced synchronization operation outcomes to improve runtime operations
JP6645348B2 (en) * 2016-05-06 2020-02-14 富士通株式会社 Information processing apparatus, information processing program, and information processing method
US10437616B2 (en) * 2016-12-31 2019-10-08 Intel Corporation Method, apparatus, system for optimized work submission to an accelerator work queue
US10444817B2 (en) * 2017-04-17 2019-10-15 Intel Corporation System, apparatus and method for increasing performance in a processor during a voltage ramp
US11237872B2 (en) * 2017-05-23 2022-02-01 Kla-Tencor Corporation Semiconductor inspection and metrology systems for distributing job among the CPUs or GPUs based on logical image processing boundaries
US10686728B2 (en) * 2017-07-06 2020-06-16 Huawei Technologies Co., Ltd. Systems and methods for allocating computing resources in distributed computing
KR102482896B1 (en) 2017-12-28 2022-12-30 삼성전자주식회사 Memory device including heterogeneous volatile memory chips and electronic device including the same
US10430246B2 (en) 2018-01-18 2019-10-01 International Business Machines Corporation Virtualized and synchronous access to hardware accelerators
US10705849B2 (en) * 2018-02-05 2020-07-07 The Regents Of The University Of Michigan Mode-selectable processor for execution of a single thread in a first mode and plural borrowed threads in a second mode
CN108920260B (en) * 2018-05-16 2021-11-26 成都淞幸科技有限责任公司 Interaction method and device for heterogeneous system
US10957095B2 (en) * 2018-08-06 2021-03-23 Intel Corporation Programmable ray tracing with hardware acceleration on a graphics processor
US10963300B2 (en) * 2018-12-06 2021-03-30 Raytheon Company Accelerating dataflow signal processing applications across heterogeneous CPU/GPU systems
WO2020257976A1 (en) * 2019-06-24 2020-12-30 Intel Corporation Apparatus and method for scheduling graphics processing resources
US11449339B2 (en) * 2019-09-27 2022-09-20 Red Hat, Inc. Memory barrier elision for multi-threaded workloads
US11327793B2 (en) * 2020-02-18 2022-05-10 International Business Machines Corporation Garbage collection work stealing mechanism
US11340942B2 (en) * 2020-03-19 2022-05-24 Raytheon Company Cooperative work-stealing scheduler
CN113886057B (en) * 2020-07-01 2024-06-28 西南科技大学 Dynamic resource scheduling method based on analysis technology and data stream information on heterogeneous many-core
US11698816B2 (en) * 2020-08-31 2023-07-11 Hewlett Packard Enterprise Development Lp Lock-free work-stealing thread scheduler
CN115705194B (en) * 2021-08-13 2024-09-24 华为技术有限公司 Code processing method and corresponding device under hardware memory sequence architecture
US11875200B2 (en) * 2021-09-23 2024-01-16 International Business Machines Corporation Adjunct processor extended message limit facility

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101091175A (en) * 2004-09-16 2007-12-19 辉达公司 Load balancing
US20110055838A1 (en) * 2009-08-28 2011-03-03 Moyes William A Optimized thread scheduling via hardware performance monitoring
CN102053870A (en) * 2009-10-28 2011-05-11 国际商业机器公司 Systems and methods for affinity driven distributed scheduling of parallel computations
CN102360313A (en) * 2011-09-29 2012-02-22 中国科学技术大学苏州研究院 Performance acceleration method of heterogeneous multi-core computing platform on chip
US20120054771A1 (en) * 2010-08-31 2012-03-01 International Business Machines Corporation Rescheduling workload in a hybrid computing environment
WO2012082557A2 (en) * 2010-12-15 2012-06-21 Advanced Micro Devices, Inc. Dynamic work partitioning on heterogeneous processing devices

Family Cites Families (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6748593B1 (en) * 2000-02-17 2004-06-08 International Business Machines Corporation Apparatus and method for starvation load balancing using a global run queue in a multiple run queue system
US6826583B1 (en) * 2000-05-15 2004-11-30 Sun Microsystems, Inc. Local allocation buffers for parallel garbage collection
US7117502B1 (en) 2000-11-10 2006-10-03 Sun Microsystems, Inc. Linked-list implementation of a data structure with concurrent non-blocking insert and remove operations
US7167916B2 (en) * 2002-08-30 2007-01-23 Unisys Corporation Computer OS dispatcher operation with virtual switching queue and IP queues
US7516456B2 (en) * 2003-09-25 2009-04-07 International Business Machines Corporation Asymmetric heterogeneous multi-threaded operating system
US20050210472A1 (en) * 2004-03-18 2005-09-22 International Business Machines Corporation Method and data processing system for per-chip thread queuing in a multi-processor system
ITMI20040600A1 (en) * 2004-03-26 2004-06-26 Atmel Corp DSP SYSTEM ON DOUBLE PROCESSOR WITH MOBILE COMB IN THE COMPLEX DOMAIN
GB2427045B (en) * 2005-06-06 2007-11-21 Transitive Ltd Method and apparatus for converting program code with access coordination for a shared resource
US8539493B1 (en) * 2006-03-23 2013-09-17 Emc Corporation Configurable prioritization and aging of queued tasks
US8209493B2 (en) * 2008-03-26 2012-06-26 Intel Corporation Systems and methods for scheduling memory requests during memory throttling
US8607237B2 (en) * 2008-06-02 2013-12-10 Microsoft Corporation Collection with local lists for a multi-processor system
US8266394B2 (en) * 2008-07-14 2012-09-11 International Business Machines Corporation Methods for single-owner multi-consumer work queues for repeatable tasks
US8813091B2 (en) * 2008-08-04 2014-08-19 Oracle America, Inc. Distribution data structures for locality-guided work stealing
US8321558B1 (en) * 2009-03-31 2012-11-27 Amazon Technologies, Inc. Dynamically monitoring and modifying distributed execution of programs
US8572622B2 (en) 2009-12-30 2013-10-29 International Business Machines Corporation Reducing queue synchronization of multiple work items in a system with high memory latency between processing nodes
US8464026B2 (en) 2010-02-17 2013-06-11 International Business Machines Corporation Method and apparatus for computing massive spatio-temporal correlations using a hybrid CPU-GPU approach
US8806497B2 (en) * 2010-03-05 2014-08-12 Microsoft Corporation System and method for altering the work definitions in an iterative parallel opportunistic work stealing scheduler
WO2011148553A1 (en) * 2010-05-24 2011-12-01 株式会社ソニー・コンピュータエンタテインメント Information processing device and information processing method
US9152468B2 (en) 2010-10-25 2015-10-06 Samsung Electronics Co., Ltd. NUMA aware system task management
JP5576305B2 (en) * 2011-01-20 2014-08-20 インターナショナル・ビジネス・マシーンズ・コーポレーション Computer operation control method, program, and system
KR101895453B1 (en) * 2011-11-09 2018-10-25 삼성전자주식회사 Apparatus and method for guarantee security in heterogeneous computing environment
US8842122B2 (en) * 2011-12-15 2014-09-23 Qualcomm Incorporated Graphics processing unit with command processor
US9116739B2 (en) * 2013-03-14 2015-08-25 Intel Corporation Fast and scalable concurrent queuing system
EP3242210B1 (en) 2013-03-15 2022-08-03 INTEL Corporation Work stealing in heterogeneous computing systems

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101091175A (en) * 2004-09-16 2007-12-19 辉达公司 Load balancing
US20110055838A1 (en) * 2009-08-28 2011-03-03 Moyes William A Optimized thread scheduling via hardware performance monitoring
CN102053870A (en) * 2009-10-28 2011-05-11 国际商业机器公司 Systems and methods for affinity driven distributed scheduling of parallel computations
US20120054771A1 (en) * 2010-08-31 2012-03-01 International Business Machines Corporation Rescheduling workload in a hybrid computing environment
WO2012082557A2 (en) * 2010-12-15 2012-06-21 Advanced Micro Devices, Inc. Dynamic work partitioning on heterogeneous processing devices
CN102360313A (en) * 2011-09-29 2012-02-22 中国科学技术大学苏州研究院 Performance acceleration method of heterogeneous multi-core computing platform on chip

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
VINICIUS GARCIA PINTO等: "Scheduling by Work-Stealing in Hybrid Parallel Architectures", 《HTTPS://WWW.INF.UFRGS.BR/GPPD/WSPPD/2012/PAPERS/WSPPD2012_SUBMISSION_23.PDF.MOD.PDF》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106777710A (en) * 2016-12-22 2017-05-31 中国兵器装备集团自动化研究所 A kind of method of the CUDA kernels realized on FPGA
CN108319510A (en) * 2017-12-28 2018-07-24 大唐软件技术股份有限公司 A kind of isomery processing method and processing device
CN112463709A (en) * 2019-09-09 2021-03-09 上海登临科技有限公司 Configurable heterogeneous artificial intelligence processor

Also Published As

Publication number Publication date
US20220027210A1 (en) 2022-01-27
US20170109213A1 (en) 2017-04-20
CN107092573B (en) 2023-04-18
WO2014143067A1 (en) 2014-09-18
EP2972907B1 (en) 2019-09-25
CN106164881B (en) 2022-01-14
EP3242210A1 (en) 2017-11-08
CN107092573A (en) 2017-08-25
EP3242210B1 (en) 2022-08-03
US11138048B2 (en) 2021-10-05
EP2972907A4 (en) 2016-11-30
EP2972907A1 (en) 2016-01-20
US20160154677A1 (en) 2016-06-02

Similar Documents

Publication Publication Date Title
CN106164881A (en) Work in heterogeneous computing system is stolen
CN103765376B (en) Graphic process unit with clog-free parallel architecture
Chen et al. Dynamic load balancing on single-and multi-GPU systems
Hermann et al. Multi-GPU and multi-CPU parallelization for interactive physics simulations
CN103365631B (en) For the dynamic base pattern addressing of memory access
US11934826B2 (en) Vector reductions using shared scratchpad memory
CN103294536B (en) Control to be distributed for the work of process task
CN103559014A (en) Method and system for processing nested stream events
CN109997115A (en) Low-power and low latency GPU coprocessor for persistently calculating
CN104572568A (en) Read lock operation method, write lock operation method and system
CN101978350A (en) Vector instructions to enable efficient synchronization and parallel reduction operations
US20100031267A1 (en) Distribution Data Structures for Locality-Guided Work Stealing
US11340942B2 (en) Cooperative work-stealing scheduler
CN103154892A (en) Method, system and apparatus for multi-level processing
CN103885903A (en) Technique For Performing Memory Access Operations Via Texture Hardware
US8413151B1 (en) Selective thread spawning within a multi-threaded processing system
Chavarria-Miranda et al. Global Futures: A multithreaded execution model for Global Arrays-based applications
WO2021154732A1 (en) Shared scratchpad memory with parallel load-store
Vinkler et al. Massively parallel hierarchical scene processing with applications in rendering
CN112346879B (en) Process management method, device, computer equipment and storage medium
CN116775265A (en) Collaborative group array
CN116774914A (en) Distributed shared memory
Gaster HSA memory model.
Silva et al. Terrame hpa: parallel simulation of multi-agent systems over smps
Masko et al. Task scheduling for SoC-based dynamic SMP clusters with communication on the fly

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant