WO2008046716A1 - A multi-processor computing system and its task allocating method - Google Patents

A multi-processor computing system and its task allocating method Download PDF

Info

Publication number
WO2008046716A1
WO2008046716A1 PCT/EP2007/060028 EP2007060028W WO2008046716A1 WO 2008046716 A1 WO2008046716 A1 WO 2008046716A1 EP 2007060028 W EP2007060028 W EP 2007060028W WO 2008046716 A1 WO2008046716 A1 WO 2008046716A1
Authority
WO
WIPO (PCT)
Prior art keywords
microtask
task
instruction
synchronization
computing system
Prior art date
Application number
PCT/EP2007/060028
Other languages
French (fr)
Inventor
Hai Ju
Guo Hui Lin
Qiang Liu
Lu Wan
Yu Dong Yang
Roderick Michael Peters West
Original Assignee
International Business Machines Corporation
Ibm United Kingdom Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corporation, Ibm United Kingdom Limited filed Critical International Business Machines Corporation
Publication of WO2008046716A1 publication Critical patent/WO2008046716A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5017Task decomposition

Definitions

  • the invention relates generally to a multi-processor computer and its task allocating method, and particularly to a computing system with array of homogeneous and/or heterogeneous computing elements that are configured dynamically at runtime to fulfill special purpose computing tasks, and its task allocating method.
  • Multi-processor computing systems are generally designed to satisfy two classes of computing requirements: (1) general purpose applications, and (2) special purpose applications.
  • general purpose applications SMP with homogeneous processors is the most common architecture.
  • special purpose application such as, for example, multimedia and digital signal processing, it is quite common that heterogeneous processors or hardware offload elements are combined to address different aspects of the applications.
  • the architecture is usually built upon an array of interconnected computing elements. These computing elements can be programmable processors or fixed function units. Despite the differences between the computing elements, the most distinct point that discriminates these architectures from each other is the way in which these computing elements are connected and controlled to handle the computing tasks.
  • Heartbeat can be as fast as one clock cycle or up to a longer time according to the architectural design and stream pipeline delays.
  • One advantage of heartbeat synchronization is that it is easier to implement and program as the computing elements are decoupled from the synchronization tasks.
  • the drawback is that it lacks the flexibility to run asynchronous tasks, thus limiting the application fields.
  • Another drawback is that its efficiency is adversely affected if pipeline stages can not be well balanced.
  • the computing elements pause on some specified point in their execution flow until all of the elements in a synchronization group come to that point, and then the execution flow resumes from that point.
  • the synchronization logic design and computing element partition could be optimized to give very good performances.
  • the drawback of this implementation is lacking of flexibility when there are changes of the configuration and applications.
  • a first aspect of the invention provides a multi-processor computing system, comprising a host processor, a global memory and at least one processing units, wherein the multiprocessor computing system further comprising a microtask sequencer that comprises a task acquisition device and a task scheduling device, wherein the task acquisition device is configured to fetch a command including microtask descriptions from the global memory if the microtask sequencer is capable to accommodate further commands, and the task scheduling device is configured to dispatch each microtask instruction defined by each of the microtask descriptions to one of the at least one processing units, so that the indicated processing unit can execute the microtask instruction, and the task scheduling device is further configured to detect the completion of all the microtasks in the command and notify the host processor of the completion, wherein the processing unit can access the global memory by itself or by another of the at least one processing units.
  • Preferred embodiments enable the host processor to be released from dispatching tasks to computing elements.
  • the host processor and computing elements are released from synchronisation control.
  • the task acquisition device may comprise: a task dispatcher; at least one task queues; and at least one execution control units (ECU), each of which is associated with a different one of the at least one task queue, and with a different one of the at least one processing unit, wherein the task dispatcher may be configured to accept the command fetched by the task acquisition device and put said each microtask instruction in one of the at least one task queue associated with said indicated processing unit, and may be further configured to perform said detection and notification; and each of the at least one execution control units may be configured to obtain one microtask instruction from the associated task queue in a FIFO manner and delivery the obtained microtask instruction to the associated processing unit for execution, in response to the associated processing unit being capable to process a further microtask instruction, and may be further configured to collect the completion of the obtained microtask instruction for purpose of said detection.
  • ECU execution control units
  • the command may further includes synchronization instructions for achieving the synchronization among two or more of the at least one processing units
  • the task dispatcher may be further configured to put the synchronization instructions into the task queues associated with the synchronization instructions
  • the task scheduling device may further comprise: a synchronization control means being operated by the execution control units based on the synchronization instructions obtained from respective task queues for achieving the synchronization.
  • the synchronization may be based on semaphore technique, and the synchronization control means may be implemented based on a shared semaphore array.
  • the command may further includes a flow control instruction for the associated execution control unit to selectively skip a part of the microtask instructions based on one or more of values reported from all or some of the processing units, the task dispatcher may be further configured to put the flow control instruction into the task queue associated with the flow control instruction, and the task scheduling device may further comprise: a memory for storing the values, wherein the execution control unit may be further configured to skip the part of the microtask instructions based on the flow control instruction and associated values in the memory.
  • the command may further includes a flow control instruction for the task dispatcher to selectively skip a part of the microtask instructions based on one or more of values reported from all or some of the processing units, and the task scheduling device may further comprise: a memory for storing the values, wherein the task dispatcher may be further configured to skip the part of the microtask instructions based on the flow control instruction and associated values in the memory.
  • the multi-processor computing system may be implemented in a system on chip accelerator or a microprocessor.
  • the access by the processing unit to the global memory may be implemented via DMA.
  • the multi-processor computing system may further comprise a configuration device for initializing the microtask sequencer.
  • a preferred embodiment of the invention provides an improved architecture of multiprocessor computer, It consists of a dedicated programmable task sequencer as the "Microtask Sequencer”.
  • the microtask sequencer is programmed by the host processor with information to describe the tasks to be issued to the computing elements. The microtask sequencer then dispatch and monitor the execution status of these tasks automatically without interfering by the host processor.
  • the microtask sequencer also has software programmable synchronization logic that is based on signal posting and waiting mechanism. This synchronization logic is used when synchronization between tasks is required. Thus synchronization logics are decoupled from computing elements.
  • the invention provides an computing task handling mechanism that has performance comparable to that of hardwired logics and the flexibility comparable to fully software implemented controls. By decoupling the synchronization logic from computing elements, the invention makes it simple to integrate heterogeneous computing elements into the same architecture. It also allows the software programmers of computing elements to focus on only the functional part of the algorithm. Another improvement is potentially power efficiency over software approaches where no instruction cycles are spent on waiting for synchronization signals.
  • Another aspect of the invention also provides a task allocating method in a multi-processor computing system comprising a host processor, a global memory and at least one processing units, the method comprising: providing, by the host processor, commands including microtask descriptions in the global memory; fetching, by the microtask sequencer, a command including microtask descriptions from the global memory if the microtask sequencer is capable to accommodate further commands; dispatching, by the microtask sequencer, each microtask instruction defined by each of the microtask descriptions to one of the at least one processing units, so that the processing unit as dispatched to can execute the microtask instruction; detecting, by the microtask sequencer, the completion of all the microtasks in the command; and notifying, by the microtask sequencer, the host processor of the completion, wherein the processing unit can access the global memory by itself or by another of the at least one processing units.
  • Figure 1 shows an example of a typical system architecture where the invention can be implemented, according to an embodiment of the invention.
  • Figure 2 shows an example of the configuration of the microtask sequencer including the mechanism of synchronization.
  • FIG. 1 shows an example of a typical system architecture where the invention can be implemented, according to an embodiment of the invention.
  • the system 100 comprises a host processor 101, a global memory 102, and a multi-processor subsystem 110, which are connected to a host system bus 103.
  • the host processor 101 runs the OS and high level application logic.
  • the global memory 102 is used to hold code and data of the application.
  • the multi-processor subsystem 110 comprises a microtask sequencer 104, optional DMA elements 105-1, ..., 105 -m, computing elements 106-1, ..., 106-n, and optional shared local memory blocks 107-1, ..., 107-k.
  • the computing elements 106-1, ..., 106-n do the actual computing of microtasks.
  • These computing elements 106-1, ..., 106-n can be from generic processors to dedicated hardwired logic units. In comparison with most other multi- processor systems, a combination of different computing element types is also allowed in this architecture.
  • private local memories that holds code and data is normally included in the computing elements 106-1, ..., 106-n.
  • the DMA elements 105-1, ..., 105-m are then used to move data between the global memory 102, the local memories of the computing elements 106-1, ..., 106-n, and the shared local memory blocks 107-1, ..., 107-k.
  • a local bus/crossbar switch 109 is used as the internal data path between the DMA elements 105-1, ..., 105-m and the computing elements 106-1, ..., 106-n.
  • the microtask sequencer 104 is used to control these function units such as the DMA elements 105-1, ..., 105-m and the computing elements 106-1, ..., 106-n. These function units are connected to the microtask sequencer 104 through the task control bus 108.
  • the microtask sequencer 104 usually get commands containing descriptions of microtasks from the global memory 102 where these descriptions are generated by the application software running on the host processor 101 according to the programmer's design or the arrangement by the compiler or the interpreter.
  • a computing task including three separate computations it can be divided into three microtasks as follows:
  • microtask 1 Compute #1 microtask 2: Compute #2 microtask 3: Compute #3.
  • the microtask sequencer 104 execute these commands and dispatch tasks descriptions (i.e., microtasks) to corresponding function units, based on the destination information specified in the task descriptions.
  • the microtasks for the DMA elements 105-1, ..., 105-m may be as small as moving of several tens byte of data, and the computing elements 106-1, ..., 106-n then process these data in tens or hundreds of clock cycles. In a system clocked more than 200MHz, these microtasks may last for less than one microsecond or even less than tenth of a microsecond.
  • the host processor 101 is used to control these microtasks without the help of the microtask sequencer 104, its processing capability can be fully occupied with this control. In this point, the multiprocessor computing system with the microtask sequencer can at least release the host processor from the scheduling and management of the function units.
  • the function unit may notify the microtask sequencer through the task control bus or any other known signaling method such as interrupt, so that the microtask sequencer may perform further dispatching or control.
  • microtask sequencer and the function units may be implemented by any physical or logic processing units, such as one or more central processing units (CPUs), digital signal processors (DSPs), application-specific integrated circuits (ASICs), and etc.
  • CPUs central processing units
  • DSPs digital signal processors
  • ASICs application-specific integrated circuits
  • the host system bus, task control bus and local bus/crossbar switch can employ any known bus or switch structures or architectures.
  • a task is to be executed, where a DMA element such as the DMA elements 105-1, ..., 105-m will load input data for a computing unit such as the computing elements 106-1, ..., 106-n before the corresponding microtask on the computing element starts to run. After the computing element finishes the job, the DMA element will move the result data to the next destination.
  • the computing task can be divided into three microtasks as follows:
  • microtask 1' Data In microtask 2': Compute microtask 3': Data Out.
  • a microtask running on a computing element and a corresponding microtask running on a DMA element may be dependent on each other.
  • the computation of microtask 2' requires that the data in of microtask 1' should be already carried out
  • the data out of microtask 3' requires that the computation of microtask 2' should be already carried out.
  • these function units must operate in a correct order to get a correct result. While for microtasks that do not depend on each other, for example, those running on two independent computing elements 106-1, ..., 106-n or two DMA elements 105-1, ..., 105-m, these microtasks can be issued simultaneously to achieve a better utilization of these hardware resources.
  • the invention provides a mechanism of synchronization to the microtask sequencer, so that the microtask sequencer is capable to contribute parallel executions of microtasks, thus releasing the function units from the burden of synchronization.
  • FIG. 2 shows an example of the configuration of the microtask sequencer 200 including the mechanism of synchronization.
  • the microtask sequencer 200 connects to the host system via the host system bus, and comprises a task acquisition device 210 for fetching the commands from the location as specified by the host processor via a memory bus of the host system bus, and a task scheduling device 220 for dispatching the microtasks described in the commands to relevant function units and managing the synchronization among the microtasks.
  • the microtask sequencer 200 may further comprise a configuration device 230 by which the host processor may initialize or control the microtask sequencer 200 and the associated function units via a conf ⁇ g bus of the host system bus.
  • the task acquisition device 210 comprises a memory bus interface 201, a FIFO loader 203-1 and a income FIFO 203-2.
  • the task scheduling device 220 comprises a task dispatcher 205, multiple task queues 206a, 206b, 206c, corresponding execution control units (ECU) 207a, 207b, 207c and a shared semaphore array 208.
  • the task scheduling device 220 may also comprise a flag array 209 that holds some flags corresponding to some or all the function units.
  • the configuration device 230 comprises a conf ⁇ g bus interface 202 and a system configuration unit 204.
  • the task control bus 210a, 210b, 210c connects the execution control units 207a, 207b, 207c to corresponding function units such as the DMA elements 105-1, ..., 105-m and computing elements 106-1, ..., 106-n, respectively.
  • Each function unit corresponds to a task queue 206 and an execution control unit 207 in the microtask sequencer 104.
  • the FIFO loader 203-1 is a DMA controller that is used to load programmed commands from global memory 102 to the income FIFO 203-2.
  • the FIFO loader 203-1 accepts requests from the task dispatcher 205 to load the commands as long as the task dispatcher 205 is capable to deal with a further command.
  • the task acquisition device 210 may be initialized, preferably through the configuration device 230, by the host processor to start the loading from one specified location inside the global memory 102 address space and it will load contents up to a given count.
  • the FIFO loader 203-1 will set the status to the system configuration unit 204 of the configuration device 230 and a host interrupt is asserted to inform the host processor 101 of the status.
  • the above interaction between the task acquisition device 210 and the host processor 101 may be achieved by other signaling mechanism known in the art.
  • the task dispatcher 205 dispatches these commands from the income FIFO 203-2 to the task queues 206 of each execution control unit 207.
  • the task queue 206a, 206b, 206c is a FIFO storage that holds the commands to be issued to the corresponding function unit.
  • the depth of the FIFO is a tradeoff of efficiency and area. In the embodiment, we assume a size of 32 to 64 entries.
  • the task dispatcher 205 monitors the status of the associated task queues 206a, 206b, 206c and fills them once the immediate available commands from the FIFO loader 203-1 match the queues that have empty slots.
  • the command consists of a segment that indicates which task queue 206 it shall be dispatched to.
  • the execution control unit 207 then checks the status of its corresponding function unit and dispatches commands to it accordingly.
  • Each function unit attached to the microtask sequencer 200 is controlled by an corresponding execution control unit 207.
  • the execution control unit 207 is a simple sequentially programmable controller that fetches commands from the corresponding task queue 206 and either forwards the command to the function unit or handles them by itself according to the type of commands, if for example an inter- function unit synchronization operation needs be performed, or preferably, if some flow control within the task needs b performed. In the latter case, the execution control unit 207 has access to the semaphore array 208 and preferably the flag array 209.
  • the execution control unit 207 can post the semaphores or poll for some specific semaphores. With the flag array 209, the execution control unit 207 can also support conditional commands to implement "IF THEN ELSE" way of execution path. This can be utilized to decouple the control of host processor 101 even more.
  • the semaphore array 208 holds an array of one bit flag that can be set and cleared by the execution control units 207a, 207b, 207c by using synchronization commands. More specifically, the semaphore array 208 is an array of one bit flags that can be set and reset by the execution control units 207a, 207b, 207c according to semaphore commands in the task queues 206a, 206b, 206c.
  • the semaphore array 208 is a shared resource in the whole microtask sequencer 200. Every execution control unit 207 can access every bits of the semaphore array 208 under command control.
  • An execution control unit 207 can either set multiple semaphore bits or can wait for multiple semaphore bits.
  • the semaphore bits are set by the semaphore post commands and the semaphore wait commands will be blocked until the semaphore bits they wait for are all set.
  • the semaphore wait commands will also reset the corresponding semaphore bits before continue.
  • each function unit can return a single bit or multiple bits result as execution of the microtask finishes.
  • This value shall be set to the flag array 209.
  • the data in the flag array 209 can be used to implements conditional execution of commands and thus allow "IF-THEN-ELSE " like execution flow of microtasks. More specifically, the flag array 209 is an array of single or multiple bits results. Each entry is corresponding to one function unit. Thus the number of entries in the flag array 209 may equals to or less than the number of function units or the number of task queues 206a, 206b, 206c or the number of execution control units 207a, 207b, 207c, as required.
  • the result is set by the function unit before associated microtask execution ends.
  • the definition of the result value is not limited to those described in this document. It shall be determined by the function unit designer himself. Some possible examples are status reports, simple return values, etc.
  • All of the execution control units 207a, 207b, 207c can read any one entry of the flag array. For the execution control unit 207, there is a command to fetch and store the desired flag array 209 contents for future reference. This is to allow new result updates while keep local serialized execution flow consists. With the flag array 209 and conditional execution capability, we could implement "IF THEN ELSE” and "SWITCH CASE” execution control and, in applicable scenarios, decouples the host processor 101 even more.
  • the configuration device 230 is an preferable interface to the host system for configuration and control. More specifically, the system configuration unit 204 therein has access, through the configuration bus interface 202, to host system, and can be used by the host system to initialize the microtask sequencer and control the operations of different internal/external function units. In general, configuration device 230 may have a slave mode I/O interface to the system bus and one interrupt request line for status report.
  • the microtask sequencer 104 is able to issue jobs to the functional units (the DMA elements 105-1, ..., 105-m and the computing elements 106-1, ..., 106-n) automatically with minimal intervening by the host processor 101.
  • All commands of the microtask sequencer are 32 bits in length.
  • the command is segmented into four fixed fields as shown in table 1.
  • the QID field is the ID number of a task queue which the command shall be dispatched to. This field is interpreted by the task dispatcher.
  • the task dispatcher also inspects the NUMARG field which indicates how many 32-bit arguments follow. The task dispatcher will treat the following NUMARG 32-bit words as the arguments of the previous command and it will dispatch these data together to the task queue without further interpretation.
  • the ECU field in the command word is used to distinguish if the Execution Control Unit shall interpret the command or just send out the command to correspondent processing element or function unit to process.
  • the command is for the execution control unit
  • This command sets the correspondent Semaphore Array bits to one. The bits affected are given by the bit mask in DATA section. If any correspondent bit is already set as ' 1 ', this command will wait until all the correspondent bits are set by other module as '0'.
  • This command sets the 0,2,3 bits of the Semaphore Array.
  • This command waits for the correspondent Semaphore Array bits to be set.
  • the bits monitored are given by the bit mask in DATA section.
  • the command shall hold execution flow of ECU until all the masked bits to be set.
  • the correspondent Semaphore Array bits are then cleared to zero.
  • There is one global timeout counter that defines a maximum waiting time in system clock cycles. Once the semaphore bits are not triggered within the defined timeout value, a status report event shall be generated and that signal can be used to trigger a host interrupt to let the host processor intervene the process.
  • This command waits until the 0,2,3 bits of the Semaphore Array are all set.
  • This command clears the correspondent Semaphore Array bits to zero. The bits are given by the bit mask in DATA section. The command can be used to initialize the Semaphore Array to correct status.
  • This command waits for the correspondent Semaphore Array bits to be cleared.
  • the bits monitored are given by the bit mask in DATA section.
  • the command shall hold execution flow of ECU until all the masked bits are cleared.
  • the command does nothing to the Semaphore Array.
  • There is one global timeout counter that defines a maximum waiting time in system clock cycles. Once the semaphore bits are not triggered within the defined timeout value, a status report event shall be generated and that signal can be used to trigger a host interrupt to let the host processor intervene the process.
  • This command waits until the 0,2,3 bits of the Semaphore Array are all cleared to zero.
  • DATA The 12:15 bits contains an index to flag array which must fall inside the valid range. The index is zero based. The other bits are unused and should be set to zero. Description: This command retrieves the 4 bits value from the specified Flag Array elements and stores it to the condition section of the local flag register for conditional execution usage.
  • the Flag Array is as the flags returned by function units when function units finish one microtask.
  • This command retrieves the element of index 3 from the Flag Array.
  • This command omits next 15 command/data from the task queue.
  • condition evaluation Assume the mask as ABCD, the match value as EFGH and the actual flags as efgh, we can define the condition evaluation as:
  • V !( A & (E ⁇ e)) & !( B & (F ⁇ f)) & !(C & (G ⁇ g)) & !(D & (H ⁇ h))
  • the specific device or modules indicated in the microtask sequencer 200 herein may be implemented in hardware and/or software.
  • a specific device or module may be performed using software and/or firmware executed on one or more processing modules.
  • the processing module can be a single processing device or a plurality of processing devices.
  • Such a processing device may be a microprocessor, microcontroller, digital processor, microcomputer, a portion of the central processing unit, a state machine, logic circuitry, hardwired logic and/or any device that can perform a desired processing or control.
  • the multiprocessor subsystem comprises the microtask sequencer
  • the DMA elements, the computing elements and the shared local memory blocks are not necessary if the computing elements are configured to directly access the global memory.
  • microtasks are dispatched to respective function units based on the destination information specified in the task descriptions in the embodiments, it is also possible to only specify the types of the function units (e.g., DMA units, calculating units and etc.) for performing the microtasks in the task descriptions and to designate randomly the function units for performing the microtasks by the task dispatcher according to availability of the respective function units.
  • Information on type, quantity and status of the function units may be defined in advance, set by the host through the system configuration units, or obtained or maintained by polling or other status detecting mechanisms. If it is necessary to designate that several microtasks should be performed by the same function unit, it is possible to specify the logical identification of this function units in the task descriptions and to bind this logical identification with the physical function unit by the task dispatcher.

Abstract

A multi-processor computing system is disclosed. The multi-processor computing system comprise a host processor, a global memory and at least one processing units, wherein the multi-processor computing system further comprising: a micro task sequencer, comprising a task acquisition device and a task scheduling device, wherein the task acquisition device is configured to fetch a command including micro task descriptions from the global memory if the micro task sequencer is capable to accommodate further commands, and the task scheduling device is configured to dispatch each micro task instruction defined by each of the micro task descriptions to one of the at least one processing units, so that the indicated processing unit can execute the micro task instruction, and the task scheduling device is further configured to detect the completion of all the micro tasks in the command and notify the host processor of the completion, wherein the processing unit can access the global memory by itself or by another of the at least one processing units.

Description

A MULTI-PROCESSOR COMPUTING SYSTEM AND ITS TASK ALLOCATING METHOD
Technical field
The invention relates generally to a multi-processor computer and its task allocating method, and particularly to a computing system with array of homogeneous and/or heterogeneous computing elements that are configured dynamically at runtime to fulfill special purpose computing tasks, and its task allocating method.
Background art
Multi-processor computing systems are generally designed to satisfy two classes of computing requirements: (1) general purpose applications, and (2) special purpose applications. For general purpose applications, SMP with homogeneous processors is the most common architecture. For special purpose application, such as, for example, multimedia and digital signal processing, it is quite common that heterogeneous processors or hardware offload elements are combined to address different aspects of the applications.
With special purpose multi-processor computing systems, the architecture is usually built upon an array of interconnected computing elements. These computing elements can be programmable processors or fixed function units. Despite the differences between the computing elements, the most distinct point that discriminates these architectures from each other is the way in which these computing elements are connected and controlled to handle the computing tasks.
Many of the prior art architectures can only program the processor array in fully synchronous "stream" mode, where there is some kind of global shared "heartbeat" that each computing element synchronizes to, and the computing elements exchange data at beginning of each heartbeat according to the preconfigured interconnection topology. The heartbeat can be as fast as one clock cycle or up to a longer time according to the architectural design and stream pipeline delays. One advantage of heartbeat synchronization is that it is easier to implement and program as the computing elements are decoupled from the synchronization tasks. The drawback is that it lacks the flexibility to run asynchronous tasks, thus limiting the application fields. Another drawback is that its efficiency is adversely affected if pipeline stages can not be well balanced.
To the other side, there is another class of architecture that is programmed with asynchronous plus synchronous mode where the whole system does not rely on a synchronous heartbeat. Computing elements can be partitioned into groups where a synchronization could be an inner-group synchronization or an inter-group synchronization. Thus multiple threads of each of tasks could be handled on different computing elements or in different element groups until there is a data dependency or an access to a shared resource taking place, and only then synchronization is required. Many of the implementations in this class take the approach of hardwired barrier synchronization, where one or several shared global synchronization logic circuits are responsible for synchronization of all computing elements. The computing elements pause on some specified point in their execution flow until all of the elements in a synchronization group come to that point, and then the execution flow resumes from that point. With specific applications, the synchronization logic design and computing element partition could be optimized to give very good performances. The drawback of this implementation is lacking of flexibility when there are changes of the configuration and applications.
Beside hardwired implementations, another approach is based on software, where the synchronization is done with software running on a host processor or on computing element themselves. Normally the host processor or each of the computing elements checks the status of others through the way of polling and/or interrupts, and then asserts the synchronization signal by way of message communication or so. Compared to the hardwired approach, the software solution is the most flexible one. However, the side effects are: (1) interrupt handling or polling requires a large amount of processor cycles, thus the granularity of synchronization can only be on a very coarse level; (2) the synchronization logic is entangled into low level functional/computing codes, thus complexity of software maintenance increases greatly. The present invention aims to address these problems.
Summary of Invention
A first aspect of the invention provides a multi-processor computing system, comprising a host processor, a global memory and at least one processing units, wherein the multiprocessor computing system further comprising a microtask sequencer that comprises a task acquisition device and a task scheduling device, wherein the task acquisition device is configured to fetch a command including microtask descriptions from the global memory if the microtask sequencer is capable to accommodate further commands, and the task scheduling device is configured to dispatch each microtask instruction defined by each of the microtask descriptions to one of the at least one processing units, so that the indicated processing unit can execute the microtask instruction, and the task scheduling device is further configured to detect the completion of all the microtasks in the command and notify the host processor of the completion, wherein the processing unit can access the global memory by itself or by another of the at least one processing units.
Preferred embodiments enable the host processor to be released from dispatching tasks to computing elements. Preferably, also the host processor and computing elements are released from synchronisation control.
The task acquisition device may comprise: a task dispatcher; at least one task queues; and at least one execution control units (ECU), each of which is associated with a different one of the at least one task queue, and with a different one of the at least one processing unit, wherein the task dispatcher may be configured to accept the command fetched by the task acquisition device and put said each microtask instruction in one of the at least one task queue associated with said indicated processing unit, and may be further configured to perform said detection and notification; and each of the at least one execution control units may be configured to obtain one microtask instruction from the associated task queue in a FIFO manner and delivery the obtained microtask instruction to the associated processing unit for execution, in response to the associated processing unit being capable to process a further microtask instruction, and may be further configured to collect the completion of the obtained microtask instruction for purpose of said detection. The command may further includes synchronization instructions for achieving the synchronization among two or more of the at least one processing units, the task dispatcher may be further configured to put the synchronization instructions into the task queues associated with the synchronization instructions, and the task scheduling device may further comprise: a synchronization control means being operated by the execution control units based on the synchronization instructions obtained from respective task queues for achieving the synchronization.
The synchronization may be based on semaphore technique, and the synchronization control means may be implemented based on a shared semaphore array.
The command may further includes a flow control instruction for the associated execution control unit to selectively skip a part of the microtask instructions based on one or more of values reported from all or some of the processing units, the task dispatcher may be further configured to put the flow control instruction into the task queue associated with the flow control instruction, and the task scheduling device may further comprise: a memory for storing the values, wherein the execution control unit may be further configured to skip the part of the microtask instructions based on the flow control instruction and associated values in the memory.
The command may further includes a flow control instruction for the task dispatcher to selectively skip a part of the microtask instructions based on one or more of values reported from all or some of the processing units, and the task scheduling device may further comprise: a memory for storing the values, wherein the task dispatcher may be further configured to skip the part of the microtask instructions based on the flow control instruction and associated values in the memory.
The multi-processor computing system may be implemented in a system on chip accelerator or a microprocessor.
The access by the processing unit to the global memory may be implemented via DMA. The multi-processor computing system may further comprise a configuration device for initializing the microtask sequencer.
A preferred embodiment of the invention provides an improved architecture of multiprocessor computer, It consists of a dedicated programmable task sequencer as the "Microtask Sequencer".
Further, the microtask sequencer is programmed by the host processor with information to describe the tasks to be issued to the computing elements. The microtask sequencer then dispatch and monitor the execution status of these tasks automatically without interfering by the host processor. The microtask sequencer also has software programmable synchronization logic that is based on signal posting and waiting mechanism. This synchronization logic is used when synchronization between tasks is required. Thus synchronization logics are decoupled from computing elements. Compared to known prior arts, the invention provides an computing task handling mechanism that has performance comparable to that of hardwired logics and the flexibility comparable to fully software implemented controls. By decoupling the synchronization logic from computing elements, the invention makes it simple to integrate heterogeneous computing elements into the same architecture. It also allows the software programmers of computing elements to focus on only the functional part of the algorithm. Another improvement is potentially power efficiency over software approaches where no instruction cycles are spent on waiting for synchronization signals.
Another aspect of the invention also provides a task allocating method in a multi-processor computing system comprising a host processor, a global memory and at least one processing units, the method comprising: providing, by the host processor, commands including microtask descriptions in the global memory; fetching, by the microtask sequencer, a command including microtask descriptions from the global memory if the microtask sequencer is capable to accommodate further commands; dispatching, by the microtask sequencer, each microtask instruction defined by each of the microtask descriptions to one of the at least one processing units, so that the processing unit as dispatched to can execute the microtask instruction; detecting, by the microtask sequencer, the completion of all the microtasks in the command; and notifying, by the microtask sequencer, the host processor of the completion, wherein the processing unit can access the global memory by itself or by another of the at least one processing units.
Description of Accompanying Drawings
The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the invention, and together with the general description given above and the detailed description of the embodiments given below, serve to explain the principles of the invention.
Wherever possible, the same reference numbers will be used throughout the drawings to the same or the like parts. In the drawings:
Figure 1 shows an example of a typical system architecture where the invention can be implemented, according to an embodiment of the invention; and
Figure 2 shows an example of the configuration of the microtask sequencer including the mechanism of synchronization.
Embodiments for Carrying Out the Invention
In the following description, certain specific details are set forth in order to provide a thorough understanding of various embodiments of the invention. However, one skilled in the art will understand that the invention may be practiced without these details. In other instances, well-known structures associated with computers, processors and etc. have not been shown or described in detail to avoid unnecessarily obscuring descriptions of the embodiments of the invention.
For reasons of resource limitation, exploring parallelism, increasing code and data locality, and code manageability, etc, it is a common practice that applications running on multiprocessor systems usually divide complex computing tasks into small pieces. Each of the small tasks is then dispatched to one of the computing elements for execution according to the designed algorithm flow. For example, in typical real-time video coding applications, one small task may only process one 4x4 or 8x8 image block and the whole number of small tasks in one second will be hundreds of thousands. Mapping these tasks to a multi-processor system, each small task may run only several microseconds from start to finish. In addition, these small tasks may have complex dependencies between them. For example, some of them need output of others as the input data. As a result, the number of small tasks is so large and the granularity of execution control of these small tasks is so fine, these make the small task apparently different from common definition of tasks in traditional operation systems. To make it less confusing, we define a term for the small task that represents a single run on a single computing element as one "microtask".
Because of the large quantity and the extremely fine granularity control of the microtasks, the approach of utilizing task scheduling in a traditional operation system will far too heavy for most host processor to handle. Our invention solves this problem by introducing a hardware sequencer logic that works as an autonomous controller. It will release the host processor from the heavy load of microtask scheduling. All that the host processor needs to do is to program the sequencer and handles some infrequent interrupts with predefined conditions.
Figure 1 shows an example of a typical system architecture where the invention can be implemented, according to an embodiment of the invention. The system 100 comprises a host processor 101, a global memory 102, and a multi-processor subsystem 110, which are connected to a host system bus 103. The host processor 101 runs the OS and high level application logic. The global memory 102 is used to hold code and data of the application.
The multi-processor subsystem 110 comprises a microtask sequencer 104, optional DMA elements 105-1, ..., 105 -m, computing elements 106-1, ..., 106-n, and optional shared local memory blocks 107-1, ..., 107-k. The computing elements 106-1, ..., 106-n do the actual computing of microtasks. These computing elements 106-1, ..., 106-n can be from generic processors to dedicated hardwired logic units. In comparison with most other multi- processor systems, a combination of different computing element types is also allowed in this architecture.
Preferably, to reduce the data traffic between the computing elements 106-1, ..., 106-n and the global memory 102, private local memories that holds code and data is normally included in the computing elements 106-1, ..., 106-n. Preferably, for the same reason, there can also be some shared local memory blocks 107-1, ..., 107-k to store intermediate or common data. Accordingly, the DMA elements 105-1, ..., 105-m are then used to move data between the global memory 102, the local memories of the computing elements 106-1, ..., 106-n, and the shared local memory blocks 107-1, ..., 107-k. A local bus/crossbar switch 109 is used as the internal data path between the DMA elements 105-1, ..., 105-m and the computing elements 106-1, ..., 106-n.
The microtask sequencer 104 is used to control these function units such as the DMA elements 105-1, ..., 105-m and the computing elements 106-1, ..., 106-n. These function units are connected to the microtask sequencer 104 through the task control bus 108. The microtask sequencer 104 usually get commands containing descriptions of microtasks from the global memory 102 where these descriptions are generated by the application software running on the host processor 101 according to the programmer's design or the arrangement by the compiler or the interpreter. By way of example, for a computing task including three separate computations, it can be divided into three microtasks as follows:
microtask 1: Compute #1 microtask 2: Compute #2 microtask 3: Compute #3.
Then the microtask sequencer 104 execute these commands and dispatch tasks descriptions (i.e., microtasks) to corresponding function units, based on the destination information specified in the task descriptions. As we described before, the microtasks for the DMA elements 105-1, ..., 105-m may be as small as moving of several tens byte of data, and the computing elements 106-1, ..., 106-n then process these data in tens or hundreds of clock cycles. In a system clocked more than 200MHz, these microtasks may last for less than one microsecond or even less than tenth of a microsecond. Apparently, if the host processor 101 is used to control these microtasks without the help of the microtask sequencer 104, its processing capability can be fully occupied with this control. In this point, the multiprocessor computing system with the microtask sequencer can at least release the host processor from the scheduling and management of the function units.
Upon finishing its microtask, the function unit may notify the microtask sequencer through the task control bus or any other known signaling method such as interrupt, so that the microtask sequencer may perform further dispatching or control.
In general, the microtask sequencer and the function units may be implemented by any physical or logic processing units, such as one or more central processing units (CPUs), digital signal processors (DSPs), application-specific integrated circuits (ASICs), and etc. The host system bus, task control bus and local bus/crossbar switch can employ any known bus or switch structures or architectures.
Further, there may be dependencies between microtasks running on different function units. For example, a task is to be executed, where a DMA element such as the DMA elements 105-1, ..., 105-m will load input data for a computing unit such as the computing elements 106-1, ..., 106-n before the corresponding microtask on the computing element starts to run. After the computing element finishes the job, the DMA element will move the result data to the next destination. In this example, the computing task can be divided into three microtasks as follows:
microtask 1': Data In microtask 2': Compute microtask 3': Data Out.
Thus, a microtask running on a computing element and a corresponding microtask running on a DMA element may be dependent on each other. For example, the computation of microtask 2' requires that the data in of microtask 1' should be already carried out, and the data out of microtask 3' requires that the computation of microtask 2' should be already carried out. Thus these function units must operate in a correct order to get a correct result. While for microtasks that do not depend on each other, for example, those running on two independent computing elements 106-1, ..., 106-n or two DMA elements 105-1, ..., 105-m, these microtasks can be issued simultaneously to achieve a better utilization of these hardware resources.
To deal with such a dependency among multiple microtasks, i.e., multiple function units, the invention provides a mechanism of synchronization to the microtask sequencer, so that the microtask sequencer is capable to contribute parallel executions of microtasks, thus releasing the function units from the burden of synchronization.
Figure 2 shows an example of the configuration of the microtask sequencer 200 including the mechanism of synchronization. As shown in Figure 2, the microtask sequencer 200 connects to the host system via the host system bus, and comprises a task acquisition device 210 for fetching the commands from the location as specified by the host processor via a memory bus of the host system bus, and a task scheduling device 220 for dispatching the microtasks described in the commands to relevant function units and managing the synchronization among the microtasks. The microtask sequencer 200 may further comprise a configuration device 230 by which the host processor may initialize or control the microtask sequencer 200 and the associated function units via a confϊg bus of the host system bus.
The task acquisition device 210 comprises a memory bus interface 201, a FIFO loader 203-1 and a income FIFO 203-2. The task scheduling device 220 comprises a task dispatcher 205, multiple task queues 206a, 206b, 206c, corresponding execution control units (ECU) 207a, 207b, 207c and a shared semaphore array 208. The task scheduling device 220 may also comprise a flag array 209 that holds some flags corresponding to some or all the function units. The configuration device 230 comprises a confϊg bus interface 202 and a system configuration unit 204.
The task control bus 210a, 210b, 210c connects the execution control units 207a, 207b, 207c to corresponding function units such as the DMA elements 105-1, ..., 105-m and computing elements 106-1, ..., 106-n, respectively. Each function unit corresponds to a task queue 206 and an execution control unit 207 in the microtask sequencer 104.
The FIFO loader 203-1 is a DMA controller that is used to load programmed commands from global memory 102 to the income FIFO 203-2. The FIFO loader 203-1 accepts requests from the task dispatcher 205 to load the commands as long as the task dispatcher 205 is capable to deal with a further command. The task acquisition device 210 may be initialized, preferably through the configuration device 230, by the host processor to start the loading from one specified location inside the global memory 102 address space and it will load contents up to a given count. Preferably, once the transfer is finished, the FIFO loader 203-1 will set the status to the system configuration unit 204 of the configuration device 230 and a host interrupt is asserted to inform the host processor 101 of the status. Alternatively, the above interaction between the task acquisition device 210 and the host processor 101 may be achieved by other signaling mechanism known in the art. Preferably, there can be some buffers inside the income FIFO 203-2 to allow efficient burst access to the memory bus.
After the FIFO loader 203-1 fetches the programmed commands from the global memory 102, the task dispatcher 205 dispatches these commands from the income FIFO 203-2 to the task queues 206 of each execution control unit 207. The task queue 206a, 206b, 206c is a FIFO storage that holds the commands to be issued to the corresponding function unit. The depth of the FIFO is a tradeoff of efficiency and area. In the embodiment, we assume a size of 32 to 64 entries. The task dispatcher 205 monitors the status of the associated task queues 206a, 206b, 206c and fills them once the immediate available commands from the FIFO loader 203-1 match the queues that have empty slots. The command consists of a segment that indicates which task queue 206 it shall be dispatched to.
The execution control unit 207 then checks the status of its corresponding function unit and dispatches commands to it accordingly. Each function unit attached to the microtask sequencer 200 is controlled by an corresponding execution control unit 207. The execution control unit 207 is a simple sequentially programmable controller that fetches commands from the corresponding task queue 206 and either forwards the command to the function unit or handles them by itself according to the type of commands, if for example an inter- function unit synchronization operation needs be performed, or preferably, if some flow control within the task needs b performed. In the latter case, the execution control unit 207 has access to the semaphore array 208 and preferably the flag array 209. Under control of semaphore commands, the execution control unit 207 can post the semaphores or poll for some specific semaphores. With the flag array 209, the execution control unit 207 can also support conditional commands to implement "IF THEN ELSE" way of execution path. This can be utilized to decouple the control of host processor 101 even more.
As mentioned above, synchronization between each execution control unit 207 is kept by using the semaphore array 208. The semaphore array 208 holds an array of one bit flag that can be set and cleared by the execution control units 207a, 207b, 207c by using synchronization commands. More specifically, the semaphore array 208 is an array of one bit flags that can be set and reset by the execution control units 207a, 207b, 207c according to semaphore commands in the task queues 206a, 206b, 206c. The semaphore array 208 is a shared resource in the whole microtask sequencer 200. Every execution control unit 207 can access every bits of the semaphore array 208 under command control. The allocation of the semaphore bits to task queues 206a, 206b, 206c is under full control of the programmer and it is the programmer's responsibility to avoid usage conflicts. An execution control unit 207 can either set multiple semaphore bits or can wait for multiple semaphore bits. The semaphore bits are set by the semaphore post commands and the semaphore wait commands will be blocked until the semaphore bits they wait for are all set. The semaphore wait commands will also reset the corresponding semaphore bits before continue. With this simple design, we could easily handle complex synchronization requirements between multiple microtasks delivered to multiple queues.
Also as mentioned above, to enable more interaction between the microtask sequencer 200 and the controlled function unit, each function unit can return a single bit or multiple bits result as execution of the microtask finishes. This value shall be set to the flag array 209. The data in the flag array 209 can be used to implements conditional execution of commands and thus allow "IF-THEN-ELSE " like execution flow of microtasks. More specifically, the flag array 209 is an array of single or multiple bits results. Each entry is corresponding to one function unit. Thus the number of entries in the flag array 209 may equals to or less than the number of function units or the number of task queues 206a, 206b, 206c or the number of execution control units 207a, 207b, 207c, as required. The result is set by the function unit before associated microtask execution ends. The definition of the result value, however, is not limited to those described in this document. It shall be determined by the function unit designer himself. Some possible examples are status reports, simple return values, etc. All of the execution control units 207a, 207b, 207c can read any one entry of the flag array. For the execution control unit 207, there is a command to fetch and store the desired flag array 209 contents for future reference. This is to allow new result updates while keep local serialized execution flow consists. With the flag array 209 and conditional execution capability, we could implement "IF THEN ELSE" and "SWITCH CASE" execution control and, in applicable scenarios, decouples the host processor 101 even more.
The configuration device 230 is an preferable interface to the host system for configuration and control. More specifically, the system configuration unit 204 therein has access, through the configuration bus interface 202, to host system, and can be used by the host system to initialize the microtask sequencer and control the operations of different internal/external function units. In general, configuration device 230 may have a slave mode I/O interface to the system bus and one interrupt request line for status report.
In the above embodiment, the microtask sequencer 104 is able to issue jobs to the functional units (the DMA elements 105-1, ..., 105-m and the computing elements 106-1, ..., 106-n) automatically with minimal intervening by the host processor 101.
To facilitate understanding the invention, specific examples of format and content of the task descriptions (microtask commands) will be given in the following. It should be noted that the invention is not limited to these examples.
1. Microtask Sequencer Command Format
All commands of the microtask sequencer are 32 bits in length. The command is segmented into four fixed fields as shown in table 1. The QID field is the ID number of a task queue which the command shall be dispatched to. This field is interpreted by the task dispatcher. The task dispatcher also inspects the NUMARG field which indicates how many 32-bit arguments follow. The task dispatcher will treat the following NUMARG 32-bit words as the arguments of the previous command and it will dispatch these data together to the task queue without further interpretation. The ECU field in the command word is used to distinguish if the Execution Control Unit shall interpret the command or just send out the command to correspondent processing element or function unit to process.
Table 1 Microtask Sequencer Command Format
Bits Mnemonic Descriptions
0:3 QID ID of the task queue which this command shall be dispatched to.
0: The command is for attached processing elements 4 ECU
1 : The command is for the execution control unit
5:11 CMD The command definition
The number of 32-bit argument data following the command 12:15 NUMARG
0000: The command does not have additional data
16:31 DATA Optional 16-bit data that the command uses
1.1 ECU Commands
1.1.1 Semaphore Commands
1.1.1.1 Semaphore Post
Definition: SEMPOST semaphore vector
CMD: b'OOlOOOO'
NUMARG: 0
DATA: 0 : 15 bit mask of semaphore vector
Description: This command sets the correspondent Semaphore Array bits to one. The bits affected are given by the bit mask in DATA section. If any correspondent bit is already set as ' 1 ', this command will wait until all the correspondent bits are set by other module as '0'. Example: SEMPOST b'1011000000000000'
This command sets the 0,2,3 bits of the Semaphore Array.
1.1.1.2 Semaphore Wait
Definition: SEMWAIT semaphore vector
CMD: b'OOlOOOl1
NUMARG: 0
DATA: 0 : 15 bit mask of semaphore vector
Description: This command waits for the correspondent Semaphore Array bits to be set. The bits monitored are given by the bit mask in DATA section. The command shall hold execution flow of ECU until all the masked bits to be set. The correspondent Semaphore Array bits are then cleared to zero. There is one global timeout counter that defines a maximum waiting time in system clock cycles. Once the semaphore bits are not triggered within the defined timeout value, a status report event shall be generated and that signal can be used to trigger a host interrupt to let the host processor intervene the process.
Example: SEMWAIT b'1011000000000000'
This command waits until the 0,2,3 bits of the Semaphore Array are all set.
1.1.1.3 Semaphore Clear
Definition: SEMCLR semaphore vector
CMD: b'OOlOOlO'
NUMARG: 0
DATA: 0 : 15 bit mask of semaphore vector
Description: This command clears the correspondent Semaphore Array bits to zero. The bits are given by the bit mask in DATA section. The command can be used to initialize the Semaphore Array to correct status. Example: SEMCLR b' 1111111111111111 '
This command clears all of the bits of the Semaphore Array. 1.1.1.4 Semaphore Pending
Definition: SEMPEND semaphore vector
CMD: b'OOlOOl l1
NUMARG: 0
DATA: 0 : 15 bit mask of semaphore vector
Description: This command waits for the correspondent Semaphore Array bits to be cleared. The bits monitored are given by the bit mask in DATA section. The command shall hold execution flow of ECU until all the masked bits are cleared. The command does nothing to the Semaphore Array. There is one global timeout counter that defines a maximum waiting time in system clock cycles. Once the semaphore bits are not triggered within the defined timeout value, a status report event shall be generated and that signal can be used to trigger a host interrupt to let the host processor intervene the process.
Example: SEMPEND b'1011000000000000'
This command waits until the 0,2,3 bits of the Semaphore Array are all cleared to zero.
1.1.2 Flag Array and Conditional Execution Commands 1.1.2.1 Get Conditional Flag Result from Flag Array
Definition: FLGLOAD flag array index number
CMD: b'OOHOOO'
NUMARG: 0
DATA: The 12:15 bits contains an index to flag array which must fall inside the valid range. The index is zero based. The other bits are unused and should be set to zero. Description: This command retrieves the 4 bits value from the specified Flag Array elements and stores it to the condition section of the local flag register for conditional execution usage.
The Flag Array is as the flags returned by function units when function units finish one microtask. Example: FLGLOAD 3
This command retrieves the element of index 3 from the Flag Array.
1.1.2.2 Unconditional Jump Forward
Definition: JMP number of words
CMD: b'OOl l lOO1
NUMARG: 0
DATA: The 8:15 bits hold the number of words in the Task List that we shall skip.
The other bits are unused and should be set to zero. Description: This command skips the following command and data on the current task queue by number of words. Example: JMP 15
This command omits next 15 command/data from the task queue.
1.1.2.3 Conditional Jump Forward
Definition: JMPWHEN conditions, number of words
CMD: b'OOl l lOl1
NUMARG: O
DATA: The 0:7 bits hold the condition description as shown in Table 3-2. The 8:15 bits hold the number of words in the Task List that we shall skip. Description: This command skips the following command and data on the current task queue by number of words when the conditions are met. If the conditions are not met, it will do nothing. Example: JMPWHEN b'lOlOlOOO1, 15
This command omits next 15 command/data from the task queue when the conditional bit 0 == 1 and bit 2 == 0. Table 2 Condition Description Format
Bits Description
0:3 A four bit mask of which condition flags we need to evaluate.
4:7 A four bit value that we expects to be matched.
Assume the mask as ABCD, the match value as EFGH and the actual flags as efgh, we can define the condition evaluation as:
V = !( A & (E Λ e)) & !( B & (F Λ f)) & !(C & (G Λ g)) & !(D & (H Λ h))
It should be understood that the specific device or modules indicated in the microtask sequencer 200 herein may be implemented in hardware and/or software. For example, a specific device or module may be performed using software and/or firmware executed on one or more processing modules. The processing module can be a single processing device or a plurality of processing devices. Such a processing device may be a microprocessor, microcontroller, digital processor, microcomputer, a portion of the central processing unit, a state machine, logic circuitry, hardwired logic and/or any device that can perform a desired processing or control.
While the invention has been described according to the embodiment where the multiprocessor subsystem comprises the microtask sequencer, the DMA elements, the computing elements and the shared local memory blocks, the DMA elements, the shared local memory blocks and the local bus/crossbar switch are not necessary if the computing elements are configured to directly access the global memory.
While the shared semaphore array is employed for synchronization in the above embodiment, other known synchronization mechanism can be implemented in the invention.
While particular numbers of task queues, execution control units, and function units of each type are described in the embodiment, these numbers are not limits to the invention. As required, any number can be embodied in the implementations. While the location in the global memory from which the commands are fetched is specified by the host processor, it can be statically determined in advance. Other relevant parameters can also be predetermined similarly.
While the invention has been described in a general multi-processor computing system, it can be particularly applicable to system on chip (SOC) accelerators or microprocessors.
While the microtasks are dispatched to respective function units based on the destination information specified in the task descriptions in the embodiments, it is also possible to only specify the types of the function units (e.g., DMA units, calculating units and etc.) for performing the microtasks in the task descriptions and to designate randomly the function units for performing the microtasks by the task dispatcher according to availability of the respective function units. Information on type, quantity and status of the function units may be defined in advance, set by the host through the system configuration units, or obtained or maintained by polling or other status detecting mechanisms. If it is necessary to designate that several microtasks should be performed by the same function unit, it is possible to specify the logical identification of this function units in the task descriptions and to bind this logical identification with the physical function unit by the task dispatcher.
The above-disclosed subject matter is to be considered illustrative, and not restrictive, and the appended claims are intended to cover all such modifications, enhancements, and other embodiments that fall within the true spirit and scope of the present invention. Thus, to the maximum extent allowed by law, the scope of the present invention is to be determined by the broadest permissible interpretation of the following claims and their equivalents, and shall not be restricted or limited by the foregoing detailed description.
The scope of the present disclosure includes any novel feature or combination of features disclosed herein. The applicant hereby gives notice that new claims may be formulated to such features or combination of features during prosecution of this application or of any such further applications derived therefrom. In particular, with reference to the appended claims, features from dependent claims may be combined with those of the independent claims and features from respective independent claims may be combined in any appropriate manner and not merely in the specific combinations enumerated in the claims.
For the avoidance of doubt, the term "comprising", as used herein throughout the description and claims is not to be construed as meaning "consisting only of.

Claims

1. A multi-processor computing system, comprising a host processor, a global memory and at least one processing units, wherein the multi-processor computing system further comprising: a microtask sequencer, comprising a task acquisition device and a task scheduling device, wherein the task acquisition device is configured to fetch a command including microtask descriptions from the global memory if the microtask sequencer is capable to accommodate further commands, and the task scheduling device is configured to dispatch each microtask instruction defined by each of the microtask descriptions to one of the at least one processing units, so that the processing unit as dispatched to can execute the microtask instruction, and the task scheduling device is further configured to detect the completion of all the microtasks in the command and notify the host processor of the completion, wherein the processing unit can access the global memory by itself or by another of the at least one processing units.
2. The multi-processor computing system according to claim 1, wherein the task scheduling device comprises: a task dispatcher; at least one task queues; and at least one execution control units, each of which is associated with a different one of the at least one task queues, and with a different one of the at least one processing units, wherein the task dispatcher is configured to accept the command fetched by the task acquisition device and put said each microtask instruction in one of the at least one task queues associated with said processing unit as dispatched to, and is further configured to perform said detection and notification; and each of the at least one execution control units is configured to obtain one microtask instruction from the associated task queue in a FIFO manner and delivery the obtained microtask instruction to the associated processing unit for execution, in response to the associated processing unit being capable to process a further microtask instruction, and is further configured to collect the completion of the obtained microtask instruction for said detection.
3. The multi-processor computing system according to claim 2, wherein the command further including synchronization instructions for achieving the synchronization among two or more of the at least one processing units, the task dispatcher is further configured to put the synchronization instructions into the task queues associated with the synchronization instructions, and the task scheduling device further comprises: a synchronization control means being operated by the execution control units based on the synchronization instructions obtained from respective task queues for achieving the synchronization.
4. The multi-processor computing system according to claim 3, wherein the synchronization is based on semaphore technique, and the synchronization control means is implemented based on a shared semaphore array.
5. The multi-processor computing system according to any of claims 2-4, wherein the command further including a flow control instruction for the associated execution control unit to selectively skip a part of the microtask instructions based on one or more of values reported from all or some of the processing units, the task dispatcher is further configured to put the flow control instruction into the task queue associated with the flow control instruction, and the task scheduling device further comprises: a memory for storing the values, wherein the execution control unit is further configured to skip the part of the microtask instructions based on the flow control instruction and associated values in the memory.
6. The multi-processor computing system according to any of claims 2 to 4, wherein the command further including a flow control instruction for the task dispatcher to selectively skip a part of the microtask instructions based on one or more of values reported from all or some of the processing units, and the task scheduling device further comprises: a memory for storing the values, wherein the task dispatcher is further configured to skip the part of the microtask instructions based on the flow control instruction and associated values in the memory.
7. The multi-processor computing system according to any of claims 1 to 6, wherein the multi-processor computing system is implemented in a system on chip accelerator or a microprocessor.
8. The multi-processor computing system according to any of claims 1 to 7, wherein the access by the processing unit to the global memory is implemented via DMA.
9. The multi-processor computing system according to any of claims 1 to 8, wherein the multi-processor computing system further comprises a configuration device for initializing the microtask sequencer.
10. The multi-processor computing system according to any of claims 1 to 9, wherein the microtask sequencer further comprises a local shared memory.
11. The multi-processor computing system according to any preceding claim, wherein the task scheduling device is configured to dispatch each microtask instruction defined in each microtask description to a processing unit as indicated in the microtask description.
12. The multi-processor computing system according to any preceding claim, wherein the task scheduling device is configured to dispatch each microtask instruction defined in each microtask description to a processing unit meeting the requirement as specified in the microtask description.
13. A task allocating method in a multi-processor computing system comprising a host processor, a global memory and at least one processing units, the method comprising: providing, by the host processor, commands including microtask descriptions in the global memory; fetching, by the microtask sequencer, a command including microtask descriptions from the global memory if the microtask sequencer is capable to accommodate further commands; dispatching, by the microtask sequencer, each microtask instruction defined by each of the microtask descriptions to one of the at least one processing units, so that the processing unit as dispatched to can execute the microtask instruction; detecting, by the microtask sequencer, the completion of all the microtasks in the command; and notifying, by the microtask sequencer, the host processor of the completion, wherein the processing unit can access the global memory by itself or by another of the at least one processing units.
14. The method according to claim 13, wherein the microtask sequencer comprises at least one task queues and at least one execution control units, each of which is associated with a different one of the at least one task queues, and with a different one of the at least one processing units, the dispatching comprises: putting said each microtask instruction in one of the at least one task queues associated with said processing unit as dispatched to; and obtaining by an associated execution control unit one microtask instruction from the associated task queue in a FIFO manner, delivering the obtained microtask instruction to the associated processing unit for execution, in response to the associated processing unit being capable to process a further microtask instruction, and collecting the completion of the obtained microtask instruction for said detection.
15. The method according to claim 14, wherein the command further includes synchronization instructions for achieving the synchronization among two or more of the at least one processing units, the dispatching further comprises: putting the synchronization instructions into the task queues associated with the synchronization instructions; and operating by the associated execution control units based on the synchronization instructions obtained from respective task queues for achieving the synchronization.
16. The method according to claim 15, wherein the synchronization is based on semaphore technique.
17. The method according to any of claims 14 to 16, wherein the command further comprises a flow control instruction for the associated execution control unit to selectively skip a part of the microtask instructions based on one or more of values reported from all or some of the processing units, the dispatching further comprises: putting the flow control instruction into the task queue associated with the flow control instruction; storing the values; and skipping, through the execution control unit, the part of the microtask instructions based on the flow control instruction and the stored associated values.
18. The method according to any of claims 14 to 16, wherein the command further comprises a flow control instruction for selectively skipping a part of the microtask instructions based on one or more of values reported from all or some of the processing units, the dispatching further comprises: storing the values; and skipping the part of the microtask instructions based on the flow control instruction and the stored associated values.
19. The method according to any of claims 13 to 18, wherein the dispatching comprises dispatching each microtask instruction defined in each microtask description to a processing unit as indicated in the microtask description.
20. The method according to any of claims 13 to 19, wherein the dispatching comprises dispatching each microtask instruction defined in each microtask description to a processing unit meeting the requirement as specified in the microtask description.
PCT/EP2007/060028 2006-10-20 2007-09-21 A multi-processor computing system and its task allocating method WO2008046716A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200610135656.X 2006-10-20
CN 200610135656 CN101165655A (en) 2006-10-20 2006-10-20 Multiple processor computation system and its task distribution method

Publications (1)

Publication Number Publication Date
WO2008046716A1 true WO2008046716A1 (en) 2008-04-24

Family

ID=38776273

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2007/060028 WO2008046716A1 (en) 2006-10-20 2007-09-21 A multi-processor computing system and its task allocating method

Country Status (2)

Country Link
CN (1) CN101165655A (en)
WO (1) WO2008046716A1 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2131278A1 (en) * 2008-06-02 2009-12-09 Mobileye Technologies Ltd Scheduling of multiple tasks in a system including multiple computing elements
WO2011020053A1 (en) * 2009-08-13 2011-02-17 Qualcomm Incorporated Apparatus and method for efficient data processing
CN103353851A (en) * 2013-07-01 2013-10-16 华为技术有限公司 Method and equipment for managing tasks
CN103729228A (en) * 2012-10-11 2014-04-16 三星电子株式会社 Method for compiling program, task mapping method and task scheduling method
US8762532B2 (en) 2009-08-13 2014-06-24 Qualcomm Incorporated Apparatus and method for efficient memory allocation
US8788782B2 (en) 2009-08-13 2014-07-22 Qualcomm Incorporated Apparatus and method for memory management and efficient data processing
WO2015013458A1 (en) * 2013-07-23 2015-01-29 Qualcomm Incorporated Providing queue barriers when unsupported by an i/o protocol or target device
CN104462302A (en) * 2014-11-28 2015-03-25 北京京东尚科信息技术有限公司 Distributed data processing coordination method and system
WO2015106687A1 (en) * 2014-01-14 2015-07-23 Tencent Technology (Shenzhen) Company Limited Method and apparatus for processing computational task
US10437650B2 (en) * 2014-06-19 2019-10-08 Nec Corporation Controlling execution of tasks in a series of operational processing by identifying processing units based on task command, task setting information, state of operational processing
WO2020047337A1 (en) * 2018-08-29 2020-03-05 Qualcomm Incorporated Method, apparatus, and system for an architecture for machine learning acceleration
CN111045979A (en) * 2018-10-11 2020-04-21 力晶科技股份有限公司 Multi-processing architecture based on memory processor and method of operation thereof

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101661405B (en) * 2008-08-28 2012-08-29 国际商业机器公司 Multi-database function invoking method and system based on multiprocessor system
CN102129390B (en) * 2011-03-10 2013-06-12 中国科学技术大学苏州研究院 Task scheduling system of on-chip multi-core computing platform and method for task parallelization
WO2013095338A1 (en) * 2011-12-19 2013-06-27 Intel Corporation Simd integer multiply-accumulate instruction for multi-precision arithmetic
US9921873B2 (en) * 2012-01-31 2018-03-20 Nvidia Corporation Controlling work distribution for processing tasks
US9483263B2 (en) * 2013-03-26 2016-11-01 Via Technologies, Inc. Uncore microcode ROM
CN105051689A (en) * 2013-09-29 2015-11-11 华为技术有限公司 Method, apparatus and system for scheduling resource pool in multi-core system
CN104572266B (en) * 2015-01-04 2018-03-06 华东师范大学 MPSoC task schedulings modeling based on UPPAAL SMC and appraisal procedure under process variation
CN108089915B (en) * 2016-11-22 2021-10-15 北京京东尚科信息技术有限公司 Method and system for business control processing based on message queue
CN108416433B (en) * 2018-01-22 2020-11-24 上海熠知电子科技有限公司 Neural network heterogeneous acceleration method and system based on asynchronous event
CN109743390B (en) * 2019-01-04 2022-02-22 深圳壹账通智能科技有限公司 Task scheduling method and device, computer equipment and storage medium
CN113032015B (en) * 2019-12-24 2022-02-18 中国科学院沈阳自动化研究所 Communication method for precision motion control
CN115176240A (en) * 2020-03-13 2022-10-11 北京希姆计算科技有限公司 Task allocation method and device, electronic equipment and computer readable storage medium
CN111431892B (en) * 2020-03-20 2022-03-25 上海金卓科技有限公司 Accelerator management architecture and method and accelerator interface controller
CN112486575A (en) * 2020-12-07 2021-03-12 广西电网有限责任公司电力科学研究院 Electric artificial intelligence chip sharing acceleration operation component and application method
CN113326221B (en) * 2021-06-30 2024-03-22 上海阵量智能科技有限公司 Data processing device, method, chip, computer device and storage medium
CN114253694B (en) * 2022-02-25 2022-06-24 杭州雄迈集成电路技术股份有限公司 Asynchronous processing method and device based on neural network accelerator

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002046887A2 (en) * 2000-10-23 2002-06-13 Xyron Corporation Concurrent-multitasking processor
EP1416377A1 (en) * 2002-10-30 2004-05-06 STMicroelectronics, Inc. Processor system with a plurality of processor cores for executing tasks sequentially or in parallel

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002046887A2 (en) * 2000-10-23 2002-06-13 Xyron Corporation Concurrent-multitasking processor
EP1416377A1 (en) * 2002-10-30 2004-05-06 STMicroelectronics, Inc. Processor system with a plurality of processor cores for executing tasks sequentially or in parallel

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ADLER R M: "DISTRIBUTED COORDINATION MODELS FOR CLIENT/SERVER COMPUTING", COMPUTER, IEEE SERVICE CENTER, LOS ALAMITOS, CA, US, vol. 28, no. 4, 1 April 1995 (1995-04-01), pages 14 - 22, XP000507856, ISSN: 0018-9162 *
JIANWEN C: "Chen Jianwen Homepage", 2 November 2006 (2006-11-02), pages 1 - 4, XP002461679, Retrieved from the Internet <URL:http://learn.tsinghua.edu.cn:8080/2002315365/index.html> *
SHONDIP SEN, HENK MULLER, DAVID MAY: "Synchronisation in a Multithreaded Processor", COMMUNICATING PROCESS ARCHITECTURES 2000, IOS PRESS, 2000, Amsterdam, pages 137 - 144, XP002461678, Retrieved from the Internet <URL:http://citeseer.ist.psu.edu/rd/0%2C360899%2C1%2C0.25%2CDownload/http://citeseer.ist.psu.edu/cache/papers/cs/17245/http:zSzzSzwww.cs.bris.ac.ukzSzToolszSzReportszSzPszSz2000-may-2.pdf/sen00synchronisation.pdf> *
VENKATARAMAN R: "Designing SONET/ATM layer processing ASICs using embedded approach", COMPUTERS AND COMMUNICATIONS, 1995., CONFERENCE PROCEEDINGS OF THE 1995 IEEE FOURTEENTH ANNUAL INTERNATIONAL PHOENIX CONFERENCE ON SCOTTSDALE, AZ, USA 28-31 MARCH 1995, NEW YORK, NY, USA,IEEE, US, 28 March 1995 (1995-03-28), pages 437 - 443, XP010149352, ISBN: 0-7803-2492-7 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2131278A1 (en) * 2008-06-02 2009-12-09 Mobileye Technologies Ltd Scheduling of multiple tasks in a system including multiple computing elements
US9038073B2 (en) 2009-08-13 2015-05-19 Qualcomm Incorporated Data mover moving data to accelerator for processing and returning result data based on instruction received from a processor utilizing software and hardware interrupts
WO2011020053A1 (en) * 2009-08-13 2011-02-17 Qualcomm Incorporated Apparatus and method for efficient data processing
US8762532B2 (en) 2009-08-13 2014-06-24 Qualcomm Incorporated Apparatus and method for efficient memory allocation
US8788782B2 (en) 2009-08-13 2014-07-22 Qualcomm Incorporated Apparatus and method for memory management and efficient data processing
CN103729228A (en) * 2012-10-11 2014-04-16 三星电子株式会社 Method for compiling program, task mapping method and task scheduling method
CN103353851A (en) * 2013-07-01 2013-10-16 华为技术有限公司 Method and equipment for managing tasks
WO2015013458A1 (en) * 2013-07-23 2015-01-29 Qualcomm Incorporated Providing queue barriers when unsupported by an i/o protocol or target device
WO2015106687A1 (en) * 2014-01-14 2015-07-23 Tencent Technology (Shenzhen) Company Limited Method and apparatus for processing computational task
US10146588B2 (en) 2014-01-14 2018-12-04 Tencent Technology (Shenzhen) Company Limited Method and apparatus for processing computational task having multiple subflows
US10437650B2 (en) * 2014-06-19 2019-10-08 Nec Corporation Controlling execution of tasks in a series of operational processing by identifying processing units based on task command, task setting information, state of operational processing
CN104462302A (en) * 2014-11-28 2015-03-25 北京京东尚科信息技术有限公司 Distributed data processing coordination method and system
WO2020047337A1 (en) * 2018-08-29 2020-03-05 Qualcomm Incorporated Method, apparatus, and system for an architecture for machine learning acceleration
US11010313B2 (en) 2018-08-29 2021-05-18 Qualcomm Incorporated Method, apparatus, and system for an architecture for machine learning acceleration
CN111045979A (en) * 2018-10-11 2020-04-21 力晶科技股份有限公司 Multi-processing architecture based on memory processor and method of operation thereof
CN111045979B (en) * 2018-10-11 2023-12-19 力晶积成电子制造股份有限公司 Multi-processing architecture based on memory processor and method of operation thereof

Also Published As

Publication number Publication date
CN101165655A (en) 2008-04-23

Similar Documents

Publication Publication Date Title
WO2008046716A1 (en) A multi-processor computing system and its task allocating method
EP3320427B1 (en) Device and processing architecture for instruction memory efficiency
EP1839146B1 (en) Mechanism to schedule threads on os-sequestered without operating system intervention
US7853779B2 (en) Methods and apparatus for scalable array processor interrupt detection and response
CN100538640C (en) The device of dynamic-configuration virtual processor resources
US20040172631A1 (en) Concurrent-multitasking processor
US6944850B2 (en) Hop method for stepping parallel hardware threads
US5440747A (en) Data processor with control logic for storing operation mode status and associated method
US9400685B1 (en) Dividing, scheduling, and parallel processing compiled sub-tasks on an asynchronous multi-core processor
EP0602359A2 (en) Architectural enhancements for parallel computer systems
EP1139215B1 (en) Method and apparatus for releasing functional units in a multithreaded VLIW processor
US20100125717A1 (en) Synchronization Controller For Multiple Multi-Threaded Processors
US11360809B2 (en) Multithreaded processor core with hardware-assisted task scheduling
US20070157199A1 (en) Efficient task scheduling by assigning fixed registers to scheduler
EP1146420A1 (en) Method and apparatus for splitting packets in a multithreaded VLIW processor
CN115617499B (en) System and method for GPU multi-core hyper-threading technology
US20170147345A1 (en) Multiple operation interface to shared coprocessor
US7711925B2 (en) Information-processing device with transaction processor for executing subset of instruction set where if transaction processor cannot efficiently execute the instruction it is sent to general-purpose processor via interrupt
EP2630577B1 (en) Exception control in a multiprocessor system
US9760969B2 (en) Graphic processing system and method thereof
US10496409B2 (en) Method and system for managing control of instruction and process execution in a programmable computing system
JPH06324996A (en) Integrated circuit and programmable multiprocessor interruption controller system
Kornaros et al. Enabling efficient job dispatching in accelerator-extended heterogeneous systems with unified address space
US20140136818A1 (en) Fetch less instruction processing (flip) computer architecture for central processing units (cpu)
WO2002046887A2 (en) Concurrent-multitasking processor

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07803577

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07803577

Country of ref document: EP

Kind code of ref document: A1