WO2008046716A1

WO2008046716A1 - A multi-processor computing system and its task allocating method

Info

Publication number: WO2008046716A1
Application number: PCT/EP2007/060028
Authority: WO
Inventors: Hai Ju; Guo Hui Lin; Qiang Liu; Lu Wan; Yu Dong Yang; Roderick Michael Peters West
Original assignee: International Business Machines Corporation; Ibm United Kingdom Limited
Priority date: 2006-10-20
Filing date: 2007-09-21
Publication date: 2008-04-24
Also published as: CN101165655A

Abstract

A multi-processor computing system is disclosed. The multi-processor computing system comprise a host processor, a global memory and at least one processing units, wherein the multi-processor computing system further comprising: a micro task sequencer, comprising a task acquisition device and a task scheduling device, wherein the task acquisition device is configured to fetch a command including micro task descriptions from the global memory if the micro task sequencer is capable to accommodate further commands, and the task scheduling device is configured to dispatch each micro task instruction defined by each of the micro task descriptions to one of the at least one processing units, so that the indicated processing unit can execute the micro task instruction, and the task scheduling device is further configured to detect the completion of all the micro tasks in the command and notify the host processor of the completion, wherein the processing unit can access the global memory by itself or by another of the at least one processing units.

Description

A MULTI-PROCESSOR COMPUTING SYSTEM AND ITS TASK ALLOCATING METHOD

Technical field

The invention relates generally to a multi-processor computer and its task allocating method, and particularly to a computing system with array of homogeneous and/or heterogeneous computing elements that are configured dynamically at runtime to fulfill special purpose computing tasks, and its task allocating method.

Background art

Multi-processor computing systems are generally designed to satisfy two classes of computing requirements: (1) general purpose applications, and (2) special purpose applications. For general purpose applications, SMP with homogeneous processors is the most common architecture. For special purpose application, such as, for example, multimedia and digital signal processing, it is quite common that heterogeneous processors or hardware offload elements are combined to address different aspects of the applications.

With special purpose multi-processor computing systems, the architecture is usually built upon an array of interconnected computing elements. These computing elements can be programmable processors or fixed function units. Despite the differences between the computing elements, the most distinct point that discriminates these architectures from each other is the way in which these computing elements are connected and controlled to handle the computing tasks.

Many of the prior art architectures can only program the processor array in fully synchronous "stream" mode, where there is some kind of global shared "heartbeat" that each computing element synchronizes to, and the computing elements exchange data at beginning of each heartbeat according to the preconfigured interconnection topology. The heartbeat can be as fast as one clock cycle or up to a longer time according to the architectural design and stream pipeline delays. One advantage of heartbeat synchronization is that it is easier to implement and program as the computing elements are decoupled from the synchronization tasks. The drawback is that it lacks the flexibility to run asynchronous tasks, thus limiting the application fields. Another drawback is that its efficiency is adversely affected if pipeline stages can not be well balanced.

To the other side, there is another class of architecture that is programmed with asynchronous plus synchronous mode where the whole system does not rely on a synchronous heartbeat. Computing elements can be partitioned into groups where a synchronization could be an inner-group synchronization or an inter-group synchronization. Thus multiple threads of each of tasks could be handled on different computing elements or in different element groups until there is a data dependency or an access to a shared resource taking place, and only then synchronization is required. Many of the implementations in this class take the approach of hardwired barrier synchronization, where one or several shared global synchronization logic circuits are responsible for synchronization of all computing elements. The computing elements pause on some specified point in their execution flow until all of the elements in a synchronization group come to that point, and then the execution flow resumes from that point. With specific applications, the synchronization logic design and computing element partition could be optimized to give very good performances. The drawback of this implementation is lacking of flexibility when there are changes of the configuration and applications.

Beside hardwired implementations, another approach is based on software, where the synchronization is done with software running on a host processor or on computing element themselves. Normally the host processor or each of the computing elements checks the status of others through the way of polling and/or interrupts, and then asserts the synchronization signal by way of message communication or so. Compared to the hardwired approach, the software solution is the most flexible one. However, the side effects are: (1) interrupt handling or polling requires a large amount of processor cycles, thus the granularity of synchronization can only be on a very coarse level; (2) the synchronization logic is entangled into low level functional/computing codes, thus complexity of software maintenance increases greatly. The present invention aims to address these problems.

Summary of Invention

A first aspect of the invention provides a multi-processor computing system, comprising a host processor, a global memory and at least one processing units, wherein the multiprocessor computing system further comprising a microtask sequencer that comprises a task acquisition device and a task scheduling device, wherein the task acquisition device is configured to fetch a command including microtask descriptions from the global memory if the microtask sequencer is capable to accommodate further commands, and the task scheduling device is configured to dispatch each microtask instruction defined by each of the microtask descriptions to one of the at least one processing units, so that the indicated processing unit can execute the microtask instruction, and the task scheduling device is further configured to detect the completion of all the microtasks in the command and notify the host processor of the completion, wherein the processing unit can access the global memory by itself or by another of the at least one processing units.

Preferred embodiments enable the host processor to be released from dispatching tasks to computing elements. Preferably, also the host processor and computing elements are released from synchronisation control.

The task acquisition device may comprise: a task dispatcher; at least one task queues; and at least one execution control units (ECU), each of which is associated with a different one of the at least one task queue, and with a different one of the at least one processing unit, wherein the task dispatcher may be configured to accept the command fetched by the task acquisition device and put said each microtask instruction in one of the at least one task queue associated with said indicated processing unit, and may be further configured to perform said detection and notification; and each of the at least one execution control units may be configured to obtain one microtask instruction from the associated task queue in a FIFO manner and delivery the obtained microtask instruction to the associated processing unit for execution, in response to the associated processing unit being capable to process a further microtask instruction, and may be further configured to collect the completion of the obtained microtask instruction for purpose of said detection. The command may further includes synchronization instructions for achieving the synchronization among two or more of the at least one processing units, the task dispatcher may be further configured to put the synchronization instructions into the task queues associated with the synchronization instructions, and the task scheduling device may further comprise: a synchronization control means being operated by the execution control units based on the synchronization instructions obtained from respective task queues for achieving the synchronization.

The synchronization may be based on semaphore technique, and the synchronization control means may be implemented based on a shared semaphore array.

The command may further includes a flow control instruction for the associated execution control unit to selectively skip a part of the microtask instructions based on one or more of values reported from all or some of the processing units, the task dispatcher may be further configured to put the flow control instruction into the task queue associated with the flow control instruction, and the task scheduling device may further comprise: a memory for storing the values, wherein the execution control unit may be further configured to skip the part of the microtask instructions based on the flow control instruction and associated values in the memory.

The command may further includes a flow control instruction for the task dispatcher to selectively skip a part of the microtask instructions based on one or more of values reported from all or some of the processing units, and the task scheduling device may further comprise: a memory for storing the values, wherein the task dispatcher may be further configured to skip the part of the microtask instructions based on the flow control instruction and associated values in the memory.

The multi-processor computing system may be implemented in a system on chip accelerator or a microprocessor.

The access by the processing unit to the global memory may be implemented via DMA. The multi-processor computing system may further comprise a configuration device for initializing the microtask sequencer.

A preferred embodiment of the invention provides an improved architecture of multiprocessor computer, It consists of a dedicated programmable task sequencer as the "Microtask Sequencer".

Further, the microtask sequencer is programmed by the host processor with information to describe the tasks to be issued to the computing elements. The microtask sequencer then dispatch and monitor the execution status of these tasks automatically without interfering by the host processor. The microtask sequencer also has software programmable synchronization logic that is based on signal posting and waiting mechanism. This synchronization logic is used when synchronization between tasks is required. Thus synchronization logics are decoupled from computing elements. Compared to known prior arts, the invention provides an computing task handling mechanism that has performance comparable to that of hardwired logics and the flexibility comparable to fully software implemented controls. By decoupling the synchronization logic from computing elements, the invention makes it simple to integrate heterogeneous computing elements into the same architecture. It also allows the software programmers of computing elements to focus on only the functional part of the algorithm. Another improvement is potentially power efficiency over software approaches where no instruction cycles are spent on waiting for synchronization signals.

Another aspect of the invention also provides a task allocating method in a multi-processor computing system comprising a host processor, a global memory and at least one processing units, the method comprising: providing, by the host processor, commands including microtask descriptions in the global memory; fetching, by the microtask sequencer, a command including microtask descriptions from the global memory if the microtask sequencer is capable to accommodate further commands; dispatching, by the microtask sequencer, each microtask instruction defined by each of the microtask descriptions to one of the at least one processing units, so that the processing unit as dispatched to can execute the microtask instruction; detecting, by the microtask sequencer, the completion of all the microtasks in the command; and notifying, by the microtask sequencer, the host processor of the completion, wherein the processing unit can access the global memory by itself or by another of the at least one processing units.

Description of Accompanying Drawings

The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the invention, and together with the general description given above and the detailed description of the embodiments given below, serve to explain the principles of the invention.

Wherever possible, the same reference numbers will be used throughout the drawings to the same or the like parts. In the drawings:

Figure 1 shows an example of a typical system architecture where the invention can be implemented, according to an embodiment of the invention; and

Figure 2 shows an example of the configuration of the microtask sequencer including the mechanism of synchronization.

Embodiments for Carrying Out the Invention

In the following description, certain specific details are set forth in order to provide a thorough understanding of various embodiments of the invention. However, one skilled in the art will understand that the invention may be practiced without these details. In other instances, well-known structures associated with computers, processors and etc. have not been shown or described in detail to avoid unnecessarily obscuring descriptions of the embodiments of the invention.

For reasons of resource limitation, exploring parallelism, increasing code and data locality, and code manageability, etc, it is a common practice that applications running on multiprocessor systems usually divide complex computing tasks into small pieces. Each of the small tasks is then dispatched to one of the computing elements for execution according to the designed algorithm flow. For example, in typical real-time video coding applications, one small task may only process one 4x4 or 8x8 image block and the whole number of small tasks in one second will be hundreds of thousands. Mapping these tasks to a multi-processor system, each small task may run only several microseconds from start to finish. In addition, these small tasks may have complex dependencies between them. For example, some of them need output of others as the input data. As a result, the number of small tasks is so large and the granularity of execution control of these small tasks is so fine, these make the small task apparently different from common definition of tasks in traditional operation systems. To make it less confusing, we define a term for the small task that represents a single run on a single computing element as one "microtask".

Because of the large quantity and the extremely fine granularity control of the microtasks, the approach of utilizing task scheduling in a traditional operation system will far too heavy for most host processor to handle. Our invention solves this problem by introducing a hardware sequencer logic that works as an autonomous controller. It will release the host processor from the heavy load of microtask scheduling. All that the host processor needs to do is to program the sequencer and handles some infrequent interrupts with predefined conditions.

Figure 1 shows an example of a typical system architecture where the invention can be implemented, according to an embodiment of the invention. The system 100 comprises a host processor 101, a global memory 102, and a multi-processor subsystem 110, which are connected to a host system bus 103. The host processor 101 runs the OS and high level application logic. The global memory 102 is used to hold code and data of the application.

The multi-processor subsystem 110 comprises a microtask sequencer 104, optional DMA elements 105-1, ..., 105 -m, computing elements 106-1, ..., 106-n, and optional shared local memory blocks 107-1, ..., 107-k. The computing elements 106-1, ..., 106-n do the actual computing of microtasks. These computing elements 106-1, ..., 106-n can be from generic processors to dedicated hardwired logic units. In comparison with most other multi- processor systems, a combination of different computing element types is also allowed in this architecture.

Preferably, to reduce the data traffic between the computing elements 106-1, ..., 106-n and the global memory 102, private local memories that holds code and data is normally included in the computing elements 106-1, ..., 106-n. Preferably, for the same reason, there can also be some shared local memory blocks 107-1, ..., 107-k to store intermediate or common data. Accordingly, the DMA elements 105-1, ..., 105-m are then used to move data between the global memory 102, the local memories of the computing elements 106-1, ..., 106-n, and the shared local memory blocks 107-1, ..., 107-k. A local bus/crossbar switch 109 is used as the internal data path between the DMA elements 105-1, ..., 105-m and the computing elements 106-1, ..., 106-n.

The microtask sequencer 104 is used to control these function units such as the DMA elements 105-1, ..., 105-m and the computing elements 106-1, ..., 106-n. These function units are connected to the microtask sequencer 104 through the task control bus 108. The microtask sequencer 104 usually get commands containing descriptions of microtasks from the global memory 102 where these descriptions are generated by the application software running on the host processor 101 according to the programmer's design or the arrangement by the compiler or the interpreter. By way of example, for a computing task including three separate computations, it can be divided into three microtasks as follows:

microtask 1: Compute #1 microtask 2: Compute #2 microtask 3: Compute #3.

Then the microtask sequencer 104 execute these commands and dispatch tasks descriptions (i.e., microtasks) to corresponding function units, based on the destination information specified in the task descriptions. As we described before, the microtasks for the DMA elements 105-1, ..., 105-m may be as small as moving of several tens byte of data, and the computing elements 106-1, ..., 106-n then process these data in tens or hundreds of clock cycles. In a system clocked more than 200MHz, these microtasks may last for less than one microsecond or even less than tenth of a microsecond. Apparently, if the host processor 101 is used to control these microtasks without the help of the microtask sequencer 104, its processing capability can be fully occupied with this control. In this point, the multiprocessor computing system with the microtask sequencer can at least release the host processor from the scheduling and management of the function units.

Upon finishing its microtask, the function unit may notify the microtask sequencer through the task control bus or any other known signaling method such as interrupt, so that the microtask sequencer may perform further dispatching or control.

In general, the microtask sequencer and the function units may be implemented by any physical or logic processing units, such as one or more central processing units (CPUs), digital signal processors (DSPs), application-specific integrated circuits (ASICs), and etc. The host system bus, task control bus and local bus/crossbar switch can employ any known bus or switch structures or architectures.

Further, there may be dependencies between microtasks running on different function units. For example, a task is to be executed, where a DMA element such as the DMA elements 105-1, ..., 105-m will load input data for a computing unit such as the computing elements 106-1, ..., 106-n before the corresponding microtask on the computing element starts to run. After the computing element finishes the job, the DMA element will move the result data to the next destination. In this example, the computing task can be divided into three microtasks as follows:

microtask 1': Data In microtask 2': Compute microtask 3': Data Out.

Thus, a microtask running on a computing element and a corresponding microtask running on a DMA element may be dependent on each other. For example, the computation of microtask 2' requires that the data in of microtask 1' should be already carried out, and the data out of microtask 3' requires that the computation of microtask 2' should be already carried out. Thus these function units must operate in a correct order to get a correct result. While for microtasks that do not depend on each other, for example, those running on two independent computing elements 106-1, ..., 106-n or two DMA elements 105-1, ..., 105-m, these microtasks can be issued simultaneously to achieve a better utilization of these hardware resources.

To deal with such a dependency among multiple microtasks, i.e., multiple function units, the invention provides a mechanism of synchronization to the microtask sequencer, so that the microtask sequencer is capable to contribute parallel executions of microtasks, thus releasing the function units from the burden of synchronization.

Figure 2 shows an example of the configuration of the microtask sequencer 200 including the mechanism of synchronization. As shown in Figure 2, the microtask sequencer 200 connects to the host system via the host system bus, and comprises a task acquisition device 210 for fetching the commands from the location as specified by the host processor via a memory bus of the host system bus, and a task scheduling device 220 for dispatching the microtasks described in the commands to relevant function units and managing the synchronization among the microtasks. The microtask sequencer 200 may further comprise a configuration device 230 by which the host processor may initialize or control the microtask sequencer 200 and the associated function units via a confϊg bus of the host system bus.

The task acquisition device 210 comprises a memory bus interface 201, a FIFO loader 203-1 and a income FIFO 203-2. The task scheduling device 220 comprises a task dispatcher 205, multiple task queues 206a, 206b, 206c, corresponding execution control units (ECU) 207a, 207b, 207c and a shared semaphore array 208. The task scheduling device 220 may also comprise a flag array 209 that holds some flags corresponding to some or all the function units. The configuration device 230 comprises a confϊg bus interface 202 and a system configuration unit 204.

The task control bus 210a, 210b, 210c connects the execution control units 207a, 207b, 207c to corresponding function units such as the DMA elements 105-1, ..., 105-m and computing elements 106-1, ..., 106-n, respectively. Each function unit corresponds to a task queue 206 and an execution control unit 207 in the microtask sequencer 104.

The FIFO loader 203-1 is a DMA controller that is used to load programmed commands from global memory 102 to the income FIFO 203-2. The FIFO loader 203-1 accepts requests from the task dispatcher 205 to load the commands as long as the task dispatcher 205 is capable to deal with a further command. The task acquisition device 210 may be initialized, preferably through the configuration device 230, by the host processor to start the loading from one specified location inside the global memory 102 address space and it will load contents up to a given count. Preferably, once the transfer is finished, the FIFO loader 203-1 will set the status to the system configuration unit 204 of the configuration device 230 and a host interrupt is asserted to inform the host processor 101 of the status. Alternatively, the above interaction between the task acquisition device 210 and the host processor 101 may be achieved by other signaling mechanism known in the art. Preferably, there can be some buffers inside the income FIFO 203-2 to allow efficient burst access to the memory bus.

After the FIFO loader 203-1 fetches the programmed commands from the global memory 102, the task dispatcher 205 dispatches these commands from the income FIFO 203-2 to the task queues 206 of each execution control unit 207. The task queue 206a, 206b, 206c is a FIFO storage that holds the commands to be issued to the corresponding function unit. The depth of the FIFO is a tradeoff of efficiency and area. In the embodiment, we assume a size of 32 to 64 entries. The task dispatcher 205 monitors the status of the associated task queues 206a, 206b, 206c and fills them once the immediate available commands from the FIFO loader 203-1 match the queues that have empty slots. The command consists of a segment that indicates which task queue 206 it shall be dispatched to.

The execution control unit 207 then checks the status of its corresponding function unit and dispatches commands to it accordingly. Each function unit attached to the microtask sequencer 200 is controlled by an corresponding execution control unit 207. The execution control unit 207 is a simple sequentially programmable controller that fetches commands from the corresponding task queue 206 and either forwards the command to the function unit or handles them by itself according to the type of commands, if for example an inter- function unit synchronization operation needs be performed, or preferably, if some flow control within the task needs b performed. In the latter case, the execution control unit 207 has access to the semaphore array 208 and preferably the flag array 209. Under control of semaphore commands, the execution control unit 207 can post the semaphores or poll for some specific semaphores. With the flag array 209, the execution control unit 207 can also support conditional commands to implement "IF THEN ELSE" way of execution path. This can be utilized to decouple the control of host processor 101 even more.

As mentioned above, synchronization between each execution control unit 207 is kept by using the semaphore array 208. The semaphore array 208 holds an array of one bit flag that can be set and cleared by the execution control units 207a, 207b, 207c by using synchronization commands. More specifically, the semaphore array 208 is an array of one bit flags that can be set and reset by the execution control units 207a, 207b, 207c according to semaphore commands in the task queues 206a, 206b, 206c. The semaphore array 208 is a shared resource in the whole microtask sequencer 200. Every execution control unit 207 can access every bits of the semaphore array 208 under command control. The allocation of the semaphore bits to task queues 206a, 206b, 206c is under full control of the programmer and it is the programmer's responsibility to avoid usage conflicts. An execution control unit 207 can either set multiple semaphore bits or can wait for multiple semaphore bits. The semaphore bits are set by the semaphore post commands and the semaphore wait commands will be blocked until the semaphore bits they wait for are all set. The semaphore wait commands will also reset the corresponding semaphore bits before continue. With this simple design, we could easily handle complex synchronization requirements between multiple microtasks delivered to multiple queues.

Also as mentioned above, to enable more interaction between the microtask sequencer 200 and the controlled function unit, each function unit can return a single bit or multiple bits result as execution of the microtask finishes. This value shall be set to the flag array 209. The data in the flag array 209 can be used to implements conditional execution of commands and thus allow "IF-THEN-ELSE " like execution flow of microtasks. More specifically, the flag array 209 is an array of single or multiple bits results. Each entry is corresponding to one function unit. Thus the number of entries in the flag array 209 may equals to or less than the number of function units or the number of task queues 206a, 206b, 206c or the number of execution control units 207a, 207b, 207c, as required. The result is set by the function unit before associated microtask execution ends. The definition of the result value, however, is not limited to those described in this document. It shall be determined by the function unit designer himself. Some possible examples are status reports, simple return values, etc. All of the execution control units 207a, 207b, 207c can read any one entry of the flag array. For the execution control unit 207, there is a command to fetch and store the desired flag array 209 contents for future reference. This is to allow new result updates while keep local serialized execution flow consists. With the flag array 209 and conditional execution capability, we could implement "IF THEN ELSE" and "SWITCH CASE" execution control and, in applicable scenarios, decouples the host processor 101 even more.

The configuration device 230 is an preferable interface to the host system for configuration and control. More specifically, the system configuration unit 204 therein has access, through the configuration bus interface 202, to host system, and can be used by the host system to initialize the microtask sequencer and control the operations of different internal/external function units. In general, configuration device 230 may have a slave mode I/O interface to the system bus and one interrupt request line for status report.

In the above embodiment, the microtask sequencer 104 is able to issue jobs to the functional units (the DMA elements 105-1, ..., 105-m and the computing elements 106-1, ..., 106-n) automatically with minimal intervening by the host processor 101.

To facilitate understanding the invention, specific examples of format and content of the task descriptions (microtask commands) will be given in the following. It should be noted that the invention is not limited to these examples.

1. Microtask Sequencer Command Format

All commands of the microtask sequencer are 32 bits in length. The command is segmented into four fixed fields as shown in table 1. The QID field is the ID number of a task queue which the command shall be dispatched to. This field is interpreted by the task dispatcher. The task dispatcher also inspects the NUMARG field which indicates how many 32-bit arguments follow. The task dispatcher will treat the following NUMARG 32-bit words as the arguments of the previous command and it will dispatch these data together to the task queue without further interpretation. The ECU field in the command word is used to distinguish if the Execution Control Unit shall interpret the command or just send out the command to correspondent processing element or function unit to process.

Table 1 Microtask Sequencer Command Format

Bits Mnemonic Descriptions

0:3 QID ID of the task queue which this command shall be dispatched to.

0: The command is for attached processing elements 4 ECU

1 : The command is for the execution control unit

5:11 CMD The command definition

The number of 32-bit argument data following the command 12:15 NUMARG

0000: The command does not have additional data

16:31 DATA Optional 16-bit data that the command uses

1.1 ECU Commands

1.1.1 Semaphore Commands

1.1.1.1 Semaphore Post

Definition: SEMPOST semaphore vector

CMD: b'OOlOOOO'

NUMARG: 0

DATA: 0 : 15 bit mask of semaphore vector

Description: This command sets the correspondent Semaphore Array bits to one. The bits affected are given by the bit mask in DATA section. If any correspondent bit is already set as ' 1 ', this command will wait until all the correspondent bits are set by other module as '0'. Example: SEMPOST b'1011000000000000'

This command sets the 0,2,3 bits of the Semaphore Array.

1.1.1.2 Semaphore Wait

Definition: SEMWAIT semaphore vector

CMD: b'OOlOOOl¹

NUMARG: 0

DATA: 0 : 15 bit mask of semaphore vector

Description: This command waits for the correspondent Semaphore Array bits to be set. The bits monitored are given by the bit mask in DATA section. The command shall hold execution flow of ECU until all the masked bits to be set. The correspondent Semaphore Array bits are then cleared to zero. There is one global timeout counter that defines a maximum waiting time in system clock cycles. Once the semaphore bits are not triggered within the defined timeout value, a status report event shall be generated and that signal can be used to trigger a host interrupt to let the host processor intervene the process.

Example: SEMWAIT b'1011000000000000'

This command waits until the 0,2,3 bits of the Semaphore Array are all set.

1.1.1.3 Semaphore Clear

Definition: SEMCLR semaphore vector

CMD: b'OOlOOlO'

NUMARG: 0

DATA: 0 : 15 bit mask of semaphore vector

Description: This command clears the correspondent Semaphore Array bits to zero. The bits are given by the bit mask in DATA section. The command can be used to initialize the Semaphore Array to correct status. Example: SEMCLR b' 1111111111111111 '

This command clears all of the bits of the Semaphore Array. 1.1.1.4 Semaphore Pending

Definition: SEMPEND semaphore vector

CMD: b'OOlOOl l¹

NUMARG: 0

DATA: 0 : 15 bit mask of semaphore vector

Description: This command waits for the correspondent Semaphore Array bits to be cleared. The bits monitored are given by the bit mask in DATA section. The command shall hold execution flow of ECU until all the masked bits are cleared. The command does nothing to the Semaphore Array. There is one global timeout counter that defines a maximum waiting time in system clock cycles. Once the semaphore bits are not triggered within the defined timeout value, a status report event shall be generated and that signal can be used to trigger a host interrupt to let the host processor intervene the process.

Example: SEMPEND b'1011000000000000'

This command waits until the 0,2,3 bits of the Semaphore Array are all cleared to zero.

1.1.2 Flag Array and Conditional Execution Commands 1.1.2.1 Get Conditional Flag Result from Flag Array

Definition: FLGLOAD flag array index number

CMD: b'OOHOOO'

NUMARG: 0

DATA: The 12:15 bits contains an index to flag array which must fall inside the valid range. The index is zero based. The other bits are unused and should be set to zero. Description: This command retrieves the 4 bits value from the specified Flag Array elements and stores it to the condition section of the local flag register for conditional execution usage.

The Flag Array is as the flags returned by function units when function units finish one microtask. Example: FLGLOAD 3

This command retrieves the element of index 3 from the Flag Array.

1.1.2.2 Unconditional Jump Forward

Definition: JMP number of words

CMD: b'OOl l lOO¹

NUMARG: 0

DATA: The 8:15 bits hold the number of words in the Task List that we shall skip.

The other bits are unused and should be set to zero. Description: This command skips the following command and data on the current task queue by number of words. Example: JMP 15

This command omits next 15 command/data from the task queue.

1.1.2.3 Conditional Jump Forward

Definition: JMPWHEN conditions, number of words

CMD: b'OOl l lOl¹

NUMARG: O

DATA: The 0:7 bits hold the condition description as shown in Table 3-2. The 8:15 bits hold the number of words in the Task List that we shall skip. Description: This command skips the following command and data on the current task queue by number of words when the conditions are met. If the conditions are not met, it will do nothing. Example: JMPWHEN b'lOlOlOOO¹, 15

This command omits next 15 command/data from the task queue when the conditional bit 0 == 1 and bit 2 == 0. Table 2 Condition Description Format

Bits Description

0:3 A four bit mask of which condition flags we need to evaluate.

4:7 A four bit value that we expects to be matched.

Assume the mask as ABCD, the match value as EFGH and the actual flags as efgh, we can define the condition evaluation as:

V = !( A & (E ^Λ e)) & !( B & (F ^Λ f)) & !(C & (G ^Λ g)) & !(D & (H ^Λ h))

It should be understood that the specific device or modules indicated in the microtask sequencer 200 herein may be implemented in hardware and/or software. For example, a specific device or module may be performed using software and/or firmware executed on one or more processing modules. The processing module can be a single processing device or a plurality of processing devices. Such a processing device may be a microprocessor, microcontroller, digital processor, microcomputer, a portion of the central processing unit, a state machine, logic circuitry, hardwired logic and/or any device that can perform a desired processing or control.

While the invention has been described according to the embodiment where the multiprocessor subsystem comprises the microtask sequencer, the DMA elements, the computing elements and the shared local memory blocks, the DMA elements, the shared local memory blocks and the local bus/crossbar switch are not necessary if the computing elements are configured to directly access the global memory.

While the shared semaphore array is employed for synchronization in the above embodiment, other known synchronization mechanism can be implemented in the invention.

While particular numbers of task queues, execution control units, and function units of each type are described in the embodiment, these numbers are not limits to the invention. As required, any number can be embodied in the implementations. While the location in the global memory from which the commands are fetched is specified by the host processor, it can be statically determined in advance. Other relevant parameters can also be predetermined similarly.

While the invention has been described in a general multi-processor computing system, it can be particularly applicable to system on chip (SOC) accelerators or microprocessors.

While the microtasks are dispatched to respective function units based on the destination information specified in the task descriptions in the embodiments, it is also possible to only specify the types of the function units (e.g., DMA units, calculating units and etc.) for performing the microtasks in the task descriptions and to designate randomly the function units for performing the microtasks by the task dispatcher according to availability of the respective function units. Information on type, quantity and status of the function units may be defined in advance, set by the host through the system configuration units, or obtained or maintained by polling or other status detecting mechanisms. If it is necessary to designate that several microtasks should be performed by the same function unit, it is possible to specify the logical identification of this function units in the task descriptions and to bind this logical identification with the physical function unit by the task dispatcher.

The above-disclosed subject matter is to be considered illustrative, and not restrictive, and the appended claims are intended to cover all such modifications, enhancements, and other embodiments that fall within the true spirit and scope of the present invention. Thus, to the maximum extent allowed by law, the scope of the present invention is to be determined by the broadest permissible interpretation of the following claims and their equivalents, and shall not be restricted or limited by the foregoing detailed description.

The scope of the present disclosure includes any novel feature or combination of features disclosed herein. The applicant hereby gives notice that new claims may be formulated to such features or combination of features during prosecution of this application or of any such further applications derived therefrom. In particular, with reference to the appended claims, features from dependent claims may be combined with those of the independent claims and features from respective independent claims may be combined in any appropriate manner and not merely in the specific combinations enumerated in the claims.

For the avoidance of doubt, the term "comprising", as used herein throughout the description and claims is not to be construed as meaning "consisting only of.

Claims

1. A multi-processor computing system, comprising a host processor, a global memory and at least one processing units, wherein the multi-processor computing system further comprising: a microtask sequencer, comprising a task acquisition device and a task scheduling device, wherein the task acquisition device is configured to fetch a command including microtask descriptions from the global memory if the microtask sequencer is capable to accommodate further commands, and the task scheduling device is configured to dispatch each microtask instruction defined by each of the microtask descriptions to one of the at least one processing units, so that the processing unit as dispatched to can execute the microtask instruction, and the task scheduling device is further configured to detect the completion of all the microtasks in the command and notify the host processor of the completion, wherein the processing unit can access the global memory by itself or by another of the at least one processing units.

2. The multi-processor computing system according to claim 1, wherein the task scheduling device comprises: a task dispatcher; at least one task queues; and at least one execution control units, each of which is associated with a different one of the at least one task queues, and with a different one of the at least one processing units, wherein the task dispatcher is configured to accept the command fetched by the task acquisition device and put said each microtask instruction in one of the at least one task queues associated with said processing unit as dispatched to, and is further configured to perform said detection and notification; and each of the at least one execution control units is configured to obtain one microtask instruction from the associated task queue in a FIFO manner and delivery the obtained microtask instruction to the associated processing unit for execution, in response to the associated processing unit being capable to process a further microtask instruction, and is further configured to collect the completion of the obtained microtask instruction for said detection.

3. The multi-processor computing system according to claim 2, wherein the command further including synchronization instructions for achieving the synchronization among two or more of the at least one processing units, the task dispatcher is further configured to put the synchronization instructions into the task queues associated with the synchronization instructions, and the task scheduling device further comprises: a synchronization control means being operated by the execution control units based on the synchronization instructions obtained from respective task queues for achieving the synchronization.

4. The multi-processor computing system according to claim 3, wherein the synchronization is based on semaphore technique, and the synchronization control means is implemented based on a shared semaphore array.

5. The multi-processor computing system according to any of claims 2-4, wherein the command further including a flow control instruction for the associated execution control unit to selectively skip a part of the microtask instructions based on one or more of values reported from all or some of the processing units, the task dispatcher is further configured to put the flow control instruction into the task queue associated with the flow control instruction, and the task scheduling device further comprises: a memory for storing the values, wherein the execution control unit is further configured to skip the part of the microtask instructions based on the flow control instruction and associated values in the memory.

6. The multi-processor computing system according to any of claims 2 to 4, wherein the command further including a flow control instruction for the task dispatcher to selectively skip a part of the microtask instructions based on one or more of values reported from all or some of the processing units, and the task scheduling device further comprises: a memory for storing the values, wherein the task dispatcher is further configured to skip the part of the microtask instructions based on the flow control instruction and associated values in the memory.

7. The multi-processor computing system according to any of claims 1 to 6, wherein the multi-processor computing system is implemented in a system on chip accelerator or a microprocessor.

8. The multi-processor computing system according to any of claims 1 to 7, wherein the access by the processing unit to the global memory is implemented via DMA.

9. The multi-processor computing system according to any of claims 1 to 8, wherein the multi-processor computing system further comprises a configuration device for initializing the microtask sequencer.

10. The multi-processor computing system according to any of claims 1 to 9, wherein the microtask sequencer further comprises a local shared memory.

11. The multi-processor computing system according to any preceding claim, wherein the task scheduling device is configured to dispatch each microtask instruction defined in each microtask description to a processing unit as indicated in the microtask description.

12. The multi-processor computing system according to any preceding claim, wherein the task scheduling device is configured to dispatch each microtask instruction defined in each microtask description to a processing unit meeting the requirement as specified in the microtask description.

13. A task allocating method in a multi-processor computing system comprising a host processor, a global memory and at least one processing units, the method comprising: providing, by the host processor, commands including microtask descriptions in the global memory; fetching, by the microtask sequencer, a command including microtask descriptions from the global memory if the microtask sequencer is capable to accommodate further commands; dispatching, by the microtask sequencer, each microtask instruction defined by each of the microtask descriptions to one of the at least one processing units, so that the processing unit as dispatched to can execute the microtask instruction; detecting, by the microtask sequencer, the completion of all the microtasks in the command; and notifying, by the microtask sequencer, the host processor of the completion, wherein the processing unit can access the global memory by itself or by another of the at least one processing units.

14. The method according to claim 13, wherein the microtask sequencer comprises at least one task queues and at least one execution control units, each of which is associated with a different one of the at least one task queues, and with a different one of the at least one processing units, the dispatching comprises: putting said each microtask instruction in one of the at least one task queues associated with said processing unit as dispatched to; and obtaining by an associated execution control unit one microtask instruction from the associated task queue in a FIFO manner, delivering the obtained microtask instruction to the associated processing unit for execution, in response to the associated processing unit being capable to process a further microtask instruction, and collecting the completion of the obtained microtask instruction for said detection.

15. The method according to claim 14, wherein the command further includes synchronization instructions for achieving the synchronization among two or more of the at least one processing units, the dispatching further comprises: putting the synchronization instructions into the task queues associated with the synchronization instructions; and operating by the associated execution control units based on the synchronization instructions obtained from respective task queues for achieving the synchronization.

16. The method according to claim 15, wherein the synchronization is based on semaphore technique.

17. The method according to any of claims 14 to 16, wherein the command further comprises a flow control instruction for the associated execution control unit to selectively skip a part of the microtask instructions based on one or more of values reported from all or some of the processing units, the dispatching further comprises: putting the flow control instruction into the task queue associated with the flow control instruction; storing the values; and skipping, through the execution control unit, the part of the microtask instructions based on the flow control instruction and the stored associated values.

18. The method according to any of claims 14 to 16, wherein the command further comprises a flow control instruction for selectively skipping a part of the microtask instructions based on one or more of values reported from all or some of the processing units, the dispatching further comprises: storing the values; and skipping the part of the microtask instructions based on the flow control instruction and the stored associated values.

19. The method according to any of claims 13 to 18, wherein the dispatching comprises dispatching each microtask instruction defined in each microtask description to a processing unit as indicated in the microtask description.

20. The method according to any of claims 13 to 19, wherein the dispatching comprises dispatching each microtask instruction defined in each microtask description to a processing unit meeting the requirement as specified in the microtask description.