GB2484907A - Data processing system with a plurality of data processing units and a task-based scheduling scheme - Google Patents

Data processing system with a plurality of data processing units and a task-based scheduling scheme Download PDF

Info

Publication number
GB2484907A
GB2484907A GB1017752.5A GB201017752A GB2484907A GB 2484907 A GB2484907 A GB 2484907A GB 201017752 A GB201017752 A GB 201017752A GB 2484907 A GB2484907 A GB 2484907A
Authority
GB
United Kingdom
Prior art keywords
data
data processing
task
task descriptor
operable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB1017752.5A
Other versions
GB201017752D0 (en
GB2484907B (en
GB2484907A8 (en
Inventor
Ray Mcconnell
Paul Winser
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bluwireless Technology Ltd
Original Assignee
Bluwireless Technology Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bluwireless Technology Ltd filed Critical Bluwireless Technology Ltd
Priority to GB1017752.5A priority Critical patent/GB2484907B/en
Publication of GB201017752D0 publication Critical patent/GB201017752D0/en
Priority to US13/880,567 priority patent/US20140040909A1/en
Priority to PCT/GB2011/052041 priority patent/WO2012052773A1/en
Publication of GB2484907A publication Critical patent/GB2484907A/en
Publication of GB2484907A8 publication Critical patent/GB2484907A8/en
Application granted granted Critical
Publication of GB2484907B publication Critical patent/GB2484907B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3206Monitoring of events, devices or parameters that trigger a change in power modality
    • G06F1/3228Monitoring task completion, e.g. by use of idle timers, stop commands or wait commands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

Data processing system 5 includes a cluster of data processing units 521-52N, shared storage 56 and bus system 54. Each data processing unit 52 comprises scalar processor device 101, heterogeneous processor device (HPU) 102 and local memory device 103. The scalar processor executes a series of instructions including instructions for execution on the HPU which it forwards automatically. Task-based scheduling includes the maintenance of a task descriptor list TL to hold task information and a free list FL of data processing units available to perform tasks. Each data processing unit is operable to retrieve PB a task descriptor from the list and to update that task descriptor PC,PD based upon a state of execution of that task. Upon completion of a task PG a data processing unit may append PH the address of its wake-up mechanism to the free list and enter a low-power mode. A data processing unit on the free list may be awoken PF to perform a task by writing the address of a task descriptor to the address of the wake-up mechanism.

Description

DATA PROCESSING SYSTEMS
The present invention relates to data processing systems.
BACKGROUND OF THE INVENTION
A simplified wireless communications system is illustrated schematically in Figure 1 of the accompanying drawings. A transmitter 1 communicates with a receiver 2 over an air interface 3 using radio frequency signals. In digital radio wireless communications systems, a signal to be transmitted is encoded into a stream of data samples that represent the signal.
The data samples are digital values in the form of complex numbers. A simplified transmitter I is illustrated in Figure 2 of the accompanying drawings, and comprises a signal input 11, a digital to analogue converter 12, a modulator 13, and an antenna 14. A digital datastream is supplied to the signal input 11, and is converted into analogue form at a baseband frequency using the digital to analogue converter 12. The resulting analogue signal is used to modulate a carrier waveform having a higher frequency than the baseband signal by the modulator 13. The modulated signal is supplied to the antenna 14 for transmission over the airinterface3.
At the receiver 2, the reverse process takes place. Figure 3 illustrates a simplified receiver 2 which comprises an antenna 21 for receiving radio frequency signals, a demodulator 22 for demodulating those signals to baseband frequency, and an analogue to digital converter 23 which operates to convert such analogue baseband signals to a digital output datastream 24.
Since wireless communications device typically provide both transmission and reception functions, and that, generally, transmission and reception occur at different times, the same digital processing resources may be reused for both purposes.
In a packet-baed system, the datastream is divided into Data Packets', each of which contains up to 100's of kilobytes of data. Each data packet generally comprises: 1. A Preamble, used by the receiver to synchronise its decoding operation to the incoming signal.
2. A Header, which contains information about the packet such as its length and coding style.
3. The Payload, which is the actual data to be transferred.
4. A Checksum, which is computed from the entirety of the data and allows the receiver to verify that all data bits have been correctly received.
Each of these data packet sections must be processed and decoded in order to provide the original datastream to the receiver. Figure 4 illustrates that a packet processor 5 is provided in order to process a received datastream 24 into a decoded output datastream 58.
The different types of processing required by these sections of the packet and the complexity of the coding algorithms suggest that a software-based processing system is to be preferred, in order to reduce the complexity of the hardware. However, a pure software approach is difficult since each packet comprises a continuous stream of samples with no time gaps in between. As such, a pipelined hardware implementation may be preferred.
For multi-gigabit wireless communications, the baseband sample rate required is typically in the range of I GHz to over 5GHz. This presents a problem when implementing the baseband processing in a digital device, since this sample rate is comparable to or higher than the clock rate of the processing circuits that are generally available. The number of processing cycles available per sample can then fall to a very low level, sometimes less than unity. Existing solutions to this problem have drawbacks as follows: 1. Run the baseband processing circuitry at high speed, equal to or greater than the sample rate: Operating CMOS circuits at GHz frequencies consumes excessive amounts of power, more than is acceptable in small, low-power, battery-operated devices. The design of such high frequency processing circuits is also very labour-intensive.
2. Decompose the processing into a large number of stages and implement a pipeline of hardware blocks, each of which perform only one section of the processing: Moving all the data through a large number of hardware units uses considerable power in the movement, in addition to the power consumed in the actual processing itself. In addition, the functions of the stages are quite specific and so flexibility in the processing algorithms is lost.
Existing solutions make use of a combination of (1) and (2) above to achieve the required processing performance.
An alternative approach is one of parallel processing; that is to split the stream of samples into a number of slower streams which are processed by an array of identical processor units, each operating at a clock frequency low enough to ease their design effort and avoid excessive power consumption. However, this approach also has drawbacks. If too many processors are used, the hardware overhead of instruction fetch and issue becomes undesirably large, and, therefore, inefficient. If processors are arranged-together into a Single Instruction Multiple data (SIMD) arrangement, then the latency of waiting for them to fill with data can exceed the upper limit for latency, as specified in the protocol standard being implemented.
An architecture with multiple processors communicating via shared memory can have the problem of contention for a shared memory resource. This is a particular disadvantage in a system that needs to process a continual stream of data and cannot tolerate delays in processing.
SUMMARY OF THE INVENTION
According to one aspect of the present invention, there is provided a data processing system comprising a plurality of data processing units, and a shared data storage device operable to store data for each of the plurality of data processing units, and to store a task descriptor list accessible by each of the data processing units, wherein the data processing units each comprise a scalar processor device, and a heterogeneous processor device connected to receive instruction information from the scalar processor, and to receive incoming data, and operable to process incoming data in accordance with received instruction information, the heterogeneous processor device comprising a heterogeneous controller unit connected to receive instruction information from the scalar processor, and operable to output instruction information, an instruction sequencer connected to receive instruction information from the heterogeneous controller unit, and operable to output a sequence of instructions, and a vector processor array including a plurality of vector processor elements operable to process received data items in accordance with instructions received from the instruction sequencer, wherein each data processing unit is operable to access a task descriptor list stored in the shared storage device, to retrieve a task descriptor in such a task descriptor list, and to update that task descriptor in the task descriptor list in dependence upon a state of execution of a task described by the task descriptor.
In one example, the data processing units are operable to execute tasks described by retrieved task descriptors substantially simultaneously in predefined processing phases.
In one example, each data processing unit is operable to transfer a modified task descriptor to another data processing unit by modifying that task descriptor in the task descriptor list.
In one example, the data processing units are operable to execute respective different tasks defined by task descriptors retrieved from the task descriptor list.
Each data processing unit may be operable to enter a low power mode upon completion of a task defined by a task descriptor retrieved from the task list. In such a case, each data processing unit may be operable to be caused to exit the low power mode upon initiation of a processing phase.
In one example, a bus system is provided which includes a data input network, a data output network, and a shared memory network.
The data processing system may receive a substantially continual stream of data items at an incoming data rate, and the plurality of data processing units can then be arranged to process such a stream of data items, such that each of the data processing units is substantially continually utilised.
According to another aspect of the present invention, there is provided a method of processing an incoming data stream using such a data processing system, the method comprising receiving instruction information, defining a task descriptor from the instruction information, defining a task descriptor list accessible by each of the data processing units, storing the task descriptor in the task descriptor list, accessing the task descriptor list to retrieve a task descriptor stored therein, and updating that task descriptor in the task descriptor list in dependence upon a state of execution of a task described by the task descriptor.
BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 is a simplified schematic view of a wireless communications system; Figure 2 is a simplified schematic view of a transmitter of the system of Figure 1; Figure 3 is a simplified schematic view of a receiver of the system of Figure 1; Figure 4 illustrates a data processor; Figure 5 illustrates a data processor including processing units embodying one aspect of the present invention; Figure 6 illustrates data packet processing by the data processor of Figure 5; Figure 7 illustrates a processing unit embodying one aspect of the present invention for use in the data processor of Figure 5; Figure 8 illustrates a method embodying another aspect of the present invention; and Figure 9 illustrates steps in a method related to that shown in Figure 8.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
Embodiments of the present invention will be described with reference to one particular implementation for a wireless communications system. However, it will be readily appreciated that the techniques to be described are more widely applicable to a wide variety of data processing systems Figure 5 illustrates a data processor which includes a processing unit embodying one aspect of the present invention. Such a processor is suitable for processing a continual datastream, or data arranged as packets. Indeed, data within a data packet is also continual for the length of the data packet, or for part of the data packet.
The processor 5 includes a cluster of N data processing units (or "physical processing units") S2152N, hereafter referred to as "PPUs". The PPUs 52152N receive data from a first data unit 51, and sends processed data to a second data unit 57. The first and second data units 51, 57 are hardware blocks that may contain buffering or data formatting or timing functions.
In the example to be described, the first data unit 51 is connected to transfer data with the radio sections of a wireless communications device, and the second data unit is connected to transfer data with the user data processing sections of the device. It will be appreciated that the first and second data units 51, 57 are suitable for transferring data to be processed by the PPUs 52 with any appropriate data source or data sink. In the present example, in a receive mode of operation, data flows from the first data unit 51, through the processor array to the second data unit 57. In a transmit mode of operation, the data flow is in the opposite direction-that is, from the second data unit 57 to the first data unit 51 via that processing array.
The PPUs 521S2N are under the control of a control processor 55, and make use of a shared memory resource 56. Data and control signals are transferred between the PPUs 52152N the control processor 55, and the memory resource 56 using a bus system 54c.
It can be seen that the workload of processing a data stream from source to destination is divided N ways between the PPUs 52152N on the basis of time-slicing the data. Each PPU then needs only lfNth of the performance that a single processor would have needed. This translates into simpler hardware design, lower clock speed, and lower overall power consumption. The control processor 55 and shared memory resource 56 may be provided in the device itself, or may be provided by one or more external units.
The control processor 55 has different capabilities to the PPUs 52152N, since its tasks are more comparable to a general purpose processor running a body of control software. It may also be a degenerate control block with no software. It may therefore be an entirely different type of processor, as long as it can perform shared memory communications with the PPUs However, the control processor 55 may be simply another instance of a PPU, or it may be of the same type but with minor modifications suited to its tasks.
It should be noted that the bandwidth of the radio data stream is usually considerably higher than the unencoded user data it represents. This means that the first data unit 51, which is at the radio end of the processing, operates at high bandwidth, and the second data unit 57 operates at a lower bandwidth related to the stream of user data.
At the radio interface, the data stream is substantially continual within a data packet. In the digital baseband processing, the data stream does not have to be continual, but the average data rate must match that of the radio frequency datastream. This means that if the baseband processing peak rate is faster than the radio data rate, the baseband processing can be executed in a non-continual, burst-like fashion. In practise however, a large difference in processing rate will require more buffering in the first and second data units 51, 57 in order to match the rates, and this is undesirable both for the cost of the data buffer storage, and the latency of data being buffered for extended periods. Therefore, baseband processing should execute as near to continually as possible, and at a rate that needs to be only slightly faster than the rate of the radio data stream, in order to allow for small temporal gaps in the processing.
In the context of Figure 5, this means that data should be near-continually streamed either to or from the radio end of the processing (to and from the first data unit 51). In a receive mode, the high bandwidth stream of near-continual data is time sliced between the PPUs 52152N Consider the receiving case where high bandwidth radio sample data is being transferred from the first data unit 51 to the PPU cluster: In the simple case, a batch of radio data, being a fixed number of samples, is transferred to each PPU in turn, in round-robin sequence. This is illustrated for a received packet in Figure 6, for the case of a cluster of four PPUs.
Each PPU 52. 52N receives 621, 622, 623, 624, 625, and 626 a portion of the packet data 62 from the incoming data stream 6. The received data portion is then processed 71, 72, 73, 74, 75, and 76, and output 81, 82, 83, 84, 85, and 86 to form a decoded data packet 8.
Each PPU 52152N must have finished processing its previous batch of samples by the time it is sent a new batch. In this way, all N FPUs 52152N execute the same processing sequence, but their execution is out of phase' with each other, such that in combination they can accept a continuous stream of sample data.
In this simple receive case described above, each FPU 52152N produces decoded output user data, at a lower bandwidth than the radio data, and supplies that data to the second data unit 57. Since the processing is uniform, the data output from all N PPUs 52152N arrives at the data sink unit 57 in the correct order, so as to produce a decoded data packet.
In a simple transmit mode case, this arrangement is simply reversed, with the PFUs 52152N accepting user data from the second data unit 57 and outputting encoded sample data to the first data unit 51 for radio transmission.
However, wireless data processing is more complex than in the simple case described above. The processing will not always be uniform -it will depend on the section of the data packet being processed, and may depend on factors determined by the data packet itself.
For example, the Header section of a received packet may contain information on how to process the following payload. The processing algorithms may need to be modified during reception of the packet in response to degradation of the wireless signal. On the completion of receiving a packet, an acknowledgement packet may need to be immediately transmitted in response. These and other examples of more complex processing demand that the PPUs 52152N have a flexibility of scheduling and operation that is driven by the software running on them, and not just a simple pattern of operation that is fixed in hardware.
Under this more complex processing regime, the following considerations must be taken into account: * A control process, thread or agent defines the overall tasks to be performed. It may modify the priority of tasks depending on data-driven events. It may have a list of several tasks to be performed at the same time, by the available PPUs 52152N of the cluster.
* The data of a received packet is split into a number of sections. The lengths of the sections may vary, and some sections may be absent in some packets. Furthermore, the sections often comprise blocks of data of a fixed number of samples. These blocks of sample data are termed Symbols' in this description. It is highly desirable that all the data for any symbol be processed in its entirety by one PPU 52152N of the cluster, since splitting a symbol between two PPUs 52152N would involve undue communication between the PPUs 52152N in order to process that symbol. In some cases it is also desirable that
S
several symbols be processed together in one PPU 52152N, for example if the Header section 61 (Figure 6)of the data packet comprises several symbols. The PPUs 52152N must in general therefore be able to dictate how much data they receive in any given processing phase from the data source unit 51, since this quantity may need to vary throughout the processing of a packet.
* Non-uniform processing conditions could potentially result in out of order processed data being available from the PPUs 52152N In order to prevent such possibility, a mechanism is provided to ensure that processed data are provided to the first data unit 51 (in a transmit mode) or to the second data unit 57 (in a receive mode), in the correct order.
* The processing algorithms for one section of a data packet may depend on previous sections of the data packet. This means that PPUs 52152N must communicate with each other about the exact processing to be performed on subsequent data. This is in addition to, and may be a modification of, the original task specified by the control process, thread, or agent.
* The combined processing power of the entire N PPUs 52152N in the cluster must be at least sufficient for handling the wireless data stream in that mode that demands the greatest processing resources. In some situations, however, the data stream may require a lighter processing load, and this may result in PPUs 52152N completing their processing of a data batch ahead of schedule. It is highly desirable that any PPU 52152N with no immediate work load to execute be able to enter an inactive, low-power sleep' mode, from which it can be awoken when a workload becomes available.
The cluster arrangement provides the software with the ability for each of the PPUs 521S2N in the cluster to collectively decide the optimal DSP algorithms and modes in which the system should be placed in. This reduction of the collective information is available to the control processor via the SON network. This localised processing and decision reduction allows the control processor to view the PPU cluster as a single logical entity.
A PPU is illustrated in Figure 7, and comprises scalar processor unit 101 (which could be a 32-bit processor) closely connected with a heterogeneous processor unit (HPU) 102. High bandwidth real time data is coupled directly into and out of the HPU 102, via a system data network (SDN) I 06a and I 06b (54a and 54b in Figure 5). Scalar processor data and control data are transferred using a PPU-SMP (PPU-symmetrical multiprocessor) network PSN 104, (54c in Figure 5). A local memory device 103 is provided for access by the scalar processor unit 101, and by the heterogeneous processor unit 104.
The data processor includes hierarchical data networks which are designed to localise high bandwidth transactions and to maximise bandwidth with minimal data latency and power dissipation. These networks make use of an addressing scheme which is common to both the local data storage and to processor wide data storage, in order to simplify the programming model.
Data are substantially continually dispatched, in real time, into the HPU 102, in sequence via the SDN I 06a, and are then processed. Processed data exit from the HPU 102 on the SDN 106b.
The scalar processor unit 101 operates by executing a series of instructions defined in a high level program. Embedded in this program are specific coprocessor instructions that are customised for computation within the HPU 102.
A task-based scheduling scheme embodying one aspect of the present invention is shown in Figure 8, which shows the sequence of steps in the case of a PPU 52152N being allocated a task by the control processor 55. The operation of a second PPU 52152N executing a second fragment of the task, and so on, is not shown in this simplified diagram.
Two lists are defined in the shared memory resource 56. Each list is accessible by each of the PPUs 52152N and by the control processor 55 for mutual communications. Figure 9 illustrates initialisation steps for the two lists, and shows the state of each list after initialisation of the system. The control processor 55 creates a task descriptor list TL and a free list FL in shared memory. Both lists are created empty. The task descriptor list TL is used to hold task information for access by the PPUs 52152N as described below. The free list FL is used to provide information regarding free processing resources.
The control processor initiates each PPU belonging to the cluster with the address of the free list FL, which address the PPUs 52152N need in order to participate in the task sharing scheme. Each PPU 52 then adds itself on to the Free List FL, in no particular order.
Specifically, a PPU 52 appends the free list FL with an entry containing the address of the PPU's wake-up mechanism. After adding itself to the free list, a PPU can enter a low-power sleep state. It can be subsequently be awoken, for example by another PPU, by the control processor, or by another processor, to perform a task by the writing of the address of a task descriptor to the address of the PPU's wake-up mechanism.
Management of lists in memory -creation, appending and deleting items is a well-known technique in software engineering and the details of the implementation are not described here, for the sake of clarity.
Referring back to Figure 8, items on the task descriptor list TL represent work that is to be done by the PPUs 52152N The free list FL allows the PPUs S2152N to queue up' to be allocated tasks by the control processor 55.
Generally, a task represents too much work for a single PPU 52152N to complete in a single processing phase. For example, a task could cause a single PPU S2152N to consume more data than it can contain, or at least so much that the continuous compute and I/O operations depicted in Figure 6 would be prevented. For this reason, a PPU 52152N that has been allocated a task will remove PB a task descriptor from the task descriptor list TL, but then return PD a modified task descriptor to the task descriptor list TL. The PPU 52 modifies the task descriptor to show that a processing phase has been accounted for by the PPU concerned, and to represent any remaining processing phases for the task in hand.
The PPU also then allocates PF any remaining processing phases of the task to another PPU 52152N that is at the head of the free list FL. In other words, the first PPU 52152N takes PB a task descriptor from the task descriptor list TL, modifies PC the task descriptor to remove from it the work that it is going to do or has done, and then returns PD a modified task descriptor to the task descriptor list TL for another PPU 521S2N to pick up and continue. This process may repeat any number of times before the task is finally fully completed. Whenever a PPU 52152N completes a task, or a phase of it, it adds itself PH to the free list FL so that it is available to be allocated a new task either by the control processor 55 or by another PPU 52152N It may also update the task descriptor in the task descriptor list to indicate that the overall task has been completed (or is close to completion), along with any other relevant information such as the timestamp of completion or any errors that were encountered in processing. The PPU 52 that completes the final processing phase for a given task may signal the control processor directly to indicate the completion of the task. As an alternative, a PPU prior to the final PPU for a task can indicate the expectation of completion of the task, in order that the control processor is able to schedule the next task at an appropriate time to ensure that all of the processing resources are kept busy.
It should be noted that in this scheme, after the initial allocation of a task to a free PPU52I...52N, the control processor 55 is not involved in subsequent handover of the task to other PPUs for completion of the task. Indeed the order in which physical PPUs 52V52N get to work on a task is determined purely by their position on the Free list FL, which in turn depends on when they completed their previous task phase. In the case of uniform processing as depicted in Figure 6, it can be seen that a round-robin' order of processing between the PPUs 521...52N naturally emerges, without being explicitly orchestrated by the control processor 55.
In the scheme described, a more general case of non-uniform processing automatically allocates free PPU 521S2N resources to available tasks as they become available. The list mechanism supports simultaneous execution of multiple tasks -the control processor 55 can create any number of tasks on the task descriptor list TL and allocate a number of them to PPUs 52152N' up to a maximum number being the number of PPUs 52152N on the free list FL at that time. In order to avoid undesirable delays in waiting for a PPU 521...62N to be free, the system is preferably designed with sufficient number of PPUs S2152N, each with sufficient processing power, so that there is always at least one PPU 52. . 52N on the free list FL during processing of a single task. Such provision ensures that the hand-off to the next PFU does not cause a delay in the processing of the current PPU. In an alternative technique, the current FFU can handover the next processing phase at an appropriate point relative to its own processing phase -that is before, during, or after the current processing phase.
Furthermore, the control processor 55 does not need to know how many PPUs S2152N there are in the cluster, since it only sees them in terms of a queue of available processing resources. This permits PPUs 52152N to join or leave dynamically the cluster without explicit interaction with the control processor 55. This may be advantageous for means of fault-tolerance or power management where one or more PPUs 52152N may leave the cluster either permanently or for long durations where it is known that the overall processing load will be light.
In the scheme described, PPUs S21S2N are passively allocated tasks by another PPU 52152N, or the control processor 55. An alternative scheme has free PPUs actively monitoring the Task list TL for new tasks to arrive. However, the described scheme is preferable since it has the advantage that idle PPUs 521S2N can be deactivated into an inactive, low power state, from which it is awoken by the agent allocating it a new task. Such an inactive state would be difficult to achieve if the PPU S2152N was actively seeking a new task by itself.
The basic interaction scheme described above can be extended to include additional functions. For example, PPUs 52152N may need to interact with each other to exchange information and to ensure that their input and output data portions are transferred in the correct order to and from the first and second data units 51 and 57. Such interactions could be direct between PP Us, or via shared memory either as additional fields in the task descriptor or as separate data structures.
It may be seen that interaction with the two memory based lists of the described scheme may itself consume some time, which represents undesirable delay and may require extra buffering of data streams. This can be minim ised by PPUs 521S2N negotiating their next task ahead of when that task can actually start execution. Thus, the time taken to manage the task list can be overlapped with the processing of a previous task item. This represents another elaboration of the scheme using handshake operations.
Another option for speeding up inter-processor communications is for each PPU 52152N to locally cache contents of the shared memory 56 such as the list structures described above, and for conventional cache coherency mechanisms to keep each PPU's local copy of the data synchronised with the others.
A task that is defined by the control processor 55 will typically consist of several sub-tasks.
For example, to decode a received data packet, firstly the packet header must be decoded to determine the length and style of encoding of the following payload. Then, the payload itself must be decoded, and finally a checksum field will be compared to that calculated during decoding of the packet to check for any errors in the decoding process. This whole process will generally take many processing phases, with each phase being executed on a different PPU S2152N according to the Free list FL mechanism described above. In each processing phase, the PPU 52152N executing the task must modify the task description so that the next PPU 52152N can perform the correct sub-task or part thereof.
An example would be in the decoding of the data payload part of a received packet. The length of the payload is specified in the packet header. The PPU 52152N which decodes the header can insert the payload length into the modified task list entry, which is then passed to the next PPU521...S2N. That second PPU 52152N will in turn subtract the amount of payload data that it will decode during its processing phase from the task description before passing the task on to a third PPU521...52N. This sequence continues until a PPU 52152N can complete decoding of the final section of the payload.
To continue the above example, the PPU 521S2N that completes payload data decoding may then modify the task entry so that the next PPU 52152N performs the checksum processing. For this to be possible, each PPU 52152N that performs partial decoding of the payload data must also append the running total' result of the checksum calculation to the modified task list. The checksum running total is therefore passed along the processing sequence, via the task descriptor, so that the PPU S21..S2N that performs the final check has access to the total checksum calculation of the whole payload. Other items of information may be similarly appended to the task descriptor on a continuous basis, such as signal quality metrics.
In some cases, the actual processing to be performed will be directed by the content of the data. An obvious case is that the header of a received packet specifies the modulation and coding scheme of the following payload. The header will also typically contain the source and destination addresses of the packet. If the receiver is not the addressed destination device, or does not lie on a valid route towards the destination address, then the remainder of the packet, i.e. the payload, may be ignored instead of decoded. This represents an early termination of a task, rather than a modification of a task, and can achieve considerable overall power savings in a network consisting of many devices.
Information gained in the payload decoding process may also cause processing to be modified. For example, if received signal quality is poor, more sophisticated algorithms may be required to recover the data correctly. If a PPU 521*S2N identifies a change to the processing algorithms required, it can communicate that change to subsequent PPUs S2152N dealing with subsequent portions of the packet, again by passing such information through the task descriptor list TL in shared memory.
Many such decisions about processing methods may be taken individually by one FPU 52152N and communicated to subsequent processing phases. Alternatively, such decisions may be made cooperatively by several or all FFUs 52152N communicating via shared memory structures outside of the task descriptor list TL. This would typically be used for changes that occur due to longer-term effects and need many individual data points to be combined for decision making. Overall processing policies such as error protection or power management may be folded in to the collective decision making process. This may be performed entirely by the PPUs, or also involve the control processor 55.
In a receive mode, the function of the first data unit 51 is to distribute the incoming data stream to the PFUs 52152N The amount of data that a FPU 52152N requires for any processing phase is known to the PPU 52152N and may depend on previous processing of packet data. Therefore, the PPU 52152N must request a defined amount of data from the first data unit 51, which then streams the requested amount of data back to the requesting PPU 52152N The first data unit 51 should be able to deal with multiple requests for data arriving from PPUs 52152N in quick succession. It contains a request queue of depth equal to the number of PPUs 52152N or more. It executes each request in the order received, as data becomes available to it to service the requests.
Again in the receive mode, the function of the second data unit 57 is simply to combine the output data produced by each processing phase on a PPU52I...52N. Each PPU 52V52N will in turn stream its output data to the data sink unit over the output data bus. In the case of non-uniform processing, it might be possible that output data from two PPUs arrives at the data sink in an incorrect order. To prevent this, the PPUs S21S2N may exchange a software token' via shared memory that can be used to force serialisation of output data to the data sink in the correct order.
Both requesting data from the first data unit 51 and negotiating access to the second data unit 57 could add unwanted delay to the execution of a PPU processing phase. Both of these operations can be performed in advance, and overlapped with other processing in a pipelined manner to avoid such delays.
For a transmit mode, the functions of the first and second data units are reversed, with the second data unit 57 supplying data for processing, and the first data unit 51 receiving processed data for transmission.

Claims (13)

  1. CLAIMS: 1. A data processing system comprising: a plurality of data processing units; and a shared data storage device operable to store data for each of the plurality of data processing units, and to store a task descriptor list accessible by each of the data processing units, wherein the data processing units each comprise: a scalar processor device; and a heterogeneous processor device connected to receive instruction information from the scalar processor, and to receive incoming data, and operable to process incoming data in accordance with received instruction information, the heterogeneous processor device comprising: a heterogeneous controller unit connected to receive instruction information from the scalar processor, and operable to output instruction information; an instruction sequencer connected to receive instruction information from the heterogeneous controller unit, and operable to output a sequence of instructions; and a vector processor array including a plurality of vector processor elements operable to process received data items in accordance with instructions received from the instruction sequencer; wherein each data processing unit is operable to access a task descriptor list stored in the shared storage device, to retrieve a task descriptor in such a task descriptor list, and to update that task descriptor in the task descriptor list in dependence upon a state of execution of a task described by the task descriptor.
  2. 2. A data processing system as claimed in claim 1, wherein the data processing units are operable to execute tasks described by retrieved task descriptors substantially simultaneously in predefined processing phases.
  3. 3. A data processing system as claimed in claim 2, wherein the processing phases are defined upon receipt of incoming data items.
  4. 4. A data processing system as claimed in any one of claims I to 3, wherein each data processing unit is operable to transfer a modified task descriptor to another data processing unit by modifying that task descriptor in the task descriptor list.
  5. 5. A data processing system as claimed in any one of claims I to 4, wherein the data processing units are operable to execute respective different tasks defined by task descriptors retrieved from the task descriptor list.
  6. 6. A data processing system as claimed in any one of claims I to 5, wherein each data processing unit is operable to enter a low power mode upon completion of a processing phase of a task defined by a retrieved task descriptor.
  7. 7. A data processing system as claimed in claim 6, wherein each data processing unit is operable to be caused to exit such a low power mode upon initiation of an processing phase of a task.
  8. 8. A data processing system as clamed in any one of the preceding claims, further comprising a bus system which provides a data input network, a data output network, and a shared memory network.
  9. 9. A data processing system as claimed in any one of the preceding claims, operable to receive a substantially continual stream of data items at an incoming data rate, wherein the plurality of data processing units is arranged to process such an stream of data items, such that each of the data processing units is substantially continually utilised.
  10. 10. A data processing system as claimed in any one of the preceding claims, wherein each data processing unit is operable to access a free list which relates to available processing resources, and to add a reference to itself when a processing phase has completed, and to transfer a modified task descriptor to a data processing unit included on the free list.
  11. 11. A method of processing an incoming data stream using a data processing system as claimed in any one of the preceding claims, the method comprising: receiving instruction information; defining a task descriptor from the instruction information; defining a task descriptor list accessible by each of the data processing units; storing the task descriptor in the task descriptor list; accessing the task descriptor list to retrieve a task descriptor stored therein; and updating that task descriptor in the task descriptor list in dependence upon a state of execution of a task described by the task descriptor.
  12. 12. A device substantially as hereinbefore described with reference to, and as shown in, Figures 5 to 7 of the accompanying drawings.
  13. 13. A method substantially as hereinbefore described with reference to Figures 8 and 9 of the accompanying drawings.
GB1017752.5A 2010-10-21 2010-10-21 Data processing systems Active GB2484907B (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
GB1017752.5A GB2484907B (en) 2010-10-21 2010-10-21 Data processing systems
US13/880,567 US20140040909A1 (en) 2010-10-21 2011-10-20 Data processing systems
PCT/GB2011/052041 WO2012052773A1 (en) 2010-10-21 2011-10-20 Data processing systems

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1017752.5A GB2484907B (en) 2010-10-21 2010-10-21 Data processing systems

Publications (4)

Publication Number Publication Date
GB201017752D0 GB201017752D0 (en) 2010-12-01
GB2484907A true GB2484907A (en) 2012-05-02
GB2484907A8 GB2484907A8 (en) 2013-07-10
GB2484907B GB2484907B (en) 2014-07-16

Family

ID=43334138

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1017752.5A Active GB2484907B (en) 2010-10-21 2010-10-21 Data processing systems

Country Status (1)

Country Link
GB (1) GB2484907B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116594758B (en) * 2023-07-18 2023-09-26 山东三未信安信息科技有限公司 Password module call optimization system and optimization method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0064142A2 (en) * 1981-05-04 1982-11-10 International Business Machines Corporation Multiprocessor system comprising a task handling arrangement
US5197130A (en) * 1989-12-29 1993-03-23 Supercomputer Systems Limited Partnership Cluster architecture for a highly parallel scalar/vector multiprocessor system
EP1011052A2 (en) * 1998-12-15 2000-06-21 Nec Corporation Shared memory type vector processing system and control method thereof
US20050132380A1 (en) * 2003-12-11 2005-06-16 International Business Machines Corporation Method for hiding latency in a task-based library framework for a multiprocessor environment
US20050188372A1 (en) * 2004-02-20 2005-08-25 Sony Computer Entertainment Inc. Methods and apparatus for processor task migration in a multi-processor system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0064142A2 (en) * 1981-05-04 1982-11-10 International Business Machines Corporation Multiprocessor system comprising a task handling arrangement
US5197130A (en) * 1989-12-29 1993-03-23 Supercomputer Systems Limited Partnership Cluster architecture for a highly parallel scalar/vector multiprocessor system
EP1011052A2 (en) * 1998-12-15 2000-06-21 Nec Corporation Shared memory type vector processing system and control method thereof
US20050132380A1 (en) * 2003-12-11 2005-06-16 International Business Machines Corporation Method for hiding latency in a task-based library framework for a multiprocessor environment
US20050188372A1 (en) * 2004-02-20 2005-08-25 Sony Computer Entertainment Inc. Methods and apparatus for processor task migration in a multi-processor system

Also Published As

Publication number Publication date
GB201017752D0 (en) 2010-12-01
GB2484907B (en) 2014-07-16
GB2484907A8 (en) 2013-07-10

Similar Documents

Publication Publication Date Title
Meng et al. Dedas: Online task dispatching and scheduling with bandwidth constraint in edge computing
Meng et al. Online deadline-aware task dispatching and scheduling in edge computing
US20140040909A1 (en) Data processing systems
US20150143073A1 (en) Data processing systems
Yang et al. A framework for partitioning and execution of data stream applications in mobile cloud computing
Tan et al. Sora: high-performance software radio using general-purpose multi-core processors
Wu et al. PRAN: Programmable radio access networks
US9285793B2 (en) Data processing unit including a scalar processing unit and a heterogeneous processor unit
US20050138622A1 (en) Apparatus and method for parallel processing of network data on a single processing thread
US20100088703A1 (en) Multi-core system with central transaction control
US20120324462A1 (en) Virtual flow pipelining processing architecture
CN105247817A (en) A method, apparatus and system for a source-synchronous circuit-switched network on a chip (NoC)
US11347546B2 (en) Task scheduling method and device, and computer storage medium
CN109379303A (en) Parallelization processing framework system and method based on improving performance of gigabit Ethernet
US20140068625A1 (en) Data processing systems
US20170097854A1 (en) Task placement for related tasks in a cluster based multi-core system
JP2023501870A (en) Autonomous virtual radio access network control
CN102609307A (en) Multi-core multi-thread dual-operating system network equipment and control method thereof
Sundar et al. Communication augmented latest possible scheduling for cloud computing with delay constraint and task dependency
Ma et al. Maximizing container-based network isolation in parallel computing clusters
GB2484907A (en) Data processing system with a plurality of data processing units and a task-based scheduling scheme
GB2484906A (en) Data processing unit with scalar processor and vector processor array
Zeng et al. Scheduling coflows of multi-stage jobs under network resource constraints
GB2484899A (en) Data processing system with a plurality of data processing units and a task-based scheduling scheme
GB2484905A (en) Data processing system with a plurality of data processing units and a task-based scheduling scheme