GB2484906A - Data processing unit with scalar processor and vector processor array - Google Patents

Data processing unit with scalar processor and vector processor array Download PDF

Info

Publication number
GB2484906A
GB2484906A GB1017751.7A GB201017751A GB2484906A GB 2484906 A GB2484906 A GB 2484906A GB 201017751 A GB201017751 A GB 201017751A GB 2484906 A GB2484906 A GB 2484906A
Authority
GB
United Kingdom
Prior art keywords
data
data processing
processing unit
unit
processor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB1017751.7A
Other versions
GB201017751D0 (en
Inventor
Ray Mcconnell
Paul Winser
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bluwireless Technology Ltd
Original Assignee
Bluwireless Technology Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bluwireless Technology Ltd filed Critical Bluwireless Technology Ltd
Priority to GB1017751.7A priority Critical patent/GB2484906A/en
Publication of GB201017751D0 publication Critical patent/GB201017751D0/en
Priority to US13/880,473 priority patent/US9285793B2/en
Priority to PCT/GB2011/052042 priority patent/WO2012052774A2/en
Publication of GB2484906A publication Critical patent/GB2484906A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8053Vector processors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3234Power saving characterised by the action undertaken
    • G06F1/324Power saving characterised by the action undertaken by lowering clock frequency
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3234Power saving characterised by the action undertaken
    • G06F1/329Power saving characterised by the action undertaken by task scheduling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/7814Specially adapted for real time processing, e.g. comprising hardware timers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3877Concurrent instruction execution, e.g. pipeline, look ahead using a slave processor, e.g. coprocessor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

A data processing unit comprises a scalar processor 101 and a heterogeneous processor (HPU) 102 which includes a heterogeneous controller unit (HCU) 120 and a vector processing array 122 including a plurality of vector processors 123 which may be operable in a single instruction multiple data (SIMD) configuration. The scalar processor executes a series of instructions which include instructions for execution on the HPU; when these instructions are encountered they are automatically forwarded to HCU. The HCU includes an instruction decode unit (150; fig 9) to decode instructions and forward them to one of a number of sequencers (1550-155N; figure 9) each of which may include a FIFO instruction queue in storage (1540-154N; fig 9) for execution on a corresponding function unit. The data processing unit is for a data processing system which may include a cluster of these data processing units (521-52N; fig 5).

Description

DATA PROCESSING UNITS
The present invention relates to data processing units, for example for use in wireless communications systems.
BACKGROUND OF THE INVENTION
A simplified wireless communications system is illustrated schematically in Figure 1 of the accompanying drawings. A transmitter I communicates with a receiver 2 over an air interface 3 using radio frequency signals. In digital radio wireless communications systems, a signal to be transmitted is encoded into a stream of data samples that represent the signal.
The data samples are digital values in the form of complex numbers. A simplified transmitter 1 is illustrated in Figure 2 of the accompanying drawings, and comprises a signal input 11, a digital to analogue converter 12, a modulator 13, and an antenna 14. A digital datastream is supplied to the signal input 11, and is converted into analogue form at a baseband frequency using the digital to analogue converter 12. The resulting analogue signal is used to modulate a carrier waveform having a higher frequency than the baseband signal by the modulator 13. The modulated signal is supplied to the antenna 14 for transmission over the air interface 3.
At the receiver 2, the reverse process takes place. Figure 3 illustrates a simplified receiver 2 which comprises an antenna 21 for receiving radio frequency signals, a demodulator 22 for demodulating those signals to baseband frequency, and an analogue to digital converter 23 which operates to convert such analogue baseband signals to a digital output datastream 24.
Since wireless communications device typically provide both transmission and reception functions, and that, generally, transmission and reception occur at different times, the same digital processing resources may be reused for both purposes.
In a packet-based system, the datastream is divided into Data Packets', each of which contains up to 100's of kilobytes of data. Each data packet generally comprises: I. A Preamble, used by the receiver to synchronise its decoding operation to the incoming signal.
2. A Header, which contains information about the packet such as its length and coding style.
3. The Payload, which is the actual data to be transferred.
4. A Checksum, which is computed from the entirety of the data and allows the receiver to verify that all data bits have been correctly received.
Each of these data packet sections must be processed and decoded in order to provide the original datastream to the receiver. Figure 4 illustrates that a packet processor 5 is provided in order to process a received datastream 24 into a decoded output datastream 58.
The different types of processing required by these sections of the packet and the complexity of the coding algorithms suggest that a software-based processing system is to be preferred, in order to reduce the complexity of the hardware. However, a pure software approach is difficult since each packet comprises a continuous stream of samples with no time gaps in between. As such, a pipelined hardware implementation may be preferred.
For multi-gigabit wireless communications, the baseband sample rate required is typically in the range of 1GHz to over 5GHz. This presents a problem when implementing the baseband processing in a digital device, since this sample rate is comparable to or higher than the clock rate of the processing circuits that are generally available. The number of processing cycles available per sample can then fall to a very low level, sometimes less than unity. Existing solutions to this problem have drawbacks as follows: 1. Run the baseband processing circuitry at high speed, equal to or greater than the sample rate: Operating CMOS circuits at GHz frequencies consumes excessive amounts of power, more than is acceptable in small, low-power, battery-operated devices. The design of such high frequency processing circuits is also very labour-intensive.
2. Decompose the processing into a large number of stages and implement a pipeline of hardware blocks, each of which perform only one section of the processing: Moving all the data through a large number of hardware units uses considerable power in the movement, in addition to the power consumed in the actual processing itself. In addition, the functions of the stages are quite specific and so flexibility in the processing algorithms is lost.
Existing solutions make use of a combination of (1) and (2) above to achieve the required processing performance.
An alternative approach is one of parallel processing; that is to split the stream of samples into a number of slower streams which are processed by an array of identical processor units, each operating at a clock frequency low enough to ease their design effort and avoid excessive power consumption. However, this approach also has drawbacks. If too many processors are used, the hardware overhead of instruction fetch and issue becomes undesirably large, and, therefore, inefficient. If processors are arranged-together into a Single Instruction Multiple data (SIMD) arrangement, then the latency of waiting for them to fill with data can exceed the upper limit for latency, as specified in the protocol standard being implemented.
An architecture with multiple processors communicating via shared memory can have the problem of contention for a shared memory resource. This is a particular disadvantage in a system that needs to process a continual stream of data and cannot tolerate delays in processing.
SUMMARY OF THE INVENTION
According to one aspect of the present invention, there is provided a data processing unit for a data processing system, the unit comprising a scalar processor device, and a heterogeneous processor device connected to receive first instruction information from the scalar processor, and to receive incoming data items, and operable to process incoming data items in accordance with received first instruction information, wherein the heterogeneous processor device comprises a heterogeneous controller unit connected to receive instruction information from the scalar processor, and operable to output second instruction information, an instruction sequencer connected to receive instruction information from the heterogeneous controller unit, and operable to output a sequence of instructions; and a vector processor array including a plurality of vector processor elements operable to process received data items in accordance with instructions received from the instruction sequencer.
Each vector processor may include a storage unit, and the data processing unit may further comprise a data distribution unit operable to distribute data items to such storage units of the vector processors.
In one example, the vector processors have respective data storage elements associated therewith, and the data storage elements are addressable using a common addressing scheme. The common addressing scheme may also be common to storage devices external to the data processing unit.
The second instruction information may represent very long instruction words (VLIWs).
BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 is a simplified schematic view of a wireless communications system; Figure 2 is a simplified schematic view of a transmitter of the system of Figure 1; Figure 3 is a simplified schematic view of a receiver of the system of Figure 1; Figure 4 illustrates a data processor; Figure 5 illustrates a data processor including processing units embodying one aspect of the present invention; Figure 6 illustrates data packet processing by the data processor of Figure 5; Figure 7 illustrates a processing unit embodying one aspect of the present invention for use in the data processor of Figure 5; Figure 8 illustrates the processing unit of Figure 7 in more detail; Figure 9 illustrates a scalar processing unit and a heterogeneous controller unit of the processing unit of Figure 8; Figure 10 illustrates a controller of the heterogeneous controller unit of Figure 9; and Figure 11 illustrates data processing according to another aspect of the present invention, performed by the processing unit of Figure 7 to 10.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
Figure 5 illustrates a data processor which includes a processing unit embodying one aspect of the present invention. Such a processor is suitabie for processing a continual datastream, or data arranged as packets. Indeed, data within a data packet is also continual for the length of the data packet, or for part of the data packet.
The processor 5 includes a cluster of N physical processing units 52152N, hereafter referred to as PPUs. The PPUs 52152N receive data from a first data unit 51, and sends processed data to a second data unit 57. The first and second data units 51, 57 are hardware blocks that may contain buffering or data formatting or timing functions. In the example to be described, the first data unit 51 is connected to transfer data with the radio sections of a wireless communications device, and the second data unit is connected to transfer data with the user data processing sections of the device. It will be appreciated that the first and second data units 51, 57 are suitable for transferring data to be processed by the PPUs 52 with any appropriate data source or data sink. In the present example, in a receive mode of operation, data flows from the first data unit 51, through the processor array to the second data unit 57. In a transmit mode of operation, the data flow is in the opposite direction-that is, from the second data unit 57 to the first data unit 51 via that processing array.
The PPUs 521...52N are under the control of a control processor 55, and make use of a shared memory resource 56. Data and control signals are transferred between the PPUs 52152N, the control processor 55, and the memory resource 56 using a bus system 54c.
It can be seen that the workload of processing a data stream from source to destination is divided N ways between the PPUs 52152N on the basis of time-slicing the data. Each PPU then needs only lfNth of the performance that a single processor would have needed. This translates into simpler hardware design, lower clock speed, and lower overall power consumption. The control processor 55 and shared memory resource 56 may be provided in the device itself, or may be provided by one or more external units.
The control processor 55 has different capabilities to the FPUs 52152N' since its tasks are more comparable to a general purpose processor running a body of control software. It may also be a degenerate control block with no software. It may therefore be an entirely different type of processor, as long as it can perform shared memory communications with the PPUs 52152N However, the control processor 55 may be simply another instance of a PPU, or it may be of the same type but with minor modifications suited to its tasks.
It should be noted that the bandwidth of the radio data stream is usually considerably higher than the unencoded user data it represents. This means that the first data unit 51, which is at the radio end of the processing, operates at high bandwidth, and the second data unit 57 operates at a lower bandwidth related to the stream of user data.
At the radio interface, the data stream is substantially continual within a data packet. In the digital baseband processing, the data stream does not have to be continual, but the average data rate must match that of the radio frequency datastream. This means that if the baseband processing peak rate is faster than the radio data rate, the baseband processing can be executed in a non-continual, burst-like fashion. In practise however, a large difference in processing rate will require more buffering in the first and second data units 51, 57 in order to match the rates, and this is undesirable both for the cost of the data buffer storage, and the latency of data being buffered for extended periods. Therefore, baseband processing should execute as near to continually as possible, and at a rate that needs to be only slightly faster than the rate of the radio data stream, in order to allow for small temporal gaps in the processing.
In the context of Figure 5, this means that data should be near-continually streamed either to or from the radio end of the processing (to and from the first data unit 51). In a receive mode, the high bandwidth stream of near-continual data is time sliced between the PPUs 52152N Consider the receiving case where high bandwidth radio sample data is being transferred from the first data unit 51 to the PPU cluster: In the simple case, a batch of radio data, being a fixed number of samples, is transferred to each PPU in turn, in round-robin sequence. This is illustrated for a received packet in Figure 6, for the case of a cluster of four PPUs.
Each PPU 52152N receives 621, 622, 623, 624, 625, and 626 a portion of the packet data 62 from the incoming data stream 6. The received data portion is then processed 71, 72, 73, 74, 75, and 76, and output 81, 82, 83, 84, 85, and 86 to form a decoded data packet 8.
Each PPU 52152N must have finished processing its previous batch of samples by the time it is sent a new batch. In this way, all N PPUs 52152N execute the same processing sequence, but their execution is out of phase' with each other, such that in combination they can accept a continuous stream of sample data.
In this simple receive case described above, each PPU S2152N produces decoded output user data, at a lower bandwidth than the radio data, and supplies that data to the second data unit 57. Since the processing is uniform, the data output from all N PPUs 52152N arrives at the data sink unit 57 in the correct order, so as to produce a decoded data packet.
In a simple transmit mode case, this arrangement is simply reversed, with the PPUs 52152N accepting user data from the second data unit 57 and outputting encoded sample data to the first data unit 51 for radio transmission.
The data processor includes hierarchical data networks which are designed to localise high bandwidth transactions and to maxim ise bandwidth with minimal data latency and power dissipation. These networks make use of an addressing scheme which is common to both the local data storage and to processor wide data storage, such as the local memory 56, in order to simplify the programming model.
However, wireless data processing is more complex than in the simple case described above. The processing will not always be uniform -it will depend on the section of the data packet being processed, and may depend on factors determined by the data packet itself.
For example, the Header section of a received packet may contain information on how to process the following payload. The processing algorithms may need to be modified during reception of the packet in response to degradation of the wireless signal. On the completion of receiving a packet, an acknowledgement packet may need to be immediately transmitted in response. These and other examples of more complex processing demand that the PPUs S21S2N have a flexibility of scheduling and operation that is driven by the software running on them, and not just a simple pattern of operation that is fixed in hardware.
Under this more complex processing regime, the following considerations must be taken into account: * A control process, thread or agent defines the overall tasks to be performed. It may modify the priority of tasks depending on data-driven events. It may have a list of several tasks to be performed at the same time, by the available PRUs 52152N of the cluster.
* The data of a received packet is split into a number of sections. The lengths of the sections may vary, and some sections may be absent in some packets. Furthermore, the sections often comprise blocks of data of a fixed number of samples. These blocks of sample data are termed Symbols' in this description. It is highly desirable that all the data for any symbol be processed in its entirety by one PPU 52152N of the cluster, since splitting a symbol between two PPUs 52152N would involve undue communication between the PPUs 52152N in order to process that symbol. In some cases it is also desirable that several symbols be processed together in one PPU 52152N for example if the Header section 61 (Figure 6) of the data packet comprises several symbols. The PPUs 52152N must in general therefore be able to dictate how much data they receive in any given processing phase from the data source unit 51, since this quantity may need to vary throughout the processing of a packet.
S
* Non-uniform processing conditions could potentially result in out of order processed data being available from the PPUs 52152N In order to prevent such possibility, a mechanism is provided to ensure that processed data are provided to the first data unit 51 (in a transmit mode) or to the second data unit 57 (in a receive mode), in the correct order.
* The processing algorithms for one section of a data packet may depend on previous sections of the data packet. This means that PPUs 52152N must communicate with each other about the exact processing to be performed on subsequent data. This is in addition to, and may be a modification of, the original task specified by the control process, thread, or agent.
* The combined processing power of the entire N PPUs 52152N in the cluster must be at least sufficient for handling the wireless data stream in that mode that demands the greatest processing resources. In some situations, however, the data stream may require a lighter processing load, and this may result in PPUs 52152N completing their processing of a data batch ahead of schedule. It is highly desirable that any PPU 52152N with no immediate work load to execute be able to enter an inactive, low-power sleep' mode, from which it can be awoken when a workload becomes available.
The cluster arrangement provides the software with the ability for each of the PPUs 52152N in the cluster to collectively decide the optimal DSP algorithms and modes in which the system should be placed in. This reduction of the collective information is available for the lower MAC layer processing via the SON network. This localised processing and reduction hierarchy provides the MAO with the optimal level of control of the PHY DSP.
A PPU is illustrated in Figure 7, and comprises scalar processor unit 101 (which could be a 32-bit processor) closely connected with a heterogeneous processor unit (HPU) 102. High bandwidth real time data is coupled directly into and out of the HPU 102, via a system data network (SDN) 106a and 106b (54a and 54b in Figure 5). Scalar processor data and control data are transferred using a PPU-SMP (PPU-symmetrical multiprocessor) network PSN 104, (54c in Figure 5).
Data are substantially continually dispatched, in real time, into the HPU 102, in sequence via the SDN 106a, and are then processed. Processed data exit from the HPU 102 on the SDN 106b.
The scalar processor unit 101 operates by executing a series of instructions defined in a high level program. Embedded in this program are specific coprocessor instructions that are customised for computation within the HPU 102. The scalar unit 101 is connected in such a way that these coprocessor instructions are routed automatically to a heterogeneous controller unit (HCU) (120 in Figure 8) within the HPU 102, which handles control of the HPU 102.
Figure 8 illustrates the processing unit of Figure 7 in more detail. The scalar processor unit 101 comprises a scalar processor 110, a data cache 111 for temporarily storing data to be transferred with the PU-SMP network 104, 105, and a co-processor interface 112 for providing interface functions to the heterogeneous processor unit 102.
The HPU 102 comprises the heterogeneous controller unit (HCU) 120 for directly controlling a number of heterogeneous function units (HFUs) and a number of connected hierarchical data networks. The total number of 1-IFUs in the HPU 102 is scalable depending on required performance. The HFUs are provided by a number of different processing units, as described below.
The HPU 102 contains a programmable vector processor array (VPA) 122 which comprises a plurality of vector processor units (VP Us) 123. The HFU also includes a number of fixed function Accelerator Units (AUs) 140a, 140b, and a number of memory to memory DMA (direct memory access) units 135, 136. The VPA, AUs, and DMA units provide the HFUs mentioned above.
The HCU 120 is shown in more detail in Figure 9, and comprises an instruction decode unit 150, which is operable to decode (at least partially) instructions and to forward them to one of a number of parallel sequencers 1550...1554, each controlling its own heterogeneous function unit (HFU). Each sequencer has storage 1540...lS44for a number of queued dispatched instructions ready for execution in a local dispatch FIFO buffer. Using a chosen selection from a number of synchronous status signals (SSS), each HFU sequencer can trigger execution of the next queued instruction stored in another HFU dispatch FIFO buffer, Referring back to Figure 8, the VPA 122 comprises a plurality of vector processor units VPUs 123 arranged in a single instruction multiple data (SIMD) parallel processing architecture. Each VPU 123 comprises a vector processor element (VPE) 130 which includes a plurality of processing elements (PEs) 1301...1304. The PEs in a VPE are arranged in a SIMD within a register configuration (known as a SWAR configuration). The PEs have a high bandwidth data path interconnect function unit so that data items can be exchanged within the SWAR configuration between PEs.
Each VPE 130 is closely coupled to a VPU partitioned data memory (VPU-PDM) 132 subsystem via an optimised high bandwidth VPU network (VPUN) 131. The VPUN 131 is optimised for data movement operations into the localised VPU-PDM 132, and to various other localised networks. One such localised data network is an Accelerator Data Network (ADN) 139 which is provided in order to allow data to be transferred between the VPUs 123 and theAUs 140, 141.
The VPE 130 addresses its local VPU-PDM 132 using an address scheme that is compatible with the overall hierarchical address scheme. The VPE 130 uses a vector SIMD address (VSA) to transfer data with its local VPU-PDM 132. A VSA is supplied to all of the VPUs 123 in the VPA 122, such that all of the VPUs access respective local memory with the same address. A VSA is an internal address which allows addressing of the VPU-PDM only, and does not specify which HFU or VPE is being addressed.
Adding additional address bits to the basic VSA forms a heterogeneous Ml MD address (H MA). A HMA identifies a memory location in a particular heterogeneous function unit HFU, and again is compatible with the overall system-level addressing scheme. HMAs are used to address specific memory in a specific HFU of a PPU 52.
The VSA and HMA are compatible with the overall system addressing scheme, which means that in order to address a memory location inside an HFU of a particular PPU, the system merely adds PPU-identifying bits to an HMA to produce a system-level address for accessing the memory concerned. The resulting system-level address is unique in the system-level addressing scheme, and is compatible with other system-level addresses, such as those for the local shared memory 56.
Each PPU has a unique address range within the system-level addresses.
Since all the HFUs are uniquely addressable, and have access to all other HFUs and PDMs in the HPU 102, stored data items are uniquely addressable, and, therefore, can be moved amongst these units using direct memory access (DMA).
DMA units 135, 136 are provided and are arranged such that they may be programmed as the other HPUs by the HCU 120 from instructions dispatched from the SU 101 using instructions specifically targeted at each unit individually. The DMA units 135, 136 can be programmed to add the appropriate address fields so that data can automatically be moved through the hierarchies.
Since the DMA units in the HPU 102 use HMAs they can be instructed by the HCU 120 to move data between the various HFU, PDM and SDN Networks. A parallel pipeline of sequential computational tasks can then be routed seam lessly through the HFUs by executing a series of DMA instructions, followed by execution of appropriate HFU instructions. Thus, these instruction pipelines run autonomously and concurrently.
The DMA units 135, 136 are managed explicitly by the HCU 120 with respective HFU dispatch FIFO buffers (as is the case for the VPU's FDM). The DMA units 13, 136 can be integrated into specific HFUs, such as the accelerator units 140a, 140b, and can share the same dispatch FIFO buffer as that HFU.
Instructions are issued to the VPA 122 in the form of Very Long Instruction Word (VLIW) microinstructions by a vector micro-coded controller (VMC) within the Instruction decode unit 150 of the HCU 120. The VMC is shown in more detail in Figure 10, and includes an instruction decoder 181, which receives instruction information 180. The instruction decoder 131 derives an instruction addresses from received instruction information, and passes those derived addresses to an instruction descriptor store 182. The instruction descriptor store 182 uses the received instruction addresses to access a store of instruction descriptors, and passes the descriptors indicated by the received instruction addresses to a code sequencer 183. The code sequencer 183 translates the instruction descriptors into microcode addresses for use by a microcode store 184. The microcode store 184 forms multi-cycle VLIW micro-sequenced instructions defined by the received microcode addresses, and outputs the completed VLIW 186 to the sequencer 155 (Figure 9) appropriate to the HFU being instructed. The microcode store can be programmed to expand such VLIWs into a long series of repeated vectorised instructions that operate on sequences of addresses in the VPU-PDM 132. The VMC is thus able to extract significant parallel efficiency of control and thereby reduce instruction bandwidth from the PPU SU 101.
In order to ensure that instructions for a specific HFU only execute on data after the previous computation or after a DMA operation has terminated, a selection of synchronous status signals (SS Signals) are provided that are used indicate the status of execution of each HFU to other HFUs. These signals are used to start execution of an instruction that has been halted in another HFU's instruction dispatch FIFO buffer. This, one HFU can be caused to await the end of processing of an instruction in another HFU before commencing its own instruction processing.
The selection of which synchronous status to use is under program control, and the status is passed as one of the parameters with the instruction for the specific HFU. This allows many instructions to be dispatched into HFU dispatch FIFO buffers ahead of the execution of that instruction. This guarantees that each stage of processing will wait until the data is ready for that HFU. Since the vector instructions in the HFUs can last many cycles, it is likely that the instruction dispatch time will be very short compared to the actual execution time. Since many instructions can wait in each HFU dispatch FIFO buffer, the HFUs can optimally execute concurrently without the need for interaction with the SU 101 or any other HFU except for the arrival completion of the data to be processed.
A group of synchronous status signals are connected into the SU1OI both via interrupt mechanisms and via an HPU Status (HPU-STA) signal. This provides synchronisation with SU 101 processes and the HFUs. These are collectively known as SU-SS signals.
Another group of synchronous status signals are connected to the SDN Network and PSN network interlaces. This provides synchronisation across the SoC such that system wide DMAs can be made synchronous with the HPU.
Another group of Synchronous Status Signals are connected to programmable timer hardware 153, both local and global to the SoC. This provides a method for accurately timing the start of a processing task and control of DMA of data around the SoC.
Some of the synchronous status signals can be programmed to map onto to the HPU power saving controls (HPU-PSC). These signals can switch the VPUs and AUs on and off, in various power saving modes, as data is passed to them. These are used for fine grain control of power saving modes.
A combination of FFT Accelerator Units, LDPC Accelerator Units and Vector Processor Units are used to offload optimally different sequential stages of computation of an algorithm to the appropriate optimised HFU. Thus the HFU's that constitute the HPU 102 operate automatically and optimally on data in a strict sequential manner described by a software program created using conventional software tools.
The status of the HPU 102 can be read back using instructions issued through the co-processor interface (CPI) 112. Depending on which instructions are used, various status conditions can be returned to the SU 101 to direct the program flow of the SU 101.
Figure 11 illustrates an example instruction sequence which demonstrates the sequential order of instruction dispatch into parallel execution unit FIFO buffers, and the subsequent sequential chaining of parallel execution units using the synchronous status.
An entire code sequence is dispatched into the heterogeneous function unit dispatch FIFO buffers before complete execution. This frees the SU 101 to proceed with other house-keeping operations in parallel with the HPU's 102 execution.
The code sequence of Figure 11 illustrates optimisation of the execution order. Sequence illustrates an execution order without optimisation, whilst sequence 202 illustrates that optimisation can be achieved by moving two lines of code (VPU_MACRO_0). Execution time is significantly reduced as the VPU 123 becomes fully utilized. The ease of which this re-ordering is applied is due to the linking of the instructions with the synchronous status.

Claims (10)

  1. CLAIMS: 1. A data processing unit for a data processing system, the unit comprising: a scalar processor device; and a heterogeneous processor device connected to receive first instruction information from the scalar processor, and to receive incoming data items, and operable to process incoming data items in accordance with received first instruction information, wherein the heterogeneous processor device comprises: a heterogeneous controller unit connected to receive instruction information from the scalar processor, and operable to output second instruction information; an instruction sequencer connected to receive instruction information from the heterogeneous controller unit, and operable to output a sequence of instructions; and a vector processor array including a plurality of vector processor elements operable to process received data items in accordance with instructions received from the instruction sequencer.
  2. 2. A data processing unit as claimed in claim 1, wherein the heterogeneous controller unit is operable to construct a second instruction information portion from a plurality of first instruction information portions.
  3. 3. A data processing unit as claimed in claim I or 2, wherein each vector processor includes a storage unit, and wherein the data processing unit further comprises a data distribution unit operable to distribute data items to the storage units of the vector processors.
  4. 4. A data processing unit as claimed in any one of the preceding claims, wherein the vector processors have respective data storage elements associated therewith, the data storage elements being addressable using a common addressing scheme.
  5. 5. A data processing unit as claimed in claim 4, wherein the common addressing scheme is also common to storage devices external to the data processing unit.
  6. 6. A data processing unit as claimed in any one of the preceding claims, wherein the second instruction information represents very long instruction words (VLIWs).
  7. 7. A data processing unit as claimed in any one of the preceding claims, further comprising at least one additional accelerator unit.
  8. 8. A data packet processing system comprising a data processing unit as claimed in any one of the preceding claims.
  9. 9. A wireless communications device comprising an RF transceiver, and a data processing unit as claimed in any one of claims 1 to 7, operable to transfer data packets with the RF transceiver.
  10. 10. A device substantially as hereinbefore described with reference to, and as shown in, Figures 5 to 11 of the accompanying drawings.
GB1017751.7A 2010-10-21 2010-10-21 Data processing unit with scalar processor and vector processor array Withdrawn GB2484906A (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
GB1017751.7A GB2484906A (en) 2010-10-21 2010-10-21 Data processing unit with scalar processor and vector processor array
US13/880,473 US9285793B2 (en) 2010-10-21 2011-10-20 Data processing unit including a scalar processing unit and a heterogeneous processor unit
PCT/GB2011/052042 WO2012052774A2 (en) 2010-10-21 2011-10-20 Data processing units

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1017751.7A GB2484906A (en) 2010-10-21 2010-10-21 Data processing unit with scalar processor and vector processor array

Publications (2)

Publication Number Publication Date
GB201017751D0 GB201017751D0 (en) 2010-12-01
GB2484906A true GB2484906A (en) 2012-05-02

Family

ID=43334137

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1017751.7A Withdrawn GB2484906A (en) 2010-10-21 2010-10-21 Data processing unit with scalar processor and vector processor array

Country Status (1)

Country Link
GB (1) GB2484906A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2539407A (en) * 2015-06-15 2016-12-21 Bluwireless Tech Ltd Data processing
GB2539409A (en) * 2015-06-15 2016-12-21 Bluwireless Tech Ltd Data processing
EP3373152A1 (en) * 2017-03-09 2018-09-12 Google LLC Vector processing unit

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2136172A (en) * 1983-03-02 1984-09-12 Hitachi Ltd Vector processor
US4685076A (en) * 1983-10-05 1987-08-04 Hitachi, Ltd. Vector processor for processing one vector instruction with a plurality of vector processing units
JPH01318161A (en) * 1988-06-20 1989-12-22 Hitachi Ltd Vector processor
US5197130A (en) * 1989-12-29 1993-03-23 Supercomputer Systems Limited Partnership Cluster architecture for a highly parallel scalar/vector multiprocessor system
EP1011052A2 (en) * 1998-12-15 2000-06-21 Nec Corporation Shared memory type vector processing system and control method thereof
US20060176308A1 (en) * 2004-11-15 2006-08-10 Ashish Karandikar Multidimensional datapath processing in a video processor
WO2007018467A1 (en) * 2005-08-11 2007-02-15 Coresonic Ab Programmable digital signal processor having a clustered simd microarchitecture including a complex short multiplier and an independent vector load unit

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2136172A (en) * 1983-03-02 1984-09-12 Hitachi Ltd Vector processor
US4685076A (en) * 1983-10-05 1987-08-04 Hitachi, Ltd. Vector processor for processing one vector instruction with a plurality of vector processing units
JPH01318161A (en) * 1988-06-20 1989-12-22 Hitachi Ltd Vector processor
US5197130A (en) * 1989-12-29 1993-03-23 Supercomputer Systems Limited Partnership Cluster architecture for a highly parallel scalar/vector multiprocessor system
EP1011052A2 (en) * 1998-12-15 2000-06-21 Nec Corporation Shared memory type vector processing system and control method thereof
US20060176308A1 (en) * 2004-11-15 2006-08-10 Ashish Karandikar Multidimensional datapath processing in a video processor
WO2007018467A1 (en) * 2005-08-11 2007-02-15 Coresonic Ab Programmable digital signal processor having a clustered simd microarchitecture including a complex short multiplier and an independent vector load unit

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2539407A (en) * 2015-06-15 2016-12-21 Bluwireless Tech Ltd Data processing
GB2539409A (en) * 2015-06-15 2016-12-21 Bluwireless Tech Ltd Data processing
GB2539409B (en) * 2015-06-15 2017-12-06 Bluwireless Tech Ltd Data processing
GB2539407B (en) * 2015-06-15 2017-12-06 Bluwireless Tech Ltd Data processing
EP3373152A1 (en) * 2017-03-09 2018-09-12 Google LLC Vector processing unit
WO2018164730A1 (en) * 2017-03-09 2018-09-13 Google Llc Vector processing unit
US10261786B2 (en) 2017-03-09 2019-04-16 Google Llc Vector processing unit
US10915318B2 (en) 2017-03-09 2021-02-09 Google Llc Vector processing unit
US11016764B2 (en) 2017-03-09 2021-05-25 Google Llc Vector processing unit
US11520581B2 (en) 2017-03-09 2022-12-06 Google Llc Vector processing unit

Also Published As

Publication number Publication date
GB201017751D0 (en) 2010-12-01

Similar Documents

Publication Publication Date Title
US9285793B2 (en) Data processing unit including a scalar processing unit and a heterogeneous processor unit
US20140040909A1 (en) Data processing systems
US20150143073A1 (en) Data processing systems
Meng et al. Dedas: Online task dispatching and scheduling with bandwidth constraint in edge computing
US8010593B2 (en) Adaptive integrated circuitry with heterogeneous and reconfigurable matrices of diverse and adaptive computational units having fixed, application specific computational elements
US6079008A (en) Multiple thread multiple data predictive coded parallel processing system and method
US8667252B2 (en) Method and apparatus to adapt the clock rate of a programmable coprocessor for optimal performance and power dissipation
US7353516B2 (en) Data flow control for adaptive integrated circuitry
US8296764B2 (en) Internal synchronization control for adaptive integrated circuitry
US11281506B2 (en) Virtualised gateways
US20140068625A1 (en) Data processing systems
KR20210029725A (en) Data through gateway
US8341394B2 (en) Data encryption/decryption method and data processing device
GB2484906A (en) Data processing unit with scalar processor and vector processor array
GB2484903A (en) Power saving in a data processing unit with scalar processor, vector processor array, parity and FFT accelerator units
GB2484900A (en) Data processing unit with scalar processor, vector processor array, parity and FFT accelerator units
GB2484902A (en) Data processing system with a plurality of data processing units each with scalar processor, vector processor array, parity and FFT accelerator units
CN112673351A (en) Streaming engine
GB2484901A (en) Data processing unit with scalar processor, vector processor array, parity and FFT accelerator units
GB2484907A (en) Data processing system with a plurality of data processing units and a task-based scheduling scheme
GB2484899A (en) Data processing system with a plurality of data processing units and a task-based scheduling scheme
GB2484905A (en) Data processing system with a plurality of data processing units and a task-based scheduling scheme
GB2484904A (en) Data processing system with a plurality of data processing units and a task-based scheduling scheme
US20240104049A1 (en) Operation result broadcasting solutions for programmable processing array architectures
EP1760579B1 (en) Data processor unit for high-throughput wireless communications

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)