CA2220993A1

CA2220993A1 - Single-instruction-multiple-data processor

Info

Publication number: CA2220993A1
Application number: CA 2220993
Authority: CA
Inventors: Paul Marriott; Tahar Ali Yahia; Qunshan Gu; Yvon Savaria
Original assignee: Ecole Polytechnique de Montreal
Current assignee: Ecole Polytechnique de Montreal
Priority date: 1997-11-07
Filing date: 1997-11-07
Publication date: 1999-05-07

Abstract

A single-instruction multiple-data array processor optimised for both linear andnon-linear signal processing is disclosed. The array processor includes processing elements each with three data input ports, three data output ports, a processing unit, and output selection unit, two comparators, and a selection control unit. The processing elements are capable of providing several data values at each of several output ports and in so doing, support execution of non-linear operations such as sort within a single instruction execution cycle. The processor is also provided with two shift register channels for flexible data organization concurrent with data processing. The two shift register channels are complemented by an inter-processing element communication channel. In use, the comparators set flags for controlling selection of a data value to provide to each of the output ports.

Description

Single-Instruction-Multiple-Data Processor Field of the Invention The present invention relates generally to parallel processors and more particularly to single-instruction multiple-data (SIMD) processor architectures for high-speed digital s signal processing supporting linear and non-linear processing.

Background of the Invention Parallel processing is a well-known technique for providing efficient solutions in colllL,ul~lionally intensive applications. Examples of applications employing parallel processing include image/video processing, neural network simulation, and artificial 0 intelligence. In the past, two main approaches to parallel processing have been proposed and implemented - multiple-instruction multiple-data (MIMD) and SIMD. Each has advantages and disadvantages.

MIMD processors are more complicated than SIMD processors due to execution of multiple instructions simultaneously. In effect, a MIMD processor is a plurality of 5 independentprocessors all executingprograms concurrently. This allows form;.xi~ .", flexibility at the expense of complexity in inter-processor communication design and pro~ "..";l,~; Much ofthe pro~ g complexity results from meeting inter-processor coll"llullication and synchronisation requirements. Many mech~ni~m~ have been proposed for det( rminin~ whether or not an individual processing element has completed 20 a task--often requiring significant complexity.

On the other hand, SIMD processors offer a simple and efficient solution for parallel processing. A SIMD processor executes a single program using a single controller for all processing elements--each processing element receiving a same instruction sequence. In order to obtain a same peak theoretical performance, SIMD processors 25 occupy significantly less circuit area than similar MIMD devices; however, SIMD
processors require suitable application and algorithm mappings. Because at any time all processing elements wit_in a SIMD execute a same instruction provided by the single controller, ensuring proper multiple-data in, and proper multiple-data out of the multiple processing elements through execution of a single-instruction is critical in SIMD

~ ~ CA 02220993 1997-11-07 architecture design and pro~,l; ""~ Ideally, data transfer and communications take place in parallel to computations within the processing elements. This poses a very significant problem that has heretofore been addressed in a variety of ways as discussed below.

Another very significant problem with SIMD processors is evident when executing conditional branch instructions, which are very common instructions in data processing. Because each processor executes a same instruction, upon execution of a conditional branch, only those processors that meet the specified condition are required to execute the subsequent instruction. A common prior art solution is to disable the unused lo processing elements within a SIMD processor in order to support conditional branch instructions.. In such a case, SIMD processors lose a great deal of culllpul~lion power, since each processing element is not being used for each of the branches. For example, when a simple instruction sequence for setting X to 5 if y>3 or setting X to 1 if y<=3 is executed, when only one processing element determines that y>3, then all other 1S processing elements lose a cycle while X within that processing element is set to 5. Since all elements receive the same instruction, in order to handle instructions of the form:

If y>3 X=5 Else X=l a SIMD architecture requires disabling of some processing elements according to predefined rules, while executing local data dependent branches. It will be evident to those of skill in the art that since many conditional branch instructions are required, each processing element is often disabled for at least one instruction execution cycle during 25 execution of conditional branch instructions. The term instruction execution cycle as used in the disclosure and claims that follow, indicates the time required for execution of a single instruction. For SIMD processors, the instruction execution cycle is the longest time required by any of the parallel processing elements to execute a given instruction.

It would be advantageous to improve communications to, from and between 30 processors in a SIMD architecture. Such improved communication would allow the use of a processing elements more efficiently and would increase SIMD processor flexibility.

It would be advantageous to elimin~te a need to execute conditional instruction sequences in a SIMD processor for at least some non-linear operations.

In the past, linear methods based on multiply-accnmlll~te and multiply-add operations have played an important role in digital signal processing; however, advances 5 in the field of digital signal processing for use in image processing require a variety of non-linear processing steps, which heretofore are executed through conditional instruction sequences.

For example, rank-order and median filtering require efficient sorting of data elements;
and motion estimation and pattern m ~tc.hing require Aet( rmin~tion of a ~ l l l or 0 m;~ x;,, ,1 l,,, value and an associated index of a vector for an array of data. Similar al~,olill"lls are also used in the fields of image/video coding, computer vision, artificial intelligence, pattern recognition, neural network simulation, object tracking, etc. There are also data manipulation operations that require non-linear processing. Examples of data manipulation operations of this type include: clipping a data value such that it falls 15 between an upper and a lower bound, a core function which sets a value to 0 only when certain conditions are met, and many other non-linear operations.

These non-linear operations have been commonly implemented by compare and conditional execution instructions as described above. Many instruction execution cycles are often required to implement a very simple non-linear operation and conditional 20 execution. On SIMD processors, heretofore, this required disabling of some processing elements. Due to the above disadvantages, conventional processors, including conventional SIMD processors, do not meet the needs for non-linear processing such as median filt~ring required in (l~m:~nclin~; image/video processing applications.

Prior Art In U.S. Patent 5,361,370 issued November 1, 1994 to David L. Sprague et al. a SIMD processor having dual-ported local memory architecture for .simlllt~neous data tr~ncmis~ion on local memory ports and global ports is disclosed. Further, a method is s provided for data transfers between local memory and global memory using dual-ported local memory. One port of the local memory is dedicated to data transfers, permitting each access to global memory to be overlapped in time with instruction processing.
Though the increased data transfer capabilities overcome known problems in SIMD
processor design, such a system does not significantly reduce conditional instruction lo execution time. Also, the implementation disclosed in U.S. Patent 5,361,370 uses two dedicated ports and, thereby, provides limited flexibility for improving communications within such an architecture.

In U.S. patent 5,430,854 issued July 4, 1995 in the name of David L. Sprague et al. a method is disclosed for efficient use of parallel data-paths when executing an 15 instruction seqll_enGe having Gonditional instruGtion seqllenGes which are referred to as conditionals in said reference. At least two conditional outcomes are sequentially determined according to the instruction sequence. A respective mask flag is set for each outcome, wherein the mask flag is effective to determine whether to execute an instruction or to idle during a selected instruction cycle. Though, such an approach 20 improves efficiency of conditional instruction execution on a SIMD processor, it requires disabling some processing elements.

In "~lgmentin~ Ada for SIMD Parallel Processing" (IEEE Trans. On Software Engineering, Vol. SE-ll, No. 9, September, 1985), C.L. Cline et al. propose extensions to the Ada pro~ " ",; "p~ language to allow specification of operations such as inter-2s processor communication and activation of particular groups of processing elements.Such an extension is useful with the invention of Sprague described above. In "Efficient m~king techniques for large scale SlMD architectures" (third Symposium on the Frontiers of Massively Parallel Co~ uL~Lions, IEEE Computer Society Press, Oct. 1990, pp.259-64), W.G. Nation et al. Disclose various m:~kin~ techniques for use in optimi~ing 30 conditional instruction execution. Again these are based on an architecture similar to that set out by Sprague et al. in the above-mentioned patent.

In U.S. Patent 5,253,308 issued on October 12, 1923 in the name of William K.
Johnson, a massively parallel digital-image data processor is disclosed. The processor is arranged in the form of a two-dimensional matrix wherein relative indexed addressing 5 among the processing elements is employed. Image data may be accessed by and shared a~nong all processing elements. A controller provides instructions to all the processing elements. Each processing element employs a triple-ported register for intrrn~l memory, which may input and output data independently and simultaneously. Unfortunately, the architecture does not elimin~te processor idling during conditional instruction execution.
lo The use of triple-ported memories and indexed addressing increases complexity in an attempt to enhance coll,tnu~3ication within the SIMD architecture. Though such an implementation is advantageous, it would be preferable to provide coll"llul~ication flexibility with less complexity.

In U.S . Patent 4,3 80,046 issued on April 12, 1983 in the name of Robert A.
15 Frosch et al. a massive parallel processor computer is disclosed employing a large number of parallel processing elements. The elements operate under control of a single set of instructions simultaneously and independently on single bit slices of a corresponding array of incoming data streams. Data flow between the processing elements, the controller and peripheral devices, is managed by a program and data management unit in the form of 20 a general purpose computer having N-bit input and N-bit output data registers. The architecture of each processing element is formed of three basic components - anarithmetic logic and routing unit (ALRU), an I/O unit and a local memory unit - a bi-directional data bus interconnects all units. Such an architecture also fails to reduce processor element idling during conditional instruction sequences.

In the CNAPS Data Book (Document release: 2.0, March 1, 1995) from Adaptive Solutions, inc. of Beaverton, Oregon detailed information on the CNAPS chip set is disclosed including information on the CNAPS sequencer (CSC), the CNAPS 1016, and the CNAPS 1064. The CNAPS-1016 contains 16 processing nodes in a linear array and the CNAPS-1064 contains 64 processing nodes in a linear array. The CSC and one or multiple CNAPS-1016 or CNAPS- 1064 chips form a SIMD processor. In the CNAPS

~ ~ CA 02220993 1997-11-07 architecture, the CSC controls flow of instructions and data to and from a CNAPS array (CNAPS-1016 or CNAPS-1064) through three global buses: input bus (8-bit), comm~n(l bus (32-bit), and output bus (8-bit). All active processing nodes receive the same data on the input bus, while the CSC reads output data from the output bus for one processing 5 element at a time. The processing elements are connected by a 4-bit inter processing element bus. As with prior SIMD architectures, conditional instruction execution is inefficient using the CNAPS chip set. Also, the communication supported by the CNAPS
chip set, is less than that desirable for more efficient use of SIMD processor capabilities.
As such, it would be preferable to more efficiently communicate to, from and between 10 processing elements.

The TMS320C80x (MVP) Online Reference (Release 1. 10, 1995) from Texas Instruments Incorporated discloses detailed information on the TMS320C80x multimedia video processor. The TMS320C80x is a parallel processor, which contains one 32-bit RISC master controller with an 80-Mflop IEEE floating point unit, four 32-bit advanced 15 DSP (ADSP) processors, a transfer controller and a video controller. Each ADSP has a three-input ALU that supports all 256 Boolean combinations of three inputs and many combinations of arithmetic and Boolean functions. Such a MIMD processor exemplifies some advantages of MIMD processors. Unfortunately, these advantages, heretofore, have been achieved at significant expense in terms of circuit area, which in turn is reflected in 20 additional cost.

The ADSP-2016x SHARC User's Manual (First edition, March 1995) of Analog Devices, Inc. discloses detailed information regarding the ADSP-2016x SHARC
processor (SHARC). The SHARC is not an array processor; however, multiple SHARCscan be used to form an array processor organised as either a MIMD processor or a SIMD
25 processor. The SHARC processor supports some non-conventional instructions such as a ", i ~-i " " " " value derived from two data input values, a maximum value derived from two data input values, etc. These instructions are implemented in a single cycle without conditional branching, but they are restricted to two data input and one data output. As such, they are not easily adapted to use in image processing and do not address most of 30 the conditional instruction sequences encountered by SIMD processors.

J ~ CA 02220993 1997-11-07 In "A236 Parallel Digital Signal Processor Chip Reference Manual" (Version 3.1.2, October 24, 1995) from Oxford Computer, Inc., detailed information relating to the A236 processor architecture and instruction set is disclosed. The A236 processorcomprises one 24-bit scalar processor, 4 SIMD processing elements, and associated 5 con"llul~ication channels. A crossbar switch through a data cache implements data sharing among the 4 processing elements, somewhat similar to the implementation in the TMS320C80x. Using a crossbar switch to implement data sharing among processing elements works well for a small number of processing elements and is therefore limited in application. For example in image processing, where a number of parallel processing lo elements can easily exceed 1,000, such a crossbar switch is impracticable.

In NIST Technical Note 1288, "Video Processing With the Engine at NIST" by ~ruce F. Field and Charles Fenimore the architecture of the Princeton Engine is disclosed.
The Princeton Engine is a massively parallel supercomputer using a SIMD architecture.
Different channels are used for data input, output, and inter-processor communications.
5 The first channel of the Princeton Engine is an input shift-register. A scan line of video data is serially moved into the input shift-register. Once a line is shifted into the input shift-register, the samples of the line are moved in parallel directly to processing elements. Each processing element operates on one pixel in a line. The second channel of the Princeton Engine comprises output registers, which are used to hold output results. An 20 Output Timing Sequence (OTS) facility is used to select the output registers for data output. The inter-processor communication is supported by the inter-processor C~llUllUll 'cation (IPC) bus, which allows any data within a processing element to be sent to any other processing element. To use the IPC, data in the processing element is loaded into the IPC bus register for that processing element and an IPC bus transmit comm~n(l is 25 executed to shift loaded data either left or right along the bus. Data at an end of the bus is either looped around to a processing element at the other end of the bus or lost, and a constant value may be provided at the other end of the bus. The input shift register and the output registers are clocked separately to program execution and as such function independently.

The Princeton Engine is well suited to the task of image processing. A single scan line is processed ~im~ neously and then another scan line follows. Unfortunately, in optimi~ing the Princeton Engine for image processing, much flexibility was lost. For example, dividing a scan line into segments of a half scan line or a quarter scan line instead of a complete scan line is not easily implemented. Further, processing must occur at preclctermined rates requiring completion of processing for each scan line in a 5 predetermined time. It would be advantageous to achieve similar image processing efficiency without limiting processor flexibility.

Unfortunately, none of the above described architectures provides a simple, efficient, c~n"llul~ication architecture for use in a SIMD processor and for providing c~ll"llullication to, from and between processing elements in a flexible and practical 10 fashion. Also, none of the above described architectures provides a simple and efficient design for implementing conditional instruction sequences in a SIMD processor without significantly reducing performance.

Object of the Invention In an attempt to overcome some of these and other limitations of the prior art, it is 15 an object of this invention to provide an efficient SIMD architecture for simultaneous execution of multiple-input multiple-output non-linear operations.

Moreover, yet another objective of the present invention is to provide an efficient SIMD architecture for linear and non-linear hybrid processing.

Summary of the Invention 20 In accordance with the invention, there is provided a single instruction multiple data processor architecture comprising:
a plurality of processing elements each processing element including a functional unit for performing data processing and a memory;
a shift register for p~lrOllllillg at least one of providing data to the memories, receiving 2s data from the memories, providing data to the functional units, and receiving data from the functional units and for supporting multiplexed data c~n"llullication between processing elements; and, a controller for controlling the shift register to provide simultaneous data c~n~ "cation and data processing.

In an embodiment, the shift register is for providing data to the memories, receiving data from the memories, providing data to the functional units, and receiving data from the functional units.

s In a further embodiment, the functional unit comprises an arithmetic-logic-s~.vap unit including:
a plurality of data input ports for receiving a plurality of data input values;
a number greater than one of data output ports for providing up to thè number of data output values;
lo a processing unit for processing some of the data input values to provide a processed data value; and, an output selection unit for switching up to the number of data values from the input data values and the processed data value for provision to at least one of the data output ports.

5 I~ accordance with the invention, there is also provided a single instruction multiple data processor architecture comprising:
a plurality of processing elements each processing element including a functional unit for pelrollllillg data processing and a plurality of memory addresses for storing data values;
a shift register comprising a plurality of data stores, each data store can be multiplexed for 20 receiving data values from a plurality of functional units, a plurality of memory addresses from a plurality of processing elements, and another data store within the shift register, the data store in coll"llul~ication with and for providing data values to a different data store within the shift register, a plurality of memory addresses and a functional unit; and, a controller for controlling the shift register to provide simultaneous data colln~ .ication 25 and data processing.

In accordance with another aspect of the invention, there is provided a method of executing instructions within a class of linear and non-linear instructions within a single instruction execution cycle. The method comprises the steps of:
30 receiving at least three data input values;
providing a first pair of the data input values to a processing unit;
processing the first pair of data input values to produce a first result;

, CA 02220993 1997-11-07 comparing two pairs of the data input values different from the first pair of data input values to produce comparison results; and in dependence upon a provided instruction code, the comparison results, and the first result, directing some of the data input values and the first result to at least an output port.

5 Brief Description of the Drawings Exemplary embodiments of the invention will now be described in conjunction with the following drawings, in which:

Fig. 1 shows a high-level block diagram of a SI~D processor according to the invention;
Fig. 2 shows connections between one of the north and south shift-register channels and lo the processing elements data-path;
Fig. 3 shows a processing element architecture according to the present invention;
Fig. 4 shows a novel architecture of an arithmetic-logic-swap unit (ALSU) for use in the processing element of Fig. 3;
Fig. 5 shows a data flow diagram indicating a shift-register channel according to the 15 invention for providing a k-pixel delayed overlapped data band from each processing element;
Fig. 6 shows a data flow diagram indicating a use of a shift-register according to the invention for providing a one-pixel delayed data band;
Fig. 7 shows a data flow diagram indicating a use of a SIMD processor according to the 20 invention for simultaneous multi-channel communication and parallel coll~uL~ions;
Fig. 8 shows a data flow diagram indicating a use of a SIMD processor according to the invention for redistributing an one-pixel delayed and interleaved data band for each processing element to form limited and continuous data bands;
Fig. 9 shows a data flow diagram in(1ic~tin~ a use of a SIMD processor according to the 25 invention for redistributing data input values of each processing element in a cyclic manner;
Fig. 10 shows a block diagram of an arithmetic logic swap unit for use in a single non-parallel processor; and, Fig. 11 shows a block diagram of an arithmetic logic swap unit for use in a single non-30 parallel processor or in a parallel processor, the arithmetic logic swap unit having more J ' CA 02220993 1997-11-07 than three data input ports.

Detailed Description of the Invention Referring to Fig. 1, a high-level block diagram of a parallel processor having as SIMD architecture according to the present invention is shown. The processor 100 comprises a controller 120 and a plurality of processing elements 103 coupled to a plurality of communication channels 104, 110, and 111. The controller 120 controls operations of the plurality of processing elements 103 and multi-channel c~,n " ~ l ications through a set of control signals 130. The controller comprises a modulo counter 170 for 10 address generation. Generated addresses provided by the modulo counter 170 are multiplexed into the control signals 130 and provided to the processing elements 103.
Operation of the modulo counter is based on four parameters--max, min, stride, and index. These values are stored in registers or, alternatively, supplied from micro-instructions. The output of the modulo counter is index = (index + stride) mod (max - min) output = index + min where a mod b = a - b [a / b], and [a / b] is an integer part of a/b.

Modulo counters used in controllers are generally known and modification of a controller for use with the invention is straighlrol~v~d for one of skill in the art with an 20 understanding of the architecture, described hereinbelow.

The co~ "cation channels include a north shift-register channel 110, a south shift-register channel 111, and accumulate-chain channel 104. The north shift-register channel comprises a plurality of shift-registers 101 a-d coupled together in a predetermined fashion. In Fig. 1 this is a linear shift-register channel with a further 25 connection from an end thereof to the other end through multiplexers, 105a and 105b. The use of multiplexers 105a and 105b allows the shift-register channel to act as a linear shift-register channel or as a circular shift-register ~h~nnel The south shift-register channel comprises a plurality of shift-registers 1 02a-d coupled together in a predetermined fashion. In Fig. 1 this is a linear shift-register channel with a further connection from an 30 end thereof to the other end through multiplexers, 1 05c and 1 05d. The use of multiplexers 105c and 105b allows the shift-register channel to act as a linear shift-register channel or as a circular shift-register channel. The accumulate-chain channel connects the processing elements 103 through the data-path indicated as cormections 104a-e. The accllmlll~te-chain is discussed below with reference to Fig. 3.

s Cormections from shift-register channel outputs to inputs of a same shift-register channel permit data rotation in the shift-register channels. A cormection between the output of the north shift-register channel 110 and the input of the south shift-register channel 111 allows inter channel communications. This special connection permitsredistribution of an interleaved data band provided to each processing element into a 10 continuous data band. Data org~ni~:~tion and re-organisation using the shift-register channels 1 10 and 1 11 is explained in detail below.

Operations of the processing elements 103 and the c~ ication channels are controlled by a single controller 120, which retrieves instructions from the instruction memory 150, decodes the instructions, and provides operation comm~n(l~ (opcodes) to the processing elements 103 through the command bus 130. The comm~ncl bus 130 co,l"llu.licates opcodes and necessary source/destination operands. These source/destination operands are constants, which are stored in global constant memory 160 and are broadcast through the command bus 130, in a fashion that provides for at least one of direct and indirect addressing. For indirect addressing, the addresses are generated by a modulo counter 170 in the controller 120. All addresses for direct and indirect addressing are broadcast through the comm~n(l bus 130.

In addition to the comm~n(l~ provided to the processing elements 103 and the COllllllul~iCatiOn channels 104, 1 10, and 1 11, the controller 120 also generates two independent e~t~rn~l addresses which are used to control external memory access via external address buses 140a and 140b. Depending on an application, the extl-rn~l memory data buses are connected with at least one of data ports 113a-b, 114a-b and 104a, 104e, in the communication channels. The connections between the processing elements 103 and the north shift-register channel 110 are symmetric to those cormections between the processing elements 103 and the south shift-register channel 11 1.
Referring to Fig. 2, connections between one of the north and south shift-register lZ

'~ ~ CA 02220993 l997~ 07 channels and the processing elements data-path 104 are shown. Corresponding circuit elements are shown in a dashed box. According to an embodiment of the present invention, processing elements 103 each contains two local memories 210 and 211, which are one of single-ported and dual-ported. With dual-ported memory, read and write 5 operations are performed simultaneously, while with single-ported memory only one of a read operation and a write operation is performed at a particular time. However, the use of two local memories permits .~imlllt~neous data transfer and co~ ul~Lion even with single-ported memories.

Data values latched by the data-in shift-registers 201 are stored in a corresponding o memory 210 or 211, through cormections therebetween and multiplexors 215. The data stored in the memory 210 or 211iS capable of being sent directly to a shift-register 201 of a processing element to the right (right neighbour), through corresponding multiplexors 208 and 202. The shift-registers 201 are capable of use as latches of data received by way of the multiplexor 202. Also, the shift-registers 201 are sources of operands provided 15 from corresponding multiplexor 215. Of course, the shift-registers 201 are capable of ~hiftin~; data to a corresponding shift-register 201 of a right neighbour.

Referring to Fig. 3, an arclliLe~;Lul~ of processing elements 103 according to the present invention is shown. Each processing element 103 is shown as an identicalprocessing element but this need not be. The processing elements 103 each comprise an 20 arithmetic-logic-swap unit (ALSU) 21, having 3 inputs and 3 outputs; amultiplier-adder unit 19; a barrel-shifter/rotator 20; an accumulator 14; two dual-ported registers lOa-b;
two memories 1 la and 1 lb, which are of one of the single ported configuration and the dual-ported configuration; a source selection matrix unit 15; a destination selection matrix unit 24; four (4) source buses 17; and four (4) destination buses 23. The north and south 25 shift-register channels and a global bus are cormected to the source selection unit through cormections 12a, 12b and 13. The 11estin~tion buses are cormected to the north and south shift-register channels through connections 12c and 12d. A sign e~rt~n~led multiplexor 18c controls the selection of addends of the multiplier-adder unit 19. The data-path employs a four-stage pipeline for fast execution.

The accumulator 14 is in a final stage of the pipeline in the processing element : =~===

103. This permits accumulation of a result from any operations of the ALSU 21, the barrel-shifter 20, and the multiplier-adder ur~it 19. Such a configuration results in a SIMD
processor element having enhanced flexibility and power.

Some typical single instruction operations implemented using a processing 5 element according to the present invention include add-accllmul~te7 subtract-accumulate, absolute-accumulate, square-accumulate, multiply-add-acc -mul~te, shift-accumulate, medium-accumulate, m;.xi~ -accumulate, and minimum-accumulate. Many other combinations with an accllmul~te operation are possible. The above list is intended as exemplary and not as an exhaustive list. From a careful review of the diagram of Fig. 3, it o is clear that any results from the ALSU 21, the barrel-shifter 20, and the multiplier-adder unit 19 may be accllmul:~ted in the accumlll~tQr 14 as desired.

Two of the four source buses 17 are connected to a processing element adjacent and to the right (right neighbour) through the acc -mul~te-chain 104a-e (shown in Fig. 1).
While the multiplier-adder unit 19 receives an addend from a processing element adjacent 5 and to the left (left neighbour) through the accllmul~te.-chain 104a-e (shown in Fig. 1) and the cormections 18b, l9c and a sign-extended multiplexor 18c (shown in Fig. 3). The said cormections and multiplexors permit the multiplier-adder unit to receive two source multiplicands from a local or global data source, while receiving an addend from the left neighbour. An example of an application for this connectivity is directly adding a partial 20 result in the left neighbour to the result of multiplying two local or global data values.
Combining the said cormections with a generalised accumulator 14 in the final stage of the pipeline, typical linear signal processing such as filtering, transforms and neural network simulations are implemented efficiently. Application mappings are flexible, since each processing element is capable of producing partial or full results.

Referring to Fig. 4, the architecture of the arithmetic-logic-swap unit (ALSU~ 21 in each of the processing elements 103 is shown. An ALSU 21 as shown in Fig. 4 is heretofore unknown. Such an ALSU 21 is applicable to MIMD processors, non-parallel processors, and to SIMD processors according to the invention. The ALSU 21 comprises 3 input ports 21a, 21b, and 21c and 3 output ports 21w. The ALSU further comprises a 30 conventional 2-point ALU l; a comparator 2; a comparator 3; a mux control unit 6; three ' CA 02220993 1997-11-07 multiplexors 7, 8, and 9 for data selection; a plurality of pipeline registers Se, 5f, and 5g;
and a plurality of output registers 5a-d. Operation of the ALSU 21 is easily described in two stages. The first stage comprises a conventional ALU 1, which implements a traditional arithmetic and logic operation on two data input values, two comparators 2 and 5 3, which set two compare flags on the three data input values, four output registers and three pipeline registers 5a-g. The first stage permits comparison between each pair of the three data input values provided to the ALSU 21 or comparison between two of the pairs and an arithmetic operation on the third pair. The advantages to the operational flexibility of the first stage are discussed below.

The second stage comprises three multiplexors 7, 8, and 9, three output registers 7r, 8r, and 9r, and a mux control unit 6. The mux control unit 6 uses comparison results in the form of flags provided from the first stage to select the data input values for provision to the multiplexors 7, 8, and 9 based on an opcode provided to the ALSU 21 via input port 6d. The opcode provided to the input port 6d is determined from the controller 5 command provided in signal 130 (shown in Fig. 1). Using an ALSU 21 according to the present invention, a class of linear and non-linear functions is implemented within a same time period without a need for execution of conditional instructions. Therefore, an architecture according to the present invention allows for implementation of a paraUel processor using a SIMD architecture that overcomes known disadvantages of the prior art.

Examples of some of functions, which are easily implemented according to the invention without a need for conditional instruction execution, are sllmman~ed below.

Operation Description Conventional two-point ALU operations, such as addition, subtract, absolute, compare, AND, OR, NOT, NAND, NOR, XOR, XNOR
25 RANK ranks up to three data input values, i.e. " ~ x i " " " " medium, I l l ;l l il ~ . In case of two data input values, the output is in the order " ,,. x i" " " " ",;";" "" "
RANKI ranks up to three data input values according to the previous RANK operation;
this operation is used to produce a ~.Ull ~ lding index of the rauked values SWAPCond swaps (or doesn't swap) two data values based on a given condition codeMAX these instructions are special cases of the RANK instruction; they only provide 30 MED one output value, which is the l " ~ x ;" .~ ~ . ", medium or ", ;" ;" .. value of a MIN plurality of data input values CA 02220993 1997~ 07 MAXI these instructions are special cases of the RANKI instruction; they take data input MEDI values and produce a single output, which is the associated index of the MINI ... ,.,~;.. , medium or .. i.-;
CLIP clips the input data according to the given upper limit and the lower limit CL~(V,U,L)--~ED(V,U,L) 5 COR implements a core function deflned by COR(W,L)=O, if V<U and V>L, COR(W,L)=V, otherwise QTZU This is a ~lUH11~ ;nn function defined by QTZU(VU,L)=U,ifLCV<U-lQTZU(V,U,L)=V, otherwise QTZL This is a ~.. ,.. I;c~ n function defined by QTZL(W,L)=L,ifL<V<U,QTZL(V,U,L)=V, otherwise 10 THRSD implements a threshold function deflned by THRSD(V,U,L)=O,ifV<U, THRSD(V,U,L)--L, otherwise Other than conventional two-point ALU operations, the above listed instructions are non-linear operations. Heretofore, these non-linear operations were implemented using conditional instruction execution. For example, to clip a value V according to the 5 upper limit U and lower limit L, a program using a conventional processor is if (V>U) V=U;
if (V<L) V=L;
R=V;

where R is the result of the clip operation. It can be seen that a clip operation requires several instructions including two conditional operations.
On a SIMD processor where the value being clipped, V, is local to each processing element and, the upper limit U and the lower limit L are also local, the 25 conditional flags for each processing element are often different. Therefore, different processing elements may require different execution instructions, which the controller cannot supply, due to constraints of the SIMD architecture. The prior art solution to this problem is to disable some processing element according to a predefined rule. For a two-state conditional execution, about half of the computational power of a parallel processor 30 employing a SIMD architecture is wasted. The following is a sample of a program for performing the above clip operation on a SIMD processor using an approach taught in US
patent 5,361,370.

~, ' CA 02220993 1997-11-07 CMP V, U
IF (GT): MOV V, U;
CMP V, L
IF (LT): MOV V, L, MOV R, V, For other non-linear operations, similar conditional instruction sequences are necessary. The ALSU 21 according to the invention elimin~tes the conditional instruction execution problem for a broad class of non-linear processing instructions by providing a class of single instruction non-linear operations. For example, the rank-order instruction 0 implemented according to the present invention can produce the rank-order of 3 data input values within a single instruction. Of course, an ALSU according to the invention, provided with four input ports and four output ports, 6 comparators, and 4 output muxes produces rank order for four such values. To implement the same rank-order of three - values on conventional processors, more than 15 instructions are often used for 5 conditional instruction execution. Thus the present invention improves the performance for non-linear signal processing even on a single processor system. Typical applications include median and rank-order filt~rin, motion estimation, pattern recognition, vector qll~nti~tion, neural network simulation, and threshold decomposition. For example, to find out the m~ llul-l data value of 9 data values and an associated index using a single 20 processing element according to the present invention, a program such as the following is used:

- -Al=max(dl, d2, d3); Al stores the m~im~l value of the 3 data dl, d2, and d3 Bl =maxI(i 1,i2,i3); Bl stores the index associated with the m~rim~l value of dl, d2, and d3 A2=max(d4,d5,d6); A2 stores the m~im~l value of the 3 data d4, dS, d6 s B2=maxl(i4,i5,i6); B2 stores the associated index of the m~rim~l value of d4, d5, d6 A3=max(d7,d8,d9); A3 stores the m~im~l value of the 3 data d7, d8, d9 B3=maxl(i7,i8,i9); B2 stores the associated index of the m~im~l value of d7, d8, d9 A=max(Al, A2, A3); A stores the final m~im~31 value of the 9 data elements B =maxI(B 1,B2,B3); B stores the index associated with A, the m~i~ ll data value As can be seen from the above code, there is no conditional instruction execution and the total program requires 8 instructions. This is significant improvement over the 15 instructions for sorting three data values according to the prior art.

The above listed non-linear instructions and many other non-linear instructions supported by the ALSU design of the present invention support efficient implementation 5 of a broad class of applications without conditional instruction sequences; however, in some cases, conditional instruction executions are necessary. The present invention of the ~SU 21 is u3eful for irnplelllerlting prior ar. lllulti-conditional ex~cutiorls in a single cycle. Many prior art multi-conditional instruction sequences executed on a processor according to the present invention are more efficient than those in the prior art such as the 20 m~king techniques disclosed in "Efficient m~.~king techniques for large scale SIMD
architectures" (third Symposium on the Frontiers of Massively Parallel Computations, IEEE Computer Society Press, Oct. 1990, pp.259-64), W.G. Nation et al. and in USpatent 5,361,370.

Referring now to Fig. 5, a diagram 400 showing a use of a shift-register charmel25 for providing a k-pixel delayed overlapped data band with each processing element. There are two pointers 406 and 407 for the processing elements 103. One of the pointers indicates a read address while the other pointer indicates an address in write memory. The relative addresses of memories in the processing elements PEm and PEm+l are indicated by reference numerals 402a-d, and 403a-d. The controller 120 through the command bus 130 30 (shown in Fig. 1) supplies these two address pointers. Two modulo counters implement the pointers 406 and 407, thus providing circular addressing. If the memories are dual-ported, the read and write operations are performed simultaneously. Otherwise, they are performed in an interleaved manner. Alternatively, when dual ported memory is used, the read and write operations are performed as desired and not necessarily simultaneously.
Initially, the read modulo counter is set to 0, and the write modulo counter is set to k-l.
All the processing elements store data from the shift-register 401a-b in memory at a s location indicated by the pointer 407. Then, all the processing elements provide a data value from memory at a location indicated by the read pointer 406 to a shift register 401b-c of the right neighbour. The left most processing element receives a data value from the data port and provides one to the shift-register. While data processing is occurring, the t~,vo pointers 406 and 407 are updated by the two modulo counters in the controller. The lo modulo counters are loaded with parameters prior to initial use; this ensures proper operation. By repeating the store, folvval-l, and modulo counter update operations, a k-pixel delayed and overlapped data band results.

Referring to Fig. 6, a diagram 500 for using a shift-register to provide a one-pixel delayed data band is shown. The one-pixel delayed data band is provided for each15 processing element and may be an interleaved or continuous data band. There is only one write pointer 505 needed to store the data values from the shift-registers 501a-c to the memory at a location indicated by the pointer 505. If the data values in the shift-register are stored in the memory every time the shift-registers 501a-c are shifted, then the resulting data band in the memory is continuous. However, each processing element is 20 provided a one-pixel delayed data window of the previous processing element. If the data in the shift-register 501a-c are shifted J times, and stored once, then each processing element is provided a one-pixel delayed and J-pixel interleaved data band.

Referring now to Fig. 7, a diagram 600 showing the use of the present invention 100 for simultaneous multi-channel communication and parallel computations. The north 25 shift-register channel 601a-c provides data input values and distributes to each processing element the data values as a k-pixel delayed and overlapped data band in memory 604a-b.
The processing elements process data values from the memory 605a-b, registers 606a-b and other sources 607a-b such as from global constant memory. Results of the data processing are placed into the south shift-register channel 611 a-c. Simultaneously, the 30 results previously stored in the south shift-register channel 61 la-c are shifted out. Of course, the data or~ni~1ion may also be a one-pixel delayed data band such as that " ~ CA 02220993 1997-11-07 shown in Fig. 6. It is therefore evident that the present invention as shown in Fig. 1 provides simultaneous communications and computations. The data input values areeasily organised to the chosen data org;~ni~tion through one shift-register channel while the results are shifted out through another shift-register channel.

Referring now to Fig. 8, a diagram 700 showing a use of the invention to redistribute an one-pixel delayed and interleaved data band for each processing element to form limited and continuous data bands by shifting both north and south shift-register channels is shown. Data bands for each processing element are interleaved, and each processing element has a one-pixel delayed version of a data window as compared to that 0 of the left neighbour. Due to the SIMD rule, x(n), x(n-l), ..., x(n-k+l) are stored in a same location in the k processing elements, as indicated by 701a-d, and x(n-k), x(n-k-l), ..., x(n-2k+1) are also stored in the same location in the k processing elements, as indicated by 702a-d. First, x(n), x(n-l), ..., x(n-k+l) are loaded into the north shift-registers lOla-d and x(n-k), x(n-k-l), ..., x(n- 2k+1) are loaded into the south shift-registers 102a-d. The input 5 port of the south shift-register channel is connected to the output port of the north shift-register channel through connections 112a and 113. The south shift-register channel stores the data window x(n-k), x(n-k- 1), ..., x(n-2k+1). When the north and south shift-register channels are shifted right once, the south shift-register is provided the data window x(n-k+l), x(n-k), ..., x(n-2k+2).

Every shift updates the data window on the south shift-register in the form of acontinuous data band. Each processing element receives a one-pixel delayed, continuous data band, thus allowing the c~Lupul~lion on an organised data sequence from the south shift-register channel. This data redistribution mech~ni.~m is useful for two-dimensional image processing such as two-dimensional filtering applications, where a one-pixel 25 delayed and overlapped continuous data band for each line on each processing element is desirable. However, this may require a large memory store. Interleaving the data band reduces memory requirements significantly over a brute force approach. Each processing element computes a one-pixel delayed and overlapped, continuous data window in an organised manner. According to the present invention, the data redistribution problem is 30 solved without violating any SIMD rule. Since, an architecture according to the present invention allows simultaneous con"lluLIications and co~llpuL~Lions, the processing ," ' CA 02220993 1997-11-07 elements are available for c~,n~uL~Lion during data redistribution.

Referring now to Fig. 9, a diagram 800 showing the use of the present invention for redistributing data input values of each processing element in a cyclic manner is provided. Though simple, such a technique is very useful for applications such as block s transforms (DCT/DFT etc.). Redistributed data is original data or partial results. For example, to perform a two-dimensional DCT, a row DCT is computed. These row DCT
partial results are redistributed to other processing elements in a cyclic manner for col~ uLillg final results. The rotate operation of the shift-register channels are performed through connections 112a 112b for north skift-register channels and south shift-register o charmels, respectively.

Referring to Fig. 10, an ALSU according to the invention is shown provided with three data input values. The high level block diagram presented is applicable to ALSU
implementations for MIMD, SIMD or non-parallel processor implementations where support of non-linear operations is desired. Each data input value and at least a processed 15 value are provided to the output selection unit to allow for flexible data output based on the output of the arithmetic-logic and comparison unit.

Referring to Fig. 11, a high level block diagram of an ALSU according to the invention is shown. The ALSU is provided with any number of data input values. Wken four data input values are provided additional comparators are required as are larger 20 muxes and a more complicated mux control circuit. When a fifth data value is provided to the ALSU, the ALSU requires further additional comparators and more complicated ~;h~;uiLl y. It will be evident to those of skill in the art, that as the number of data input values increases, more unused ~;h~;uiLl~ for an average instruction cycle results. For example, witk four data in~ut values~ all existin~; two input one output value operations 2s waste all the comparators except that within the ALU--5 comparators are wasted. With 5 data input values, 9 wasted comparators result. For image processing applications where a balance between cost and speed is desirable, a three-input value three-output value ALSU
21 is preferable. Such an ALSU balances convenience and circuit complexity.

An ALSU such as that disclosed herein may be used with a number of different 30 processor architectures. These architectures include RISC, CISC, MIMD, etc. Of course, ,. ~ CA 02220993 1997-11-07 -some of the advantages may be different or may not be realised in these alternative implementations .

Numerous other embodiments may be envisaged without departing from the spirit and scope of the invention.

Claims

1. A single instruction multiple data processor architecture comprising a plurality of processing elements, each processing element including a memory and a functional unit for performing data processing;
a shift register for supporting multiplexed data communication between processing elements and for performing at least one of:
a) providing data to the memories, b) receiving data from the memories, c) providing data to the functional units, and d) receiving data from the functional units, and, a controller for controlling the shift register to provide data communication and data processing simultaneously.

2. A single instruction multiple data processor architecture according to claim 1 wherein the shift register is for providing data to the memories, receiving data from the memories, providing data to the functional units, and receiving data from the functional units.

3. A single instruction multiple data processor architecture according to claim 2 wherein the processing elements each comprise at least two memories.

4. A single instruction multiple data processor architecture according to claim 3 comprising a second shift register for providing data to the memories, receiving data from the memories, providing data to the functional units, and receiving data from the functional units and for supporting multiplexed data communication between processing elements.

5. A single instruction multiple data processor architecture according to claim 4 wherein the shift registers comprise multiplexing means for supporting multi-channel communications such that each of said shift registers provides data input, data output, inter processing element communication, and a combination of data input, data output, and inter processing element communications.

6. A single instruction multiple data processor architecture according to claim 1 wherein the shift register comprises synchronisation means for synchronising the shift register operation and the processing element operations.

7. A single instruction multiple data processor architecture according to claim 1 wherein the shift register forms a data buffer for supporting stretched data streams.

8. A single instruction multiple data processor architecture according to claim 1 wherein the shift register and the processor element memory are coupled using multiplexors.

9. A single instruction multiple data processor architecture according to claim 1 wherein a functional unit of a processing element comprises means for accessing data stored in a memory of the processing element and means for accessing data stored in a memory of a second other processing element.

10. A single instruction multiple data processor architecture according to claim 1 wherein the shift register comprises a first data store having a right neighbour data store and for providing data values thereto, a last data store having a left neighbour data store for receiving data values therefrom, and a plurality of middle data stores each having an associated right neighbour data store for receiving data values therefrom and anassociated left neighbour data store for providing data values thereto, each middle data store further in communication with at least a memory from the memories and at least a functional unit forming part of a processing element from the processing elements.

11. A single instruction multiple data processor architecture according to claim 10 wherein the first data store and the last data store are coupled through a multiplexer.

12. A single instruction multiple data processor architecture according to claim 1 wherein the functional unit comprises an arithmetic-logic-swap unit including:
a plurality of data input ports for receiving a plurality of data input values;
a number greater than one of data output ports for providing up to the number of data output values;
a processing unit for processing some of the data input values to provide a processed data value; and, an output selection unit for switching up to the number of data values from the input data values and the processed data value for provision to at least one of the data output ports.

13. A single instruction multiple data processor architecture according to claim 12, the arithmetic logic swap unit including:
comparison means for receiving the data input values, comparing the received data input values, and providing a first signal in dependence upon the comparison;
a selection control unit for controlling the output selection unit in dependence upon the first signal received from the comparison means, wherein at least one of the data input values and the processed data value are selectively provided at the data output ports in dependence upon an instruction code and the data input values.

14. A single instruction multiple data processor architecture according to claim 13 wherein the plurality of data input ports comprises three data input ports;
wherein the number of data output ports comprises three data output ports;
wherein the processing unit comprises a two-input one-output arithmetic logic unit for performing linear operations and logic operations.
wherein the comparison means comprises two comparators each for receiving a data input value provided to the processing unit and another data input value and each providing a first signal in dependence upon the comparison; and wherein the selection control unit is for controlling data output values provided to the data output ports in dependence upon the first signal received from the comparison means wherein a data input value is selectively provided to any of the plurality of data output ports.

15. A single instruction multiple data processor architecture according to claim 12 wherein the plurality of data input ports comprises three data input ports;
wherein the number of data output ports comprises three data output ports; and wherein the processing unit comprises a two-input one-output arithmetic logic unit for performing linear operations and logic operations.

16. A single instruction multiple data processor architecture according to claim 4 wherein the shift registers form a multistage pipelining register.

17. A single instruction multiple data processor architecture according to claim 4 wherein the plurality of processing elements are each coupled to two other processing elements and provide an output data value to a first shift-register and to a second shift-register, and wherein a signal from the first shift-register is selectively routable to the second shift-register.

18. A single instruction multiple data processor architecture according to claim 17 wherein the first shift-register and the second shift-register are coupled to the processing elements in a symmetric fashion and wherein a processing element coupled to the first shift-register is also coupled to the second shift-register.

19. A single instruction multiple data processor architecture comprising:
a plurality of processing elements each processing element including a functional unit for performing data processing and a plurality of memory addresses for storing data values;
a shift register comprising a plurality of data stores, each data store in multiplexed communication for receiving data values from a plurality of functional units, a plurality of memory addresses from a plurality of processing elements, and another data store within the shift register, the data store in communication with and for providing data values to a different data store within the shift register, a plurality of memory addresses and a functional unit; and, a controller for controlling the shift register to provide simultaneous data communication and data processing.

20. A method of executing instructions within a class of linear and non-linear instructions within a single instruction execution cycle comprising the steps of:
receiving at least three data input values;
providing a first pair of the data input values to a processing unit;
processing the first pair of data input values to produce a first result;
comparing two pairs of the data input values different from the first pair of data input values to produce comparison results; and in dependence upon a provided instruction code, the comparison results, and the first result, directing some of the data input values and the first result to at least an output port.

21. A method of executing instructions within a class of linear and non-linear instructions within a single instruction execution cycle as defined in claim 20 wherein the step of directing some of the data input values and the first result to at least an output port comprises directing at least two data values to each of two different data output ports.

22. A method of executing instructions within a class of linear and non-linear instructions within a single instruction execution cycle as defined in claim 21 comprising the step of accumulating results of an instruction in order to provide an accumulated result.