GB2286909A

GB2286909A - Pipelined SIMD-systolic array processor.

Info

Publication number: GB2286909A
Application number: GB9413501A
Authority: GB
Inventors: Chen-Mie Wu
Original assignee: WU CHEN MIE
Current assignee: WU CHEN MIE
Priority date: 1994-02-24
Filing date: 1994-07-05
Publication date: 1995-08-30
Also published as: DE19504089A1; GB9413501D0; CN1107597A

Description

2286909 TITLE: PIPELINED SIMD-SYSTOLIC ARRAY PROCESSOR AND METHODS THEREOF

BACKGROUND OF THE INVENTION

The present invention relates to a pipelined SIMD-Systolic array processor and its methods.

Especially, the present invention uses a way which combines both the broadcasting and the systolic structures to connect multiple pipelined processing elements together. Totally, the present invention accomplishes the design of an array processing architecture, which can process multiple data stream with single instruction stream, and its related computing methods. Moreover, the present invention can be applied to the design of parallel computers, video image processors, and digital signal processors. Meanwhile, the present invention can manipulate data transferring and shifting more efficiently, and also can be implemented on single VLSI chip. Thus, the present invention is full of practicability.

SUMMARY OF THE INVENTION

It is the primary object of the present invention, to provide a way for data input/output, data shifting, and data transferring. Thus, data processing can be faster and more efficiently.

Through efficient manipulation of data input/output, the present invention can save data lines and VLSI chip's pin-count.

Moreover, the present invention avoids using complex control and uses the memory in an efficient manner. Thus, the present invention can be implemented on single VLSI chip. This is the secondary object of the present invention.

It is another object of the present invention to be designed as one-dimensional or two-dimensional array processor.

It is a further object of the present invention to be implemented on a VLSI chip and able to be installed directly on 05 computers or televisions to accomplish various image processing functions. This means that the present invention is of practicability, of convenience, and of small size.

To achieve the previously described objects, the present invention mainly comprises registers, multiplexers and a number 10 of processing elements, constructed as an array processing architecture. In the front and rear input/output ports, each processing element is also connected to registers and multiplexers. By cascading these registers and multiplexers together, the present invention can update the input data to each 15 processing element by shifting. Therefore, reusable datum are not necessary to be reloaded every cycle from the multi-port memory. This can save the data loading time and the number of data lines, and, make the present invention easier to be implemented on a VLSI chip.

BRIEF DESCRIPTION OF THE DRAWINGS

Fig. 1 is a schematic block diagram for thepipelined SIMD-Systolic array processing architecture of the present invention.

Fig. 2 is a schematic circuit diagram for the processing elements of the present invention.

Fig. 3 is the input/output truth table for the mode-control ROM of the processing elements of the present invention.

Fig. 4 is the first operational mode for the processing elements t of the present invention.

Fig. 5 is the second operational mode for the processing elements of the present invention.

Fig. 6 is the third operational mode for the processing elements of the present invention.

Fig. 7 is the fourth operational mode for the processing elements of the present invention.

Fig. 8 is the fifth operational mode for the processing elements of the present invention.

Fig. 9 is the sixth operational mode for the processing elements of the present invention.

Fig.10 is a schematic circuit diagram of the present invention for processing matrix multiplication computation.

Fig. 11 is a cycle-based data and control signal diagram of the present invention for loading constant data into the processing elements during processing matrix multiplication computation.

Fig.12A & 12B are cycle-based data and control signal diagram of the present invention for processing matrix computation.

Fig.13 is a schematic circuit diagram of the present invention for processing finite-impulse response Filtering Computation.

Fig.14 is a cycle-based data and control signal diagram of the present invention for processing finite-impulse-response filtering computation.

Fig.15 is a schematic circuit diagram of the present invention for processing infinite-impulse-response filtering computation.

Fig.16 is a cycle-based data and control signal diagram of the present invention for processing infinite-impulse-response filtering computation.

Fig.17 is a schematic circuit diagram of the present invention for processing edge-detection and smoothing computation.

Fig.18A, 18B & 19 represent cycle-based data and control signal diagrams of the present invention for processing edge-detection and smoothing computation.

Fig.20 is a schematic circuit diagram of the present invention for processing two-dimensional discrete cosine transform.

Fig.21 is a cycle-based data signal diagram of the present invention for loading constant data into the processing elements during processing two-dimensional discrete cosine transform.

Fig.22 & 23 represent a cycle-based data and control signal diagram of the present invention for processing the two-dimensional discrete cosine transform.

Fig. 24 is a schematic circuit diagram for two-dimensional array processing architecture of the present invention.

Fig.25 represents an implementation of two-dimensional array processing architecture of the present invention.

Fig.26 is a cycle-based data and control signal diagram of the invention for loading constant data into the elements of the two-dimensional array architecture shown as Fig. 25 for processing the two-dimensional discrete cosine transform.

- 4 presen Drocessin Fig.27 & 28 represent cycle-based data and control signal diagrams of the present invention for processing the two-dimensional discrete cosine transform by the two-dimensional array architecture shown as Fig. 25.

Fig.29 is a schematic circuit diagram for two-dimensional array processing architecture of the present invention for processing image template matching and motion estimation.

Fig.30 represents an implementation of two-dimensional array processing architecture of the present invention for processing image template matching and motion estimation.

Fig.31A, 31B & 32 represent cycle-based data and control signal diagrams of the present invention for processing image template matching and motion estimation by the two-dimensional array architecture shown as Fig. 30.

Fig.33 shows that the array processing architecture of the present invention can be cascaded to form stage-pipelined architectures.

Fig.34 shows how the array processing architectures of the present invention are cascaded to form a stage-pipelined architecture to compute 1008-point discrete Fourier transform.

Fig.35 shows how the array processing architectures of the present invention can be combined with systolic architectures.

Fig.36 shows how the array processing architectures of present invention can be applied to the implementation image compression systems.

- 5 the of DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

As shown in Fig. 1, the present invention mainly comprises a number of processing elements PEl- PEn, which constructed as an array (processing) architecture, a broadcasting register rb, shift register arrays rs11rsln, rs21- rs2n, rol- ron, multiplexers Mull- Muln, Mu21- Mu2n, Mb, M01MOn, M0b, a multiport memory M, and a controller C. At the input ports the processing elements PEl- PEn are connected to the registers rs11- rsln, rs21- rs2n, rb through the multiplexers Mull- Muln, Mu21- Mu2n, Mb.

At the output ports the processing elements PEl- PEn are connected to the registers rol- ron through the multiplexers M01- MOn, M0b. Moreover, the multiport memory M is connected to the registers rs21, rs11, rb, rol. Then, all of the components of the present invention are controlled by the controller C. The control signals sent out from the controller C are presented as follows:

Control signal 1: the shift/load control signal for the shift register array rs21- rs2n.

the clear control signal for the shift register array rs21- rs2n.

the shift/load control signal for the shift register array rs11- rsln.

the clear control signal for the shift register array rs11- rsln.

the data-select control signal for multiplexers Mull- Muln.

the data-select control signal for multiplexers Control signal 2:

Control signal 3:

Control signal 4:

Control signal 5:

Control signal 6:

Control signal 7:

Control signal 8:

Control signal 9:

Control signal 10:

Control signal 11:

Control signal 12:

Control signal 13:

Mu21- Mu2n.

the data-select control signal for the multiplexer Mb to select broadcasting data.

the load control signal for the broadcasting register rb.

the function control signals for the processing elements PE1- PEn.

the reset control signal for the processing elements PE1- PEn.

the shift/load control signal for the shift register array rol- ron.

the data-select control signal for the multiplexers M01- MOn.

the data-select control signal for the multiplexer M0b.

Control signal 14: Control signals for the multiport memory which include addresses, Read/Write, Enable, etc.

Data and Control signal 15: data and control signals from an external processor to the multiport memory.

Data signal 16: data signals to other external functional unit.

Control signal 17: control signals to other external functional unit.

According to the data processing operations of the present invention, input datum are transferred to the processing elements PE1- PEn for processing under the control of control signals 1- 8. The action of these control signals is described in the following.

If the control signal 2 is of logic one, the content of registers rs21rs2n would be cleared as logic zero; if the control signal 1 is of logic one, register rs2n would be loaded with the content of register rs2(n-1), where n>l, and register rs21 would be loaded with the value of ms2 which is read from the multiport memory M; if the control signal 4 is of logic one, the content of registers rs11- rsln would be cleared as logic zero; if the control signal 3 is of logic one, register rsln would be loaded with the value of is(n-1), where n>l, and register rs11 would be loaded with the value of msl which is read from the multiport controlled by control memory M. Multiplexers Mull- Muln are signal 5 and multiplexers Mu21- Mu2n are controlled by control signal 6. These multiplexers are used to generate isn from rs2n, rsln, Oin in the following way.

If the control signal 6 is of logic zero, isn is equal to the content of rs2n; if the control signal 6 is of logic one and the control signal 5 is of logic one, isn is equal to the content of rsln; if the control signal 6 is of logic one and the control signal 5 is of logic zero, isn is equal to the content of Oin.

Additionally, the control signal 8 is to control the loading of broadcasting register rb with Mb which is read from the multiport memory M. If the control signal 8 is of logic one, register rb would be loaded with Mb. Control signal 7 is to control the multiplexer Mb to generate the broadcasting data to the processing elements PEl- PEn from rb and Ob, where Ob is the broadcasting output data from the processing elements PElPEn.

If the control signal 7 is of logic one, the broadcasting data signal ib is equal to the content of register rb; if the control signal 7 is of logic zero, the broadcasting data signal ib is equal to Ob. The output control of the present invention is through the control of control signals 11- 13. The control method is similar to what has been described for the input control. If the control signal 11 is of logic one, registers ron, where n>l, is loaded with the data from multiplexers MOn and register rob is loaded with the data from M0b and M01. If both the control signals 12 and 13 are of logic one, registers ron would be loaded with ro(n+l); if the control signal 12 is of logic zero and the control signal 13 is of logic one, registers ron would be loaded with On; if the control signal 13 is of logic zero, register rol would be loaded with Ob.

Finally, control signal 14 is for the control of multiport memory M to read and write data.

As shown in Fig. 2, the pipelined processing element, PE, of the present invention comprises first-in first-out memory 100, constant register file 101, multiplexers 102, 103, 108 and 114, registers 106, 107 and 110, multiplier 104, absolute-difference unit 105, adder 109, data register file 113, tristate buffer 111, and decoder 112. Meanwhile, control signal 9 from the controller C are for the function control of the processing element and can be further divided into the following subgroups first-in first-out memory control 91, operational mode control 92, register-load control 93, adder control 941 identification control 95, constant register file control 96, data register file control 97.

9 For operational mode controll there is a read-only memory 921 to generate the control signals CO- C7 from the mode control 92.

As shown in Fig. 3, there are six operational modes for running the processing element.

Referring to Fig. 2, CO, Cl are to control the multiplexer 102; C2, C3, C4 are to control the multiplexer 103; C5, C6 are to control the multiplexer 108; C7 is to control the multiplexer 114. Thus, by using mode control 92, the processing element can change the operational mode. Totally, through controlling the internal data flow paths, each processing element of the present invention can have six operational modes. Figs. 4, 5, 6, 7, 8, 9 show the schematic block diagrams for each operational mode respectively. With these operational modes, the array processing architecture of the present invention can manipulate various operations more efficiently.

As for the other control signals, explained as follows:

911: the read control signal for the first-in first-out memory 100; 912: the 100; 913: the 100; the the the the 931:

932: 933: 94:

their functions are write control signal for the first-in first-out memory reset control signal for the first-in first-out memory load control signal for the register 106; load control signal for the register 107; load control signal for the register 110; function control signal for the adder 109; - 10 - 95: the identification control for the processing element and the input of the decoder 112; 951: the switch control of the tristate buffer 11111; 961: the read control signal for the constant register file 101; 962: addresses for read operation for the constant register file 101; 963: the write control signal for the constant register file 101; 964: addresses for write operation for the constant register file 101; 971: the read control signal for the data register file 113; 972: addresses for read operation for the data register file 113; 973: the write control signal for the data register file 113; 974: addresses for write operation for the data register file 113; As shown in Fig. 10, the array processing architecture is the embodiment of the present invention for processing the matrix computation. For explanation, only two processing elements are included. During processing the matrix computation, the processing elements of the present invention are all in the first operational mode, shown as Fig. 4, through the control of controller C. Also, control signals 5, 6, 7, 13 are all in the state of logic one. Thus, multiplexers Mull- Muln, Mu21- Mu2n, Mb, M0b are in the data transferring state as shown in Fig. 10. Here the following matrix computation is used as an example for explaining how the present invention can process the matrix computation.

a00 a01 a02 a03 X00 X01 YOO yol a10 all a12 a13 X10 X11 Y10 Y11 a20 a21 a22 a23 20 a21 Y20 Y21 [a30 a31 a32 a331 Lxx3o x311 Y30 Y31 In order to process the matrix computation shown above, first of all, the present invention loads the processing element PE1 with constant data a00, a019 a02, a03, a20, a21, a22$ a23 and loads the processing element PE2 with constant data alOs all, a12, a13, a30, a31, a32, a33. Referring to Fig. 11. the constant data are loaded into the processing elements through registers rsll, rs12, and, the loading operation is controlled by control signals 3, 963, 964. The control signal 3 is always in the state of logic one. Therefore, registers rsll, rs12 can shift and load data from the multiport memory M to the processing elements. In the first cycle, load data a10 into register rs11. In the next cycle, load data a00 into register rs11 and data a10 would propagate to register rs12. Then, when data all is coming, data a00, a10 which are now stored in registers rs11 and rs12 respectively would be transferred into processing elements PE1 and PE2 individually. AT this time, the write control signal 963 for the contant register file 101 would be in the state of logic one. Continuously doing in this way, the processing element PE1 would eventually be loaded with data a00, a01, a02, a03, a20, a21, a22, a23, and, the processing element PE2 loaded with data 25 a10, all, a12, a13, a30, a31, a32, a33. As to the processing of matrix computation$ Fig. 12 shows the internal operation of the processing elements PE1, PE2 and the broadcasting register rb cycle by cycle during the computation.

According to the matrix computation shown above, the computational results are as follows:

y00 = a00 x00 + a01 x10 + a02 x20 + a03 x30 y10 = a10 x00 + all x10 + a12 x20 + a13 x30 y20 = a20 xOO + a21 X10 + a22 x20 + a23 x30 Y30 = a30 X00 + a31 X10 + a32 x20 + a33 x30 yol = a00 X01 + aol xll + a02 x21 + a03 x31 Y11 = a10 X01 + all xll + a12 x21 + a13 x31 Y21 = a20 X01 + a21 xll + a22 x21 + a23 x31 Y31 = a30 X01 + a31 xll + a32 x21 + a33 x31 The data [aijI have been preloaded into the processing elements PE1, PE2. Therefore, during processing the matrix computation, data x00 is first transferred into register rb from the memory M. Meanwhile, data a00, a10, are read from constant register file 101 in the processing elements PE1 and PE2.

Therefore, through the operation of multiplier 104, the processing elements PE1 and PE2 load register 106 with a00 x00 and a10 x00 individually. Then, in the next cycle, the output of Adder 109 of PE1, PE2 would be equal to a00 x00, a10 x00 respectively. At this time, the adder control signal 94 is in the state of logic one. Also, the output of the multiplier 104 of PE1, PE2 would be equal to a01 x10 and all x10 respectively. Then, in the next cycle, the content of registers 106, 110 of PE1, PE2 would be a01 x10, a00 x00 and all x10, a10 x00 respectively. Continuously doing in this way, the output of adder 25 109 of PE1, PE2 would be equal to y00, y10 eventually. Meanwhile, the control signal 12 is in the state of logic zero in order to load y00, y10 into registers rol ro2 respectively. Then, in the following cycles, during computing y20, y30, y00, y10 are shifted - 13 - into the memory M. Referred to Fig. 12, the present invention processes the matrix computation in a way similar to what has been described.

As shown in Fig. 13, the array processing architecture is the embodiment of the present invention forprocessing the finite-impulse-response filtering computation. Under the control of controller C, the processing elements are running in the second operational mode shown as Fig. 5. Meanwhile, control signals 5, 7, 13 are in the state of logic one and control the multiplexers Mull- Muln, Mb, M0b. As an example, Fig. 13 shows the resulted architecture with two processing elements PE1, PE2. Also, the data processing for computing yi = aO xi +al xi-1 + a2 xi-2 + a3 xi-3 is presented for explanation. According to yi = aO xi +al xi-1 + a2 xi-2 + a3 xi-3, the computational results would be as follows:

yO = aO xO + al x-1 + a2 x-2 + a3 x-3 Y1 = aO xl + al x 0 + a2 x-1 + a3 x2 Y2 = aO x2 + al X 1 + a2 X 0 + a3 X-1 Y3 = aO x3 + al x 2 + a2 X 1 + a3 X 0 Y4 = aO x4 + al x 3 + a2 x 2 + a3 X 1 Y5 = aO X5 + al x 4 + a2 x 3 + a3 x 2 and so forth Referred to Fig. 14, during computing yi, the present invention uses registers rs21, rs22, rs11, rs12 and multiplexers Mu21, Mu22, which are controlled by control signal 6, to transfer - 14 - input data (xm1 to the processing elements PEI, PE2. Meanwhile, constant data [an] is broadcasted through register rbto the processing elements PEI, PE2.

Also, the computational results yi are transferred to the memory M through registers rOl, rO2 and multiplexers M01, M02, which are controlled by control signal 12.

As to data transferring and processing, it would be explained as follows:

Initially, data x1 is loaded from the multiport memory M into register rs2l. Then, in the next cycle, register rs2l is loaded with data xO and register rs22 is loaded with data x1. At this time, control signal 6, which controls multiplexers Mu21, Mu22, is in the state of logic zero. Therefore, isl, is2, which are input ports of processing elements PEI, PE2 respectively, are of value xO, x1 individually. Also, register rb is loaded with data aO so that the output of multiplier 104 is aOxO for PEI and aOxI for PE2. One cycle later, control signal 6 would change to logic one, and, input data xn are transferred to PEI, PE2 through rs1I, rs12. Continuously doing in this way, the output of adder 109 would become yO for PEI and yl for PE2. At this time, control signal 12 is set to logic zero.

One cycle later, yo, yl would be loaded into rol ro2 respectively. Then, control signal 12 is set to logic one and yO, yl are transferred to multiport memory M or other functional unit through registers rol, ro2. In such way the computational results for finite-impulse-response filtering would be generated.

As shown in Fig. 15, the array processing architecture is the embodiment of the present invention for processing the infinite-imPulse-response filtering computation. Under the control of controller C, the processing elements are running in the second operational mode shown as Fig. 5. Moreover, the data signal ob is used for broadcasting the intermediate results to the processing elements through multiplexer Mb. Meanwhile, control signals 2, 6, 7, 12 are used for clearing registers rs21, rs22, controlling multiplexers Mu21, Mu22, controlling multiplexer Mb, and controlling multiplexers M01, M02 respectively. Fig. 15 shows the resulted architecture with two processing elements PE1, PE2. Except the circuits for feedback signal Ob, the architecture shown in Fig. 15 is the same as that in Fig. 13 for finite-impulse-response filtering computation. In the following the data processing for computing yi + bl yi-1 + b2 yi-2 + W yi-3 = aO xi + al xi-1 + a2 xi-2 + a3 xi-3 is presented for explanation. Therefore, the computational results would be as follows: yO = - bly-1 - b2y-2 - b3y-3 + aOxO + alx-1 + a2x-2 + a3x-3 yl = - blyO b2y-1 - b3y-2 + aOxl + alxO + a2x-1 + a3x-2 y2 = - blyl - b2yO - Wy-1 + aOx2 + a1xl + a2xO + a3x-1 20 y3 = - bly2 - b2yl - Wy 0 + aOx3 + alx2 + a2xl + a3xO and so forth Referred to Fig. 16$ it shows that the present invention uses the processing element PE1 to compute yO, y2, y4.... and the processing element PE2 to compute yl, y3, y5,.... As for data transferring and processing, it would be explained as follows: - 16 Initially, data xl is loaded from the multiport memory M into register rs21. Then, in the next cycle, register rs21 is loaded with data xO and data xl is transferred from register rs21 to register rs22. At this time, control signal 6, which controls multiplexers Mu21, Mu22, is in the state of logic zero. Therefore, isl, is2 are of value xO, xl individually. Meanwhile, register rb is of value aO so that the output of multiplier 104 is aOxO for PE1 and aOxl for PE2. In the next cycle, control signal 6 would change to logic one. Then, data xn are transferred to PE1, PE2 through rs11, rs12. During the computation, control signal 2 is set to logic one, when data signals 01, 02 of PE1, PE2 are equal to aOxO + alx-1, aOxl+alxO respectively, to clear registers rs21, rs22. Then, in the following cycles, data -bn are transferred to processing elements PE1, PE2 through the cooperation of registers rs21, rs22, rs11, rs12 and multiplexers Mu21, Mu22. On the other hand, ym are sent to PE1, PE2 by broadcasting. After yO is computed, it is broadcasted to PE1, PE2 to compute yl. Then, yO, yl are transferred to registers rol, ro2, by setting control signal 12 to logic zero, and shifted to multiport memory M in the following cycles. Continuously doing in this way, the computational results for infinite-impulse-response filtering would be generated.

As shown in Fig. 17, the array processing architecture is the embodiment of the present invention for processing the computation of edge detection and smoothing. Under the control of controller C, the processing elements are running in the second operational mode shown as Fig. 5. Moreover, the first-in first-out memory 100 is used as data buffer. Fig. 17 shows the 17 - resulted architecture with four processing elements PE1, PE2, PE3, PE4. Also, the following computation is usedfor explanation:

y30=x50 w20 + x51 w21 + x52 w22 +x40 w10 + x41 w11 + x42 w12 +x30 woo + x31 wol + x32 w02 y20=x40 w20 + x41 w21 + x42 w22 +x30 W10 + x31 W11 + x32 w12 +x20 WOO + x21 wol + x22 w02 Y10=x30 w20 + x31 W21 + x32 w22 +x20 W10 + x21 W11 + x22 w12 +X10 WOO + X11 wol + x12 w02 y00=x20 w20 + x21 w21 + x22 w22 Y31=x51 w20 + x52 w21 + x53 w22 +x41 w10 + x42 w11 + x43 w12- +x31 woo + x32 wol Y21=x41 w20 + x42 w21 +x31 W10 + x32 W11 +x21 woo + x22 wol Y11=x31 w20 + x32 w21 +x21 W10 + x22 W11 +Xll woo + x12 wol + x33 w02 + x43 w22 + x33 + x23 + x33 + x23 w12 w22 w12 + x13 w02 y01=x21 w20 + x22 w21 + x23 w22 +x10 w10 + x11 w11 + x12 w12 +xll w10 + x12 w11 + x13 w12 +x00 w00 + xOl w01 + x02 w02 +x01 w00 + x02 w01 + x03 w02 During data processing, the processing element PE1 is used to compute y30, y31; PE2 is to compute y20, y21; PE3 is to compute y10, yll; PE4 is to compute y00, y01. Referred to Fig. 18, Fig. 19, data transferring and processing can be explained as follows:

Initially, data x30, x20, x10, x00 are loaded into registers rs21, rs22, rs23, rs24 from multiport memory by shifting. At this time, control signal 6, which controls multiplexers Mu21, Mu22, Mu23, Mu24, is set tologic zero. Therefore, isl, is2, is3, is4 are of value x30, x20, x10, x00 respectively. Meanwhile, register rb is of value w00 so that the output of multiplier 104 is x30w00, x20w00, xlOwOO, x00w00 for processing elements PE1, PE2, PE3, PE4 individually. - 18 - 1 During the following cycles, control signal 6 is set to logic one. Then x40, x50 are shifted through register rs11 and registers rs21, rs22, rs23, rs24 are for preloading xOl, x11, x21, x31. Continuously doing in this way, y30, y20, y10, y00 would be computed by PE1, PE2, PE3, PE4. Also, during computing y30, y20, y10, y00, data x31, x32 would be stored in the first-in first-out memory 100 of PE1 through the control of write control signal 912. Similarly, data x21, x22, x11, x12, xOl, x02 are stored in the first-in first-out memory 100 of PE2, PE3, PE4 respectively.

In this way, during computing y31, y21, yll, y01, data x31, x21, x11, xOl are read from first-in first-out memory 100 instead of registers rs21, rs22, rs23, rs24. Therefore, only data x33, x23, x13, x03 are loaded through registers rs21, rs22, rs23, rs24.

This can save a lot of data loading time when y32, y22, y12, y02, y33, y23, y13, y03, etc. are also computed. During computing yij, constant data wkl, 0;5 k, l<3, are sent to the processing elements through register rb by broadcasting. Also, yij are shifted to multiport memory M or other functional unit through registers rol, ro2, ro3, ro4 and multiplexers M01, M02, M03, M04 under the control of control signal 12.

As shown in Fig. 20, the embodiment of the the array processing architecture is present invention for processing the two-dimensionaldiscrete cosine transform. Under the control of controller C, the processing elements are running in the first operational mode shown as Fig. 4. Moreover, constant register file 101, data register file 113, decoder 112, tristate buffer 111 are also involved in this computation. Here, the following computation is used as an example for explanation: 19 - a00 a01 a02 a10 all a12.a20 a21 a22- [-a00 a01 a02 -100 X01 a10 all a12 X10 ill La20 a21 a22 x20 x21 x22 T zoo Z01 Z02 Z10 Z11 Z12 Lz20 Z21 z22 05, where T represents transposition.

This is to compute [zijl which is the two-dimensional discrete cosine transform of the 3x3 matrix [xii].

The first step is to compute column - transform, a00 a01 a02- x00 xOl x02 -YO0 yol Y02 a10 all a12 x10 xll x12 Y10 yll Y12 _a20 a21 a22- x20 x21 x22 _y20 y21 Y22 then, compute the row - transform, T a00 a01 a02- a10 all a12 _a20 a21 a22- YOO YO1 Y02Y10 Y11 Y12 Y20 Y21 Y22 1) = ( 9 z00 z01 z02 z10 zll z12 Lz20 z21 z22 T Referred to Fig. 21, Fig. 22 and Fig. 23, the loading of data, data processing and the operation of control signals can be explained as follows:

As shown in Fig. 21, first of all, data aij are loaded into the constant register file 101 in the processing elements PE1, PE2, PE3. Then, shown as Fig. 22, data xij are loaded from multiport memory M into register rb by the following sequence:

x00, x10, x20, xOl, x11, x21, x02, x12, x22.

In this way, processing element PE1 would compute y00, y01, y02, PE2 would compute y10, yll, y12, and PE3 would compute y20, y21, y22. Afterwards, by using decoder 112 to generate control signal to control tristate buffer 111, yij would be sent back to the input ib of the processing elements through multiplexer Mb by the following sequence:

y00, y01, y02, y10, yll, y12, y20, y21, y22.

Finally, the two-dimensional discrete cosine transform would be computed.

As shown in Fig. 24, the array processing architecture is the two-dimensional embodiment of the present invention. As an example, shown as Fig. 25, six processing elements PE11, PE12, PE21, PE22, PE31, PE32 are used to explain the process of computing the two-dimensional discrete cosine transform.

Referred to Fig. 26, Fig. 27, and Fig. 28, data loading, control sequence of control signals, and operational method can be explained as follows: as shown in Fig. 26, first of all, data aii are loaded into the constant register files 101 in the processing elements PE11, PE21, PE31, PE12, PE22, PE32. Then, shown as Fig. 27, data xij are loaded from multiport memory M into register rb by the following sequence:

x00, x10, x20, x01, x11, x21, x02, x12, x22.

In this way, processing element PE11 would compute y00, y01, y02, PE21 would compute y10, yll, y12, and PE31 would compute y20, y21, y22. Afterwards, shown as Fig. 28, by using decoder 112 to generate control signal to control tristate buffer 111, yij computed by PE11, PE21, PE31 would be sent to the input ib of the processing elements PE12, PE22, PE32 by the following sequence:

y00, y01, y02, y10, yll, y12, y20, y21, y22.

Then, processing element PE12 would compute ZOO, Z10, Z20, PE22 would compute Z01, Z11, Z21, and PE32 would compute Z02, Z12, Z22. In this way, the two-dimensional array processing - 21 - architecture can achieve the effect two-dimensional discrete cosine transform.

of processing the As shown in fig. 29, the array processing architecture is a two-dimensional embodiment, which comprises nxm processing elements, of the present invention for processing the operations of motion estimation and template matching. Here, P1, P2, Pm represent programmable delays. As an example, shown as Fig. 30, a 3x3 processing array is used to explain the operation. Here, P1, P2 are 3-clock-cycle delays. Moreover, the processing elements PE11, PE12, PE13, PE21, PE22, PE23, PE31, PE32, PE33 are running under the sixth operational mode which is shown as Fig. 9. For explanation, the following computation is used as an example:

z20 =1 x20-y401 + 1 x21-y411 +1 x22-y421 + 1 x10-y301 + 1 x11-y311 +1 x12-y321 + 1 x00-y201 + 1xOl-y211 +1 x02-y221, Z10 =1 x20-y301 + 1 x21-y311 +1 x22-y321 + 1 x10-y201 + 1 x11-y211 +1 x12-y221 + 1 XOO-Y101 + 1X01-Y111 +1 X02-y121, ZOO x20-y201 + 1 x21-y211 +1 X22-y221 + X10-Y10 + X11-yll 1 + X12-y121 + xOO-YOO + xOl-YO1 1 + X02-YO2 1, z21 =1 x20-y411 + 1 x21-y421 +1 x22-y431 + 1 X10-y311 + 1 x11-y321 +1 x12-y331 + 1 XOO-y211 + 1x01-y221 +1 x02-y231, z11 =1 x20-y311 + 1 x21-y321 +1 x22-y331 + 1 x10-y211 + 1 x11-y221 +1 x12-y231 + 1 x00-y111 1 x01-y121 +1 x02-y131, z01 =1 x20-y211 +1 x21-y221 +1 x22-y231 + 1 x10-y111 + 1 x11-y121 +1 x12-y131 + 1 x00-y011 + 1 x01-y021 +1 x02-y031, z22 =1 x20-y421 + 1 X21-y431 +1 x22-y441 + 1 x10-y321 +1 X11-y331 +i x12-y341 + 1 XOO-y221 + 1 X01-y231 +1 x02-y241, z12 =1 x20-y321 +1 x21-y331 +1 x22-y341 + 1 x10-y221 + j x11-y231 +1 x12-y241 + 1 x00-y121 + 1 xOl-y131 +1 x02-y141, z02 =1 x20-y22f + 1 x21-y231 +1 x22-y24i + 1 x10-y121 + 1 x11-y131 +1 x12-y141 + 1 x00-Y021 + 1xOl-YO31 +1 x02-YO21.

Referred to Fig. 31, and Fig. 32, processing element PE11 is used to compute z20, PE21, PE31 are to compute z10, z00 respectively, PE12, PE22, PE32 are to compute z21, z11, z01 respectively, and PE13, PE23, PE33 are to compute z22, z12, z02 respectively. Totally, this array processing architecture can achieve the function of processing both motion estimation and template matching.

As shown in Fig. 33, the array processing architecture is a stage-pipelined embodiment of the present invention. Such an array processing architecture comprises n pipelined SIMD-Systolic array processing architectures, which are cascaded in a pipelined manner, and is called stage-pipelined architecture. Also, such architecture can be combined with a general-purpose processor 1001 to enhance its computational performance. Shown as Fig. 34, the computation of 1008-point discrete Fourier transform is used as an example for explanation. A generalpurpose processor 1001 is cascaded with three pipelined SIMD-Systolic array processing architectures 3000, 3001, 3002 which are for computing 7-point, 9-point, 16-point discrete Fourier transform respectively. By using such an architecture, the 1008-point discrete Fourier transform can be computed with a high computational performance. As shown in Fig. 35, the array processing architecture is an embodiment of combining the present invention with systolic architecture which comprises of multiple processine elements. Referred to Fig. 35, a group of processing CD elements PE1- PEn, which form a systolic architecture 4002, is added between pipelined SIMD-Systolic array processing architectures 4000 and 4001. Also, such an architecture can be combined with a general-purpose processor4 Referred to Fig. 36, the implementation of an image compression system is used as an example for explanation. Two pipelined SIMD-Systolic array processing architectures 5000, 5001, which compute twodimensional discrete cosine transform and inverse discrete cosine transform individually, are combined with a systolic architecture 5002 in one end and with a general-purpose processor 1001 in the other end. Also, the systolic architecture 5002 comprises quantizer PE11, Zig-Zag scan processor PE21, coder PE31, dequantizer PE12, inverse Zig-Zag scan processor PE22, decoder PE32 and multiplexer Mul. All of the processing elements in the systolic architecture 5002 are cascaded systolically. Meanwhile, control signal 19 is to choose the operational mode. If control signal 19 is of logic one, data input of dequantizer PE12 is from the output of quantizer PE11. Therefore, the whole system is running the encoding process. On the other hand, the control signal 19 is of logic zero, data input of dequantizer PE12 is from the output of inverse Zig-Zag scan processor PE22. Then, the whole system is running the decoding process.

In such manner, the effect of image compression function can be achieved.

As described above, the present invention is related to pipelined SIMD-Systolic array processing architecture and its computing methods.

The present invention controls data processing, data transferring and data input/output in a concurrent manner.

Therefore, computational performance can be increased. Also, the 20 present invention can save data lines and increase the memory efficiency.

Therefore, it is possible to fabricate the present invention on single VLSI chip. Totally, the present invention is of practicability to the industry.

Claims

CLAIMS:

1. A pipelined SIMD-Systolic array processor, including: a controller; a number of processing elements constructed as an array architecture, wherein each processing element comprises a multiplier, an adder, a register, input ports and an output port, the output end of the multiplier is connected with an input end of the adder, input ends of the multiplier are connected receiving input data, the output end of the adder is connected with the register of which the output end is connected to another end of the adder, the output end of the adder is connected to the output port of processing element, and the adder and the register are controlled by said controller; a number of shift register arrays, respectively disposed at the input ports and output ports of the processing elements of said array architecture; a number of multiplexers, disposed at the transmitting ends of the shift register arrays; a multiport memory, connected with the front ends of the shift register arrays; a set of broadcasting data lines, connected with the input ports of the processing elements of said array architecture, for receiving the feedback data from output of the processing elements of said array architecture and data from the multi-port memory; wherein the registers, multiplexers and multi-port memory are - 26 - with the input ports of processing element for all controlled by said controller.

2. A pipelined SIMD-Systolic array processor, including:

a controller; a number of processing elements, constructed asan array architecture, wherein each processing element comprises an adder, a register, a multiplier, input ports and an output port, an input end of the adder and an input end of the multiplier are connected with the input port of processing element for receiving input data, the register is connected between the output end of adder and another input end of the multiplier, the output of register is connected to another input end of the adder, the output of multiplier is connected to the output port of processing element, and the adder and the register are controlled by said controller; a number of shift register arrays, respectively disposed at the input ports and output ports of the processing elements of said array architecture; number of multiplexers, disposed at the transmitting ends of the shift register arrays; multi-port memory, connected with the front ends of the shift register arrays; set of broadcasting data lines, connected with the input ports of the processing elements of said array architectue, for receiving the feedback data from output of the processing elements of said array architecture and data from the multi-port memory; wherein the registers, multiplexers and multi-port memory are all controlled by said controller. - 27 - a a 3. A pipelined SIMD-Systolic array processor, including:

a controller; a number of processing elements, constructed as an array architecture, wherein each processing element comprises an absolute-difference operational element, a multiplier, an adder, a register, input ports and an output port, the output ends of the absolute-difference operational element are connected to the multiplier, the output end of the multiplier is connected to an input end of the adder, the output of the adder is connected with the register of which the output is connected to another input end of the adder, the output end of the adder is connected to the output port of the processing element, and the adder and the register are controlled by said controller; number of shift register arrays, respectively disposed at the input ports and output ports of the processing elements of said array architecture; a number of multiplexers, disposed at the transmitting ends of the shift register arrays; multi-port memory, connected with the front ends of the shift register arrays; set of broadcasting data lines, connected with the input ports of the processing elements of said array architecture, for receiving the feedback data from output of the processing elements of said array architecture and data from the multi-port memory; wherein the registers, multiplexers and multi-port memory are all controlled by said controller.

4. A pipelined SIMD-Systolic array processor, including: a controller; a number of processing elements, constructed as an array architecture, wherein each processing element comprises an absolute-difference operational element, a register, an adder, input ports and an output port, the input ends of the absolute-difference operational element are connected with the input ports of the processing element, the output end of the absolute-difference operational element is connected to an input end of the adder, the output end of the adder is connected with the register of which the output end is connected to another input end of the adder, the output end of the adder is connected to the output port of the processing element, and the adder and the register are controlled by said controller; a number of shift register arrays, respectively disposed at the input ports and output ports of the processing elements of said array architecture; a number of multiplexers, disposed at the transmitting ends of the shift register arrays; a multi-port memory$ connected with the front ends of the shift register arrays; set of broadcasting data lines, connected with the input ports of the processing elements of said array architecture, for receiving the feedback data from output of the processing elements of said array architecture and data from the multiport memory; wherein the registers, multiplexers and multi-port memory are - 29 - 6.

7.

all controlled by said controller.

5. A processor according to Claim 1, wherein in each processing element, a further register is connected between the output end of the multiplier and an input end of the adder. A processor according to Claim 1, wherein in each processing element, a constant register file is connected between input port of the processing element and one input end of the multiplier, and a further register is connected between the output end of the multiplier and an input end of the adder. A processor according to Claim 1, wherein in each processing element, a first-in first-out memory is disposed for receiving input data of each processing element and provided as another output of the processing element, and a further register is connected between the output end of the multiplier and an input end of the adder. A processor according to Claim 1, wherein each processing element further includes: a constant register file, connected between an input end of processing element and an input end of the multiplier; a further register, connected between an output end of the multiplier and an input end of the adder; a data register file, connected to the output end of adder; and a tristate buffer and a decoder for connecting the output end of the data register file and providing as another output end of the processing element; wherein above said constant register file, the further register, data register file, tristate buffer and decoder 30 - are all controlled by said controller.

9. A processor according to claim 2, wherein in each processing element, a constant register file is connected between the input port of processing element and said an input end of multiplier, and the constant register file is also controlled by said contoller.

10. A processor according to Claim 2, wherein in each processing element, a constant register file is connected between the input port of processing element and said an input end of multiplier, the output end of the multiplier is connected with a data register file which is in turn connected with a tristate buffer and a decoder which provides another output of the processing element, and the constant register file, data register file, tristate buffer and decoder are all also controlled by said controller.

11. A processor according -to claim 2, wherein in each processing element, a first-in first-out memory is disposed for receiving input data of the processing element and provided as another output of the processing element, and the first-in first-out memory is also controlled by said controller 12. A processor according to claim 3, wherein in each processing element, a second register is connected between the output end of the absolute-difference operational element and the input end of the multiplier, a third register is connected between the output end of the multiplier and an input end of the adder, and the second and third registers are also controlled by said controller.

13. A processor according to claim 3, wherein in each processing element, a first-in first-out memory is disposed for receiving input data of the processing element and provided as another output of the processing element, and the first-in first-out memory is also controlled by said controller.

14. A processor according to claim 3, wherein in each processing element, a data register file is connected with the output end of the adder, a tristate buffer and a decoder are inturn connected with the data register file, and the data register file, tristate buffer, and the decoder are also controlled by said controller.

15. A processor according to claim 4, wherein in each processing element, a further register is connected between the output end of the absolute-difference operational element and an input end of the adder, and the further register is also controlled by said controller.

16. A processor according to claim 4, wherein in each processing element, a first-in first-out memory is disposed for receiving input data of the processing element and provided as another output of the processing element, and the first-in first-out memory is also controlled by said controller.

17. A processor according to Claim 4, wherein in each processing element, a data register file is connected with the output end of the adder, a tristate buffer and a decoder are in turn connected with the output end of the data register file, and the data register file, tristate buffer and decoder are all controlled by said controller.

18. A processor according to Claim 1, wherein said array architecture is constructed as a two-dimensional array.

19. A processor according to Claim 2, wherein said array architecture is constructed as a two-dimensional array.

20. A processor according to Claim 3, wherein said array constructed as a two-dimensional array. according to Claim 4, wherein said array constructed as a two-dimensional array. according to Claim 1, wherein the array is constructed as a stage-pipelined array connected with according to 21.

23.

architecture is A processor architecture is A processor architecture architecture, A processor architecture architecture, 24. A processor architecture architecture, 25. A processor architecture architecture, 26. A processor a general-purpose processor.

Claim 2, wherein the array is constructed as a stage-pipelined array connected with a general-purpose processor.

according to Claim 3, wherein the array is constructed as a stage-pipelined array connected with a general-purpose processor.

according to Claim 4, wherein the array is constructed as a stage-pipelined array connected with a general-purpose processor.

wherein one end of the array architecture is connected with a systolic architecture constructed by said processing elements, and the whole is connected and controlled by a general purpose processor.

27. A processor according to Claim 2, wherein one end of the array architecture is connected with a systolic 33 - according to Claim 1.

architecture constructed by said processing elements, and the whole is connected and controlled by a general-purpose processor.

28. A processor according to Claim 3, wherein one end of the array architecture is connected with a systolic architecture constructed by said processing elements, and the whole is connected and controlled by a general-purpose processor.

29. A processor according to Claim 4, wherein one end of the array architecture is connected with a systolic architecture constructed by said processing elements, and the whole is connected and controlled by a general-purpose processor.

30. A pipelined SIMD-Systolic array processor, including: A number of cascaded pipelined processing elements, wherein, each pipelined processing element comprises: a first register having an input connected to the output of a multiplier and an output; a second register having an input connected to the output of an adder and an output; a third register having an input connected to the output of an absolute-difference operational element and an output; a first multiplexer selecting data from a constant register file, a first input port, or a third register and having an output connected to the input of said multiplier; a second multiplexer selecting data from the first input - 34 - port, a second input port, the first register, the third register, or the second register and having an output connected to the input of a multiplier; a third multiplexer selecting data from the first register, the third register, a data register fi le or the second input port and having an output connected to the input of said adder; a fourth multiplexer selecting data from said adder or multiplier and having an output; the said a first input port for receiving data from an input broadcasting circuit being connected to the inputs of said first multiplexer, said second multiplexer, and absolute-difference operational element; a second input port for receiving systolic data from a first input shift-register array being connected to the inputs of said second multiplexer, said third multiplexer, a first-in first-out memory, a constant register file, element; the first-in first-out memory having an input connected to said second input port and an output connected to a first output port of the processing element: the constant register file having an input connected to said second input port and an output connected to the input of said first multiplexer; the multiplier having a first input connected to the output of said first multiplexer, a second input connected to the output of said second multiplexer, and - 35 - and said absolute-difference operational an output connected to the inputs of said first register and said fourth multiplexer; the adder having a first input connected to the output of said third multiplexer, a second input connected to the output of said second register, and an output connected to the inputs of said second register and said fourth multiplexer; the absolute-difference operational element having a first input connected to said first input port, a second input connected to said second input port, and an output connected to the input of said third register; a data register file having an input connected to the output of said adder and an output connected to the inputs of said third multiplexer and a tristate buffer; the tristate buffer having a first input connected to the output of data register file, a second input connected to the output of a decoder, and an output connected to a third output port of the processing element; the first output port for sending feedback data being connected to the output of first-in first-out memory; the second output port for sending output data being connected to the output of said fourth multiplexer; the third output port for sending wired-or feedback data being connected to the output of tristate buffer; the multiplexers, registers, first-in first-output memory, constant register file, adder, data register file, and decoder being all connected to control lines issued from a controller and a mode-control ROM for organizing various data transferring structures; and, by using control signals to control the said multiplexers that each pipelined processing element is able to have various operational modes; An input broadcasting circuit, comprising a register and a multiplexer, having a first input connected to the output of a multiport memory, a second input connected to a wired-or outputs of all the processing to said first elements, and an output connected input ports of all the processing elements; A first input shift-register array, comprising registers and multiplexers, having an input connected to the output of said multiport memory, a group of inputs connected to said first output ports of all the processing elements, and another group of inputs connected to the outputs of a second input shift-register array; The second input shift-register array comprising registers having an input connected to the output of multiport memory and a group of outputs connected to the inputs of said first input shift-register array; An output shift-register array, comprising of registers and multiplexers, having an output connected to the inputs of multiport memory and external functional unit, a group of inputs connected to said second output ports of all the processing elements, and an input connected to said wired-or output ports of all the processing elements; An output wired-or circuit having a group of inputs connected 31.

to said third output ports of all the processing elements, an output connected to said second input of said input broadcasting circuit, an output connected to an input of said output shift-register array; The multiport memory having a first input connected to external host machine, a second input connected to the output of said output shift-register array, a first output connected to the input of said input broadcasting circuit, a second output connected to the input of said first input shift-register array, and a third output connected to the input of said second input shift-register array; and The controller generating control signals which are broadcasting to and for controlling said pipelined processing elements, said input broadcasting circuit, said input shift-register arrays, said output shift-register array, said output wired-or circuit, said multiport memory, and said external functional unit.

A pipelined SIMD-Systolic array processing method, comprising the following procedures:

transferring data from a multiport memory systolically into a first input shift-register array by controlling multiplexers appropriately; transferring data from the multiport memory systolically into a second input shift-register array; transferring data parallelly from said second input shift-register array into said first shift-register array and pipelined processing elements by controlling multiplexers appropriately; 38 - transferring data from said multiport memory into said pipelined processing elements through an input broadcasting circuit; transferring data from wired-or outputs into said pipelined processing elements through said input broadcasting circuit by controlling multiplexer appropriately; transferring data parallelly from a first output ports of pipelined processing elements into said first input shift-register array by controlling multiplexers appropriately; transferring data parallelly from said first input shift-register array into a second input ports of said pipelined processing elements; transferring data from said input broadcasting circuit into a first input ports of said pipelined processing elements; computation being performed in said pipelined processing elements under certain operational mode; transferring the computational results parallelly from a second output ports of said pipelined processing elements into an output shift-register array by controlling multiplexers appropriately; transferring the computational results systolically from said output shift-register array into said multiport memory or external functional unit; transferring the computational results from a third output ports of said pipelined processing elements into said input broadcasting circuit, multiport memory or external functional unit; and 39 under the control of a controller, said input shift-register arrays, said input broadcasting circuit, said pipelined processing elements, said output shift-register array, said wired-or output circuit, and said multiport memory running concurrently to perform data transferring and computation.

32. A two-dimensional pipelined SIMD-Systolic array proce ssor, comprising: pipelined processing elements, registers, multiplexers, data switches, multiport memory and controller, and further including:

a two-dimensional processing array, rows of said pipelined processing pipelined processing element has receiving vertical broadcasting data, a second input port for receiving horizontal broadcasting data, a first output port for sending computational results into an output shift register array, and a computational results into a wired-or output circuit; a first input shift-register array, comprising registers and multiplexers, having an input connected to an output of multiport memory and outputs connected to said two-dimensional processing array, wherein, each output is connected to said second input ports of all the processing elements in the same row for broadcasting data horizontally; a second input shift-register array, comprising registers, having an input connected to an output of multiport memory and outputs connected to inputs of multiplexer of said first input shift-register array; - 40 comprising columns and elements, wherein, each a first input port for second output port for sending a broadcasting register having an input connected to an output of multiport memory and an output connected to said first input ports of all the processing elements in the leftmost column of said two-dimensional processing array for broadcasting data vertically; a third input shift-register array, comprising register delays, having an input connected to the output of said broadcasting register and outputs connected to said two-dimensional processing array, wherein, each output is connected to said first input ports of all the processing elements in the same column for broadcasting data vertically, each column of processing elements of said two-dimensional processing array. may have an output shift-register array connected to said first output ports, comprising registers and multiplexers for transferring computational results systolically into multiport memory or its right neighboring column, the data switches may be used for controlling data transferring between output shift-register arrays and said multiport memory when more than one output shift-register arrays are used, and each column of processing elements of said two-dimensional processing array may have a wired-or output circuit connected to said first input ports of all the processing elements in its right neighboring column for transferring intermediate computational results; and a controller generating control signals which are connected to and for controlling said two-dimensional processing array, said input shift-register arrays, said input broadcasting register, said output shift-register arrays, said data - 41 switches, said wired-or output circuits, and said multiport memory.

33. A two-dimensional pipelined SIMD-systolic array processor comprising a number of pipelined processing elements, registers, multiplexers, multiport memory and controller, and the processor further including: a two-dimensional processing array, comprising columns and rows of said pipelined processing elements; an input shift-register array, comprising registers, having an input connected to an output of multiport memory and outputs connected to said two-dimensional processing array, wherein, each output is connected to said second input ports of all the processing elements in broadcasting data horizontally; a broadcasting register having an input connected to an output of multiport memory and an output connected to said first input ports of all the processing elements in the leftmost the same row for column of said two-dimensional processing array for broadcasting data vertically, wherein except the rightmost column, each column of processing elements of said two-dimensional processing array has a wired-or output circuit connected to said first input ports of all the processing elements in its right neighboring column for transferring intermediate computational results; an output shift-register array, comprising registers and multiplexers, having inputs connected to said first output ports of all the processing elements in the rightmost column of said two-dimensional processing array for transferring - 42 computational results systolically into multiport memory; and a controller generating control signals for controlling said two- dimensional processing array, said input shift-register array, said broadcasting register, said wired-or output circuits, said output shift- register array, and said multiport memory.

34. A two-dimensional pipelined SIMD-Systolic array processor comprising:

pipelined processing elements, registers, multiplexers, data switches, multiport memory and controller, and the processor further including:

a two-dimensional processing array, comprising columns and rows of said pipelined processing elements; a first input shift-register array, comprising registers and multiplexers, having an input connected to an output of multiport memory and outputs connected to said two-dimensional processing array, wherein, each output is connected to said second input ports of all the processing elements in the same row for broadcasting data horizontally; a second input shift-register array, comprising register, having an input connected to an output of multiport memory and outputs connected to inputs of multiplexers of said first input shift-register array; a broadcasting register having an input connected to an output of multiport memory and an output connected to said first input ports of all the processing elements in the leftmost column of said two-dimensional processing array for broadcasting data vertically; - lq - dat;e a third input shift-register array, comprising register delays, having an input connected to the output of said broadcasting register and outputs connected to said two-dimensional processing array, wherein, each output is connected to said first input ports of all the processing elements in the same column for broadcasting data vertically, wherein each column of processing elements of said two-dimensional processing array has an output shift-register array connected to said first output ports, comprising registers and multiplexers for transferring computational results systolically into said multiport memory; switches being used for controlling data transferring between output shift-register arrays and said multiport memory; and a controller generating control signals for controlling said two-dimensional processing array, said input shiftregister arrays, said broadcasting register, said data switches, said output shift-register arrays, and said multiport memory.

11% 35. A pipelined SIMD-systolic array processor substantially as described herein with reference to the drawings.

9 Amendments to the claims have been filed as follows 1. A pipelined SIMD-Systolic array processor, including:

controller; plurality of processing elements constructed as an array architecture, wherein each processing element comprises an adder, a register, input ports and an output port, the adder and the register being controlled by said controller; a plurality of shift register arrays, respectively disposed at the input ports and output ports of the processing elements of said array architecture; a plurality of multiplexers, disposed transmitting ends of the shift register arrays; a multi-port memory, connected with the front ends of the shift register arrays; and a set of broadcasting data lines, connected with the input ports of the processing elements of said array architecture, for receiving the feedback data from output of the processing elements of said array architecture and data from the multi-port memory; wherein the registers, multiplexers and multi- port memory are all controlled by said controller.

at the 2. A processor according to claim 1, wherein each processing element includes a multiplier, the output end of the multiplier is connected with an input end of the adder, input ends of the multiplier areconnected with the input ports of a processing element for receiving input data, the output end of the adder is connected with the register of which the output end is connected to another end of the adder and the output end of the adder is connected to the output port of the processing element.

4.(0 1 3. A processor according to claim 2, wherein each processing element further includes an absolutedifference operational element having output ends which are connected to the multiplier.

4. A processor according to claim 1, wherein each processing element includes a multiplier, an input end of the adder and an input end of the multiplier are connected with the input port of a processing element for receiving input data, the register is connected between the output end of the adder and another input end of the multiplier, the output of the register is connected to another input end of the adder and the output of the multiplier is connected to the output port of the processing element.

A processor according to any one of claims 2 to 4, wherein in each processing element, a further register is connected between the output end of the multiplier and an input end of the adder.

A processor according to any one of claims 2 to 5, wherein in each processing element, a constant register file is connected between the input port of the processing element and one input end of the multiplier.

7. A processor according to claim 3, wherein in each processing element, a second register is connected between the output end of the absolutedifference operational element and the input end of the multiplier, a third register is connected between the output end of the multiplier and an input end of the adder, and the second and third registers are also controlled by said controller.

0 1 8. A processor according to claim 1, wherein each processing element further includes an absolutedifference operational element having input ends connected with the input ports of the processing element, the output end of the absolute-difference operational element is connected to an input end of the adder, the output end of the adder is connected with the register of which the output end is connected to another input end of the adder and the output end of the adder is connected to the output port of the processing element.

A processor according to claim 8, wherein in each processing element, a further register is connected between the output end of the absolutedifference operational element and an input end of the adder, and the further register is also controlled by said controller.

10. A processor according to any preceding claim, wherein in each processing element, a first-in first-out memory is disposed for receiving input data of the processing element and provided as another output of the processing element, and the first-in first-out memory is also controlled by said controller.

11. A processor according to any preceding claim, wherein in each processing element, a data register file is connected with the output end of the adder, a tristate buffer and a decoder are in turn connected with the output end of the data register f ile, and the data register f ile, tristate buf f er and decoder are all controlled by said controller.

1 i 0 1 1 12. A processor according to any preceding claim, wherein said array architecture is constructed as a twodimensional array.

13. A processor according to any preceding claim, wherein the array architecture is constructed as a stage-pipelined array architecture, connected with a general-purpose processor.

14. A processor according to any preceding claim, wherein one end of the array architecture is connected with a systolic architecture constructed by said processing elements, and the whole is connected and controlled by a general purpose processor.

15. A processor according to claim 1, wherein the first register has an input connected to the output of a multiplier and an output and wherein the processor further includes: a second register having an input connected to the output of the adder and an output; a third register having an input connected to the output of an absolute-difference operational element and an output; a first multiplexer selecting data from a constant register file, a first input port or a third register and having an output connected to the input of said multiplier; a second multiplexer selecting data from the first input port, a second input port, the firsz register, the third register or the second register and having an output connected to the input of a multiplier; a third multiplexer selecting data from the first register, the third register, a data register file or the second input port and having an output connected to the input of said adder; 1 a f ourth multiplexer selecting data from said adder or multiplier and having an output; a f irst input port f or receiving data from an input broadcasting circuit being connected to the inputs of said f irst multiplexer, said second multiplexer and an absolute-difference operational element; a second input port f or receiving systolic data from a f irst input shift-register array being connected to the inputs of said second multiplexer, said third multiplexer, a first-in first-out memory, a constant register f ile and said absolute-dif f erence operational element; the f irst-in f irst-out memory having an input connected to said second input port and an output connected to a first output port of the processing element; the constant register file having an input connected to said second input port and an output connected to the input of said first multiplexer; the multiplier having a first input connected to the output of said first multiplexer, a second input connected to the output of said second multiplexer and an output connected to the inputs of said first register and said fourth multiplexer; the adder having a first input connected to the output of said third multiplexer, a second input connected to the output of said second register and an output connected to the inputs of said second register and said fourth multiplexer; the absolute-difference operational element having a first input connected to said first input port, a second input connected to said second input port and an output connected to the input of said third register; a data register file having an input connected to the output of said adder and an output connected to the inputs of said third multiplexer and a tristate buffer; -50 i the tristate buf f er having a f irst input connected to the output of data register file, a second input connected to the output of a decoder and an output connected to a third output port of the processing element; the first output port for sending feedback data being connected to the output of said first-in first-out memory; the second output port for sending output data the output of said fourth multiplexer; the third output port for sending wired-or feedback data being connected to the output of the tristate buffer; the multiplexers, registers, first-in first-output memory, constant register file, adder, data register file, and decoder being all connected to control lines issued from the controller and a mode-control ROM for organising various data transferring structures; and, by using control signals to so control the said multiplexers that each pipelined processing element is able to have various operational modes.

being connected to 16. A pipelined SIMD-Systolic array processing method utilising the processor claimed in any one of the preceding claims, said method comprising the following procedures: transferring data from a multiport memory systolically into a f irst input shift-register array by controlling multiplexers appropriately; transferring data from the multiport memory systolically into a second input shift-register array; transferring data parallelly from said second input shift-register array into said first shift-register array and pipelined processing elements by controlling multiplexers appropriately; transferring data from said multiport memory into said pipelined processing elements through an input broadcasting circuit; t transferring data from wired-or outputs into said pipelined processing elements through said input broadcasting circuit by controlling multiplexers appropriately; transferring data parallelly from first output ports of pipelined processing elements into said f irst input shift-register array by controlling multiplexers appropriately; transferring data parallelly from said first input shift-register array into second input ports of said pipelined processing elements; transferring data from said input broadcasting circuit into first input ports of said pipelined processing elements; computation being performed in said pipelined processing elements under certain operational modes; transferring the computational results parallelly from second output ports of said pipelined processing elements into an output shift-register array by controlling multiplexers appropriately; transferring the computational results systolically from said output shift-register array into said multiport memory or external functional unit; transferring the computational results from third output ports of said pipelined processing elements into said input broadcasting circuit, multiport memory or external functional unit; and under the control of the controller, said input shiftregister arrays, said input broadcasting circuit, said pipelined processing elements, said output shiftregister array, said wired-or output circuit, and said multiport memory running concurrently to perform data transferring and computation.

EZ, 17. A processor according to any one of claims 1 to 11 and further including: a two-dimensional processing array, comprising columns and rows of said pipelined processing elements; an input shiftregister array, comprising registers, having an input connected to an output of a multiport memory and outputs connected to said two-dimensional processing array, wherein, each output is connected to said second input ports of all the processing elements in the same row for broadcasting data horizontally; a broadcasting register having an input connected to an output of the multiport memory and an output connected to said first input ports of all the processing elements in the leftmost column of said two-dimensional processing array for broadcasting data vertically, wherein except the rightmost column,, each column of processing elements of said two-dimensional processing array has a wired-or output circuit connected to said first input ports of all the processing elements in its right neighbouring column for transferring intermediate computational results; an output shift-register array, comprising registers and multiplexers. having inputs connected to said first output ports of all the processing elements in the rightmost column of said two-dimensional processing array for transferring computational results systolically into the multiport memory; and the controller generating control signals for controlling said two dimensional processing array, said input shiftregister array, said broadcasting register, said wiredor output circuits, said output shift-register array, and said multiport memory.

0 2 c 1 t 40 a. 1.. & 18. A processor according to any one of claims 1 to 11 and further including:

a two-dimensional processing array, comprising columns and rows of said processing elements, wherein, each processing element has a first input port for receiving vertical broadcasting data, a second input port for receiving horizontal broadcasting data, a first output port for sending computational results into an output shift-register array, and a second output port for sending computational results into a wired-or output circuit; a second input shift-register array, comprising registers, having an input connected to an output of multiport memory and outputs connected to inputs of multiplexers of said first-mentioned input shiftregister array; a third input shift-register array, comprising register delays, having an input connected to the output of said broadcasting register and outputs connected to said twodimensional processing array, wherein, each output is connected to said first input ports of all the processing elements in the same column for broadcasting data vertically, each column of processing elements of said two-dimensional processing array having an output shift-register array connected to said first output ports, comprising registers and multiplexers for transferring computational results systolically into the multiport memory or its right neighbouring column, data switches being used for controlling data transferring between output shift-register arrays and said multiport memory when more than one output shift-register arrays are used.

(54 F 19. A processor according to claim 18 and further including a twodimensional processing array, comprising columns and rows of said pipelined processing elements; and data switches for controlling data transferring between output shift-register arrays and said multiport memory.

20. A pipelined SIMD-Systolic array processor substantially as described herein with reference to the drawings.