CA1165005A - Processing element for parallel array processors - Google Patents

Processing element for parallel array processors

Info

Publication number
CA1165005A
CA1165005A CA000425731A CA425731A CA1165005A CA 1165005 A CA1165005 A CA 1165005A CA 000425731 A CA000425731 A CA 000425731A CA 425731 A CA425731 A CA 425731A CA 1165005 A CA1165005 A CA 1165005A
Authority
CA
Canada
Prior art keywords
register
data
processing element
array
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired
Application number
CA000425731A
Other languages
French (fr)
Inventor
Kenneth E. Batcher
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Goodyear Aerospace Corp
Original Assignee
Goodyear Aerospace Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US06/108,883 external-priority patent/US4314349A/en
Application filed by Goodyear Aerospace Corp filed Critical Goodyear Aerospace Corp
Priority to CA000425731A priority Critical patent/CA1165005A/en
Application granted granted Critical
Publication of CA1165005A publication Critical patent/CA1165005A/en
Expired legal-status Critical Current

Links

Landscapes

  • Multi Processors (AREA)

Abstract

PROCESSING ELEMENT FOR
PARALLEL ARRAY PROCESSORS

ABSTRACT OF THE DISCLOSURE
A processing element constituting the basic building block of a massively-parallel processor.
Fundamentally, the processing element includes an arithmetic sub-unit comprising registers for operands, a sum-bit register, a carry-bit register, a shift register of selectively variable length, and a full adder. A logic network is included with each proces-sing element for performing the basic Boolean logic functions between two bits of data. There is also included a multiplexer for intercommunicating with neighboring processing elements and a register for receiving data from and transferring data to neighbor-ing processing elements. Each such processing element includes its own random access memory which communicates with the arithmetic sub-unit and the logic network of the processing element.

Description

1. 3~0~

PROCESSING ELEMENT FOR
P~~LLEL A~RA~ ~OC~SSOP~S
_ BACKGROVND OF THE INVENTION
_ . _ The instant invention resides in the art of data processors and, more particularly, with large scale parallel processors capable of handling large volumes of data in a rapid and cost-effective manner.
Presently, t~e demands on data processors are such that large pluralities of data must b~ arithmetically and logically processed in short periods of time for purposes of constantly updating previously obtained results or, alternatively, for monitoring large fields from which data may be acquired and in which correlations must be made. For example r this country is presently intending to orbit imaging sensors which can generate data at rates up to 1013 bits per day.
For such an Lmaging system, a variety of image processing tasks such as geometric correction, correl ation, image registration, feature selection, multi-spectral classification, and area measurement are required to extract useful information from the mass ( of data obtained. Indeed, it is expected that the work load for a data processing system utilized in association with such orbiting image sensors would fall somewhere ~etween 109 and 101 operations per second.
High speed processing systems and sophisti-cated parallel processors, capable of simultaneously operating on a plurality of data, have been known for a num~er of years. Indeed, applicant's prior U.S.
Patents 3,800,289; 3,812,467; and 3,936,806, all relate to a structure for vastly increasing the data processing capability of digital computers.
Similarly, U.S. Patent 3,863,233, assigned to Good-year Aerospace Corporation, the assignee of the
2, o ~

insta.nt application~ relates specifically to a d~ta processi~ng element ~or an associati~e or paral-lel processor ~hic~ also increases data processing speed By ~ncluding a plurality of arithmetic units, one for each word in the memory array. However, even the great advancements of these prior art teachings do not possess the capa~ility of cost effectively handllng the large volume o~ data previously described.
~ s~tem of the required nature includes thousands of pxocessiny elements, each including its own arithmetic and lo~ic network operating in conjunction with its o~n memory, ~ile possessing the capability of commu-nicatl'ng ~ith other sLmilar processing elements with-in the system~ ~ith thousands of such processing e.le~ents operating simultaneously (~assive-parallelism), the requisite speed-may be achie~ed. Turther, the fact that typical satellite images include millions of picture e.lements or p~x~is that can generally be pro-cessed at the same time, such a structure lends itself well to t~e solution of the aforementioned problem~
In a syste~ capa~le of processing a large ,~olume of data in a ~assively~parallel manner, it is '~ most desirable that the system be capable of perform-Ing ~.t-serial mathematics for cost effectiveness.
~o~eyex~ in order to increase speed in the bit-seria c~mputation, it is most desira~le t~at a variable length shi~ft reyister be included such that ~arious word lengths m~y be accommodated. Further, it is de$irab.1e t~at the massi~e array of processing ele-ments be capable of intercom~unication such that data ~ay ~e moved ~et~een and among at least neighbor-ing process~ng elements~ ~urther, ;t i.s desirable tha,t each process~ng element be capable of performing all of the Boolean operations possible between two ~i`t~ of data, and that each such processing element include its own rand~n access memory~ ~et ~urther, 0~
3.
for such a system to be efficient, it should include means for bypassing inoperative or malf-unctioni~g processing elements without diminishing system integrity.
OBJECTS OF THE INVENTION
In light of the foregoing, it is an object of an aspect of the invention to provide a plurality of processing elements for a parallel array processor wherein each such element includes a variable length shift register for at least assisting in arithmetic computations.
An object of an aspect of the invention is to provide a plurality of processing elements for a parallel array processor wherein each such processing element is capable of intercommunicating with at least certain neighboring processing elements.
An object of an aspect of the invention is to provide a plurality of processing elements for ~ parallel array processor wherein each such processing element is capable of performing bit-serial mathematical computations.
An object of an aspect of the invention is to provide a plurality of processing elements for a parallel array processor wherein each such processing element is capable of performing all of the Boolean functions capable of being performed between two bits of binary data.
An object of an aspect of the invention is to provide a plurality of processing elements for a parallel array processor wherein each such processing element includes its own memory and data bus.
An object of an aspect of the invention is to provide a plurality of processing elements for a parallel array processor wherein certain of said processing elements may be bypassed should they be found to be inoperative or malfunctioning, such bypassing not diminishing the system integrity.
An object of an aspect of the invention is to provide a plurality of processing elements for a parallel 5~0~
4.

array processor which achieves cost-effective processing or a large plurality of data in a time-efficient manner.
SUMMARY OF THE INVENTION
An aspect of the invention is as follows:
An array of a plurality of processing elements interconnected with each other and wherein each processing element comprises:
an adder;
first and second data registers connected with and supplying data bits to said adder;
a carry register connected to said adder and receiving therefrom data bits resulting from arithmetic operations and which functions according to the rule:
C ~---APVPCVAC where A iS the state of said first register, P is the state of said second register, and C
is the state of said carry register;
a memory; and a data bus interconnecting said first, second, and carry registers and said memory for the transfer of data thereamong.
DESCRIPTION OF DRAWINGS
For a complete understanding of the objects, techniques, and structure of the invention, reference should be had to the following detailed description and accompanying drawings wherein:
Fig. 1 is a block diagram of a massively-parallel processing system according to the invention, showing the interconnection of the array unit incorporating a plurality of processing elements;
Fig. 2 is a block diagram of a single processing element, comprising the basic building block of the array unit of Fig. l;
Fig. 3, consisting of Figs. 3A 3C, constitutes a circuit schematic of the control signal generating circuitry of ~he processing elements main-~.

tained u~orL a chip ~nd including the sum or and parit~
trees;
Fi~,, 4 ~s a detailed ci~cuit sch~matic of the fundamental ci~rcu~try of a processing element of the ~n~ention;
Fi~ 5, compri~s~ng ~igs~ 5~ an~ 5B, presents circuit sche~atics of the ~w~tchi.ng circuitry utilized in remoYin~ an inoperatl~ve or ~alfunctioni~n~ processin~
element f~om the array un~t~
, .. ... ... .... ..
DET~,ILED DESC:RIPTION OF PREFE:R~D 3~1BODIME~IT
Referring no~ to the drawin~s and ~oxe particularly ~.~g< 1, ~t can ~e seen that a m~ssi~ely~
parallel processor is designated ~enerally by the numer~l 10~ A ke~ el~ment of t~e. processor 10 is the array unit ~2 which~ ~n a pre~erred em~odiment of the invention., includes a ~atrix o~ 128 X.128 processing elements, for a total of 16,384 processing ele~ents, to be descriBed in deta~l hereinaft~r~
The arra~ unit 12 i~nputs d~ta on i~ts left side and outputs data on its right side oYer 128 parallel lines, The maxL~um transfer rate of 128~bit columns of data . is 10 mhz for a ~xLmum b~nd~idth o~ 1~28 billion bits per second~ Input, output, or ~oth, can occur simul-taneously ~ith processin~
Electron.ic s~itches 24 select the input of the axra~ un~t 12 from the 128-b~t interf~ce o~ the processor.10~ or ~rom the i~np~t register 16~ SLmi~
larlyr the array 12 output ~ay be steexed to the 3~ 12~h~t output i:nterface of the processor lQ or to the output register 14 ~ia switches 26, These switches 24,26 are controlled ~y the program and data ~anagel ~ent unit 18 under suita~le program control~ Control 5ignals to the arr~y unit 12 and status ~.its from the array un-`t may ~e connected to the external control interface.of the proce~sor la ox to the arra~ control -~ 3~

unit 2~ A~ainl this transfer is ach.ieved by electron-ic switche~ 22, ~h.i.ch are under program control of the uni~t 18 The array control UnIt 20 ~roadcasts control signals and memory addresses to all pxoces-sing el~ments of the arra~ unit 12 and receives status bits there~rom, It is desi~ned to perform ~ookkeeping operations such as address calculation, loop control, ~ranc~ing, su~rout~ne call~n~ and the like It operates sLmultaneously with the processing element control such that full processing power of the processing elements of the array unit 12 can ~e applied to the data to be handleds The control unit 20 ~n~
cludes three separate control units; the processiny element control unit executes micro-coded ~rector processin~ routines and controls the processing ele-ments and their associated memories; the input~output control unit controls the shiftin~ of data through the array unit 12; and the main control unit executes the application programs, performs the scaler processing internally, and makes calls to the processing element control unit or all ~ector processing/
The program and data management uni.t 18 manages data flo~ ~etween the units of the processor 10, loads programs into the control unit 20, executes system tests and diagnosti~ routines, and pro~ides progr~m de~elopment ~acilities~ The details of such structure axe not important ~or an understanding of the instant in~ention, ~ut it ~ould ~e noted that the unit 18 may readily co~prise a mi-ni-computer such. as the Digital Equipment Corporation ~PEC~ PDP~ 34 ~ith interface~ to the con.tr~l unit 2a, arra~ un.it 12 (registers 14,16~.~ and the external computer interace~
As is well known ~n the axt, the unit 18 may also include peripheral equipment such as ma~netic tape drive 28, disks 30, a line printer 32, and an alphanumeric s terminal 34~
~hile the structure of Fig.. 1 is of some significance ~or an appreciation of the overall system incorporating the învention, it is to be understood that the details thereof are not necessary for an appreciation of the scope and ~readth of applicant's inventive concept. Suffice it to say at this time that the array unit 12 comprises the inventive concept to ~e descri~ed in detail herein and that such array includes a large plurality of interconnected processing elements, each of w~ich has its own local memory, is capable o~ performing arith;
metic computations, is capa~le o~ performing a full complement of Boolean functions, and is furthar capable of communicating wit~ at least the processing elements orthogonally neigh~oring it on each side, hereinafter referenced as north, south, east, and ~est~
With specific reference now to Fig 2, it can be seen that a single process~ng element is desig-nated generally ~y the numeral 36. The processing element itself includes a P register 38 which~
together with its input logic 40, performs all logic and routing functions for t~e process~ng el2ment 36~
The A, B, and C registers 42-46, the vari~able length shift register 48 and the associated logic of the full adder 50 comprise the arithm~tic unit of the processing element 36. The G register 52 is proYided to control masking of ~oth arithmetic and logical operations, while the S register 54 is used to shi~t data into and out of the processing element 36 without disturbing operations thereo~, ~inally, the aforementioned elements of the processing element 36 are connected to a uniquely associated random access memory 56 by means of a ~i-directional dat~ ~us 58, 35~ As presently designed, the processing element 36 is reduced ~y larye scale integration to ` 8 1~6~S

such a size that a single chip may include eight such processing elements along with a parity tree, a sum-or circuit, and associated control decode~ In the pre-ferred embodiment of the invention, the eight pro-cessing elements on a chip are provided in a two row by foux column arrangement. Since the size of random access memories presently availa~le through large scale integration is rapidly changing, it is preferred that the memory 56, while comprising a portion of the processing element 36, be maintained separate from the integrated circuitry of the remaining structure of the pxocessing elements such that, when technology allows, larger memories may be incorporated with the process-ing elements without altering the total system design.
The data bus S8 is the main data path ~or the processing element 36. During each machine cycle it can transfer one bit of data ~rom any one o~ six sources to one or more destinations. The sources include a bit read from the addressed location in the random access memory 56, the state of the B, C, P, or S registers, or the state of the equivalence function generated by the element 60 and indicating the state of equivalence existing between the out-puts of the P and G registers, The equivalence function is used as a source during a masked-negate operation.
The destinations of a data ~it on the data bus 58 are the addressed location of the random access memory 56, the A, G, or S re~isters, the logic asso-ciated with the P register, the input to the sum~or tree, and the input to the p~XLty tree~
Before considering the detailed circuitr~
of the processing element 36, attention should ~e given to Fig. 3 ~herein the circuitry 62 for generat.ing the control signals for operating the processing elements is shown. The circuitry of Fig, 3 is 9.

included in a large scale integrated chip which includes eight processing elements, and is responsible for controlling those associated elements~ Funda-mentally, the circuitry o~ Fig 3 includes decode logic receiving control signals on lines L0-LF
under program control and converts those signals into the control signals Kl~K27 ~or application to the processing elements 36, sum-or ~ree, and parity tree. Additionally, the circuitry of Fig. 3 gener-ates from the ~ain clock of the system all ot~er clock pulses necessary for control of the processing element 36.
One skilled in the art may readily deduce from the circuitry of Fig. 3 the relationship between the programmed input function on the lines L0-LF and the control signals Kl-K27. For example, the inverters 64,66 result in Kl=LC. Similarly, inverter 68-72 and NAND gate 74 result in K16=LO-Ll- By the same token, K18=L2 L3.L4 L6.
Clock pulses for controlling the processing elQments 36 are generated in substantially the same manner as the control signals. The same would ~be readily apparent to those skilled in the art from a review of the circuitry 62 of Fig. 3. For example, the clock S-CLK = S-CLK-ENABLE- MAIN CLK
by virtue of inverters 76,78 and NAND gate 80.
Similarly, clock G-CLK = L8-MAIN CLK by virtue of inverters 76,82 and NAND gate 84.
With further respect to the circuitry 62 3~ of Fig 3, it can be seen that there is provided means for determining parity error and the sum-or of the data on the data bus of all processing ele-ments. The data bit on the data bus may be presented to the sum-or tree, which is a tree of inclusive-or logic elements which forms the inclusive-or of all 1 0 ~

processing element data bus states and presents the results to the array control unit 20~
In order to detect the presence of process-ing elements in certain states, groups of eight pro-cessing elements are ORed together in an eight inputsum-or tree whose output is then fed to a 2048-input or-tree ex~ernal to the chip to achieve a sum~or of all 16,384 processing elements.
Errors in the random access memory 56 may be determined in standard ~ashion by parity-generation and checking circuitry. With each group of eight pro-cessing elements 36 there is a parity-error flip-flop 86 which is set to a logic 1 whenever a parity error is detected in an associated random access memory 56. As shown in the circuitry 62, the sllm-or tree comprises the three gates designated by the numeral 88 while the parity error tree consists o~ the seven exclusive-OR gates designated by the numeral 90.
During read operations, the parity output is latched in the flip-flop 86 at the end of the cycle by the M-clock. During write operations, parity is outputted to a parity memory through the parity-bit pin of the chip. The parity memory comprises a ninth randQm access memory similar to the elements 56. The parity state stored at the parity bit during write opera-tions is exclusive -ORed with the output of the parity tree 90 during read operations to affect the latch 86.
As shown, control signal K23 determines whether a read or wr~te opèration is ~eing performed, while K24 is used for clearing the parity-error flip-flop 86. The sum-or tree 88 OR's all of thP
data bits D0-D7 on the associated data bus lines of the eight processing elements 36 of the chip. ~s can be seen, both the par~ty outputs and the sum-or out-puts are transferred via the same gating ~atrix 92, 11 .

which is controlled by K27 to determine whether par-ity or sum~or will be transferred from the chip to the array control unit 20. The oUtputs of the flip flops 86 of each of the processing elemenks are connected to the 2048 input sum-or tree such that the presence of any set flip-flop 86 might be sensed.
By using a flip-flop whîch latches upon an error, the array control unit 20 can sequentially disable columns of processing elements until that column containing the faulty element is found.
Finally, and as will be discussed further hereinafter, control signal K25 is used to disable the parity and sum-or outputs from the chip when the chip is disabled and no longer used in the system.
While the utilization of sum-or and parity functions are known in the art, their utilization in the instant invention is important to assist in locating faulty processing elements such that those elements may be removed from the operative system.
The trees 88,90, mutually exclusively gated via the network 92, provide the capability for columns o processing elements 36 to be checked for parity and ( further provides the sum-or network to determine the presence of processing elements in particular logic states, such as to determine th responder to a search operation. The number of circuit elements necessary for this technique have been kept to a minimum by utilizing a single output for the two trees, wi-th that output being multiplexed under program control.
With final attention to Fig. 3, it can be seen that the disable signal, utilized for removing an entire column of processing element chips from the array unit 12, generates the signal K25,K26 for this purpose. As mentioned above, the control signal K25 disables the sum-or and parity outputs for asso-ciated processing elements. Further functions of the 12.

signals K25,K26 with respect ko removing selected processing elements will be discussed with respect to Fig. 5 hereinafter.
With reference now to Fig. 4, and correlat-ing the same to Fig. 2, it can be seen that the full adder of the invention comprises logic gates 94-100.
This full adder communicates with the B register comprising flip-flop 102 which receives the sum bit, the C register which comprises flip-flop 104 which receives the carry bit, and further communicates with the variable length shift register 48 which comprises 16, 8, and 4 bit shift registers 106-110, flip-flops 112,114, and multiplexers 116-120.
The adder receives an input ~rom the shift register, the output of the A register 122, and an input from the logic and routing sub-unit the output of the P register 124. Whenever control line K21 is a logic 1 and BC-CLK is clocked, the adder adds the two input bits from registers A and P to the carry bit stored in the C register 104 to form a two-bit sum. The least significant bit of the sum is clocked into the B register 102 and the most significant bit ( of the sum is clocked into the C register 104 so that it becomes the carry bit for the next machine cycle. If K21 is at a logic 0, a 0 is substituted for the P bit.
As shown, control line K12 sets the C
register 104 to the logic 1 state while control line K13 resets the C register to the logic 0 state.
Control line K16 passes the state of the B register 102 onto the bi-directional data bus 58, while control line K22 transfers the output of the C regis-ter to the data bus.
In operation, the full adder of Fig. 4 incorporates a carry function expressed as follows:
C ~ AP v PC v AC.

i 16500~
13.

The new state of the carry register C, flip-flop 104, i5 equivalent to the states of the A and P registers ANDed together, or the states of the P and C regis-ters ANDed together, or the states of the A and C
registers ANDed together. This carry function is achieved, notwithstanding the fact that there is no feedback of C register outputs to C register inputs, because the JK flip-flop 104 follows the rule:
C~--JC v XC.
The new state of the C register is the complement of the present state of the C register ANDed with the J
( input or the complement of the K input ANDed with the present state of the C reyister, Accordingly, in the circuit of Fig. 4, the flip-flop 104 follows the rule:
C~--APC v ~AvP)C .
The expression immediately above is equivalent to the carry function first given.
With respect to the sum expression, the B
register, flip-flop 102, receives a su m bit which is an exclusive O~ function of the states of the A, P, and C registers according to the expression:
B~--A ~ P ~ C .
( The gate 98 generates A ~ P from gates 94 and 96 which gates 100 exclusive OR's that result ~ith C to achieve the sum expression.
The shift register of the arithmetic unit of the processing element 36 has 30 stages. These stages allow for the shift registers to have varying lengths so as to accommodate various word sizes, substantially reducing the time or arithmetic opera-tions in serial-by-bit calculations, such as occur in ~ultiplication. Control lines Kl-K4 control multiplexers 116~120 so that certain parts of the shift register may be bypassed, causing the length of the shift register to be selectively set at either 2, 6, 10, 14, 18, 22, 26, or 30 stages. Data bits 0 ~
14~

are entered into the shi~t reyister through the B
register 102, these ~eing the sum bits from the adder.
The data bits leave the shift register through the A
register 122 and recirculate back through the adder.
The A and B registers add two stages of delay to the round-trip path. Accordingly, the round-trip length of an arithmetic process i5 either 41 8, 12, 16, 20, 24, 28, or 32 stages, depending upon the states of the control lines Kl-K4 as they regulate the multiplexers 112-120.
The shift register outputs data to the A
register 122 which has two other inputs selectable via control lines Xl,K2, and multiplexer 120. One input is a logic 0. This is used to clear the shift register to an all-zero state. The other input is the bi directional data bus 58. This may be used to enter data directly into the adder.
The A register 122 is clocked ~y A-CLK, and the other thirty stages of the shift register are clocked by SR-CLK. Since the last stage of the shift register has a separate clock, data from the bi-directional data bus 53 or logic 0 may be entered ( into the adder without disturbing data in the shift register.
As discussed above, the P register 124 provides an input to the adder 50 with such input being supplied from one o the orthogonally contiguous processing elements 36, or from the data bus 58.
Data is recei~ed by the P register 124 from the P
register of neighboring processing elements 36 by means of the multiplexer 126 under control of control slgnals K5,K6. In transferring data to the P register i24 from the multiplexer 126, transfer is made via inverter 128 andND gates 130,132. The transfer is effectuated under control of the control signal K7 to apply the true and complement o the data to :L5.

the J and K inputs respectively of the flip-flop 124c The data is la~ched under control of the clock P-CLK. As noted, the true and complement outputs of the P flip-flop 124 are also adapted to be passed to the P flip-flops of neighboring pro~
cessing elements 36. The complement is passed off of the chip containing the Lmmediate processing -element, but is inverted by a driver at the destina-tion to supply the true state of the P flip-flop.
The true state is not inverted and is applied to neighboring processing elements on the same chip.
The logic circuitry 40 is shown in more detail in Fig. 4 to be under control of control lines K8-Kll.
This logic receives data from the data bus 58 either in the true state or complementary through the in~erter 130. The logic network 40, under control of the control signals K8-Kll, is then capable of performing all sixteen Boolean logic functions which may be performed ~etween the data from the data bus and that mai~tained in the P
register 124~ The result is then stored in the P register 124.
( It will be understood that with K7=0, gates 130,132 are disabled. Control Iines K8 and R9 then allow either 0, 1, D, or b to be gated to the J input of the P register, flip-flop 124. D is the state of the data bus 58. Independently, control lines K10 and Kll allow 0, 1, D, or D to ~e sent to the K input.
Following the-rule of J-K flip-flop operation, the new state of the P register is defined as follows:
P~-JP v KP , As can be seen, in selecting all four s~ates of J
and all four states of K, all sixteen logic functions of P and D can be obtained.
As discussed above, the output of the P
register may be used in the arithmetic calculations 16, of the processing elements 36, or may be passed to the data bus 58~ If K21 iS at a logic 1, the current state of the P register is enabled to the adder logic, If ~q is a logic 0, the output of the P register is ena~led to the data bus. ~f K15 is at a logic 0, the output of the P register is exclusi~ely OR'ed with the complement of the G
register 132, and the result is ena~led to the data bus. It will be noted that certain transfers to the data bus are achie~ed via bi-directional transmission gates 134,136, respectively enabled by control signals K14 and K15. These types of gates are ~ell known to those skilled in the art.
The mask register ~, designated by the numeral 132, comprises a simple D-type flip-flop.
The ~ register reads the state of the ~i directional data bus on the positive transition of G CLK. Control line K19 controls the maskiny of the arithmetic sub-unit clocks ~A-CLK, SR-CLK, and BC-CLK). When Kl9 equal5 1, these clocks will only be sent to the arithmetic suh-units of those processing elements where G=l, The arithmetic su~-units of those processing elements where G=0 will not be clocked and no register and no sub-units will change state.
When K19 = 0, the arithmetic sub-un~ts of all processing elements will participate in the opera-tion.
Control line K20 controls the masking of the logic and routing sub-unit. When ~ = 1, the clock P-CLK is only sent to the logic and routing sub-units of those processing elements where G=l.
The logic and routing sub-units of those processing elements where G=0 will not be clocked and their P
registers will not change state.
Translation operations are masked when control line K20 = 1~ In those processing elements --~ I 16.5~

where G-l, the P register is clocked by P-CLK and recei~es the state of its neighbor~ In those where G~O, th.e P register is not clocked and does not chan~e state. Regardless of w~ether G=0 or G=l, each processing element sends t~e state of its P
register to its ne~hBors, Brief attention is now given to the equivalence function provided for ~y the inclusive OR gate 138, ~hich provides a logic 1 outp~t when the inputs thereof from the P and G registers are of common logic states. In other words, the gate 138 provides the output function of P ~ G , This result is then supplied to the data bus~
The.S register c~mprises a D-type flip-flop 140 ~ith the input t~ere.to under control of the multiplexer 142~ The output from the S register is transmitted to the data bus 58 by means of the bi-directional transmission gate 144. The flip-flop 140 reads the sta~e.o.f..it.s input on the transition of the clock pulse S-CLK-TN ~ When control line Kl~ is at a logic 0, the multiplexer 142 receives the state of the S register of the processing ele-ment .immediately to the west~ In such case, each S-CLK-IN pulse will shi~t the data in the S regis-ters one place to the east. To store the state of the S register 140 in local memory, control line K18 is set to a logic 0 to enable the ~i-directional transmission gate 144 to pass the complementary output of the S register.l40 through the inverter 146 and to the data bus 58~ T~e S register 140 may be loaded with a data bit from the local ~emory 56 by setting K17 to a logic 1, and thus enabling the data ~us 58 to the input of the flip-flop 140.
As mentioned hereînabove, a particular attribute of the massively-parallel processor 10 is that the array unit 12 is capable of bypassing 18.

a set of cOlUmn5 0~ processing elements 36 should an error or fault appear in that set. As discussed earlier herein, each chip has two processing elements 36 in each o~ four columns of the array unit matrix.
The instant in~ention disables columns of chips and, accordingly, sets of columns of processing elements. ~undamentally, the columns are dropped ou~ of operation by merely jumping the set of columns by interconnecting the inputs and outputs of the east-most and west-most processing eïements on the chips establishing the set of columns, The method of inhibiting the outputs cf the sum-or tree and the parity tree o~ the chips ~ave previously bePn described. However, it is also necessary to bypass the outputs of the P and S registers which intercom-municate between the east and west neighboring chips.
As shown in Fig. 5A, a chip includes eight processing elements, PE0-PE7, arranged as earlier described. The S reg;ster of each processing element may receive data from the S register of the processing element immedlately to the west and may transfer data to the S register of the processing element immediate-ly to the east~ When ena~led, the chip allows data to flow from S-INOr through the S registers of PEO-PE3 and then out S-OUT3 to the neighboring chip. Similar data flow occurs from S-IN~ to S-OUT4. When it is desired to disable a column of chips, the output gates of the column of chips which pass the S regis-ter data to the neighboring east chip are disabled.
That is, control signal X25 may inhibit output gates 148,150 while concurrently enabling the bypass gates 152,154. This interconnects S-IN0 with S-OUT3 and S-IN~ with S-OUT4, for all chips in the column.
In Fig. 5B it can be seen that co~munica-tions between the P registers of east-west neighbor-ing chips may also be bypassed, P register data is o ~
19.

received from the chip to the west via inverters 156,158 and is transmitted thereto ~y gates 160,162.
Similarly, P register data is received from the chip to the east via inverters 164,166 and is trans-mitted thereto via gates 168,170. If the chip is enabled and P register data is to ~e routed to the west, then control line K6 is set to a logic 1 and K26 to a logic 0 so gates 160,162 are enahled and gates 168,1~0 are disabled~ When routîng to the east, K6 is set to zero and K26 to one. To disable the chip, K6 and K26 are ~oth set to a logic 0 to disa~le all P register east-west outputs rom the chip and K25 is set to allow the bi-directional ~ypass gates 172,174 to interconnect WEST-0 with EAST-3 and WEST 7 with EAST-4. This connects the P registers of PE3 of the west chip with PE0 of the east chip and PE4 of the west chip with PE7 of the east chip.
By disabling the parity and sum-or trees and by jumping the inputs and outputs o~ bordering P and 2~ S registers of the chips in a column, an entire column of chips may ~e removed from service if a fault is detected It will be understood that while the pro-cessing el~ents of the disabled chips do not cease functioning when disabled, the outputs thereof are simply removed from effecting the system as a whole.
Further, it will be appreciated that, by removing columns, no action need be taken with respect to intercolmnunication between north and south neighbors.
Finally, by removing entire chips rather than columns of processing elements, the amount of bypass gating is greatly reduced.
In the preferred em~odiment of the inven-tion, the array unit 12 has 128 rows and 132 columns of processing elements 36. In okher words, there are 64 rows and 33 columns of chips. Accordingly, there is an extra column of chips beyond those necessary for OV~
, . . .
20.

achieving the desired square array. This allows for the maintenance of a square array even when a fau~ty chip ls found and a column of chips are to be removed from service.
Thus it can be seen that the objects of the invention have been satisfied ~y the structure pxesented hereinabove. A massively-parallel processor, having a unique array unit of a large plurality of interconnected and intercommunicating processing elements achieve rapid parallel processing. A
variable length shift register allows serial-by-bit arithmetic computations in a rapid fashion, while reducing system cost. Each processing element is capable of performing all requisite mathematical computations and logic functions and is further capa~le of intercommunicating not only with neighbor-ing processing elements, ~ut also with its own unlquely associated random access memory. Provisions are made for removing an entire column of processing chips wherein at least one processing element has been found to be faulty. All of this structure leads to a highly reliable data processor which i5 capable of handling large magnitudes of data in rapid fashion~
While in accordance with the patent stat-utes, only the best mode and preferred embodiment of the invention has been presented and described in detail, it is to be understood that the invention is not limited thereto or thereby. Consequently, for an appreciation of the true scope and breadth of the invention, reference s~ould be had to the following claims.

Claims (6)

The embodiments of the invention in which an exclusive property or privilege is claimed are defined as follows:
1. An array of a plurality of processing elements interconnected with each other and wherein each processing element comprises:
an adder;
first and second data registers connected with and supplying data bits to said adder;
a carry register connected to said adder and receiving therefrom data bits resulting from arithmetic operations and which functions according to the rule:
C ?APvPCvAC where A is the state of said first register, P is the state of said second register, and C is the state of said carry register;
a memory; and a data bus interconnecting said first, second, and carry registers and said memory for the transfer of data thereamong.
2. The array as recited in claim 1 wherein each processing element further includes a shift register of selectably variable length interconnected between said first data register and said adder.
3. The array as recited in claim 1 wherein said carry register comprises a J-K flip-flop.
4. The array as recited in claim 3 wherein each processing element further includes a sum register inter-connected between said shift register and said adder, said sum register functioning according to the rule B ?A?P?C, where B, A, P, and C are respectively the states of said sum, first, second, and carry registers.
5. The array as recited in claim 1 wherein each said processing element includes logic means interconnected with said second register for performing the sixteen logic functions possible between the data of said second register and a data bit from said data bus.
6. The array as recited in claim 1 wherein said second register of each said processing element is communicat-ingly interconnected with said second register of orthogonally neighboring processing elements within the array.
CA000425731A 1979-12-31 1983-04-12 Processing element for parallel array processors Expired CA1165005A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CA000425731A CA1165005A (en) 1979-12-31 1983-04-12 Processing element for parallel array processors

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US108,883 1979-12-31
US06/108,883 US4314349A (en) 1979-12-31 1979-12-31 Processing element for parallel array processors
CA000361873A CA1154168A (en) 1979-12-31 1980-09-26 Processing element for parallel array processors
CA000425731A CA1165005A (en) 1979-12-31 1983-04-12 Processing element for parallel array processors

Publications (1)

Publication Number Publication Date
CA1165005A true CA1165005A (en) 1984-04-03

Family

ID=27166840

Family Applications (1)

Application Number Title Priority Date Filing Date
CA000425731A Expired CA1165005A (en) 1979-12-31 1983-04-12 Processing element for parallel array processors

Country Status (1)

Country Link
CA (1) CA1165005A (en)

Similar Documents

Publication Publication Date Title
CA1154168A (en) Processing element for parallel array processors
US5828894A (en) Array processor having grouping of SIMD pickets
US5822608A (en) Associative parallel processing system
US5815723A (en) Picket autonomy on a SIMD machine
Batcher Design of a massively parallel processor
US4910669A (en) Binary tree multiprocessor
US4901224A (en) Parallel digital processor
EP0085520B1 (en) An array processor architecture utilizing modular elemental processors
US3544973A (en) Variable structure computer
US5081573A (en) Parallel processing system
EP0100511B1 (en) Processor for fast multiplication
EP0211614A2 (en) Loop control mechanism for a scientific processor
Kartashev et al. A multicomputer system with dynamic architecture
Batcher Architecture of a massively parallel processor
Finnila et al. The associative linear array processor
US3320594A (en) Associative computer
US5765015A (en) Slide network for an array processor
US4910700A (en) Bit-sliced digit-serial multiplier
CA1165005A (en) Processing element for parallel array processors
Bernhard Computers: Computing at the speed limit: Computers 1000 times faster than today's supercomputers would benefit vital scientific applications
EP0395240A2 (en) High speed numerical processor
Chang Multiple-read single-write memory and its applications
McKeown Iterated interpolation using a systolic array
Symanski Progress on a systolic processor implementation
EP0394362A4 (en) Method and apparatus for aligning the operation of a plurality of processors

Legal Events

Date Code Title Description
MKEX Expiry