EP0485594A1 - Mechanism providing concurrent computational/communications in simd architecture - Google Patents

Mechanism providing concurrent computational/communications in simd architecture

Info

Publication number
EP0485594A1
EP0485594A1 EP91921011A EP91921011A EP0485594A1 EP 0485594 A1 EP0485594 A1 EP 0485594A1 EP 91921011 A EP91921011 A EP 91921011A EP 91921011 A EP91921011 A EP 91921011A EP 0485594 A1 EP0485594 A1 EP 0485594A1
Authority
EP
European Patent Office
Prior art keywords
output
processor
architecture
data
processor node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP91921011A
Other languages
German (de)
French (fr)
Other versions
EP0485594A4 (en
Inventor
Daniel W. Hammerstrom
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Adaptive Solutions Inc
Original Assignee
Adaptive Solutions Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Adaptive Solutions Inc filed Critical Adaptive Solutions Inc
Publication of EP0485594A1 publication Critical patent/EP0485594A1/en
Publication of EP0485594A4 publication Critical patent/EP0485594A4/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/10Interfaces, programming languages or software development kits, e.g. for simulating neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/17Interprocessor communication using an input/output type connection, e.g. channel, I/O port
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17356Indirect interconnection networks
    • G06F15/17368Indirect interconnection networks non hierarchical topologies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8007Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors
    • G06F15/8015One dimensional arrays, e.g. rings, linear arrays, buses

Definitions

  • the instant invention relates to a computer processor architecture, and specifically to an architecture which provides concurrent computational/communications actions in a SIMD architecture.
  • the architecture includes an output processor node which allows simultaneous data processing and communications in a single processor node.
  • Neural networks are a form of architecture which enables a computer to closely approximate human thought processes.
  • One form of neural network architecture enables single instruction stream, multiple data stream (SIMD) operations which allow a single command to direct a number of processors, and hence data sets, simultaneously.
  • SIMD single instruction stream, multiple data stream
  • a neuron performs a weighted sum of inputs, which may be described as ⁇ W* j O-, where W* j is a value drawn from "memory" and 0 j is an input value.
  • a SIMD multiply/accumulate function performs this operation.
  • Feldman's 100-step rule which is an argument for massively parallel computing, states that a "human” cognitive process having a time of 500 msec, can be accomplished in 5 msec neuron switching time. If the "switching" time is slow, a fast system may nevertheless be constructed with a large number of "switches”. This implies that there are two vastly different computational models at work. It also suggests that in order to build computers that will do what nervous systems do, the computers should be structured to emulate animal nervous systems.
  • a SIMD system is a computing system which is designed to emulate a massively parallel neural network.
  • a nervous system and a neurocomputational computer, is characterized by continuous, non-symbolic, and massively parallel structure that is fault-tolerant of input noise and hardware failure.
  • Representations, ie, the input is distributed among groups of computing elements, which independently reach a result or conclusion, and which then generalize and interpolate information to reach a final output conclusion.
  • connectionist/neural networks search for "good 1 solutions using massively parallel computations of many small computing elements.
  • the model is one of parallel hypothesis generation and relaxation to the dominant, or "most-likely" hypothesis.
  • the search speed is more or less independent of the size of the search space.
  • Learning is a process of incrementally changing the connection (synaptic) strengths, as opposed to allocating data structures. rogramming" in such a neural network is by example.
  • An object of the invention is to provide a concurrent use architecture which will provide concurrent computational and communications functions in a SIMD structure.
  • Another object of the invention is to provide a output processor architecture which retains processed data while like output processors are communicating processed data on a transmission network.
  • a further object of the invention is to provide an output processor which stores processed data while the processor acts upon a subsequent data set. Still another object of the invention is to provide a mechanism in a output processor for arbitrating communication protocol on a communications network.
  • the concurrent use architecture of the invention is intended for use in a single instruction stream, multiple data stream processor node which includes an input bus, an input unit, manipulation units and an output bus.
  • the architecture includes an output processor having an output buffer unit which receives data from the input unit and various manipulation units.
  • the processor node stores and transmits data from the output buffer unit at a selected time over the output bus.
  • a control unit is provided for directing the
  • the output processor has an internal controller to direct storage and transmission of data on the output bus.
  • FIG. 1 is a schematic representation of an array of processor nodes constructed according to the invention.
  • Fig. 2 is a schematic diagram of a broadcast communication pattern of communication nodes contained within the processor nodes of Fig. 1.
  • Fig. 3 is a block diagram of a single processor node of the invention.
  • Array 10 includes processor notes 12, 14, 16 and 18, and may contain as many more processor nodes as are desired for a particular application.
  • Fig. 1 contains only four such processor nodes for the sake of brevity and to allow an explanation of the interaction between multiple processor nodes.
  • Each processor node is connected to an input bus 20 and an output bus 22. Buses 20 and 22 are depicted as being single entity structures in Fig. 1, however, in some circumstances, both the input and/or output bus may be multiple bus structures.
  • a single controller 24 is connected to a control bus 26, which in turn is connected to each of the processor nodes.
  • output bus 22 may be connected directly to input bus 20 by means of a connection 28, or, output bus 22 may be connected to the input bus of another array of processor nodes, while the output bus of another array of processor nodes may be connected to input bus 20.
  • each processor node (PN) includes a pair, or more, of communication nodes, such as the connection node (CN) depicted at 30-37 in processor nodes 12-18.
  • a CN is a state associated with an emulated node in a neural network located in a PN.
  • Each PN may have several CNs located therein.
  • processor node 12 contains the state information for connection node 0 (CN0) and the state information for connection node 4 (CN4)
  • processor node 14 contains the state information for connection node 1 (CN1) and the state information for connection node 5 (CN5)
  • processor node 16 contains the state information for connection node 2 (CN2) and the state information for connection node 6 (CN6)
  • processor node 18 contains the state information for connection node 3 (CN3) and the state information for connection node 7 (CN7).
  • connection nodes 0-7 are depicted.
  • the CNs are arranged in "layers", with CN0 - CN3 comprising one layer, while CN4 - CN7 comprise a second layer.
  • the connection nodes operate in what is referred to as a broadcast hierarchy, wherein each of connection nodes 0-3 broadcast to each of connection nodes 4-7.
  • An illustrative technique for arranging such a broadcast hierarchy is disclosed in U.S. Patent No. 4,796,199, NEURAL-MODEL
  • the available processor nodes may be thought of as a "layer" of processors, each executing its function (multiply, accumulate, and increment weight index) for each input, on each clock, wherein one processor node broadcasts its output to all other processor nodes.
  • the output processor node arrangement it is possible to provide n 2 connections in n clocks using only a two layer arrangement.
  • SIMD structures may accomplish n 2 connections in n clocks, but require a three layer configuration, or 50% more structure.
  • a normal sequence of events in the SIMD array depicted at 10 begins with a specific piece of data being transmitted to each processor node. The data is acted upon by an instruction which is transmitted over input bus 20. Each processor node performs whatever operation is required on the data, and then attempts to transmit the information on output bus 22. Obviously, not every processor node can transmit on the output bus simultaneously, and under normal conditions, the processor nodes have to wait for a number of clocks, or cycles, until each processor node in the array had transmitted its information on the output bus.
  • output processors or output buffers, such as those depicted at 38, 40, 42 and 44 are included in the architecture of each processor node.
  • the output buffers receive the information from an associated connection node and hold the information or data until such time as each processor node receives clearance to transmit on output bus 22. Because the data is held in the output buffer, the remainder of the processor node can perform other functions while the output processor node is waiting to transmit.
  • each processor node contains a flip flop, such as flip flops 46, 48, 50 and 52.
  • Flip flops are also referred to herein as arbitration signal generators.
  • Node 12 includes an input unit 54 which is connected to input bus 20 and output bus 30. Again, only single input and output buses are depicted for the sake of simplicity. Where multiple input and/or output buses are provided, input unit 54, and the other units which will be described herein, have connections to each input and output bus.
  • a processor node controller 56 is provided to establish operational parameters for each processor node.
  • An addition unit 58 provides addition operations and receives input from input unit 54, a multiplier 60, and is connected to the input and output buses.
  • a register unit 62 contains an array of registers, which in the preferred embodiment of the architecture includes an array of 32 16-bit registers. A number of other arrangements may be utilized.
  • a weight address generation unit 64 computes the next address for a weight memory unit 66. In the preferred embodiment, the memory address may be set in one of two ways: (1) by a direct write to a weight address register, or (2) by asserting a command which causes the contents of a weight offset register to be added to the current contents of the memory address register, thereby producing a new address.
  • Node controller 56, addition unit 58, multiplier unit 60 register unit 62 and weight address generation unit 64 comprise what is referred to herein as a manipulation unit.
  • An output unit 68 is provided to store data prior to the data of being transmitted on output bus 22.
  • Output unit 68 includes output processor node, 38, which receives data from the remainder of the output unit prior to the data being transmitted. Data is transmitted to output bus 22 by means of one or more connection nodes, such as connection node 30 or 34, which are part of the output unit.
  • Output unit 68 includes an output buffer register which initially receives processed data. Once this data is loaded into the register, the output buffer unit becomes "armed”. Once armed, the output buffer operates independently, but synchronously, from the rest of the processing node.
  • An arbitration process is provided between the output buffers of the different processor nodes in order to determine which output buffer, or output processor node, will transmit first over output bus 22, since only one PN can use the output bus at any one time.
  • a hand shake arrangement extending from a flip flop, such as flip flop 46 in processor node 12, and indicated by arrow 70, signals the next processor node to transmit.
  • the arrangement in Fig. 1 indicates that the transmission occurs from immediately adjacent nodes, this is not necessarily representative of what may happen in the actual processor node array. It may be that some other order of transmission is determined by the arbitration process.
  • Arbitration and transmission occur only when a transmit signal is asserted by controller 24 to allow synchronization of the transmission with other operations being conducted in the array.
  • Several modes of arbitration/data transfer are provided in the architecture: sequential, global and direct.
  • the arbitration mode is selected by controller 24.
  • Controller 24 and flip flops 46, 48, 50 and 52 comprise what is referred to herein as arbitration means, which is operable to determine at which point in processor operation a signal will be transmitted from a output processor node.
  • Sequential mode as the term indicates, requires a control signal over control bus 26 to travel from one processor node to the next.
  • Global arbitration uses a signal that travels on the control bus which enables transmission from all processor nodes, but which operates the processor nodes in a daisy-chain, allowing certain processor nodes to transmit while others may pass, i.e., thereby not transmitting any data for a particular cycle, or clock.
  • Direct arbitration is used in a situation where data is written directly into the output buffer and immediately transmitted over the output bus on the next transmit cycle.
  • a single instruction passes over control bus
  • Each output buffer has its own internal controller, depicted in Fig. 1 at 38a, 40a, 42a and 44a, which is described in the code and structure which follow herein, which is how each output processor operates separately from SIMD control 24.
  • the output is transmitted over output bus 22 in a predetermined sequencing of transmissions from the individual processor nodes. This of course, requires that the processor node wait until the other processor nodes in the array are finished transmitting before a new instruction set or new data can be received in the processor node.
  • the provision of output buffers in each PN provides a location where the process data may be stored while the processor node waits its turn at the output bus. Additional operations may be occurring in the processor nodes, which may have new data loaded, or which may have existing data acted upon by a new instruction set.
  • Connection 28 may be enabled if it is desired that the output data from a processor node become the input data for a processor node, or other processor nodes. Such enablement is accomplished by conventional micro-circuitry mechanisms, and is depicted schematically in Fig. 2..
  • the following code is a simplification of the code that describes the actual CMOS implementation of the PN structure in a neurocomputer chip.
  • the code shown below is in the C programming language embellished by certain predefined macros.
  • the code is used as a register transfer level description language in the actual implementation of the circuitry described here.
  • Bolded text indicates a signal, hardware or firmware component, or a phase or clock cycle.
  • the phi and ph2 variables simulate the two phases in the two- phase, non-overlapping clock used to implement dynamic MOS devices.
  • Output unit (obUNIT) 68 contains the output buffer.
  • leftff is the signal from flip flop 46, also referred to herein as arbitration signal generator means, in PN 12, on the left and indicates that PN 12 has transmitted and it is now PN 14's turn.
  • the obarmd flip flop 46 When the output buffer is written, the obarmd flip flop 46 is set. obarmd indicates that the output buffer is armed and ready to transmit. This transmission operates independently of the remaining computation inside the
  • the low byte is sent first. This occurs if the transmit control is asserted and if the PN's output buffer is armed.
  • the outst flip flop is used to indicate that this is the second of a two byte transmission. All of the arbitration decisions discussed at this point operate in the sequential mode, where a PN signals its neighbor to the right when it has completed its transmission, xmit is the SIMD command that tells the output buffer of a particular PN to transmit (if it has received an arbitration signal).
  • the seqarb flip flop indicates that this arbitration mode is enabled. outmd2 indicates that this is a two byte transmission:
  • obarmd is set and the cycle repeats
  • vcval indicates a valid output buffer control signal
  • rgctl_B2 indicates that a write to the output buffer is now occurring
  • r_B indicates the output buffer, F_OUTBUF, is being addressed.
  • Writing to the register OUTBUF sets obarmd, which turns on (enables) the PN output buffer.
  • CN0, CN1, CN2 and CN3 have values which are located in register 62 of processor nodes 12, 14, 16 and 18, respectively. The values are written to the respective output buffer in each PN over an appropriate connection.
  • Output buffer 38 in processor node 12 transmits the output of CN0 onto output bus 22. This value is read by processor nodes 12, 14, 16 and 18 over connection 28 and input bus 20. Each processor node fetches a weight from weight memory 66 and multiplies, for instance, the output of CN0 times the various weights, such as w ⁇ w ⁇ , w ⁇ and w ⁇ . Output buffer 38 and processor node 12 then transmit an arbitration signal, through flip flop 47 and connection 70, to output buffer 40 in processor node 14, the next-in-time processor node, that it may now transmit on output bus 22.
  • DURING CLOCK TWO
  • Output buffer 40 in processor node 14 transmits the output of CN1 onto output bus 22. This value is read by processor nodes 12, 14, 16 and 18 over connection 28 and input bus 20. Each processor node fetches a weight from weight memory 66 and multiplies, for instance, the output of CN0 times the various weights, such as w 41 , w 51 , w 61 and w 71 . Output buffer 40 and processor node 14 then signal, by flip flop 48 and connection 70, output buffer 42 in processor node 16 that it may now transmit. Similar action occurs during successive clock cycles until the data has been processed.
  • each processor node has a multiplier 60, an adder 58 and two look tables, weight address generation 64 and weight memory 66.
  • Each node receives an input and does a multiply-accumulate on each clock. After the accumulation loop, each processor node moves its output into its output buffer and waits for its turn to broadcast.
  • the steps may be represented as
  • the output, O* is therefore equal to the summation of the values drawn from weight memory 66, W* j , times the input value, O j , wherein the entire function is stored in the weight address generator 64.
  • Industrial Application Processors constructed according to the invention are useful in neural network systems which may be used to simulate human brain functions in analysis and decision making applications.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Multi Processors (AREA)
  • Spark Plugs (AREA)
  • Ignition Installations For Internal Combustion Engines (AREA)

Abstract

Une architecture de calcul et de communication concurrents est destinée à l'usage dans un noeud de traitement à flux unique d'instructions et à flux multiples de données (SIMD), lequel noeud comprend un bus d'entrée (20), une unité d'entrée (54), des unités de manipulation (58, 60, 62, 64, 66) et un bus de sortie (22). Les noeuds de traitement (12, 14, 16, 18) comprennent une unité de sortie (68) qui reçoit les données provenant de l'unité d'entrée (54) et des diverses unités de manipulation. Les noeuds de traitement (12, 14, 16, 18) stockent les données et les transmettent sur le bus de sortie (22) à partir de l'unité de sortie (68) à un moment sélectionné. Une unité de commande (56) des noeuds de traitement commande les échanges de données entre lesdits noeuds de traitement (12, 14, 16, 18), les tampons de sortie (38, 40, 42, 44) associés à ceux-ci, et le bus de sortie (22).A concurrent computing and communication architecture is intended for use in a single flow instruction and multiple data flow (SIMD) processing node, which node includes an input bus (20), a data processing unit input (54), handling units (58, 60, 62, 64, 66) and an output bus (22). The processing nodes (12, 14, 16, 18) include an output unit (68) which receives data from the input unit (54) and the various manipulation units. The processing nodes (12, 14, 16, 18) store the data and transmit it on the output bus (22) from the output unit (68) at a selected time. A control unit (56) for the processing nodes controls the exchange of data between said processing nodes (12, 14, 16, 18), the output buffers (38, 40, 42, 44) associated with them, and the output bus (22).

Description

MECHANISM PROVIDING CONCURRENT COMPUTATIONAL/COMMUNICATIONS IN SIMD ARCHITECTURE
Technical Field
The instant invention relates to a computer processor architecture, and specifically to an architecture which provides concurrent computational/communications actions in a SIMD architecture. The architecture includes an output processor node which allows simultaneous data processing and communications in a single processor node.
Background Art Neural networks are a form of architecture which enables a computer to closely approximate human thought processes. One form of neural network architecture enables single instruction stream, multiple data stream (SIMD) operations which allow a single command to direct a number of processors, and hence data sets, simultaneously. There are several important practical problems that cannot be solved using existing, conventional algorithms executed by traditional, conventional computers. These problems are often incompletely specified and are characterized by many weak constraints requiring large search spaces.
The processing of primary cognitive information by computers, such as computer speech recognition, computer vision, and robotic control, fall into this category. Traditional computational models bog down to the point of failure under the computational load if they are tasked to solve these types of problems. Yet animals perform these tasks using neurons that are millions of times slower than transistors. A neuron performs a weighted sum of inputs, which may be described as Σ W*jO-, where W*j is a value drawn from "memory" and 0j is an input value. A SIMD multiply/accumulate function performs this operation.
Feldman's 100-step rule, which is an argument for massively parallel computing, states that a "human" cognitive process having a time of 500 msec, can be accomplished in 5 msec neuron switching time. If the "switching" time is slow, a fast system may nevertheless be constructed with a large number of "switches". This implies that there are two vastly different computational models at work. It also suggests that in order to build computers that will do what nervous systems do, the computers should be structured to emulate animal nervous systems. A SIMD system is a computing system which is designed to emulate a massively parallel neural network.
A nervous system, and a neurocomputational computer, is characterized by continuous, non-symbolic, and massively parallel structure that is fault-tolerant of input noise and hardware failure. Representations, ie, the input, is distributed among groups of computing elements, which independently reach a result or conclusion, and which then generalize and interpolate information to reach a final output conclusion. Put another way, connectionist/neural networks search for "good1 solutions using massively parallel computations of many small computing elements. The model is one of parallel hypothesis generation and relaxation to the dominant, or "most-likely" hypothesis. The search speed is more or less independent of the size of the search space. Learning is a process of incrementally changing the connection (synaptic) strengths, as opposed to allocating data structures. rogramming" in such a neural network is by example.
In the case where total, or near total, connectivity (30% or higher) of communication node units is constructed, a great deal of time is spent waiting for the individual processors to communicate with one another after the data set has been acted upon. Known processing units do not provide a mechanism for allowing computation to occur on a subsequent data set until all processor nodes have transmitted their previously acted upon data.
Disclosure of the Invention An object of the invention is to provide a concurrent use architecture which will provide concurrent computational and communications functions in a SIMD structure.
Another object of the invention is to provide a output processor architecture which retains processed data while like output processors are communicating processed data on a transmission network.
A further object of the invention is to provide an output processor which stores processed data while the processor acts upon a subsequent data set. Still another object of the invention is to provide a mechanism in a output processor for arbitrating communication protocol on a communications network.
The concurrent use architecture of the invention is intended for use in a single instruction stream, multiple data stream processor node which includes an input bus, an input unit, manipulation units and an output bus.
The architecture includes an output processor having an output buffer unit which receives data from the input unit and various manipulation units. The processor node stores and transmits data from the output buffer unit at a selected time over the output bus. A control unit is provided for directing the
SIMD computation and controlling the exchange of data between the processor node, and an associated output processor node and the output buffer unit.
The output processor has an internal controller to direct storage and transmission of data on the output bus. These and other objects and advantages of the invention will become more fully apparent as the description which follows is read in conjunction with the drawings.
Brief Description of the Drawings Fig. 1 is a schematic representation of an array of processor nodes constructed according to the invention.
Fig. 2 is a schematic diagram of a broadcast communication pattern of communication nodes contained within the processor nodes of Fig. 1.
Fig. 3 is a block diagram of a single processor node of the invention.
Best Mode for Carrying Out the Invention Turning now to the drawings, and initially to Fig. 1, an array of single instruction stream, multiple data stream (SIMD) processor nodes are shown generally at 10. Array 10 includes processor notes 12, 14, 16 and 18, and may contain as many more processor nodes as are desired for a particular application. Fig. 1 contains only four such processor nodes for the sake of brevity and to allow an explanation of the interaction between multiple processor nodes. Each processor node is connected to an input bus 20 and an output bus 22. Buses 20 and 22 are depicted as being single entity structures in Fig. 1, however, in some circumstances, both the input and/or output bus may be multiple bus structures.
A single controller 24 is connected to a control bus 26, which in turn is connected to each of the processor nodes. In some instances, output bus 22 may be connected directly to input bus 20 by means of a connection 28, or, output bus 22 may be connected to the input bus of another array of processor nodes, while the output bus of another array of processor nodes may be connected to input bus 20. As depicted in Fig. 1, each processor node (PN) includes a pair, or more, of communication nodes, such as the connection node (CN) depicted at 30-37 in processor nodes 12-18. A CN is a state associated with an emulated node in a neural network located in a PN. Each PN may have several CNs located therein. Following conventional computer architecture nomenclature, processor node 12 contains the state information for connection node 0 (CN0) and the state information for connection node 4 (CN4), processor node 14 contains the state information for connection node 1 (CN1) and the state information for connection node 5 (CN5), processor node 16 contains the state information for connection node 2 (CN2) and the state information for connection node 6 (CN6), and processor node 18 contains the state information for connection node 3 (CN3) and the state information for connection node 7 (CN7).
Referring momentarily to Fig. 2, the broadcast patterns that are set up between connection nodes 0-7 are depicted. The CNs are arranged in "layers", with CN0 - CN3 comprising one layer, while CN4 - CN7 comprise a second layer. As previously noted, there may be more than two layers on connection nodes in any one processor node or in any array of processor nodes. The connection nodes operate in what is referred to as a broadcast hierarchy, wherein each of connection nodes 0-3 broadcast to each of connection nodes 4-7. An illustrative technique for arranging such a broadcast hierarchy is disclosed in U.S. Patent No. 4,796,199, NEURAL-MODEL
INFORMAΉON-HANDΠNG ARCHΠΈCΓURE AND METHOD, to
Hammerstrom et al, January 3, 1989, which is incorporated herein by reference. Conceptually, the available processor nodes may be thought of as a "layer" of processors, each executing its function (multiply, accumulate, and increment weight index) for each input, on each clock, wherein one processor node broadcasts its output to all other processor nodes. By using the output processor node arrangement, it is possible to provide n2 connections in n clocks using only a two layer arrangement. Known, conventional SIMD structures may accomplish n2 connections in n clocks, but require a three layer configuration, or 50% more structure.
A normal sequence of events in the SIMD array depicted at 10 begins with a specific piece of data being transmitted to each processor node. The data is acted upon by an instruction which is transmitted over input bus 20. Each processor node performs whatever operation is required on the data, and then attempts to transmit the information on output bus 22. Obviously, not every processor node can transmit on the output bus simultaneously, and under normal conditions, the processor nodes have to wait for a number of clocks, or cycles, until each processor node in the array had transmitted its information on the output bus.
During this wait time, the processor node is essentially operating at idle, as it cannot perform any additional functions until it has transmitted. To resolve this problem, output processors, or output buffers, such as those depicted at 38, 40, 42 and 44 are included in the architecture of each processor node. The output buffers receive the information from an associated connection node and hold the information or data until such time as each processor node receives clearance to transmit on output bus 22. Because the data is held in the output buffer, the remainder of the processor node can perform other functions while the output processor node is waiting to transmit.
In order to control arbitration between the various processor nodes, so that the nodes will transmit on output bus 22 at the proper time, each processor node contains a flip flop, such as flip flops 46, 48, 50 and 52. Flip flops are also referred to herein as arbitration signal generators.
Referring now to Fig. 3, the remaining components of a single processor node, such as processor node 12, are depicted in greater detail. Node 12 includes an input unit 54 which is connected to input bus 20 and output bus 30. Again, only single input and output buses are depicted for the sake of simplicity. Where multiple input and/or output buses are provided, input unit 54, and the other units which will be described herein, have connections to each input and output bus. A processor node controller 56 is provided to establish operational parameters for each processor node. An addition unit 58 provides addition operations and receives input from input unit 54, a multiplier 60, and is connected to the input and output buses.
A register unit 62 contains an array of registers, which in the preferred embodiment of the architecture includes an array of 32 16-bit registers. A number of other arrangements may be utilized. A weight address generation unit 64 computes the next address for a weight memory unit 66. In the preferred embodiment, the memory address may be set in one of two ways: (1) by a direct write to a weight address register, or (2) by asserting a command which causes the contents of a weight offset register to be added to the current contents of the memory address register, thereby producing a new address. Node controller 56, addition unit 58, multiplier unit 60 register unit 62 and weight address generation unit 64 comprise what is referred to herein as a manipulation unit.
An output unit 68 is provided to store data prior to the data of being transmitted on output bus 22. Output unit 68 includes output processor node, 38, which receives data from the remainder of the output unit prior to the data being transmitted. Data is transmitted to output bus 22 by means of one or more connection nodes, such as connection node 30 or 34, which are part of the output unit. Output unit 68 includes an output buffer register which initially receives processed data. Once this data is loaded into the register, the output buffer unit becomes "armed". Once armed, the output buffer operates independently, but synchronously, from the rest of the processing node.
An arbitration process is provided between the output buffers of the different processor nodes in order to determine which output buffer, or output processor node, will transmit first over output bus 22, since only one PN can use the output bus at any one time. Once the processor node has transmitted, a hand shake arrangement, extending from a flip flop, such as flip flop 46 in processor node 12, and indicated by arrow 70, signals the next processor node to transmit. Although the arrangement in Fig. 1 indicates that the transmission occurs from immediately adjacent nodes, this is not necessarily representative of what may happen in the actual processor node array. It may be that some other order of transmission is determined by the arbitration process.
Arbitration and transmission occur only when a transmit signal is asserted by controller 24 to allow synchronization of the transmission with other operations being conducted in the array. Several modes of arbitration/data transfer are provided in the architecture: sequential, global and direct. The arbitration mode is selected by controller 24. Controller 24 and flip flops 46, 48, 50 and 52 comprise what is referred to herein as arbitration means, which is operable to determine at which point in processor operation a signal will be transmitted from a output processor node. Sequential mode, as the term indicates, requires a control signal over control bus 26 to travel from one processor node to the next.
Global arbitration uses a signal that travels on the control bus which enables transmission from all processor nodes, but which operates the processor nodes in a daisy-chain, allowing certain processor nodes to transmit while others may pass, i.e., thereby not transmitting any data for a particular cycle, or clock.
Direct arbitration is used in a situation where data is written directly into the output buffer and immediately transmitted over the output bus on the next transmit cycle. In a SIMD structure, a single instruction passes over control bus
26 to all of the processor nodes. The instruction is carried out, simultaneously, on the values which are present in the individual processor nodes, which have been input over input bus 20. Each output buffer has its own internal controller, depicted in Fig. 1 at 38a, 40a, 42a and 44a, which is described in the code and structure which follow herein, which is how each output processor operates separately from SIMD control 24.
In conventional architectures, the output is transmitted over output bus 22 in a predetermined sequencing of transmissions from the individual processor nodes. This of course, requires that the processor node wait until the other processor nodes in the array are finished transmitting before a new instruction set or new data can be received in the processor node. The provision of output buffers in each PN provides a location where the process data may be stored while the processor node waits its turn at the output bus. Additional operations may be occurring in the processor nodes, which may have new data loaded, or which may have existing data acted upon by a new instruction set.
Connection 28 may be enabled if it is desired that the output data from a processor node become the input data for a processor node, or other processor nodes. Such enablement is accomplished by conventional micro-circuitry mechanisms, and is depicted schematically in Fig. 2..
The actual operation of an individual processor in array 10 may be described by the following instruction set, which, while presented in software form, would be incorporated into the physical design of the integrated circuit containing the output processor nodes of the invention.
The following code is a simplification of the code that describes the actual CMOS implementation of the PN structure in a neurocomputer chip. The code shown below is in the C programming language embellished by certain predefined macros. The code is used as a register transfer level description language in the actual implementation of the circuitry described here. Bolded text indicates a signal, hardware or firmware component, or a phase or clock cycle.
The phi and ph2 variables simulate the two phases in the two- phase, non-overlapping clock used to implement dynamic MOS devices.
The post-fix "_D" associated with some signal names, means a delayed version of the signal, "_B" means a bus (more than one signal line), "_1" means a dynamic signal that is only valid during phi. These can be combined arbitrarily. Output unit (obUNIT) 68 contains the output buffer. The PN
Output Buffer Interface is used for output and allows the output to run independently from the input, and is used during various recursive and feedback operations. if (reset) { outst=0; seqrght=0; obarmd=0; seqgo_lD=0; } This step initialize variables which are used during the processing. During output buffer arbitration for the output bus, the arbitration signal from a PN's left most neighbor will set the sequential arbitration ok signal, seqgo_lD. if (phi) seqgo_lD = leftff ; leftff is the signal from flip flop 46, also referred to herein as arbitration signal generator means, in PN 12, on the left and indicates that PN 12 has transmitted and it is now PN 14's turn.
When the output buffer is written, the obarmd flip flop 46 is set. obarmd indicates that the output buffer is armed and ready to transmit. This transmission operates independently of the remaining computation inside the
PN and forms the essence of the output buffer architecture of the invention.
If transmission is in two bytes, the low byte is sent first. This occurs if the transmit control is asserted and if the PN's output buffer is armed. The outst flip flop is used to indicate that this is the second of a two byte transmission. All of the arbitration decisions discussed at this point operate in the sequential mode, where a PN signals its neighbor to the right when it has completed its transmission, xmit is the SIMD command that tells the output buffer of a particular PN to transmit (if it has received an arbitration signal). The seqarb flip flop indicates that this arbitration mode is enabled. outmd2 indicates that this is a two byte transmission:
if ((phi) ANDb ((obarmd) ANDb (xmit)) { if (seqarb) { if (outst—1) { outbus_B2 = (outbuf_B
AND 0xFF00L)» 8; outst=0; obarmd=0; seqrght=l;
The previous sequence executes the outbuf state machine for sequential arbitration, seqrght signals the next PN to transmit on the next clock else if (seqgo_lD) { if (outmd2) { if (outst==0) { outbus B2 = outbuf B AND OxFFL; outst=l; }
} else { outbus B2 = outbuf B AND OxFFL; obarmd=0 ; seqrght=l ; }
} } }
After all of the PNs have transmitted over output bus 22, the arbitration protocol is reset and the system is queried to determine if there is more output to follow:
if (reset) obarmd=0; if ( (ph2) ANDb (vcval) ANDb (rgctl_B2==F_RGABUS) ANDb (r_B=F_OUTBUF) ) { obarmd=l;} if ( (ph2) ANDb (vcval) ANDb (rgctl_B2==F_RGBBUS) ANDb (r_B==F_OU BUF) ) { obarmd=l;}
If an output buffer contains information to be transmitted, obarmd is set and the cycle repeats, vcval indicates a valid output buffer control signal, rgctl_B2 indicates that a write to the output buffer is now occurring, and r_B indicates the output buffer, F_OUTBUF, is being addressed. Writing to the register OUTBUF sets obarmd, which turns on (enables) the PN output buffer.
Operation of the architecture of array 10 may be summarized as follows:
DURING CLOCK ZERO:
Initially, CN0, CN1, CN2 and CN3 have values which are located in register 62 of processor nodes 12, 14, 16 and 18, respectively. The values are written to the respective output buffer in each PN over an appropriate connection.
DURING CLOCK ONE:
Output buffer 38 in processor node 12 transmits the output of CN0 onto output bus 22. This value is read by processor nodes 12, 14, 16 and 18 over connection 28 and input bus 20. Each processor node fetches a weight from weight memory 66 and multiplies, for instance, the output of CN0 times the various weights, such as w^ w^, w^ and w^. Output buffer 38 and processor node 12 then transmit an arbitration signal, through flip flop 47 and connection 70, to output buffer 40 in processor node 14, the next-in-time processor node, that it may now transmit on output bus 22. DURING CLOCK TWO:
Output buffer 40 in processor node 14 transmits the output of CN1 onto output bus 22. This value is read by processor nodes 12, 14, 16 and 18 over connection 28 and input bus 20. Each processor node fetches a weight from weight memory 66 and multiplies, for instance, the output of CN0 times the various weights, such as w41, w51, w61 and w71. Output buffer 40 and processor node 14 then signal, by flip flop 48 and connection 70, output buffer 42 in processor node 16 that it may now transmit. Similar action occurs during successive clock cycles until the data has been processed. Another way of looking at the processor node function is to consider that each processor node has a multiplier 60, an adder 58 and two look tables, weight address generation 64 and weight memory 66. Each node receives an input and does a multiply-accumulate on each clock. After the accumulation loop, each processor node moves its output into its output buffer and waits for its turn to broadcast. The steps may be represented as
o} = / ( ∑ W^) i=l
The output, O* is therefore equal to the summation of the values drawn from weight memory 66, W*j, times the input value, Oj, wherein the entire function is stored in the weight address generator 64.
Although a preferred embodiment of the invention has been disclosed herein, it should be appreciated that variations and modifications may be made thereto without departing from the scope of the invention as defined in the appended claims.
Industrial Application Processors constructed according to the invention are useful in neural network systems which may be used to simulate human brain functions in analysis and decision making applications.

Claims

WHAT I CLAIM IS:
1. In a single instruction stream, multiple data stream (SIMD) processor node (12, 14, 16, 18), having an input bus (20), an input unit (54), manipulation units (56 - 66) and an output bus (22), a output processor architecture comprising: an output unit (68) which receives data from the input unit (54) and manipulation units (56 - 66); a output buffer (38) located on select processor nodes (12) for storing and transmitting data from the output unit (68) at a selected time; and a control unit (56) for controlling the exchange of data between the processor node (12), an associated output buffer (38) and the output unit (68).
2. The architecture of claim 1 which further includes arbitration means (46) for determining at what point in processor operation a signal will be transmitted from an output buffer (38).
3. The architecture of claim 2 wherein said arbitration means includes an arbitration signal generator (46) located in each processor node (12) for generating a control signal to the next processor node (14).
4. The architecture of claim 1 wherein each output buffer includes a discrete, internal controller (38a)for controlling the sequencing of data storage and subsequent transmission to the output bus (22).
5. The architecture of claim 1 which includes plural connection nodes (30, 34) located within each processor node, said connection nodes being constructed and arranged to provide n2 connections in n clocks.
6. The architecture of claim 5 wherein, during each clock, selected connection nodes (30) broadcast, through their associated processor nodes, to all of a selected set of connection nodes (34, 35, 36, 37).
7. In a single instruction stream, multiple data stream (SIMD) processor node (12, 14, 16, 18), having an input bus (20), an input unit (54), manipulation units (58 - 66) and an output bus (22), an output processor architecture comprising: an output unit (68) which receives data from the input unit (54) and manipulation units (58 - 66); an output buffer (38) located on select processor nodes (12) for storing and transmitting data from the output unit (68) at a selected time; a control unit (56) for controlling the exchange of data between the processor node (12), an associated output buffer (38) and the output unit (68); and an arbitration signal generator (46) located in each processor node (12) for generating a signal to the next-in-time processor node (14) instructing the next-in-time processor node to transmit.
8. The architecture of claim 7 wherein each output buffer (38) includes a discrete, internal controller (38a) for controlling the sequencing of data storage and subsequent transmission to the output bus (22).
9. The architecture of claim 7 which includes plural connection nodes (30, 34) located within each processor node (12), said connection nodes being constructed and arranged to provide n2 connections in n clocks.
10. The architecture of claim 9 wherein, during each clock, selected connection nodes (30) broadcast, through their associated processor nodes, to all of a selected set of connection nodes (34, 25, 26, 37).
EP91921011A 1990-05-30 1990-05-30 Mechanism providing concurrent computational/communications in simd architecture Withdrawn EP0485594A4 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US1990/003066 WO1991019256A1 (en) 1990-05-30 1990-05-30 Mechanism providing concurrent computational/communications in simd architecture

Publications (2)

Publication Number Publication Date
EP0485594A1 true EP0485594A1 (en) 1992-05-20
EP0485594A4 EP0485594A4 (en) 1995-02-01

Family

ID=22220890

Family Applications (1)

Application Number Title Priority Date Filing Date
EP91921011A Withdrawn EP0485594A4 (en) 1990-05-30 1990-05-30 Mechanism providing concurrent computational/communications in simd architecture

Country Status (3)

Country Link
EP (1) EP0485594A4 (en)
JP (1) JPH05500124A (en)
WO (1) WO1991019256A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5978831A (en) * 1991-03-07 1999-11-02 Lucent Technologies Inc. Synchronous multiprocessor using tasks directly proportional in size to the individual processors rates

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1982002102A1 (en) * 1980-12-12 1982-06-24 Ncr Co Chip topography for integrated circuit communication controller
US4428048A (en) * 1981-01-28 1984-01-24 Grumman Aerospace Corporation Multiprocessor with staggered processing
EP0293700A2 (en) * 1987-06-01 1988-12-07 Applied Intelligent Systems, Inc. Linear chain of parallel processors and method of using same

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4380046A (en) * 1979-05-21 1983-04-12 Nasa Massively parallel processor computer
US4481580A (en) * 1979-11-19 1984-11-06 Sperry Corporation Distributed data transfer control for parallel processor architectures
US4412303A (en) * 1979-11-26 1983-10-25 Burroughs Corporation Array processor architecture
US4384273A (en) * 1981-03-20 1983-05-17 Bell Telephone Laboratories, Incorporated Time warp signal recognition processor for matching signal patterns
US4574394A (en) * 1981-06-01 1986-03-04 Environmental Research Institute Of Mi Pipeline processor
US4498134A (en) * 1982-01-26 1985-02-05 Hughes Aircraft Company Segregator functional plane for use in a modular array processor
BG35575A1 (en) * 1982-04-26 1984-05-15 Kasabov Multimicroprocessor system
US4580215A (en) * 1983-03-08 1986-04-01 Itt Corporation Associative array with five arithmetic paths
US4621339A (en) * 1983-06-13 1986-11-04 Duke University SIMD machine using cube connected cycles network architecture for vector processing
US4720780A (en) * 1985-09-17 1988-01-19 The Johns Hopkins University Memory-linked wavefront array processor
US4852048A (en) * 1985-12-12 1989-07-25 Itt Corporation Single instruction multiple data (SIMD) cellular array processing apparatus employing a common bus where a first number of bits manifest a first bus portion and a second number of bits manifest a second bus portion
US4773038A (en) * 1986-02-24 1988-09-20 Thinking Machines Corporation Method of simulating additional processors in a SIMD parallel processor array
US4873626A (en) * 1986-12-17 1989-10-10 Massachusetts Institute Of Technology Parallel processing system with processor array having memory system included in system memory
US4792894A (en) * 1987-03-17 1988-12-20 Unisys Corporation Arithmetic computation modifier based upon data dependent operations for SIMD architectures

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1982002102A1 (en) * 1980-12-12 1982-06-24 Ncr Co Chip topography for integrated circuit communication controller
US4428048A (en) * 1981-01-28 1984-01-24 Grumman Aerospace Corporation Multiprocessor with staggered processing
EP0293700A2 (en) * 1987-06-01 1988-12-07 Applied Intelligent Systems, Inc. Linear chain of parallel processors and method of using same

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of WO9119256A1 *

Also Published As

Publication number Publication date
EP0485594A4 (en) 1995-02-01
WO1991019256A1 (en) 1991-12-12
JPH05500124A (en) 1993-01-14

Similar Documents

Publication Publication Date Title
US5175858A (en) Mechanism providing concurrent computational/communications in SIMD architecture
US5369773A (en) Neural network using virtual-zero
Uhr Algorithm-structured computer arrays and networks: architectures and processes for images, percepts, models, information
Ramacher SYNAPSE—A neurocomputer that synthesizes neural algorithms on a parallel systolic engine
US5204938A (en) Method of implementing a neural network on a digital computer
US20020184291A1 (en) Method and system for scheduling in an adaptable computing engine
US5586289A (en) Method and apparatus for accessing local storage within a parallel processing computer
Eckert et al. Eidetic: An in-memory matrix multiplication accelerator for neural networks
Card et al. Competitive learning algorithms and neurocomputer architecture
EP0485594A1 (en) Mechanism providing concurrent computational/communications in simd architecture
EP0223849B1 (en) Super-computer system architectures
EP0485466A1 (en) Distributive, digital maximization function architecture and method
JPH10124473A (en) Neural module
Kerckhoffs et al. Speeding up backpropagation training on a hypercube computer
Linde et al. Using FPGAs to implement a reconfigurable highly parallel computer
James et al. Design of low-cost, real-time simulation systems for large neural networks
CN111767133A (en) System and method for reconfigurable systolic array
AsanoviĆ et al. Designing a connectionist network supercomputer
Molendijk et al. Benchmarking the epiphany processor as a reference neuromorphic architecture
Wilson Neural Computing on a One Dimensional SIMD Array.
Hoshino et al. Transputer network for dynamic control of multiple manipulators
WO1991019248A1 (en) Neural network using virtual-zero
Pacheco A" neural-RISC" processor and parallel architecture for neural networks
Svensson et al. Towards modular, massively parallel neural computers
Bogdan et al. Kobold: a neural coprocessor for backpropagation with online learning

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FR GB IT LI LU NL SE

17P Request for examination filed

Effective date: 19920610

RBV Designated contracting states (corrected)

Designated state(s): DE FR GB NL

A4 Supplementary search report drawn up and despatched

Effective date: 19941215

AK Designated contracting states

Kind code of ref document: A4

Designated state(s): AT BE CH DE DK ES FR GB IT LI LU NL SE

Kind code of ref document: A4

Designated state(s): DE FR GB NL

17Q First examination report despatched

Effective date: 19970422

GRAG Despatch of communication of intention to grant

Free format text: ORIGINAL CODE: EPIDOS AGRA

GRAG Despatch of communication of intention to grant

Free format text: ORIGINAL CODE: EPIDOS AGRA

GRAH Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOS IGRA

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 19981028