WO1993025976A1

WO1993025976A1 - Parallel processor for processing multiple data with a series of repetitive instructions

Info

Publication number: WO1993025976A1
Application number: PCT/FR1993/000596
Authority: WO
Inventors: Remi Eugene; Cemal Draman
Original assignee: Technium (Societe Civile D'etudes Et De Recherches)
Priority date: 1992-06-16
Filing date: 1993-06-16
Publication date: 1993-12-23
Also published as: FR2692382A1

Abstract

A parallel processor comprising: an assembly (21) of α series of K elemental arithmetic units (UCEi,j), wherein each of the elemental arithmetic units (UCEi,j) has M data input lines (Lce) and N data output lines (Lcs), and each of said lines (Lce, Lcs) is connected to one of the input/output lines (Lces) of the assembly (21); a bulk memory (22) having O input/output lines (Lmes) for writing in and reading out words of O bits, said assembly (21) transferring data to and from said bulk memory; a transposing structure performing the 'direct' conversion of a set of Q data items having P bits into P words having Q bits, said transposing structure further performing the 'reciprocal' conversion of P' words having Q' bits into Q' data items having P' bits; and a set of ports (24) for transferring data to and from the assembly (21), the memory (22) and the transposing structure (23). Said processor may be used to perform repetitive processing, particularly image processing.

Description

PARALLEL PROCESSOR FOR MULTIPLE DATA PROCESSING

BY A SERIES OF INSTRUCTIONS. REPETITIVE.

The present invention relates to a parallel processor for the rapid processing of a large number of data according to an identical algorithm for each of the data. Such a processor is in particular intended for image processing, pixel by pixel, and real time.

Different types of processors with parallel architecture are known in the state of the art, in particular processors operating according to the "single instruction stream, multiple data streams" mode. For processors of the prior art, a single control unit manages the operation of a plurality of calculation units, and several data are processed simultaneously. The association of a large number of interconnected calculation units, each integrating random access memories for the storage of data and information relating to the processing, as well as the result of the processing, forms a matrix network. Parallelism of the "single instruction flow, multiple data flows" type is particularly suitable for low-level processing, that is to say for point operations or in a neighborhood independent of the pixel position in 1 ' picture .

American patent US 4 144 566 describes in particular a parallel processor comprising a large number of elementary processors connected in parallel on an address bus and a control bus, and each comprising a memory and control and calculation means for operating calculations on the bits addressed in the memory and on bits coming either from the internal memory or from external peripheral organs. The mass memory and said control and calculation means are interconnected to a fast memory of low capacity, the means of. command comprising a single storage scale allowing to operate in series calculations on bits extracted from memory and / or coming from said peripheral.

By "control means" is meant in particular structures making it possible to direct information from one or more sources to one or more destinations, and in particular multiplexers.

In order to improve the computing power and speed, US Pat. No. 4,215,401 describes a massive parallel processor comprising a plurality of elementary computing units UCE, each of the elementary computing units being connected to the neighboring elementary computing units for forming a matrix of interconnected elementary computing units of dimension N x M. The elementary computing units are also connected to a central controller.

Each of the elementary computing units comprises a random access memory, an accumulator with one input bit, an accumulator with one output bit and a NAND gate. The use of NAND gates as the processing unit, and the organization of memories imply a relative complexity of the instruction sequences and a limitation of the processing speed.

European patent 0 122 048 discloses a computer for parallel processing of data, comprising a plurality of processing cells (P) and a control device for generating control signals in response to program instructions, each cell comprising an element of arithmetic processing having three input terminals (Dl, D2, C) and a plurality of output terminals (PLUS, RETENUE; +, CY, BW). Each cell further includes a plurality of memories operable to provide data input signals to the input terminals of the processing element in response to control signals from the controller.

Memories are connected to the controller and the arithmetic processing element such that both arithmetic and logical operations, including logical operations involving a plurality of data input signals, can be executed by the arithmetic processing element in response to the selective application a logic level of ONE binary or ZERO binary from one of the memories at one of the input terminals of the processing element. The aim of the present invention is to optimize the processor in order to balance the resources devoted to managing the formatting of the data and to supplying the elementary computing units with data on the one hand, and the resources devoted the execution of logical and / or arithmetic processing on the other hand.

To this end, the invention relates more particularly to a parallel processor constituted, in addition to a control part generating all of the commands necessary for the operation of the part assigned to the calculation and to the management of the data, by:

a set of UCE of α series of K elementary calculation units UCEi, _j , i designating an integer between 1 and α and j designating an integer between 1 and K, said set comprising L input-output lines, each of the elementary computing units

UCEi, j having M lines L _ce for data input and N lines L _cs for data output, each of these lines L _ce , L _cs being connected to one of the input-output lines L _these of the whole;

a large capacity mass memory having O input-output lines L _mes for writing and reading words of O bits, the whole exchanging data with said mass memory;

a transposition structure performing the so-called "direct" transformation of a set of Q data of P bits into a set of P words of Q bits, a word being constituted by the association of the Q bits of equivalent weight of the Q data, said transposition structure also ensuring the so-called "reciprocal" transformation of P 'words of Q' bits into Q 'data of P' bits. Such a structure communicates with the input-output ports by means of the input of "data" having respectively P and P 'bits, and with the mass memory by "words" having respectively Q and Q' bits. The lines allowing these exchanges can be monodirectional or bidirectional, and distinct or common for transfers of "data" and "words";

- a set of ports allowing the exchange of data between the transposition structure or the mass memory on the one hand and the outside on the other hand. The different ports do not necessarily have the same number of lines, and can be individually mono-directional or bidirectional; the set of ECUs exchanging data with the memory, the memory exchanging data with the transposition structure and with the set of ports and the transposition structure exchanging data with the set of ports. In the present description, a "Data" gathers all the bits of the same information.

A "Word" groups together all the bits of the same weight of a data set.

The transposition of data into word- is designated by "direct transposition" while the transposition of "words" into "data" is designated by "reciprocal transposition".

In the processor according to the invention, each of the elementary computing units UCEj _^ performs identical processing on the data originating from the mass memory, according to instructions supplied identically by the central controller to each elementary computing unit UCEj ^ j . This mode of implementation makes it possible to limit the flow of data and therefore to speed up the processing speed of a large number of data.

The number α. K of elementary computing units is typically between a few tens and several thousand, and depends essentially on technological progress in terms of circuit integration.

The number L of input-output lines is equal to the number of calculation units, to a submultiple or to a multiple of this number.

The number M of inputs of each of the elementary computing units is typically between 1 and 32. The number N ^• of outputs of each of the elementary computing units is typically between 1 and 32. the length O of input lines -output L _raes of the mass memory of large capacity is equal to L or a submultiple of L.

P and P 'are equal to the number of lines of the external ports to which they are connected

Q and Q 'are equal to the length O of input-output lines L _e s - ^e l ^a mass memory or to sub-multiples of O.

The transfers between the set of elementary computing units and the large capacity memory depend on the values of O and L:

* if O ≈ L, all the information present at the output of one of the elements is transferred in a cycle to the input of the other element

* if O <L, the information present at the output of the set of elementary calculation units is transmitted in whole or in part in one or more cycles, this by connecting for each cycle O L lines of the set of units of elementary calculations at O lines of large capacity memory. The information present at the output ^' of the large-capacity memory are transmitted in a cycle at 0 of the L lines of the set of elementary calculation units, the unsolicited elementary calculation units are blocked in writing. The transfers between the large capacity memory and the transposition structure depend on the values of O, Q and Q '

* if Q = 0 All the information present at the output of the transposition structure is transferred in one cycle to the input of the large capacity memory

* If Q <O The information present at the output of the transposition structure is transmitted in a cycle to Q of the O lines of the large capacity memory, the part of the unsolicited large capacity memory is blocked in writing.

* if Q '= L All the information present at the output of the large capacity memory is transferred in one cycle to the input of the transposition structure

* if Q '<O the information present at the output of the large capacity memory is transmitted in whole or in part in one or more cycles, this by connecting, for each cycle, Q of the O lines of the large capacity memory to the Q structure lines of transposition.

The transfers from the large capacity memory to the set of ports as well as the transfers between the set of ports and the large capacity memory depend on the values of R designating the number of lines of the ports requested by the transfer which is less than O.

For transfers from the memory of ^large capacity to a port configured to output the said information output from the mass storage are sent fully or partially in one or more cycles, connecùant R from O lines of large capacity memory to R lines of the port requested at output, for each cycle.

For transfers from a port configured as input to the high-capacity memory, the ^'information contained on the input port configured are transmitted in one cycle from 0 to R rows of the large memory capacity. The part of the unsolicited memory is blocked in writing.

The transfers between a port and the transposition structure depend on the values of P, P 'and R.

The information present at the "data" output of the transposition structure is transmitted in one cycle to the port configured as output (R = P).

The information present on a port configured as input is transmitted in one cycle to the "data" input of the transposition structure (R = P '). According to a simplified embodiment, each of the K elementary calculation units UCE _j of a series i of elementary calculation units UCEj. _fj accesses the memory elements M _j _ _r to M _{j + r} and writes to the memory elements M _j _ _s to M _{j + S} without exchanging data with an elementary calculation units UCE _j > or j ≠ j '.

Advantageously, each of the K elementary calculation units UCEj _. , _j accesses the memory elements Mi_ _t , _j - _r to M ₊ , j _{+ r} and writes in the memory elements Mj._ _{Uf j} _ _s to Mi + _Ufj + _S without exchanging data with an elementary calculation units UCE _j >, _j 'where i ≠ i' and j ≠ j '-

The present invention will be better understood on reading the description which follows, referring to the appended drawings in which: FIG. 1 schematically represents the environment of a processor according to the invention; FIG. 2 represents the general structure of the processor according to the invention; - Figure 3 shows schematically the architecture of a memory element of the transposition structure, according to a first embodiment; - Figure 4 shows the memory network and the connections with the various BUSes;

- Figure 5 shows the block diagram of a cell of the memory network;

- Figure 6 schematically shows the architecture of a second embodiment of a memory element of the transposition structure;

- Figure 7 shows the memory network and connections with the various BUS; - Figure 8 shows the block diagram of a cell of the memory array; FIG. 9 schematically represents the architecture of a third embodiment of a memory element of the transposition structure;

- Figure 10 shows the memory network and the links with the various BUSes;

- Figure 11 shows the block diagram of a cell of the memory array; - Figure 12 shows the architecture of an Elementary Calculation Unit.

FIG. 1 represents the overall architecture of an example of application of a processor according to the invention. The image processing chain essentially comprises three modules (1 to 3).

The first module consists of an imaging workstation (1) capable of acquiring and reproducing images in real time, connected to a digital camera (5) and to a monitor (6). This workstation is generally equipped with a video processor (7) provided with a local program memory (8), with a data memory (9) making it possible to store the information coming from a camera, or intended for transfer to a monitor, and several cable operators to perform specific functions.

The second module (3) can be composed of a single processor or of a type processor

Multiple instruction Multiple given, of which the elementary processor can be a conventional processor, a vector processor, a transputer ...

This second module (3) provides medium and high level transformations such as the transition from image to list and the processing of these lists.

The third module (4) consists of a processor according to the invention, comprising a calculation structure (101) intended for real-time video processing on the basis of low level algorithms in image processing (calculations on neighborhoods) . This computing structure (101) receives commands from a central controller (102) generating the instructions necessary for the operation of the computing structure (101).

The entire architecture is connected on the one hand to a standard industrial BUS (referenced "Multibus" in FIG. 1) - for commands, and on the other hand to a set of digital video BUSs (referenced "BUS images" "in FIG. 1) for data exchanges which allow exchanges of the pipeline type or, by a referral system, more complex communications.

FIG. 2 represents the general structure of the processor according to the invention.

In the example described, the processor comprises:

- a set of ECUs (21) of 512 identical elementary calculation units. This set forms a one-dimensional network in which each elementary processor has an output line and 7 input lines. Each of the lines of index i among the L = 512 input-output lines of the set of ECUs (21) is connected to the output of the elementary processor of index i and to the input of the 7 elementary processors of index i-3 to i + 3.

- a mass memory (22) having 512 data input-output lines and which allows the writing of 512-bit words. a transposition structure (23) composed of two identical and independent 32 * 512 transposition memories each having an addressing allowing reading or writing in a "given" format (32 bits) and reading and writing in the format "word" (512 bits). Each of the transposition memories allows either a data-to-word transposition or a word-to-data transposition at any time.

- A set of ports (24) composed in the example described of two bi-directional 32-bit ports.

The processor according to the invention communicates with its external environment by 32-bit data (Q = Q '= 32). On the other hand, the communication between the set of computing units (21) and the mass memory (22) as well as the communication between the mass memory and the transposition structure takes place with data of 512 bits (K = L = 0 = P = P '= 512).

The following description relates more particularly to various embodiments of the transposition structure. Transposing 32-bit "words" to

"data" of 512 bits is ensured by one or more memories called transposition or "orthogonal memories". Such elements make it possible to carry out either the direct transformations (writing of "data", reading of "words"), or the reciprocal transformations (writing of "words", reading of "data"), or the two transformations. The Writes or Readings of information in a format can be done either by shift, or by addressing. We can distinguish three types of transposition elements: a first type with double orthogonal shift where the shift in one direction corresponds to the access in one format while the shift in an orthogonal direction corresponds to the access in the other format ; a second type with double cross-addressing, one addressing corresponding to access in one format and the other addressing corresponding to addressing in the other format; a third hybrid type allowing access to one format by offset and access to the other format by addressing.

FIGS. 3 to 11 represent transposition memories having data and word accesses by addressing • In the case where the transposition structure (23) is composed of a transposition memory with reading writing "words" and reading writing "data" , such a memory makes it possible to carry out both direct and reciprocal transformations, as illustrated by FIGS. 3 to 5.

FIG. 3 schematically represents the architecture of the transposition structure, constituted by a read and write memory "data" and read write write "words". FIG. 4 represents the memory network and the links with the various BUSes.

Each cell (28 to 31) whose figure 5 represents the block diagram communicates with:

- command lines for Writing "words" (32), - command lines for Reading "words" (33)

- lines of "data" in Writing (34)

- lines of "data" in Reading (35)

- lines of "words" in Writing (36) - lines of "words" in Reading (37)

- Write command lines "data" (38)

- "Data" read command lines (40).

Each elementary cell (28 to 31) comprises:

- an OR gate (41), the inputs of which receive the "data" Write command and "word" Write command signals via the lines (32) and (38), an OR gate, the inputs of which are connected to the lines of "words" in writing (36) and

"data" in Write (34) - a memory (45)

- a door (43) controlled by the signal from the read command lines "words"

(33) and the output of which is connected to the lines of "words" in reading (37) - a door (44) controlled by the signal coming from the reading control lines "data"

(40) and whose output is connected to the lines of

"data" in Read (35).

In the case where the memory element of the transposition structure (23) does not ensure the two conversions, we can distinguish a first memory architecture which ensures direct trans formation of data in words s (f Figures 6 to 8) and a second memory architecture which ensures the reciprocal transformation of words into data (Figure 9 to 11).

FIG. 6 represents the general architecture of the “Data” write memory and

Reading "words". FIG. 7 represents the memory network and the links with the various BUSes.

Each cell (50 to 53) whose figure 8 represents the block diagram communicates with: - Reading command lines "words" (57)

- lines of "data" in Writing (55)

- lines of "words" in Reading (56) - writing command lines "data"

(54)

Each elementary cell (50 to 53) includes:

- a memory (58) receiving the signals the “data” Write command lines (54) and “write” data lines (55),

- A gate (59) controlled by the signal from the read control lines (57) and the output of which is connected to the lines of "words" in read (56).

FIG. 9 represents the general architecture of the second memory, with "data" reading and "words" writing.

FIG. 10 - represents the memory network and the connections with the various BUSes.

Each cell (60 to 63) whose figure 11 represents the block diagram communicates with:

- Writing command lines "words" (67),

- lines of "data" in Reading (66) - lines of "words" in Writing (64)

- "Data" read command lines (65)

Each elementary cell (60 to 63) comprises: a memory (68) receiving the signals of the Write command lines (67) and of the "word" lines in Write (64),

- a gate (69) controlled by the signal from the read control lines (65) and the output of which is connected to the data lines in

Reading (66). Figure ^' 12' represents ^" the architecture of an Elementary Computing Unit. Each Unit of Elementary computation consists of two RAMs RAM I (70) and RAM II (71), of the dual access memory type to allow simultaneous reading and writing at different addresses, each composed of r words of one bit, and by two calculation structures, the first calculation structure (72) being a serial bit logic and arithmetic unit and the second calculation structure (73) being a parallel word arithmetic unit. The Elementary Computing Unit further comprises a set (74) of multiplexers and registers connecting the outputs and the inputs of the elements.

In the example described, there is an output line and 7 data input lines.

The present invention is described in the foregoing with reference to a nonlimiting exemplary embodiment.

Claims

R E V E N D I C A T I O N S

1 - Parallel processor comprising a plurality of identical elementary calculation units, memories for storing the data and at least one controller generating all the commands necessary for the operation of the part assigned to the calculation and to the management of the data, characterized in that it comprises: - a set of (21) of α series of K Elementary Computing units UCEi _fj , i designating an integer between 1 and α and j designating an integer between 1 and K, said set comprising L input-output lines, each of the elementary calculation units

UCI _{# j} having M lines L _ce for data input and N lines L _cs for data output, each of these lines L _ce , L _cs being connected to one of the input-output lines Lces of the assembly (21); - a mass memory (22) of large capacity having O input-output lines L _mes for writing and reading words of 0 bits, all

(21) exchanging data with said mass memory (22); a transposition structure performing the so-called "direct" transformation of a set of Q data of P bits into a set of P words of Q bits, a word being constituted by the association of Q bits of equivalent weight of the Q data, said transposition structure also ensuring the so-called "reciprocal" transformation of P 'words of Q' bits into Q 'data of P' bits. Such a structure communicates with the input-output ports by means of the input of "data" having respectively P and P 'bits, and with the mass memory by "words" having respectively Q and Q' bits. The lines allowing these exchanges can be monodirectional or bidirectional, and separate - or common for transfers of "data" and "words";

- a set of ports (24) allowing the exchange of data between the set (21), the memory (22) and the transposition structure (23), the set (21) exchanging data with the memory (22 ), the memory (22) exchanging data with the transposition structure (23) and with the set of ports (24) and the transposition structure (23) exchanging data with the set of ports (24).

2 - Parallel processor according to claim 1 characterized in that each of the K elementary calculation units UCE _j of a series i of elementary calculation units UCEi _fj accesses the memory elements M _j _ _r to M _j + _r and writes in the memory elements M _j _ _s to M _j + _S without exchanging data with an elementary calculation unit UCE _j i or j ≠ j '.

3 - Parallel processor according to claim 1 characterized in that each of the K elementary calculation units UCEi, _j accesses the memory elements M _, _j - _r to i + _t , j + r ^and writes to the memory elements M _ _Ufj - _s at Mi + _U , _{j + s} without exchanging data with an elementary calculation units UCE _j ' _fj ' where i ≠ i 'and j ≠ j'.

4 - Parallel processor according to any one of the preceding claims, characterized in that the transposition structure (23) consists of at least one read and write memory "data" and read write write "words".

5 - Parallel processor according to any one of claims 1 to 3 characterized in that the transposition structure (23) is composed of two types of memories, one of the types corresponding to "Data" write and "Word" read memories, the other corresponding to "Word" write and "Data" read memories.

6 - Parallel processor according to any one of the preceding claims, characterized in that each elementary computing unit is constituted by two random access memories (70, 71), of the dual access memory type to allow simultaneous reading and writing at addresses different, each composed of r words of one bit, and by two calculation structures, the first calculation structure (72) being a serial bit logic and arithmetic unit and the second calculation structure (73) being a parallel word arithmetic unit and in that it further comprises a set (74) of multiplexers and registers connecting the outputs and the inputs of the elements.