EP3782036A1

EP3782036A1 - Mimd processor emulated on simd architecture

Info

Publication number: EP3782036A1
Application number: EP19742845.1A
Authority: EP
Inventors: Stéphane CHEVOBBE; Marc Duranton
Original assignee: Commissariat a lEnergie Atomique CEA; Commissariat a lEnergie Atomique et aux Energies Alternatives CEA
Current assignee: Commissariat a lEnergie Atomique et aux Energies Alternatives CEA
Priority date: 2018-06-08
Filing date: 2019-06-06
Publication date: 2021-02-24
Also published as: WO2019234359A1; FR3082331A1; US20210240482A1; US11182170B2; FR3082331B1

Abstract

The present invention relates to a processor having a SIMD architecture, comprising an array (120) of elementary processors (150), each elementary processor (150) being associated with an elementary memory cell (155), a central controller (110) connected to the elementary processors by an instruction bus and a status bus. The central controller transmits a sequence of instructions in a loop, each instruction comprising a calculation flow indicator. Each elementary processor has an instruction filter that makes it possible to reject or take into account an instruction depending on the identifier it contains. This operating mode makes it possible to emulate a MIMD processor on a SIMD architecture.

Description

MIMD PROCESSOR EMULATED ON HMIS ARCHITECTURE

DESCRIPTION

TECHNICAL AREA

The present invention generally relates to the field of Multiple Instruction Multiple Data (MIMD) processors, in particular for performing image processing in a vision system such as an intelligent retina.

STATE OF THE PRIOR ART

Intelligent retinas are integrated circuits combining a matrix of sensors and a processor consisting of a matrix of elementary processors, the elementary processors, also called PEs (Processing Elements) performing processing on the signals provided by these sensors. In general, there is a correspondence between the sensors (or pixels) and the elementary processors: an elementary processor is in charge of processing the signals from one or more pixels.

The processor can perform elementary image processing (spatial filtering for example) or even more complex operations, such as POIs or object detection. Generally, the processor architecture is of the SIMD (Single Instruction Multiple Data) type, ie the same instruction is performed in parallel by all the elementary processors which each process a different datum because connected to different pixels. Each elementary processor has its own arithmetic and logical unit (ALU), registers and, if applicable, a local memory and receives the same instruction as all other elementary processors.

An example of a vision system using a SIMD architecture processor has been described, for example, in the P. Dudek chapter "SCAMP-3: a SIMD vision chip current-mode analog processor array" of the book "Focal-plane sensor- processor chips ", 2011, published by A. Zarandy at Springer.

This type of architecture is suitable for massively parallel computations but is not optimal when different processes must be executed on different parts. of the image. The nature of the SIMD architecture requires that these separate processes be performed sequentially, which penalizes the execution time.

More recently, it has been proposed a SIMD processor architecture whose elementary processors operate in parallel on the respective columns of the sensor array. This architecture has been described in the article by T. Yamazaki et al. entitled "A 1 ms high-speed vision chip with 3D-stacked 1 column 140 Gops column-parallel PEs for spatial-temporal image processing" published in ISCCC 2017 Conf. Proc., Session 4, Imagers 4.9, pages 82-84. This architecture allows a certain flexibility in that it is possible to choose independently and simultaneously one of four processing on different vertical regions of the image.

The object of the present invention is therefore to provide a processor architecture that is simple and allows to perform in a flexible manner separate parallel processing, in particular on different areas of any configuration of an image captured by a sensor array.

STATEMENT OF THE INVENTION

The present invention is defined by a SIMD architecture processor comprising a matrix of elementary processors, each elementary processor being associated with a memory cell intended to store data to be processed by said elementary processor, the processor further comprising a central controller, the processors elementaries being connected to the central controller by a first bus, said instruction bus, allowing the central controller to transmit instructions in parallel to the elementary processors, and by a second bus, called status bus, allowing the central controller to receive the statuses. different elementary processors, said processor being advantageous in that:

the central controller comprises a memory in which the tasks to be performed by the various elementary processors are stored in the form of a sequence of instructions, the central controller looping the sequence of instructions on the instruction bus, each instruction comprising a calculation flow identifier, a computational flow being defined as an ordered list of tasks, each calculation flow relating to one or more elementary processor (s);

each elementary processor comprises an instruction filter and an identifier table, the instruction filter being adapted to extract the calculation flow identifier of each instruction received by the elementary processor and to determine if the identifier is present; in said table, the instruction being stored in a FIFO buffer to be executed by the elementary processor in the affirmative and rejected by the elementary processor in the negative.

The FIFO buffer is typically popped at each instruction executed by said elementary processor.

Advantageously, each instruction of a task comprises a sequence number indicating its order of execution in the task, the instruction filter of the elementary processor comprising a counter incremented each time the FIFO buffer is pared, an instruction n ' being stored in the FIFO buffer only if its stream identifier is present in the table of the elementary processor and if its sequence number is equal to the output value of said counter.

The transmission frequency of the instructions on the instruction bus may notably be substantially greater than the frequency of execution of these instructions by the elementary processors.

Each instruction advantageously comprises an instruction pointer and the elementary processor comprises a micro-sequencer connected to a storage memory of a microcode library, the micro-sequencer sequencing micro-instructions of the microcode pointed by said instruction pointer.

In addition, each elementary processor can be connected to its neighbors by means of communication links, a communication link between a first elementary processor and a second elementary processor connecting a first transmission register of the first elementary processor to a second reception register. of the second elementary processor and a second transmission register of the second elementary processor to a reception register of the first elementary processor. The execution of the micro-instructions by the first elementary processor is then stopped as long as the first transmission register is not empty.

Alternatively, the execution of the micro-instructions by the second elementary processor is stopped as long as the second reception register is not full.

In the first case, the first elementary processor having completed the execution of a task informs the central controller by a notification of its status and the second elementary processor is informed of this status by the central controller.

The present invention also relates to an intelligent optical sensor characterized in that it comprises a matrix of elementary sensors and a SIMD architecture processor according to one of the preceding claims, each elementary processor being associated with a plurality of sensors of said matrix and being adapted to process the signals from these sensors. Each elementary processor may itself have a SIMD architecture.

BRIEF DESCRIPTION OF THE DRAWINGS

Other characteristics and advantages of the invention will appear on reading a preferred embodiment of the invention, described with reference to the appended figures among which:

Fig. 1 schematically represents the general architecture of a SIMD processor according to one embodiment of the invention;

Fig. 2 schematically represents the architecture of an elementary processor of the processor of FIG. 1;

Fig. 3 schematically shows a mode of synchronization between two elementary processors of the processor of FIG. 1;

Fig. 4 schematically represents a task delegation between two elementary processors of the processor of FIG. 1. DETAILED PRESENTATION OF PARTICULAR EMBODIMENTS

We will consider in the following a SIMD processor as defined in the introductory part. Recall that such a processor consists of a matrix of elementary processors (PEs) sharing the same instruction bus and intended to run in parallel the same instruction during the same time interval. In a particular mode of use, this processor is integrated with a matrix of sensors (photodiodes for example) within an intelligent optical sensor (intelligent retina). More precisely, in this case, each elementary processor is associated with a sub-matrix of the sensor array, the signals of the various sensors of the sub-array being stored in a storage sub-array, also called macropixel. The structure of such a storage sub-matrix has been described in application FR-A-2984556. The elementary processors themselves advantageously have a SIMD architecture (each elementary processor then comprising a plurality of calculation units operating in parallel ) and can therefore process in parallel several data stored in the storage sub-array.

The idea underlying the present invention is to emulate a multiple instructions multiple data processor (MIMD), such as a multi-core processor, from a SIMD architecture processor, so as not to multiply the resources necessary to ensure the storage and sequencing of instructions, necessary for each MIMD processor instance.

Fig. 1 schematically shows the architecture of a SIMD architecture processor according to one embodiment of the invention.

This processor comprises a matrix 120 of elementary processors 150 (PE), each elementary processor can access a memory cell with which it is associated. More precisely, the memory 125 is divided into memory cells 155 (CE) containing the data to be processed by the elementary processor. The memory cell has for example the structure of the aforementioned storage sub-matrix and each elementary processor processes the data of the corresponding macropixel. The elementary processors are connected in parallel to a central controller 110 by means of a first common bus, called the instruction bus. Thus, when an instruction is transmitted by the controller, each of the elementary processors receives it and can execute it in parallel.

The elementary processors are also connected to the central controller via a second common bus, called status bus, on which they can transmit their respective statuses. By status, we mean here for example the state of a task (in particular the end of a task), the occurrence of an error in the execution of a task (division by zero, overflow) or a software interruption. . The statuses of the various elementary processors are grouped together in a status table 130. Thus, the central controller knows at any time the completion status of the tasks performed by the various elementary processors and can transmit instructions accordingly.

The central controller also comprises a memory 140 in which is stored the program to be executed by the processor, said program consisting of a task sequence task ₀ Jask _l , ... Jask _N , each task being itself composed of a series of instructions. Advantageously, as will be discussed in more detail below, the instructions of the task or sequence of tasks are looped over the instruction bus. A computational flow is defined as an ordered sub-sequence of tasks in the task sequence task ₀ , task ^ ..., task _N. A calculation stream may concern a subset of all the elementary processors, or even in some cases all the elementary processors.

An instruction includes a header followed by a calculation flow identifier and, if applicable, the order index of the instruction in the task, and then a number of words defining the instruction to be performed. and, where appropriate, arguments of this instruction. Advantageously, the instruction may be coded in compressed form, for example in the form of an instruction index pointing in an instruction library. In the case of an intelligent optical sensor, an example of such an instruction may be the convolution with a kernel for filtering the pixels of the macropixel, the kernel being provided as argument of the instruction. Alternatively, the instruction can be directly executable by the elementary processor without needing to be decoded. The two types of instructions mentioned above can generally coexist.

Fig. 2 schematically shows the architecture of an elementary processor of FIG. 1.

On the left of the figure, it was recalled that the central processor looped a sequence of instructions inst ₀ , ..., inst _K on the instruction bus. These instructions can be related to different tasks, a task belonging to a calculation stream that one or more elementary processor (s) must (i) execute.

Each instruction is read on the bus by the elementary processor 200. The header of the instruction is analyzed by a filtering module 210. This detects the beginning of the instruction by means of the header, retrieves the calculation stream identifier and determines whether or not the compute flow is relevant to it. To do this, it compares the identifier received with the identifier stored in a current stream register 220. This register contains the identifier of the current stream to be executed by the elementary processor, ie tasks of this calculation stream that this elementary processor must perform. The contents of the register 220 are loaded at the time of the initialization phase of the processor or by a specific microcode.

Advantageously, the instruction may be coded in compressed form, for example in the form of an instruction index pointing in an instruction library.

When the instruction belongs to a calculation flow concerning the elementary processor, the instruction pointer is stored in a FIFO buffer, 230. In the case where the FIFO buffer is full, the instruction in question is not recorded. The instruction pointer may, however, be stored during a subsequent iteration of the instruction loop if a place has been released meanwhile at the input of the buffer.

Different variant embodiments are possible depending on the nature of the instruction sequence comprising a task.

According to a first variant, the sequence can be resumed from any instruction, in particular because the different instructions of the sequence can be executed independently. In this case, the elementary processor ensure that the FIFO buffer is empty enough to record a complete sequence that can then be started again. For example, the FIFO buffer can be purged when a sequence has been interrupted or an overflow has occurred.

According to a second variant, all the instructions of a task must be carried out according to the order in which they appear in the sequence. It should then be ensured that all the instructions for this task are carried out in this order by the elementary processor, even if the FIFO buffer overflows. In this case, each instruction has an additional field indicating the sequence number of the instruction in the task. In addition, the filtering module 210 comprises a counter incremented each time an instruction is stored in the FIFO buffer and is reset at the end of the task. This value is used for filtering instructions and ensures that they are entered in sequence in the FIFO. Thus only the next instruction in the task, whose sequence number is equal to the output of the counter and whose stream identifier corresponds to the one stored in the register 220, can be stored in the FIFO buffer.

In general, the frequency of transmission of the instructions by the central controller is substantially higher than the instruction processing frequency by the elementary processors, which makes it possible to transmit different instruction streams to the different elementary processors without forcing them to put on hold an instruction.

An advantageous solution is to interleave the instructions of the different calculation flows, allowing a regular supply of instructions for the different streams.

If a sequence of instructions constituting a task is carried out more quickly than the others, it can be advantageously repeated several times in a repetitive cycle of tasks. Those skilled in the art can define an order of the instructions of the different tasks and the number of repetitions of these tasks for optimal operation of the elementary processor, that is to say to avoid too many times when the FIFO buffer is empty (so the elementary processor is waiting for instruction) or saturated.

When an instruction is taken into account to be executed by the elementary processor, the instruction pointer is unstacked from the FIFO buffer and supplied to the Finite State Machine (FSM) 240. This plays the role of micro-sequencer: it extracts and sequences the microcode pointed by the instruction pointer in the microcode library 250. This microcode library is loaded during the initialization (or during a specific phase of operation - reconfiguration of the system - by the central controller 110 ). The micro-instructions contained in the microcode are sequentially transferred one by one into the microinstruction register 260. The arithmetic and logic unit (ALU) 280 receives these microinstructions sequenced by the state machine 240, the arguments, as well as the data covered by the instruction. The data will have been previously read in the memory cell associated with the elementary processor and stored in the data register 270.

It will thus be understood that the program to be executed by the processor may comprise different tasks to be executed in parallel by the different elementary processors, which makes it possible to emulate an MIMD architecture.

For example, in the case of an intelligent optical sensor, elementary processors associated with macropixels in the center of the image will be able to search Points of Interest (POIs) while elementary processors associated with macropixels at the periphery of the image will be able to perform motion detection. The instructions for these two tasks are transmitted at high frequency and looped (repetitively) on the instruction bus, the central processors in the central area selecting the instruction flow for the first task (POI search). and those in the peripheral zone selecting the instruction flow for the second task. It will be noted that it is not necessary for the instruction streams of the first task and the second task to be successive. The instructions for these two tasks can be intertwined, for example.

The iteration mechanism of the instruction loop on the one hand and the filtering of the instructions on the level of the elementary processors makes it possible to differentiate the processes performed by the latter.

It should be noted that the different tasks are executed asynchronously by the various elementary processors. This also makes it possible to have different processing frequencies for the elementary processors and thus optimize the consumption according to the tasks to be performed. In particular, two processors Elementals loaded with the same task may terminate it at different times because of the respective occupancy states of their FIFO buffers. When an elementary processor has completed the execution of a flow of instructions, it informs the central controller via the status bus.

The asynchronous character of the execution of the tasks can be exploited to distribute the computing load between the elementary processors.

Alternatively, it is possible to synchronize the execution of tasks between neighboring elementary processors.

Fig. 3 schematically shows a synchronization mode between two neighboring elementary processors.

In this embodiment, neighboring processors can exchange data by means of duplex communication links, each communication link implementing two registers, namely a transmission register and a reception register.

Advantageously, four communication links are provided by elementary processor, connecting it to its four neighbors (in the North, South, East, West directions). Alternatively, eight communication links can be provided linking it to its eight neighbors (the neighbors in the previous sense and those in the diagonal directions). The association of a transmission register and a reception register by link makes it possible to carry out asynchronous communication between neighboring elementary processors.

It is shown in FIG. 3 a first elementary processor 310 and a second elementary processor 320, adjacent to the first. The duplex communication link 350 connects, on the one hand, a first transmission register 311 of the first elementary processor to a second reception register 322 of the second elementary processor and, on the other hand, a second transmission register 321 of the second second elementary processor at a first receive register 312 of the first elementary processor.

A send microcode of the elementary processor makes it possible to transmit data to a neighboring elementary processor via a communication link. Similarly, a receive microcode can receive data from a neighboring elementary processor via this same link. However, it is necessary to ensure that the codes of the elementary transmit and receive processors are well written so that the data transfer is carried out correctly (a send microcode on one side corresponding to a send microcode of the other, and vice versa) and in the expected order.

Different variants of the send and receive microcode are possible depending on whether the transfers in the communication registers block the sequence of microinstructions in the elementary processor or not.

For example, taking into account a transmission or reception of data can use the semaphore principle. To do this, each communication register includes a status bit that indicates whether the register in question is empty or full.

The execution of the send microcode transfers data from the ALU to a transmission register of the elementary processor to be transmitted on the corresponding communication link. Two situations are possible: either the send microcode is blocking, in which case it stops the execution of the microinstruction sequence as long as the transmission register is not empty, or it is non-blocking, in which case the microcode simply writes. the data in the transmission register and sets the register status bit to "full" without affecting the execution of the microinstruction sequence.

Conversely, on the side of the elementary processor receiving the data, the latter executes the microcode receive, which can in turn be blocking or non-blocking. If it is blocking, the receiving elementary processor waits for the transmit element status register bit of the sending elementary processor to be "full". When this condition is fulfilled, the data contained in the transmission register of the transmitting elementary processor is stored in the reception register of the receiving elementary processor. The receive microcode then sets the status bit of the transmit register (of the sending elementary processor) to the value "empty" and the status bit of the receive register (of the receiving elementary processor) to the value "full". An additional read microcode can then read the data from the receive register and supply it as input to the ALU (receiver elementary processor). After reading the receive register, the read microcode sets the status bit of the receive register to the value "empty". The skilled person may consider different combinations of instructions (blocking or non-blocking) send, receive and read, without departing from the scope of the present invention.

The synchronization between elementary processors for sending and receiving data can also be performed via the central controller which then explicitly orders the data exchanges in synchronous mode.

Fig. 4 represents a task delegation between two elementary processors under the supervision of the central controller.

When an elementary processor 430 has completed its task and has signaled it to the central controller on the status bus, it becomes available to perform a new process. A neighbor elementary processor 420 can then delegate part of its task while it is running. The elementary processor 420 is informed of the availability of the elementary processor 430 by the central controller which maintains the status table. The central controller can then indicate the task to be performed by means of a new code to be loaded in the register 220) and trigger in 425 a transfer of data via the communication link that connects them.

This indication may also take the form of a start address and an end address in the compute flow. The elementary processor 430 then determines by means of its selection module the instructions that are intended for the elementary processor 420 and whose addresses are between said start and end addresses of the delegated task. At the end of the execution of the delegated task, the elementary processor 430 informs the central controller that updates its status table. The elementary processor 420 is thus informed of the end of the delegated task and triggers at 435 the transfer of data to receive them in its register (or its buffer) reception. In the case of an optical sensor, the task delegation may for example concern a part of the data of the macro-pixel and / or a particular operation. For example, if a search for points of interest and a motion detection must be performed by the elementary processors on an area of the image (shaded area) and that only motion detection must be performed in the rest of the image, the elementary processor 430 may be loaded with a point of interest search on behalf of the elementary processor 420 once it has completed its task of motion detection. The task delegation process can be repeated over time until the end of the program.

Claims

An SIMD architecture processor comprising a matrix of elementary processors (150), each elementary processor being associated with a memory cell (155) for storing data to be processed by said elementary processor, the processor further comprising a central controller (110). ), the elementary processors being connected to the central controller by a first bus, said instruction bus, allowing the central controller to transmit in parallel instructions to the elementary processors, and by a second bus, called status bus, allowing the central controller to receive the statuses of the various elementary processors, characterized in that:

the central controller comprises a memory (140) in which the tasks to be performed by the various elementary processors are stored in the form of an instruction sequence, the central controller looping the sequence of instructions on the instruction bus, each an instruction comprising a calculation stream identifier, a calculation stream being defined as an ordered list of tasks, each calculation stream relating to one or more elementary processor (s);

each elementary processor comprises an instruction filter (210) and an identifier table (220), the instruction filter being adapted to extract the calculation flow identifier of each instruction received by the elementary processor and to be determined if the identifier is present in said table, the instruction being stored in a FIFO buffer (230) to be executed by the elementary processor in the affirmative and rejected by the elementary processor in the negative.

2. SIMD architecture processor according to claim 1, characterized in that the FIFO buffer (230) is depilated at each instruction executed by said elementary processor.

SIMATIC architecture processor according to claim 2, characterized in that each instruction of a task comprises a sequence number indicating its execution order in the task, the instruction filter of the elementary processor comprising a counter incremented each time the FIFO buffer is unstacked, an instruction being stored in the FIFO buffer only if its stream identifier is present in the table of the elementary processor and if its sequence number is equal to the value of output of said counter.

4. SIMD architecture processor according to one of the preceding claims, characterized in that the transmission frequency of the instructions on the instruction bus is substantially greater than the frequency of execution of these instructions by the elementary processors.

5. SIMD architecture processor according to one of the preceding claims, characterized in that each instruction comprises an instruction pointer and the elementary processor comprises a micro-sequencer (240) connected to a storage memory of a library of microcode (250), the micro-sequencer sequencing micro instructions microcode pointed by said instruction pointer.

SIMD architecture processor according to claim 5, characterized in that each elementary processor is connected to its neighbors by means of communication links, a communication link (350) between a first elementary processor

(310) and a second elementary processor (320) connecting a first transmission register

(311) from the first elementary processor to a second receive register (322) of the second elementary processor and a second transmit register (321) from the second elementary processor to a receive register (312) of the first elementary processor.

7. SIMD architecture processor according to claim 6, characterized in that the execution of the micro-instructions by the first elementary processor is stopped until the first transmission register is not empty.

8. SIMD architecture processor according to claim 6, characterized in that the execution of the micro-instructions by the second elementary processor is stopped until the second reception register is not full. 9. SIMD architecture processor according to claim 6, characterized in that the first elementary processor having completed the execution of a task informs the central controller by a notification of its status and the second elementary processor is informed of this status by the central controller. 10. Intelligent optical sensor characterized in that it comprises a matrix of elementary sensors and a SIMD architecture processor according to one of the preceding claims, each elementary processor being associated with a plurality of sensors of said matrix and being adapted to process the signals from these sensors. 11. Intelligent optical sensor according to claim 10, characterized in that each elementary processor itself has a SIMD architecture.