EP2332067A1

EP2332067A1 - Device for the parallel processing of a data stream

Info

Publication number: EP2332067A1
Application number: EP09779672A
Authority: EP
Inventors: Laurent Letellier; Mathieu Thevenin
Original assignee: Commissariat a lEnergie Atomique CEA; Commissariat a lEnergie Atomique et aux Energies Alternatives CEA
Current assignee: Commissariat a lEnergie Atomique et aux Energies Alternatives CEA
Priority date: 2008-09-30
Filing date: 2009-06-08
Publication date: 2011-06-15
Also published as: WO2010037570A1; US20110273459A1; JP2012504264A; US8836708B2; FR2936626B1; FR2936626A1

Abstract

The present invention relates to a device for processing a data stream from a device generating data matrices of Nl-lines by Nc-columns. The processing device includes K calculation tiles (TC) and an interconnection means (4) for transferring the data stream between the calculation tiles (TC). According to the invention, at least one calculation tile (TC) comprises: one or more control units (UC) for providing instructions; n processing units (UT), each processing unit (UT) implementing the instructions received from a control unit (UC) on a region of Vl lines by Vc columns of data; a storage unit (UM) for converting the streamed data into regions of Vl lines by (n+Vc-1) columns of data, the storage unit (UM) including a block (31) of converting memories having a size of VlxNc and a block of region registers having a size of Vlx(n+Vc-1); an input/output unit (UES) for conveying the data stream between the interconnection means (4) and the storage unit (UM) as well as between the processing units (UT) and the interconnection means (4). The invention enables a high modularity in the processing of a data stream while limiting power consumption.

Description

Device for parallel processing of a data stream

The invention relates to a device for processing a data stream. It is located in the field of computational architectures and finds particular utility in embedded multimedia type applications incorporating a video sensor. These include mobile telephony, mobile media players, cameras and digital camcorders. The invention is also useful in telecommunications applications and, more generally, in any signal processing chain processing high speed digital data. Signal processing in general, and image processing in particular, require significant computing power, especially in recent years with the rapid increase in the resolution of image sensors. In the field of embedded applications intended for the general public, strong constraints in terms of manufacturing costs are added to the power consumption constraints (of the order of a few hundred milliwatts). To meet these constraints, image processing is commonly performed from dedicated computing modules operating in data flow mode. The "data flow" mode, commonly called "data flow" in the Anglo-Saxon literature, is understood as a data processing mode according to which the data entering the calculation module are processed as and when rate of their arrival, a result being provided at the output of the calculation module at the same rate, possibly after a latency time. The dedicated calculation modules make it possible to comply with manufacturing cost constraints due to their small silicon surface and the performance constraints, particularly as regards computing power and power consumption. However, such modules suffer from a problem of flexibility, the supported processing can not be modified after the realization of the modules. At best, these modules are configurable. Otherwise, said, a number of processing related parameters may be modified after the implementation.

One solution to this lack of flexibility is to use fully programmable processors. The most commonly used processors are the signal processing processors, well known in the literature Anglo-Saxon under the acronym "DSP" for "Digital Signal Processor". Disadvantages of these processors are their large silicon footprint and their power consumption making them often unsuitable for embedded applications very constraints. Compromises between dedicated computing modules and fully programmable processors are currently under development. According to a first compromise, a circuit includes a data processing unit with very long instruction words, called VLIW unit for "Very Long Instruction Word", and a unit for executing an instruction on several calculation units, called unit SIMD for "Single Instruction Multiple Data". In some current embodiments, VLIW and / or SIMD type calculation units are installed in the circuit as a function of the computing power required. The choice of the type of unit to be included in the circuit, of their number and their chaining is decided before the realization of the circuit by an analysis of the application code and the necessary resources. The order in which the units are chained is fixed and it does not allow to change thereafter the sequence of treatments. In addition, the units are generally quite complex because the control code of the application is not separated from the processing code. Thus, the processing operators of these units are large, which leads to an architecture whose silicon surface and power consumption are greater at equal computing power.

According to a second compromise, a code in C language can be transformed into a set of elementary instructions by a specific compiler. The set of instructions is then implemented on a configurable matrix of predefined operators. This technology can be compared to that of gate arrays programmable by the user, better known by the acronym FPGA for "Field Programmable Gate Array", the computing grain being larger. It therefore does not allow to obtain programmable circuits, but only configurable circuits by compiling the code. If it is desired to integrate program code portions not initially planned, then computing resources that are not present in the circuit are required. It becomes difficult or impossible to implement this code. According to a third compromise, the data is processed by a so-called parallel architecture. Such an architecture comprises several calculation tiles interconnected by an interconnection bus. Each calculation tile includes a storage unit for locally storing the data, a control unit providing instructions for performing processing on the stored data, processing units executing instructions received from the control unit on the stored data. and an input / output unit carrying the data either between the interconnect bus and the storage unit, or between the processing units and the interconnect bus. This architecture has several advantages. A first advantage is the ability to modify the code to be executed by the processing units, even after the realization of the architecture. In addition, the code to be executed by the processing units generally comprises only calculation instructions but no control or address calculation instructions. A second advantage is the possibility of performing in parallel, either an identical processing on several data, or more complex processes for the same number of clock cycles taking advantage of the paralleling of the processing units. A third advantage is that the calculation tiles can be chained according to the processing to be performed on the data, the interconnection bus carrying the data between the calculation tiles in a configurable order. In addition, the parallel architecture can be extended by adding additional calculation tiles, so as to adapt its processing capabilities to the treatments to be performed. However, the management of data in the calculation tiles is complex and usually requires significant memory resources. In particular, when a calculation tile is processing on a data neighborhood, it must have all the data of that neighborhood at the same time, while the data arrives as a continuous stream. The storage unit of the calculation tile must then store a large portion of the data of the stream before processing on a neighborhood. This storage and the management of the stored data require an optimization in order to limit the silicon surface and the power consumption of the parallel architecture while offering computing performances adapted to the processing of a data stream. An object of the invention is to provide a calculation structure that is programmable and adapted to the processing of a data stream, especially when processing must be performed on data neighborhoods. For this purpose, the subject of the invention is a device for processing a stream of data coming from a device generating matrices of NI lines by Nc columns of data. The processing device includes K calculation tiles and interconnection means for transferring the data stream between the calculation tiles. At least one calculation tile comprises:

- one or more control units to provide instructions,

n processing units, each processing unit executing the instructions received from a control unit on a neighborhood of Vl lines by Vc columns of data,

a storage unit making it possible to put the data of the stream in the form of neighborhoods of Vl lines by (n + Vc-1) columns of data, the storage unit comprising a block of formatting memories of dimension VIxNc and a neighborhood register block of dimension Vlx (n + Vc-1),

an input / output unit for routing the data stream between the interconnection means and the storage unit on the one hand, and between the processing units and the interconnection means on the other hand.

An advantage of the invention is that the storage unit of a calculation tile in which a processing is performed on a data neighborhood is particularly suitable for such processing, particularly in terms of sizing the memory registers and managing the data. access to the memory registers by the processing units.

The invention will be better understood and other advantages will become apparent on reading the detailed description of an embodiment given by way of example, a description given with regard to the appended drawings which represent:

FIG. 1, an example of a device for processing a data stream according to the invention, FIG. 2, an exemplary processing unit comprising a processor with very long instruction words,

FIG. 3, an exemplary management of a block of formatting memories, FIG. 4, an example of management of a block of neighborhood registers in a case where the data of the memory block of implementation of FIG. form are in order,

FIG. 5, an example of management of the neighborhood register block in the case where the data of the formatting memory block are not in order,

FIG. 6, a set of chronograms illustrating the temporal management of a block of neighborhood registers,

FIG. 7, an exemplary embodiment of a calculation tile comprising several processing units in parallel, FIG. 8, a set of timing diagrams illustrating the temporal management of a storage unit of a calculation tile comprising two parallel processing units,

FIG. 9, an exemplary embodiment of an input / output unit, FIG. 10, an exemplary implementation of the device according to the invention for video images,

FIG. 11, a schematic representation of a Bayer filter,

FIG. 12, an example of format registers for splitting the data of the stream,

FIG. 13, an example of a mechanism allowing access to a register containing metadata,

FIG. 14, an exemplary embodiment of a calculation tile comprising several processing units, the processing units receiving specific instructions as a function of metadata,

- Figure 15, an embodiment of an insertion operator.

The following description is made in connection with a processing chain of a video data stream from a video sensor such as a CMOS sensor. The processing chain makes it possible, for example, to reconstruct color images from a monochrome video sensor on which a color filter, for example a Bayer filter, is applied, to improve the quality of the images rendered, or to produce images. morphological operations such as erosion / dilatation or low-level processing pixels of advanced applications such as image stabilization, red-eye correction or face detection. However, the device according to the invention may equally well be suitable for processing a data stream other than those originating from a video sensor. The device can for example process a stream of audio data or data in the Fourier space. In general, the device is of particular interest for the processing of data which, although conveyed as a stream, have coherence in a two-dimensional space.

FIG. 1 schematically represents a device 1 for processing a data stream according to the invention. A video sensor 2 generates a digital data stream directed to the processing device 1, via a data bus 3. The data from the video sensor 2 is referred to as raw data. The device 1 processes this raw data in order to generate data qualified as final data. For this purpose, the device 1 according to the invention comprises processing units UT, control units UC, storage units UM and input / output units UES grouped into K calculation tiles TC. The device 1 also comprises interconnection means 4 such as data buses 41, 42. These interconnection means 4 make it possible to transfer the flow of data between the different calculation tiles TC. Each calculation tile TC comprises a storage unit UM, one or more control units UC, at least one processing unit UT per control unit UC and an input / output unit UES. The storage units UM make it possible to format the data of the stream so that it can be processed by the processing units UT according to code instructions issued by the control units UC. The input / output units UES make it possible to route the data flow between the interconnection means 4 and the storage units UM on the one hand, and between the processing units UT and the interconnection means 4 on the other hand. In the example of FIG. 1, the device 1 comprises 4 calculation tiles TC, the first and the fourth calculation tiles TC1 and TC4 each comprising a storage unit UM, a control unit UC, a processing unit UT, and an input / output unit UES, the second calculation tile TC2 comprising a storage unit UM, a control unit UC, two processing units UT and an input / output unit UES, and the third calculation tile TC3 comprising a storage unit UM, two control units UC, two processing units UT per control unit UC and an input / output unit UES. Each calculation tile TC makes it possible to perform a function or a series of functions from code instructions. As part of a video processing chain, each calculation tile TC performs for example one of the following functions: correction of the white balance, demosaicing, noise reduction, sharpening of the outlines. The composition of a calculation tile TC depends in particular on the function or functions it has to perform. In particular, the number of control units UC composing a calculation tile TC depends on the number of different treatments to be performed simultaneously by the calculation tile TC. Each control unit UC within the calculation tile TC may include its own code, a calculation tile TC comprises for example as many UC control units as separate processing to be performed in parallel on the data.

The UT processing units can be more or less complex. In particular, they may comprise either simple dedicated operators, for example composed of logic blocks, or processors. Each UT processing unit is independent of the others and may have different operators or processors. Dedicated operators are for example multipliers, adders / subtracters, assignment operators or shift operators. Advantageously, the processing units UT contain only the dedicated operators commonly used for the treatment envisaged.

A processing unit UT may also include a processor. In a first embodiment, the processor comprises a single arithmetic and logical unit. In a second embodiment, the processor is a processor with a very long instruction word, commonly referred to as VLIW's English-language processor for "Very Long Instruction Word". Such a processor may comprise several arithmetic and logical units. In a preferred variant, a VLIW processor comprises for example instruction decoders, no longer arithmetic and logical units but only calculation operators, a local memory and data registers. Advantageously, only the computation operators necessary for the execution of the computation codes to be produced are implanted in the processor during its design. Then two or more of them can be used in the same cycle to perform separate operations in parallel. Unused operators do not receive clock signals. The power consumption of the UT processing units is thereby reduced. These advantageous characteristics have led to a particular embodiment, shown in FIG. 2. In this figure, the VLIW processor comprises two channels. In other words, it can execute up to two instructions in the same clock cycle. The processor comprises a first instruction decoder 21, a second instruction decoder 22, a first set of multiplexers 23, a set of calculation operators 24, a second set of multiplexers 25, a set of data registers 26 and local memory 27. The instruction decoders 21 and 22 receive instructions from a control unit UC. According to the instructions received, the multiplexers 23 direct data to be processed on an input of one of the calculation operators 24 and the multiplexers 25 direct the processed data to the data registers 26. The data registers 26 containing the processed data can be linked to processor outputs. The size of the very long instruction words is for example 48 bits, ie 24 bits per channel. Calculation operators 24 thus work in 24-bit precision. In the context of a video processing and more particularly an image reconstruction from data from a video sensor, the calculation operators 24 are advantageously two adders / subtracters, a multiplier, an assignment operator, a write operator in the local memory and an offset operator.

Still according to a particular embodiment, the execution of the instructions may be conditioned by a positioning of a flag. The instruction can then be completed with a prefix indicating the execution condition. The flag is for example a bit of a register containing the result of an instruction executed during the preceding clock cycle. This bit may correspond to the zero, sign or carry indicators of the register. At each instruction, instruction decoders 21 and 22 test the positioning of the flag linked to this instruction. If this positioning conforms to the execution condition, the operation is executed, otherwise it is replaced by a non-operation instruction, called NOP. At the end of the cycle of each instruction, the flag value is sent to the two instruction decoders 21 and 22 in order to test the possible condition of a next instruction.

According to a particular embodiment, each instruction word is coded on 24 bits. The first 3 bits (bits 0 to 2) can contain the instruction condition, the next two bits (bits 3 and 4) can encode the data access mode, the sixth, seventh, and eighth bits (bits 5 through 7) can encode the identifier of the operation, the next four bits (bits 8 to 1 1) can designate the destination register, the next four bits (bits 12 to 15) can designate the source register and the last 8 bits (bits 16 to 23) can contain a constant. An example of programming using such a coding is given in the appendix.

The device 1 for processing a data stream comprises M control units UC, M being between 1 and N, where N is the number of processing units UT. In the case where the number M of control units UC is equal to the number N of processing units UT, each processing unit UT can have its own control unit UC. In the case where the number M of control units UC is smaller than the number N of processing units UT, then at least one calculation tile TC comprises several processing units UT, as in the example of FIG. TC2, TC3). A control unit UC of this calculation tile TC then provides instructions to several processing units UT, these processing units UT being said in parallel. A control unit UC may include a memory for storing the code instructions for the processing unit (s) UT that it serves. A UC control unit can also include an ordinal counter, an instruction decoder and an address manager.

In the context of a raw image processing obtained by a color filter, the address manager and the ordinal counter make it possible to apply a different treatment according to the color of the current pixel. In particular, the code may be divided into code segments, each code segment including instructions for one of the colors of the filter. The address manager may indicate to the ordinal counter the color of the current pixel, for example red, green or blue. According to a particular embodiment, the address manager comprises a two-bit word for encoding up to four colors or different types of pixels in a pixel neighborhood of size two by two. At each clock cycle, the ordinal counter is incremented by an offset value depending on the value of the word. The ordinal counter then makes it possible to point to the segment of code corresponding to the color of the current pixel. The four offset values are determined at code compilation based on the number of instructions in each code segment. The use of an address manager and an ordinal counter makes it possible to unload the programmer and thus avoids that he himself determines programmatically the nature of the current pixel. This management becomes automatic and allows a shorter execution time and simpler programming. In the particular case where the processed images are monochrome, the same instructions are applied to all the pixels. The offset values are then equal and determined so that the ordinal counter points to the first instruction after the initialization code.

The device 1 for processing a data stream also comprises K storage units UM, K being between 1 and M. A calculation tile TC may comprise several control units UC, as in the example of FIG. TC3). The same data of the flow, or neighboring data, present in the storage unit UM can then be processed differently by the processing units UT of the calculation tile, each control unit UC providing instructions to at least one unit of calculation. UT treatment. The units of memory UM have for main function of formatting the data of the flow in order to facilitate the access of the processing units UT to these data.

According to a first embodiment, a storage unit UM comprises a number of data registers equal to the number of processing units UT located in the calculation tile TC of the storage unit UM considered.

According to a second embodiment, particularly adapted to the processing of video images, a storage unit UM formats the data in the form of neighborhoods and manages access to the data when UT processing units are in parallel. Such a storage unit UM may comprise a first memory block called formatting memory block and a second memory block called neighborhood register block. The storage units UM of the different calculation tiles TC being independent of one another, the data flow processing device 1 may comprise both storage units UM according to the first embodiment and storage units UM according to the second embodiment. The second embodiment allows processing on data neighborhoods. For a video image, a neighborhood can be defined as a mesh of adjacent pixels, this mesh being generally square or at least rectangular. A rectangular mesh may be defined by its dimension VIxVc where Vl is the number of pixels of the neighborhood along the lines and Vc is the number of pixels of the neighborhood according to the columns. The block of formatting memories stores the data of the stream so that it can be copied systematically at each arrival of a new datum. The neighborhood register block allows access to the pixels of the current neighborhood by the processing unit (s) UT of the considered calculation tile.

FIG. 3 illustrates, by a block 31 of formatting memories represented at different time steps T, an example of management of the block 31 for data corresponding to a stream of pixel values coming from a device generating matrixes of NI rows by Nc columns of data, such as a video sensor 32. The video sensor 32 is of Nc resolution columns by NI rows of pixels. The resolution is for example VGA (640x480), "HD Ready" (1080x720) or "FuII HD" (1920x1080). Pixels are sent and stored as they arrive to block 31 formatting memories. This block 31 is advantageously of dimension VIxNc to allow generation of neighborhoods of dimension VIxVc. In other words, the block 31 comprises VIxNc memory cells arranged along a mesh of Vl lines and Nc columns. Current values for Vl are three, four, five, six or seven. Physically, block 31 may consist of one or more memory modules. Block 31 can be managed as a shift register. In other words, at each time step or clock cycle, the data is shifted to make room for the new incoming data. Advantageously, the block 31 is managed as a conventional memory so that the pixels are copied in their order of arrival.

In the latter case and in a first embodiment, a CPT counter is considered incrementing to each incoming data item. Each new pixel coming from the data stream is then copied into a cell 33 of the formatting memory block 31 located at the line corresponding to E (CPTVNc), where E (x) is the function returning the integer part of a number x, and the column corresponding to the rest of CPT / Nc. The counter CPT is reset each time it reaches the value equal to VIxNc. In a second embodiment, a CPTC counter is considered to increment after each incoming data and a counter CPTL is incremented each time the counter CPTC reaches the value Nc. The counter CPTC is reset each time it reaches the value Nc and the counter CPTL is reset each time it reaches the value Vl. Each new pixel coming from the data stream is then copied into the cell 33 with the line number corresponds to the CPTL value and whose column number corresponds to the CPTC value.

FIG. 4 illustrates an exemplary management of the neighborhood register block for data from block 31 of formatting memories. The block 34 of neighborhood registers for example comprises a number of neighborhood registers equal to VIxVc. These neighborhood registers are arranged in the same way as the neighborhood of pixels, that is to say they form a mesh of Vl lines and Vc columns of registers. The copy of the data from the block 31 of formatting memories to the neighborhood registers starts as soon as there is a number of data in the block 31 equal to (VI-1) xNc + 1. In the case of a neighborhood of dimension 3x3, represented in FIG. 4, the copying of the data begins thus when two lines of data plus one datum are present in block 31. In one embodiment, the data is copied at each clock cycle by groups of data items of the same column. At a given time step, the number of the column to be copied is given by the value of CPTC. This column indeed includes the last pixel arrived in block 31. Advantageously, a column 35 of V1 data registers is added to the neighborhood registers. This column 35 makes it possible to block access to the registers of the block 34 by the processing units UT only during a single clock cycle, that of the shifting of the values in the block 34. Otherwise, the accesses are blocked both during the offset values and while copying data from block 31. During a first clock cycle, the data of the column of the block 31 indicated by the counter CPTC are copied into the registers of the column 35. During a second clock cycle, all the data of the block 34 and the column 35 are shifted by one column. Thus, for a neighborhood of dimension 3 × 3, in the same clock cycle, the data of a first column 341 are shifted to a second column 342, while the data of this column 342 are shifted to a third column 343 and that the data in column 35 is shifted to column 341.

Because of the cyclic management of the block 31, the data are not always stored in the block 31 in the order of the lines of the video sensor 32. In this case, the pixels must be copied in the column 35 or, where appropriate in column 341 of block 34 in a different order. FIG. 5 illustrates such a case where the last data of the stream are stored on the first line of block 31. In the case of a 3 × 3-dimensional neighborhood, the copying of the pixels in the column 35 can be managed by the following placement steps: the last arriving pixel always goes to the third line 347 of the neighborhood register column 35;

if the counter CPTL is equal to zero, in other words if the last pixel has arrived at the first line 31 1 of the block 31, then the pixel of the second line 312 of the block 31 is copied to the first row 345 of the column 35 the pixel of the third line 313 of the block 31 is copied to the second line 346 of the column 35;

if the counter CPTL is equal to one, in other words if the last pixel has arrived at the second line 312 of the block 31, then the pixel of the first line of the block 31 is copied to the second line 346 of the column 35, the pixel of the third line 313 of the block 31 is copied to the first line 345 of the column 35;

if the counter CPTL is equal to two, in other words if the last pixel has arrived at the third line 313 of the block 31, then the pixel of the first line 31 1 of the block 31 is copied to the first row 345 of the column 35, where the pixel of the second line 312 of the block 31 is copied to the second line 346 of the column 35. More generally, in the case of a neighborhood of size VIxVc, the pixel of the block 31 of formatting memories located on the line NoLigne and the column indicated by CPTC is in particular copied in column 35, or, where appropriate, in the first column 341 of block 34, to the line defined by (CPTL + NoLigne + 1) modulo Vl. NoLine takes all the positive integer values between 1 and Vl so as to allow the copying of the pixels for all the lines of the neighborhood.

According to a particular embodiment, the copying of the pixels of the block 31 in the register column 35 is not performed simultaneously with the offset of the pixels in the block 34. This embodiment allows the UT processing units to access the data. present in block 34 of neighborhood registers for a longer period. Figure 6 shows a set of timing diagrams for implementing this embodiment. The temporal offset between the copying of the pixels and the offset of the pixels in the block 34 can be achieved by introducing, in addition to a first clock, called the pixel clock 61 and allowing to clock the data flow and the copying of the pixels, a second clock, called offset pixel clock 62. This offset pixel clock 62 may be at the same frequency as the pixel clock 61 but shifted in time. This offset corresponds, for example, to a period of the clock of the processing units UT 63. The data present in the block 34 are then accessible during the entire period separating two clock ticks from the offset pixel clock 62. Access to the neighborhood registers by the processing units UT can be achieved by an input / output port, for example integrated to each unit UT processing, whose number of connections is equal to the number of neighborhood registers multiplied by the size of the data. Each neighborhood register is connected to the input / output port. Advantageously, each storage unit UM comprises a multiplexer whose number of inputs is equal to the number of neighborhood registers of the block 34 and the number of outputs is equal to the number of data that can be processed simultaneously by the processing unit UT of the TC calculation tile considered. The processing unit UT may then comprise an input / output port whose number of connections is equal to the number of data that can be processed simultaneously multiplied by the size of the data. In this case, a processing unit UT comprising a two-way VLIW processor processing 12-bit data may comprise an input / output port with 24 (2x12) connections.

According to a particular embodiment, the same storage unit UM provides data to several processing units UT in parallel. In other words, the processing device 1 comprises a calculation tile TC comprising several processing units UT. This embodiment advantageously uses the storage units UM comprising a block 31 of formatting memories and a block 34 of neighborhood registers. However, the size of the block 34 of neighborhood registers needs to be adapted. FIG. 7 illustrates an example of a calculation tile TC in which a storage unit UM supplies data with n processing units UT in parallel, n being less than or equal to the number N of processing units UT of the device 1. The instructions are supplied to the n UT processing units by one or more UC control units. According to this embodiment, the block 34 of neighborhood registers is of dimension Vlx (n + Vc-1). In other words, the block 34 comprises Vlx (n + Vc-1) data registers arranged along a mesh of Vl lines and n + Vc-1 columns. For example, for three parallel UT processing units and a 5x5-sized neighborhood, a cell of 7 (= 3 + 5-1) columns and 5 rows of registers is required. In addition, a column 35 of VL registers of data can be added to block 34. Thus, access to neighborhood registers by UT processing units are blocked only during a single cycle of UT processing units. The copy of the data from block 31 to register column 35 then begins when block 31 of formatting memories includes (VI-1) xNc + 1 data. Furthermore, for n UT processing units in parallel, the data processing is performed when n new data has arrived in block 31. The access to neighborhood registers by the n processing units UT can also be achieved by a port integrated input / output to each UT processing unit. The number of connections of the input / output port of each processing unit UT is then equal to the number of neighborhood registers to which the processing unit UT requires access multiplied by the size of the data. Similarly, the storage unit UM may comprise a multiplexer whose number of inputs is equal to the number of neighborhood registers of the block 34 and the number of outputs is equal to the number of data that can be processed simultaneously by the n units of UT processing, each UT processing unit comprising an input / output port whose number of connections is equal to the number of data that can be processed simultaneously by said processing unit UT multiplied by the size of the data.

FIG. 8 illustrates, by a set of timing diagrams, an example of management of a calculation tile TC comprising two processing units UT in parallel. A first timing diagram 81 represents the clock of the processing units UT of the frame rate F _arc hi. A second timing _diagram 82 represents the clock pixel clock F _piXΘ |. The pixel clock sets the rate at which the data of the stream arrive, which are sent in the block 31 of formatting memories. The rate F _arc hi may be equal to pxF _piXθ ι with p a positive integer. According to FIG. 8, the rate F _piXθ ι is four times greater than the clock rate F _arc. Each processing unit UT thus has four clock cycles per data to be processed. A third timing diagram 83 represents an offset clock. This clock generates two successive 831, 832 clock ticks after a clock pulse on two of the pixel clock. At each clock stroke of the shift clock, the data of block 34 is shifted by one column. A fourth timing diagram 84 represents the offset pixel clock. The rate of this clock is substantially equal to half the rate F _piXθ ι, a clock stroke 840 being generated after the two clock ticks 831, 832 of the shift clock. In general, the rate of the offset pixel clock is equal to 1 / n times the rate F _piXθ ι of the pixel clock. At each clock stroke 840 of the offset pixel clock, the data is copied from block 31 to block 35. Access to the neighborhood registers by the processing units UT is possible between two clock ticks 840. offset pixel clock.

According to a particular embodiment, the interconnection means 4 comprise a number Nb_bus of data bus. Nb_bus can be defined by the following relation:

NbJ ⁾ US = Kx ( _{Fpi XΘ} | / Farchi) +1 -

This embodiment makes it possible to connect the K calculation tiles TC to each other by performing a spatiotemporal multiplexing whose time division multiplexing ratio Mux_t is defined by the relation: MUXJ = Farchl / Fpixβi.

The time division multiplex Mux_t report defines an equal number of time slots, the read and write access permissions can be set for each time interval. For example, for a rate F _piXθ ι equal to 50 MHz and a rate f _arch i at 200 MHz, the four calculation tiles TC of FIG. 1 can be linked in any order if the interconnection means 4 comprise at least two (4x (50/200) +1) data bus, the calculation tiles TC being addressed by a time multiplexing ratio four (= 200/50). According to this embodiment, each UES input / output unit can handle the read and write access permissions according to the number bus bus Nb_bus and the time division multiplex report Mux_t. In particular, each input / output unit UES may comprise registers for determining the time intervals during which the calculation tile CT considered has a read or write access authorization on one of the data buses and , for each of these time slots, the data bus for which read or write access is allowed. An input / output unit UES comprises, for example, for the management of write access permissions, Nb_bus registers of size log2 (Mux_t) bits, where Iog2 (x) is the function returning the logarithm in base 2 of the number x and, for the management of read access permissions, a register of size log2 (Nb_bus) bits specifying the number of the bus to read and a log size log2 (Mux_t) bits specifying the time interval. An exemplary embodiment of such an input / output unit UES is shown in FIG. 9. The input / output unit UES comprises two registers 91 and 92 of 2 bits each, the register 91 managing the authorization of write access on the bus 41 and the register 92 managing the write access authorization on the bus 42. The contents of the registers 91 and 92 are compared with the value of the current time interval, for example by means of comparators 93 and 94 and, in case of equality, the writing of data is allowed on the bus 41 or 42 concerned. The input / output unit UES also comprises a register 95 of 1 bit specifying the number of the bus 41 or 42 to be read and a register 96 of 2 bits specifying the time interval for the reading. The contents of the register 96 are also compared to the current time interval, for example by a comparator 97 and, in the case of equality, the reading of the data is authorized on the bus 41 or 42 concerned. This embodiment has the advantage that each UES input / output unit individually manages the access authorizations between the calculation tiles TC and the buses 41 and 42. Consequently, no centralized control unit is necessary. The value of the registers of each input / output unit UES is set at the start of the system according to the desired chaining of the calculation tiles TC. An unused CT calculation tile may have the values of the registers of its input / output unit UES initialized so as to have no read or write rights on the bus 41 or 42.

According to a particular embodiment, represented in FIG. 1, each calculation tile TC furthermore comprises a series block BS comprising as many data registers as of UT processing units present in the tile considered, the size of the registers being size at least equal to the size of the feed data. The serial block BS of a calculation tile TC receives as input the data coming from the processing unit or units UT and is connected at the output to the input / output unit UES. During write authorization on one of the buses 41 or 42, the data present in the serial block (BS) are sent sequentially on this bus 41 or 42.

FIG. 10 illustrates an exemplary implementation of the device 1 for processing a data stream for treatments to be carried out on raw images. The raw images are for example from a Bayer filter 1 10, for example shown in Figure 1 1. With such a filter, a color image is constituted by a mosaic of pixels of red, green and blue colors. In particular, the mosaic consists of an alternation of blue and green pixels on a first type of line and an alternation of green and red pixels on a second type of line, the types of lines being alternated so as to form diagonals of green pixels. The device 1 according to the invention is particularly suitable for such data. Indeed, for each type of line, it is possible to constitute a calculation tile TC capable of simultaneously processing several pixels although they are of different color. In one embodiment, shown in FIG. 10, the calculation tile TC comprises, on the one hand, a first control unit UC1 supplying a first code to a first and a third processing unit UT1 and UT3 and, d secondly, a second control unit UC2 providing a second code to a second and a fourth processing unit UT2 and UT4. The first code is specific to a first pixel color, for example red, and the second code is specific to a second pixel color, for example green. The code can also be divided into code segments, an address manager then indicating to the control units UC1 and UC2 the color of the processed pixel. The processing units UT1, UT2, UT3 and UT4 then act on the data present in the block 34 of neighborhood registers according to the instructions they receive. In this case, the first and third processing units UT1 and UT3 act on red pixels and the second and fourth processing units UT2 and UT4 act on the green pixels. The calculation tile thus makes it possible to process simultaneously, but distinctly, four pixels of the block 34 of neighborhood registers. In the case of the Bayer 1 10 filter, two control units UC1 and UC2 per line suffice because a line has only two different colors. Of course, the calculation tiles can be adapted according to the applied color filter.

According to a particular embodiment, the data in transit on the interconnection means 4 may contain the data to be processed, but also additional information, called metadata. These metadata can be used to transport different information associated with the data. In the context of a video image processing, where the data relate to pixels, a metadata contains for example a value representative of a noise correction or gain to be applied to the pixels. The same correction can thus be applied to all the pixels of the image. The metadata can also relate to the three values R, G and B (Red, Green and Blue), intermediate results to be associated with a pixel, or information to control the program according to the characteristics of the pixel. The use of these metadata makes it easy to split the algorithms into parts and run them on different TC calculation tiles in multi-SIMD mode. The data in transit can be divided into different formats. The cutting format is specified in registers of format M _r i, as represented in FIG. 12 (four in this particular case M ₁ -O, M _r 1, M _r 2, M _r 3). In the case shown in Figure 12, a 24-bit data word was split into three parts 121, 122, 123 while cutting up to four was possible. These format registers are defined for a calculation tile TC and their respective value is fixed during the loading of the program. The accesses to the data are then systematically made by a mechanism as shown in FIG. 13, composed of multiplexers 131a, 131b, 131c, 131d, shift registers 132a, 132b and logic gates 133a, 133b, 133c, 133d, 137. In this FIG. 13, we find the registers of format M _r 134 associated with position registers D _r i 135. These registers D _r i 135 are deduced from the registers M _r i 134 by the loading software program settings. They make it possible to obtain the starting position of the metadata considered. Only the registers of format M _r i 134 are given by the programmer, the position registers D _r i 135 being automatically obtained from the registers of format M _r i 134. In the example of FIG. 13, M ₁ -O is associated with D _r 0 = 0, M _r 1 is associated with D _r 1 = 8, M _r 2 is associated with D _r 2 = 16 and M _r 3 is associated with D _r 1 = 24. A possible embodiment of a mechanism for reading and writing metadata in a 24-bit register 136, allowing access to the communication network, is given in FIG. 13. For the reading part of the metadata, the format registers 134 are connected by a multiplexer 131b controlled by the current position to be recovered CMr to a logical AND cell 133d making it possible to zero the bits of the register 136 which are not concerned by the defined position, then the position registers 135 are connected to a shift register 132b which allows the preceding result to be shifted a good number of times in order to have a value set to the right which is the final value to be recovered for the position considered. For the write part of the metadata, the data to be written Val is shifted to the right position by virtue of the position register 135 connected to a multiplexer 131c controlled by the current position to write CMw, multiplexer 131c itself connected to a register to shift 132a. Only the bits concerned are not set to zero by a logical AND cell 133a connected to the multiplexer 131a which gives the bits to be masked by the format register 134. Finally, these bits are concatenated with those already present in the destination register. 136. For this purpose, the format register 134 is inverted by a logical NO cell 137 and then a logical AND is made by a cell 133b between the inverted format register and the value of the register 136, thus making it possible to manufacture a new mask which Attacks a logical OR cell 133c to associate the new data to the register 136 without touching the bits not concerned.

With this metadata management, the calculation of complex or fixed-point numbers is greatly facilitated. The processing device according to the invention can thus advantageously be used for processing in the Fourier (frequency) domain, for example. The processing device according to the invention can also be adapted to accelerate the emulation of floating-point numbers using fixed-point operators. In a multi-SIMD architecture, metadata can be used to determine instructions to transmit to processors. Indeed, the additional information (metadata) can indicate the specific treatment to be performed on the data with which they are associated. It is sufficient to extract the necessary information from the data word as soon as it enters the calculation tile TC, illustrated by the example of FIG. and transmit it to the UC control unit that manages the UT processing units in multi-SIMD mode. In this case, the metadata can be extracted when the neighborhood manager organizes the data for transmission to the processing units UT via multiplexers 141. An additional communication 142 between the input / output unit UES and the communication unit. UC control allows the transfer of the metadata, as shown in Figure 15.

In order to allow the transfer of the metadata, the interconnection means 4 can be adapted in terms of capacity. In particular, the size of the data buses 41, 42 can be increased depending on the size of the metadata. In addition, the device 1 according to the invention may include an insertion operator for concatenating each data flow with a metadata. FIG. 15 shows such an insertion operator 150. The insertion operator 150 comprises an input bus 151 connected to an input of an insertion block 152 whose output is connected to an output bus 153. The insertion operator 150 may also include a memory 154 for storing the metadata. The memory 154 is linked to the insertion block 152 to allow the transfer of the metadata. The size of this data must be less than or equal to the difference between the maximum size of the data that can be transferred by the interconnection means 4 and the size of the data of the stream. The size of the input bus 151 must be adapted to the size of the data of the stream while the size of the output bus 153 must be adapted to the size of the stream data concatenated with the metadata. The insertion operator 150 may be inserted on one of the data buses 41, 42, for example between the video sensor 2 and the calculation tiles TC or between two calculation tiles TC. In one embodiment, the insertion operator 150 is made by a calculation tile TC. The calculation tile TC then comprises a storage unit UM containing the complementary data and a processing unit UT making it possible to concatenate the data of the stream with the complementary data item. The additional data is for example stored in a data register of the storage unit UM. This embodiment has the advantage of avoiding the insertion of a component additional in the processing chain of the data flow. It is made possible thanks to the modularity of the device 1 according to the invention.

ANNEX

Instruction Set (on 48 bits: 24 bits per channel)

Composition of the 24-bit instruction word: 0..2 -> 3 bits Condition;

3..4 -> 2 bits Data access mode 5..7 -> 3 bits Operation identifier 8..11 -> 4 bits Register Destination

12..15 -> 4 bits Source register 16..23 -> 8 bits of constant

Prefix of the instructions: F_: Execution if flag = 1 NF_: Execution if flag = 0 fC: Update of the flag on Carry fZ: Update of the flag on Result at ZERO fS: Update of the flag on Sign (1 if > 0, 0 if <0)

Postfix instructions: Allows you to choose the source r (D, A, B): R [D] destination register; R [A] source register, R [B] source register c (D, A, C): R [D] destination register; R [A] source register, C Constant v (D, A, V): R [D] destination register; R [A] source register, Neighbor [V]

If the 8 bits of the argument B are formed in the following way: "10 ... 0 <V>" instead of "0 .. <V>" "we will take as neighbor the value stored by the register V is VoIsIn [R [V]] m (D, A, M): R [D] destination register; R [A] source register, M address memory of local memory

Use: The VLIW Allows to work on two ways:

OPi (Dm ...) / OPj (Dn)

However, j! = I and that m! = N

Except in the case of additional conditional instructions:

F OP; No operation if FLAG = 1 NF_OP; No operation if FLAG = 0

We can write on the same line FjOP (Dm ...) / NFjOP (Dn ...)

Since whatever the Flag value is, only one will actually be executed List of Operations - NOP

PREFIX: F_NF_

NOP

F_NOP NF_NOP

- LD

PREFIX: F_ NF_ fZ_ fC_ fS_

LDr (D, A): R [D] = R [A]

LDc (D, C): R [D] = C; C signed constant LDv (D, V): R [D] = VoIs [V]

LDv (D, V); R [D] = VoIs [R [V]]

LDm (D, M); R [D] = SP [M]

- ADD; SUB; MUL PREFIX: F NF fZ fC fS Two signed adders are available, so ADDO and ADDI can be used simultaneously without restriction on the channel.

PREFIX: F_ NF_ fZ_ fC_ fS_ ADDOr (D, A, B): R [D] = R [A] + R [B]

ADDOc (D, A, C): R [D] = R [A] + C; C constant signed ADDOv (D, A, V): R [D] = R [A] + VoIs [V] ADDOv (D, A, -V); R [D] = R [A] + VoIs [R [V]] ADDOm (D, A, M); R [D] = R [A] + SP [M]

IDEM for ADDO, SUBO SUB1, and MUL; all these operations are signed

- SHIFT

PREFIX: F_ NF_ fZ_ fC_ fS_

A signed offset operator allows you to shift values to the right or to the left according to the offset sign

An offset of 0 equals a SHIFTc assignment (D ₁ A ₁ C): if (C> 0)

R [D] = R [A] C if (C <0)

R [D] = R [A] »C - INV INV (D, A): R [D] = -R [A]

INVc (D, C): R [D] = -C INVv (D, V): R [D] = -VoIs [V] INVv (D, -V): R [D] = -VoIs [R [V] ]] INVm (D, M): R [D] = -SP [M]

Program code example This code performs the following operation:

For a 2x2 neighborhood, set R1 to the average of neighborhood pixels and set RO to 255 if the average value is> 128; increments R2 if the pixel is 255 (to count the pixels> 128 at the end of processing)

1 #include "macros; .h"

2. initcode

3 LDc (RO, O); / NOP

LDc (R1, O); / NOP

LDc (R2, O); / NOP

LDc (R3, O); / NOP

LDc (R4, O); / NOP

LDc (R5, O); / NOP

9 NOP / NOP

10 .pixelcodeO

11 LDv (Rl, VO) / NOP

12 LDv (R2, Vl) / ADDOv (R1, R1, VO)

ADDOv (R2, R2, V2) / ADDIv (R1, R1, V3)

14 ADDO (R1, R1, R2) / NOP

15 SHIFTc (R1, R1, -2) / NOP

16 fS_SUBc (R7, R1, 128) / NOP

11 F_LDc (RO, 0) / NF_LDc (RO, 255); in this exceptional case we can call 2x LD because one is executed and not the other 18 NF_ADD0v (R2, R2, 1) / NOP 19 NOP / NOP 20 .pixelcodel

21. pixelcode2

22 .pixelcode3 We could write by taking advantage of the 2 channels of the VLIW

10 .pixelcodeO

11 LDv (R1, VO) / NF_ADD0v (R2, R2, 1)

12 LDv (R2, Vl) / ADDOv (R1, R1, VO)

13 ADDOv (R2, R2, V2) / ADDIv (R1, R1, V3) 14 ADDO (R1, R1, R2) / NOP

15 SHIFTc (R1, R1, -2) / NOP

16 fS_SUBc (R1 ₁ R1, 128) / NOP

17 F_LDc (RO, 0) / NF_LDc (RO, 255)

Claims

1. A device for processing a data stream coming from a device generating matrices of NI lines by Nc columns of data, characterized in that it comprises K calculation tiles (TC) and means (4) of interconnection for transferring the data stream between the calculation tiles (TC), at least one calculation tile (TC) comprising:

- one or more control units (CU) to provide instructions,

n processing units (UT), each processing unit (UT) carrying instructions received from a control unit (UC) on a neighborhood of Vl lines by Vc columns of data,

a storage unit (UM) for putting the data of the stream in the form of neighborhoods of Vl lines by (n + Vc-1) columns of data, the storage unit (UM) comprising a block (31) of memories VIxNc dimensioning and a block (34) of Vlx neighborhood register (n + Vc-1),

an input / output unit (UES) for routing the data stream between the interconnection means (4) and the storage unit (UM) on the one hand, and between the processing units (UT); ) and the interconnection means (4) on the other hand.

2. Device according to claim 1, characterized in that the block (34) of neighborhood registers further comprises a column (35) of Vl data registers.

3. Device according to one of claims 1 and 2, characterized in that it comprises a CPTC counter and a CPTL counter, the counter CPTC incrementing after each incoming data and being reset each time it reaches the value Nc, the counter CPTL incrementing each time the counter CPTC reaches the value Nc and is reset to zero each time it reaches the value Vl, the incoming data being stored in a cell (33) of the block (31) formatting memories whose line number corresponds to the value CPTL and whose column number corresponds to the value CPTC.

4. Device according to claims 2 and 3, characterized in that the data column of the block (31) formatting memories marked by the CPTC value is copied, at each clock stroke of the data stream, in the column (35) of V1 data registers and then shifted to a first column (341) of the neighborhood register block (34).

5. Device according to claim 4, characterized in that each datum of the block (31) shaping memories located in the column marked by the CPTC value and the line marked by NoLigne is copied in column (35) of Vl data registers at the line marked by (CPTL + NoLine + 1) modulo Vl, where NoLine takes all positive integer values between 1 and Vl.

6. Device according to any one of the preceding claims, characterized in that control units (UC) each comprise a memory in which is stored a program, the program can be divided into code segments, each code segment comprising different instructions to allow different processing of data flow according to their nature.

7. Device according to claim 6, characterized in that the data stream comes from a video sensor (2) delivering images of Nc columns by NI rows of pixels, each image comprising an alternation of two types of lines, each type. line having two types of pixels, at least one calculation tile (TC) comprising two control units (UC), the program of each of these control units (UC) being divided into four code segments, each code segment corresponding to the different types of pixels to take into account.

8. Device according to claim 7, characterized in that the pixel type depends on its color.

9. Device according to any one of the preceding claims, characterized in that at least one calculation tile (TC) comprises: a storage unit (UM) for storing stream data,

a control unit (UC) for providing instructions for performing processing on the stored data, a plurality of processing units (UT), each processing unit;

(UT) executing the instructions received from the control unit (UC) on part of the stored data,

an input / output unit (UES) for routing the data stream between the interconnection means (4) and the storage unit (UM) on the one hand, and between the processing units (UT); ) and the interconnection means (4) on the other hand, the storage unit (UM) comprising a number of data registers equal to the number of processing units (UT).

10. Device according to any one of the preceding claims, characterized in that the interconnection means (4) comprise a number Nb_bus data bus (41, 42) defined by the relation: Nb_bus = Kx (Fpiχ _θ | / F _a rchi) +1, where Fpiχ _θ ι is a frequency of the data flow and F _arch i is an operating frequency of the processing units (UT), the frequency F _arch i being equal to pxF _piXθ ι with p an integer positive.

1 1. Device according to Claim 10, characterized in that input / output units (UES) each comprise Nb_bus registers of log2 (Mux_t) bits size to manage the write access permissions, a register of size log2 (Nb_bus) bits and a register of size log2 (Mux_t) bits for the management of read access authorizations, Mux_t being defined by the relation MuX_t = Farchi / Fpiχ _θ |.

12. Device according to claim 1 1, characterized in that the calculation tiles (TC) are addressed by a time division multiplexing whose ratio is defined by the relationship: MUXj = Farchl / Fpixβi.

13. Device according to any one of the preceding claims, characterized in that calculation tiles (TC) each comprise a series block (BS) having a number of data registers equal to the number of processing units (UT) of the calculation tile (TC) considered, each serial block (BS) of a calculation tile (TC) receiving as input data from the processing units (UT) of the calculation tile (TC) considered and being connected in output to the input / output unit (UES) of said calculation tile (TC).

14. Device according to any one of the preceding claims, characterized in that processing units (UT) each comprise a processor comprising two instruction decoders (21, 22), a first set of multiplexers (23), operators a second set of multiplexers (25), data registers (26) and a local memory (27), the instruction decoders (21, 22) receiving instructions from a control (UC), the first set of multiplexers (23) directing the data to be processed on one or two computing operators (24) of all the available operators, the second set of multiplexers (25) directing the processed data to the data registers (26) and the processor being able to execute up to two instructions per processor clock cycle.

15. Device according to any one of the preceding claims, characterized in that stream data are concatenated with complementary data, calculation tiles (TC) being able to dissociate the data from the flow of complementary data so that the units processing (UT) said calculation tiles (TC) perform instructions on the flow data according to the complementary data.

Device according to claim 15, characterized in that the concatenation of data of the flow with a complementary data is performed by an insertion operator (120) inserted on the interconnection means (4).