WO2022078982A1

WO2022078982A1 - Method and device for processing data to be supplied as input for a first shift register of a systolic neural electronic circuit

Info

Publication number: WO2022078982A1
Application number: PCT/EP2021/078105
Authority: WO
Inventors: Eric Lenormand; Hadi SAOUD
Original assignee: Thales
Priority date: 2020-10-12
Filing date: 2021-10-12
Publication date: 2022-04-21
Also published as: FR3115136A1; EP4226256A1

Abstract

Disclosed is a method for processing, in an electronic processing device, data to be supplied as input to a first shift register of a systolic neural electronic circuit comprising an array of elementary processors, the first shift register conveying a data vector of size K at each clock cycle intended for K columns of elementary processors of the array, the processing method comprising the following steps implemented by the electronic processing device: - obtaining a series of successive data D(0), D(1), D(2) …D(d) extracted by reading the data classed consecutively in a first storage memory and storage of said data in a second memory of the processing device; - generating vectors W(u) according to the stored data, each vector W(u) comprising K components (W(u)₀, …, W(u)_K-1) such as W(u)_{k, k= 0 to K-1}= D(K0 + k*S2 + u*S1), where K is an integer strictly greater than 1, K0 as a predefined constant, the steps S1 and S2 are predefined integer constants; - providing, at the input of the shift register, the vector W(u), such that successive values W(u) are provided during successive clock beats, with u being an integer between 0 and L1-1, and L1 an integer constant strictly greater than one.

Description

TITLE: Process and device for processing data to be supplied as input to a first shift register of a systolic neural electronic circuit

The invention lies in the field of the implementation of convolutional-type neural processing (CNN) on reconfigurable calculation circuits in on-board equipment. The fields of application of embedded neural networks are numerous, for example in image processing, and/or to classify data or detect specific elements on signals, etc.

This involves so-called inference processing, requiring very high computing power (typically hundreds or thousands of Giga-Operations/s), but very repetitive and well suited to parallelization.

An effective way of implementing these processing operations is to implement a systolic architecture, which can take the form of an at least two-dimensional array of elementary processors (EPs), reduced to their purely arithmetic part, which are powered column by column by a progressively downward propagating input data stream (in a shift register). This organization, on an FPGA ("field-programmable gate array") or an ASIC ("application-specific integrated circuit"), simplifies placement-routing operations and makes it possible to make the best use of circuit resources and integrate a large number (hundreds or thousands) of PEs.

Systolic architectures make it possible to obtain very compact implementations of PEs reduced to their simplest expression, which in particular process the data which passes within their reach over time.

There are a number of systolic implementations of CNN (Ref: httDs://www.siaarch.org/dnn-accelerator-architecture-simd-or-svstolic), for example in CN107578098.

The range of functionalities offered by the systolic grid largely depends on the composition of the data streams sent to it. However, the data flows entering or leaving the circuit are most of the time connected to one or more memories of the DDR (Double Data Rate) type, which require to be read or written by sequences of consecutive addresses, which reduces greatly the choice of the distribution of these data on the calculation grid.

In FIG. 1 are represented an external high capacity DDR memory 10 and a systolic neural electronic circuit 20. The circuit 20 is implemented by a programmable logic circuit, for example FGPA, and the DDR memory 10 is external to the FGPA.

The circuit 20 comprises a grid 22, here in 2 dimensions (in other embodiments, the dimension is greater than two) of elementary processors PEs, arranged in columns and rows, a line 23 of memories, a system 27 of distribution of data.

Each memory m is associated with a respective column of processors PEs. A memory m receives, from the data distribution system 27, the data that the PEs of the column which is associated with this memory will process and returns their calculation results to the distribution system 27.

This data is initially supplied to the data distribution system 27 by the DDR memory 10 and the results are delivered by the data distribution system 27 to the DDR memory 10

The distribution system 27 comprises two buses 28, 29 (which can in some cases be combined into a circular ring bus) adapted to each convey K data (K integer, here equal to 4, K can take any value greater than or equal to 1) every clock cycle.

Bus 28 is a data input bus supplied by the DDR 10 memory and bus 29 is a results output bus intended for the DDR 10 memory.

In the same spirit as the systolic architecture of the processors, to promote compactness and speed, each of these buses 28, 29 operate as a shift register. A stage of the register corresponding to the input bus 28 supplies in parallel K (here therefore 4) column memories with data and the addresses of the memories for which these data are intended. A stage of the register corresponding to the output bus 29 collects in parallel K (here therefore 4) data coming from the memories m to which it is connected, and supplies the addresses where these data are written in the memories m.. At the following clock stroke , this data is part of the next stage of the register and is thus presented/collected to the next set of K memories m. The other line of memories, vertical, which stores the CNN processing coefficients, has not been represented in figure 1.

The DDR 10 memory contains the processing input data, for example groups of images called feature maps. This 3D data is stored dimension by dimension in the memory, typically line by line. Data of one dimension (typically row) is stored at consecutive addresses in DDR 10 memory. DDR 10 is read in address sequences sequences (bursts), with an initiation delay at each start of the burst which slows down the useful bit rate all the more as the burst is short.

Therefore, unless you accept a significant loss in performance, DDR 10 reads produce long sequences of adjacent pixels of the same line. The prior art way of connecting the DDR memory output port 10 to the distribution system 27 is to use a counter that generates consecutive addresses. Manufacturers like Xilinx provide IPs optimized for this type of function (IP CDMA for example).

Thus according to the prior art: as said above, the dimension (the line) of the input table (or similarly a result) which is contiguous in DDR is found distributed identically on the column memories in the prior art ; once this DDR stream is connected to the data distribution system internal to the FPGA, each of the internal memories receives the pixel following that of its neighbor in the read line.

The dominant operation in CNNs is convolution, which is the weighted sum of consecutive or fixed-stepped input data (called dilation). Convolution can be applied to input data with several dimensions (row, column, channel, ...). In the systolic architecture of the neural circuit 20, the calculations are parallelized by distributing one of the dimensions (generally the line) on the columns of the grid.

A one-column PE then needs to access data that has been placed in neighboring columns. To achieve this, physical paths that give access to a few neighboring memories at the head of the column are generally added to the architecture; these extensions, beyond a certain width, are paid for in increased complexity and risk of limiting the clock frequency. In this extended configuration, however, the architecture can only execute convolutions with a width equal to the number of memories accessible from a column.

FIG. 2 shows the data distribution of a row of a data structure (consecutive data of a row named D1, D2, D3, etc.), possibly multi-dimensional, presented as input to the systolic grid 22 for a convolution of the same dimension as the array of data stored in DDR 10. Only the row dimension can pose a problem, the other dimensions of the array being local to each memory m, and directly accessible to the PEs of the column associated with the memory m. In the figures, the data above the columns are those stored in the columns by the distribution system 27, before the start of the calculations by the columns of PEs.

In figure 2, Di is thus supplied to the memory mi, for i = 1 to 5. In this corresponding example, a column sees 3 memories, which makes it possible to make convolutions of maximum width equal to 3.

A frequent case in CNNs is that of so-called convolution stages with “stride” of value Str, where the calculations to be made are no longer as in figure 2: convolution of (D1 ,D2, D3), convolution of (D2, D3, D4), convolution of (D3,D4,D5), convolution of (D4,D5,D6), but, in the case where the stride Str is equal to 2: convolution of (D1 ,D2,D3), (D3, D4,D5), (D5,D6,D7), ...

In these cases, in the prior art, even if the PEs perform the calculations, one out of 2 columns of PEs (in fact (Str-1 ) columns all the Str columns) calculates results which will not be used, which translated into an underutilization of resources.

Other types of operations are not feasible with the supply of data from the systolic neural circuit 20 by the DDR memory 10 as performed by the prior art or even lead to a loss of calculation resources.

To this end, according to a first aspect, the invention proposes a method for processing, in an electronic processing device, data to be supplied as input to a first shift register of a systolic neural electronic circuit comprising a grid of processors elementary processors, said first shift register conveying a data vector of size K at each clock cycle to K columns of elementary processors of said grid, said processing method comprising the following steps implemented by said electronic processing device : obtaining a sequence of successive data D(0), D(1), D(2) ... D(d) extracted by reading said data classified consecutively in a first storage memory and storing said data in a second memory of the processing device;

- generation of vectors W(u) as a function of said stored data, each vector W(u) comprising K components (W(u)o, ... , W(u) _K -i) such that W(u)k, k=oàK-i = D(K0 + k*S2 + u*S1 ), where K is an integer strictly greater than 1 , KO is a predefined constant, the steps S1 and S2 are predefined integer constants; - provision, at the input of said shift register, of the vector W(u), such that successive values W(u) are provided during successive clock strokes, with u an integer ranging from 0 to L1 -1 , and L1 constant integer strictly greater than one.

The invention thus makes it possible, through this input/output data management solution, to implement adequate data flows and thus to reduce the limitations of the prior art in the performance of a systolic neural circuit, such as than convolution of width 1 to 3, loss of performance in case of stride greater than 1, etc.

In embodiments, a processing method according to the invention further comprises one or more of the following characteristics: the following steps implemented by said electronic processing device: o obtaining a data vector w of size K provided by a second shift register of the systolic neural circuit, said second shift register conveying at each clock cycle a data vector of size K corresponding to the results of K respective columns of elementary processors of said grid; o the K components of w being (w ₀ , ... , w _K -i), storage in the second memory, of Wk, k=oàK-i at the address indO +K*C*S2 + x*S1 + k*S2 , where indO is a predefined constant, C is the column number and x is an index ranging from 0 to L1 -1; o supply to the first storage memory of a sequence of successive data extracted by reading said data classified consecutively in the second memory;

- for k=0 to K-1, W(u)k=D(K0 + Ei=iàn ui*S1 i + k*S2), the steps S1 i and S2 are predefined constants; the vector being provided at the input of said shift register, with ui an integer ranging from 0 to L1 i-1 , and L1 i a constant strictly greater than one; the following steps are implemented by said electronic processing device: o obtaining a vector of data w of size K provided by a second shift register of the systolic neural circuit, said second shift register conveying at each clock cycle a data vector of size K corresponding to the results of K respective columns of elementary processors of said grid; o the K components of w being (wO, wK-1 ), the storage in the second memory of wk, k= 0 to K-1 is carried out at the address indO+4*C*S2 + Sxt=o àLii - i ( xi * S1 i ) + k*S2, where indO is a predefined constant, C is the column number and xi is an integer index ranging from 0 to L1 i-1; o supply to the first storage memory of a sequence of successive data extracted by reading said data classified consecutively in the second memory; the second memory comprises at least K distinct memory banks, and the storage of the data in the second memory is carried out in such a way as to verify that data which are components of the same vector W(u) cannot be stored in the same bench.

According to a second aspect, the present invention proposes a device for processing data to be supplied as input to a first shift register of a systolic neural electronic circuit comprising a grid of elementary processors, said first shift register being adapted to convey a vector of data of size K at each clock cycle intended for K columns of elementary processors of said grid, said electronic processing device being adapted to obtain a sequence of successive data D(0), D(1), D( 2)...D(d) extracted by reading said data classified consecutively in a first storage memory and for storing said data in a second memory of the processing device; said electronic processing device being adapted to generate vectors W(u) as a function of said stored data, each vector W(u) comprising K components (W(u)o, ... , W(u) _K -i) such that W(u)k,k=oàK-i = D(K0 + k*S2 + u*S1 ), where K is an integer strictly greater than 1 , KO is a predefined constant, steps S1 and S2 are constants predefined integers; said electronic processing device being adapted to supply, at the input of said shift register, the vector W(u), such that successive values W(u) are supplied during successive clock strokes, with u an integer ranging from 0 to L1 -1 , and L1 integer constant strictly greater than one.

These characteristics and advantages of the invention will appear on reading the following description, given solely by way of example, and made with reference to the appended drawings, in which:

[Fig 1] Figure 1 shows a schematic view of a 2D grid for calculating a systolic neural electronic circuit; [Fig 2] Figure 2 illustrates an input data distribution for a systolic neural circuit;

[Fig 3] Figure 3 illustrates another distribution of input data for a systolic neural circuit, which may be implemented in one embodiment of the invention;

[Fig 4] Figure 4 illustrates another distribution of input data for a systolic neural circuit, which may be implemented in one embodiment of the invention;

[Fig 5] Figure 5 illustrates another distribution of input data for a systolic neural circuit, which may be implemented in one embodiment of the invention;

[Fig 6] Figure 6 illustrates another distribution of input data for a systolic neural circuit, which may be implemented in one embodiment of the invention;

[Fig 7] Figure 7 illustrates another distribution of input data for a systolic neural circuit, which may be implemented in one embodiment of the invention;

[Fig 8] Figure 8 shows the steps of a method in one embodiment of the invention;

[Fig 9] Figure 9 depicts a systolic neural system in one embodiment of the invention;

Figure 9 shows a systolic neural system 1 in one embodiment of the invention. This system is for example embedded in an aircraft.

The system 1 comprises a memory, for example mass storage, an electronic circuit physically implementing a neural network and a processing device 40.

The memory is for example the DDR memory 10 as mentioned previously with reference to FIG. 1, the electronic circuit is the systolic neural circuit 20 on FGPA as described in relation to FIG. 1, the processing device 40 according to the invention is interposed between the DDR memory 10 and the circuit 20, for example in the FGPA.

The processing device 40 comprises a memory MP 41 and a control block 42 comprising in particular a unit 420 for timing the operations.

The control block 42 comprises for example a memory and a microprocessor (not shown), the memory comprising instructions software, which when executed on the microprocessor, implement the following operations, described in Figure 8.

Thus in a step 101, the control block 42 obtains a sequence of successive data D(0), D(1), D(2)...D(d) extracted by reading, for example in burst, of said classified data consecutively in the memory 10 (for example the data of an image line, d being for example a value lying in the range going for example from 16 to 256 and stores said data in the memory MP 41 .

For example, the data D(n) are read by bursts from the DDR 10 in the form of vectors V(t), each vector containing, for K=4, the data D(4t), D(4t+1), D (4t+2), D(4t+3) which are written consecutively in memory MP 41: data D(n) is written at address ADO+n.

In a step 102, the control block 42 generates vectors W(u) as a function of said stored data, according to formula (1) described below.

Each vector W(u) is made up of 4 components (W(u)o, ... , W(u) _K -i) (remember that in the example K is equal to 4, as explained above with reference in Figure 1, but K can take any value).

In a step 103, the control block 42 successively supplies, at the input of the shift register 28, a vector W(u), for u an integer ranging from 0 to L1 -1 , and L1 a predefined constant strictly greater than one (for example strictly greater than 2), at each clock stroke, accompanied by the indication of the addresses of the K consecutive columns for which the K components of W(u) are intended.

These vectors, applied as input to the internal distribution system 27, produce the expected data distribution.

In one embodiment, W(u)k= D(K0 +u*S1 + k*S2), k=0 to K-1 ,

KO is a predefined constant, steps S1 and S2 are predefined constants, non-zero or zero depending on the case (in the case of an implementation based on a memory with a prime number of banks, S2 will not be a multiple of the number of banks ) .

And the memory mk thus contains the datum D(K0 +u*S1 + k*S2), for each u from 0 to L1 -1 , after the input shift register 28 has presented this datum W(u) to it. k accompanied by the memory indication mk.

The values of S1, S2, L1, KO are initially chosen so as to give rise to the desired distribution.

In a more general embodiment, with k = 0 to K-1:

W(u) _k = D(K0 + 2i=i to n Ui*S1 i + k*S2) formula (1 ) Ui ranging from 0 to L1 i-1 , i = 1 to n (n integer greater than or equal to 2 ) and L1 i non-zero predefined constant and for example strictly greater than one (for example strictly greater than 2) (LÜ=1 amounts to adding the sum of the Sii to KO and removing the sum from the formula), the steps S1 i and S2 are predefined constants.

For the supply of data to the circuit 20, the incrementation of the Ui is done in the order of increasing i: i<j, for each j starting from 0, i is incremented, from 0 to L1 i-i: for each value of i, we provide the W(u); then we increment j by 1 , reset i to 0 and start again.

And the memory mk thus contains the data D(K0 +u *S1 i + k*S2), for each Ui from 0 to L1 i-1, i = 1 to n after the input shift register 28 has given it presented this data W(u)k accompanied by the indication of the memory mk.

The values of S1 i, S2, L1 i, KO are initially chosen so as to give rise to the desired distribution.

If we consider all the columns of the grid (for example C _to t columns, with Ctot ranging from 1 to a few hundred, each column mC thus contains the data D(K0 +Ui*S1 i + C*S2), C=0 to C _tot -1

All of these operations performed by the control block 41 are clocked by the operations sequencing unit 420 (all the operations are clocked at the rate of the clock 20, which is the or one of the clocks used by the FPGA) .

The formula (1) is defined so as to obtain the desired distribution of the data between the columns of processors, according to the definition of the convolutions to be calculated.

By way of example, in FIG. 7 have been represented the data recorded successively in the memory which are located under them, vertically, KO having been fixed at 0 for the following formula taken from general formula 1: W(u )k = D(K0 +u*S1 + v*S3 + k*S2), with u ranging from 0 to L1 - 1 and v ranging from 0 to L3 -1 , L1 = 3, S1 = 2, L3 = 2 and S3=50, memory m1 being associated with column No. 0 of the PEs, memory m2 with column No. 1, ... memory mK with column No. K-1.

This type of distribution introduces copies, and cannot be done just by loading the internal distribution system 27 in the order in which the data comes out of the DDR 10. The processing device 40 according to the invention is suitable for writing a sequence of data in the column memories in a different order from that in which they were read in DDR 10 memory. Similarly, the processing device 40 according to the invention is suitable for reading a data sequence in the column memories in an order different from that in which they are then written in DDR 10.

In a step 201, the processing device 40 reads the results delivered by the shift register 29.

The order in which the data in memories m can be read is constrained by the operation of bus 29, which reads the K (=4 here) data which are located at the same address in 4 consecutive columns.

The data to be exported by the processing device 40 to the DDR 10 are distributed in the memories m with a pitch equal to S2 from one column to another and a pitch S1 within each column. The read will produce vectors W(C,x) containing the data of index 4*C*S2 + x*S1 + k*S2 with 0<=k<K=4 (assuming here that the index starts at 0 (C being an index ranging from 0 to L1 -1; if we consider that the index does not start at zero but at indo, then the formula giving the index is indo+4*C*S2 + x*S1 + k *S2) Note: for a generic formula, x, S1 , and L1 become vectors of the same size n; let Xi be the components of the vector index x (Xi an integer ranging from 0 to L1 i -1 ), we replace the x*S1 by _{x/=0 to Lli} _ _x ( Xi * S1 i ), (in the above reference to Figure 7 where we use S1 and S3 as the step in the column, S1 and S3 in this example are then the two components of the vector S1 ).

In a step 202, the processing device 40 writes the K (=4) data of the vector W(C,x) to the addresses 4*C*S2+x*S1+k*S2 with 0<k<K; in memory MP 41 , a sequence of consecutive index data is thus written, with the consecutive index data written to consecutive addresses.

In a step 203, the processing device 40 writes these sequences extracted from the MP memory 41 in the form of long bursts in DDR 10 at consecutive addresses.

The processing device 40 according to the invention is suitable for carrying out data distributions according to a predefined formula, in the memories m while remaining compatible with the constraints of the DDR accesses and of the internal distribution system.

On the DDR 10 side, the data stream read by the processing device 40 is the series of consecutive pixels on a row of the input image table

On the side of the internal distribution system 27, the sequence conveyed is a sequence of vectors of K (here 4) data, intended for 4 consecutive column memories, which makes it possible to use the buses 28, 29 as a shift register which reduces the interconnections and simplifies the clock distributions in the FPGA.

The processing device 40 is of a size that is independent of the size of the computing grid that it supplies and is moreover inexpensive in terms of resources. It is possible to load or read the column memories of the grid with data patterns conforming to formula (1) and this considerably widens the field of action of the architecture.

The functions of storage, then of re-reading in a different order are easily carried out in the processing device 40 with a dual-port MP memory 41 if it only produces one piece of data per cycle. The objective in one embodiment being to produce a vector with several (K) components (4 in the example), this requires a parallel memory functionality described below.

During one cycle, one vector is written and another vector is read from MP 41, clocked from timing unit 420.

In MP 41:

- Concerning the exchanges of data in read or write mode between the processing device 40 and the circuit 20, the vector is read or written in MP 41 at addresses separated by a step S, whereas

- this step is fixed at 1 (consecutive addresses) in MP 41 for read or write exchanges between the processing device 40 and the DDR 10.

In one embodiment, the processing device 40 contains two pairs of address generators respectively for transfers between DDR 10 and pivot memory MP 41 (address generators 421, 422) and transfers between pivot memory MP 41 and circuit 20 (address generators 423, 424).

An address generator is a multi-dimensional counter that provides the sequence of addresses: AdO + n1*S1 + n2*S2 + ... with 0<=ni<Li , the ni incrementing in lexicographic order , and the Si being integers giving the incrementation steps in each dimension of the counter.

The memory MP 41 being of limited capacity, the transfer of data blocks between DDR 10 and circuit 20 takes place for example in several consecutive phases. For reasons of efficiency, these transfers are done in a pipeline: sub-block N is read at the same time as sub-block N+1 is written. Both sides of MP memory 41 are mostly active simultaneously. The sequencing of these operations is controlled by the sequencing unit 420 which sequences the inputs/outputs and which executes a firmware containing in particular the parameters of the address generators. The invention makes it possible to significantly extend the range of CNN operations of the systolic circuit making it possible to break out of the prior art data distribution scheme presented above, where adjacent pixels are distributed in the same order in the memories. column m.

The functionality limitations of the architecture (convolution of width 1 to 3, loss of performance in the event of a stride greater than 1) are thus lifted thanks to an adapted distribution of the row dimension data.

FIGS. 3 to 5 give examples of distribution of data in a system implementing the invention and which make processing possible which was not in the distribution of the prior art where the datum of index i0+i on the line is written in the memory of column i and only in this memory and for which the matrix of PEs is used at full capacity.

Figure 3 illustrates an example distribution in one embodiment (where S1 = 1, S2 = 2, in the case of a convolution of width 5 and stride 1; the result of the column associated with memory m2 must be the convolution of (D1 , D2, D3, D4, D5), the result of the column associated with the memory m3 must be the convolution of (D2, D3, D4, D5, D6), the result of the column associated with the memory m4 must be the convolution of (D3, D4, D5, D6, D7), etc. In this case of implementation of the invention, a datum (for example here the datum D2 or D3 or D4 or D5) can thus be stored in several memories while limiting the calculations of non-useful PEs.

The distribution does not include a copy in the case of figure 4 also corresponding to a convolution of width 5, but with a stride Str equal to 2. The result of the column associated with the memory m2 must be the convolution of (D1 , D2, D3, D4, D5), the result of the column associated with memory m3 must be the convolution of (D3, D4, D5, D6, D7), the result of the column associated with memory m4 must be the convolution of (D5, D6, D7, D8, D9), etc. In this case of implementation of the invention (where S1=1, S2=2), the convolution carried out by the column of PEs associated with the memory 3 will use the data inside the surface delimited in dotted lines: the data D3 to D6 are in memory m3 and D7 is in m4.

FIG. 5 illustrates an example of distribution in one embodiment (where L1=1, S2=2- and therefore any S1), in the case of so-called “dilation” stages where the difference, ie the expansion value, between the successive data of the same convolution is not 1. In the embodiment of the invention illustrated here, the convolution width is 3 and the dilation value is 2; the result of the column associated with memory m2 must be the convolution of (D1 , D3, D5), the result of the column associated with memory m3 must be the convolution of (D3, D5, D7), the result of the column associated with memory m4 must be the convolution of (D5, D7, D9).

Finally, FIG. 6 illustrates a case where several stages of the CNN graph are chained to the grid 22, from an initial data distribution in one embodiment of the invention. This mode of operation, called fusion, has the advantage of avoiding storage outside of intermediate results and drastically reducing the need for input-output bandwidth. Merging is an important factor in improving the performance of circuit 20: in fact, the systolic allowing the integration of high computing power, this is accompanied by a need of the same order in terms of input output rate. , which usually becomes the limiting factor for at least part of the CNN processing. There may be, among the merged stages, stages with a stride value Str greater than 1, as illustrated in figure 6 (a convolution stage of width K = 5 and stride value Str equal to 1 and a convolution stage of width K=3 and stride value Str equal to 2). If the data is distributed as shown in Figure 6 (where L1 =3 S1 =1 S2=2), the PEs can chain the calculations for the two stages without the need for data redistribution at the end of the first stage. Merging applies to any number of stages, the limitation in practice coming from other criteria, including the capacity of the column memories m.

In summary, the invention makes it possible to significantly broaden the field of operations of the systolic neural circuit 20 while maintaining the compactness of the systolic grid architecture and of its internal distribution system 27. The intermediate processing, of input/output management, implemented by the processing device 40 between the DDR memory 10 and the distribution system 27 makes it possible to carry out particular data distributions in the grid 22 and to process convolutions at any characteristics (nucleus sizes, stride, dilation), making it possible to cover a wide range of CNNs, while optimizing the performance of the grid.

In addition, the type of data distribution according to the invention opens the door to CNN processing by stage merging, which divides the I/O throughput requirement from or to the DDR by the number of merged stages. This artifice avoids the capping of the performance of the circuit on a part of the graph (in general the upper part of the graph, where the number of feature maps is lower).

The resource cost of the processing device 40 of the invention remains low compared to the density of current FPGAs: less than 4% for an FPGA of size medium (7Z045 from Xilinx). The size of the device is moreover the same whatever the size of the grid of PEs.

Furthermore, since an individual memory is generally only capable of storing or producing a single piece of data per clock cycle (one for each port if it is a dual-port memory), the device processing 40 using a parallel memory MP 41, i.e. capable of manipulating a multi-component vector at each clock cycle, and taking into account the data patterns to be manipulated (multi-dimensional affine addressing laws). The parallel memory 41 comprises several memory banks M 410 as illustrated in FIG. 9. To access a vector in such a set of banks M 410 at each cycle, it must be ensured that the components of the vectors are always found in distinct banks. .

One possible implementation is parallel memory composed of a prime number of banks. This parallel memory technique is known (described in particular by D. H. Lawrie and C. R. Vora, “The prime memory system for array access" in 1982).

The pivot memory MP 41 is therefore a set of K identical double-port memory banks M 410 (allowing read or write access to each of its ports at each cycle), numbered from 1 to K. These memories M 410 are seen collectively as a larger memory (MP), the capacity of which is the sum of the individual capacities. Address A in memory MP 41 corresponds to address A/K in memory bank M 410 with number A%K (i.e. A modulo K).

In the direction from the DDR memory 10 to the gate 22, the objective of the processing device 40 is to be able to read in one clock cycle a vector of K (here 4) components y(k)=MP(ADR_A+k *S ) with 0<k<K-1 (corresponds to a case where N0 =0, S1 = 0, S=S2 and S3 = 0 ). This is only possible if the K components are in separate banks, since a bank can only provide one value per cycle.

So (ADR_A+k1*S)%K must be different from (ADR_A+k2*S)%K if k2 is different from k1 for k1 and k2 between 0 and K-1 .

The condition is verified if (k1 -k2)*S %K is zero only if k2=k1 , condition always true if S and K are coprime. In the current implementation, K=5, a prime number guaranteeing that all the address vectors ADR_A+k*S in the virtual memory MP 41 can be accessed, as long as S is not a multiple of 5, a pathological case that one can always avoid or circumvent.

In addition to the memory banks themselves, the processing device 40 then further comprises on each side (ie DDR memory side 10 and circuit side 20) and in each direction (read and write) logic blocks, named Shuffler, 430, 431 , 432, 433, which connect each memory bank to the component of the vector that it must read or write: if the vector targets the addresses ADO + n*S, 0<=n<4, the number of the bank concerned by the component n being ( ADO +n*S)%5 stores the components of the vector. It is recalled that there is also on each side an address converter

421 , 422, 423, 424 which transforms a virtual address A into 5 addresses intended for each of the banks: if the component n of a vector is intended for bank b = (ADO +n*S)%5, the address in this bench will be equal to (ADO +n*S -b )/5.

The invention can be implemented in embedded systems of a very varied nature, for example in the processing of images by neural network on board a satellite, and/or configurable on a radiation-hardened FPGA, and/or more generally relating to classification, detection or recognition on real-time images or signals.

Claims

1. Process for processing, in an electronic processing device (40), data to be supplied as input to a first shift register (28) of a systolic neural electronic circuit (1) comprising a grid of elementary processors (22 ), said first shift register conveying a data vector of size K at each clock cycle to K columns of elementary processors (PE) of said grid, said processing method comprising the following steps implemented by said device processing electronics: obtaining a sequence of successive data D(0), D(1), D(2) ... D(d) extracted by reading said data classified consecutively in a first storage memory and storing said data in a second memory (m) of the processing device;

- generation of vectors W(u) as a function of said stored data, each vector W(u) comprising K components (W(u)o, , W(u) _K -i) such that W(u)k,k=oàK -i = D(K0 + k*S2 + u*S1 ), where K is an integer strictly greater than 1 , KO is a predefined constant, steps S1 and S2 are predefined integer constants;

- provision, at the input of said shift register, of the vector W(u), such that successive values W(u) are provided during successive clock strokes, with u an integer ranging from 0 to L1 -1 , and L1 constant integer strictly greater than one.

2. Processing method according to claim 1, further comprising the following steps implemented by said electronic processing device (40): obtaining a data vector w of size K supplied by a second shift register (29) the systolic neural circuit (1), said second shift register conveying at each clock cycle a data vector of size K corresponding to the results of K respective columns of elementary processors of said grid;

- the K components of w being (w ₀ , ... , w _K -i), storage in the second memory (m), from Wk, k= o to Ki at the address indO +K*C*S2 + x*S1 + k*S2 , where indO is a predefined constant, C is the column number and x is an index ranging from 0 to L1 -1; supplying the first storage memory (10) with a sequence of successive data extracted by reading said data classified consecutively in the second memory (m).

3. Processing method according to claim 1 or 2, according to which, for k=0 to K-1, W(u)k=D(K0 + St=i to n Ui*S1 i + k*S2), the steps S1 i and S2 are predefined constants; the vector being provided at the input of said shift register, with Ui an integer ranging from 0 to L1 i-1 , and L1 i a constant strictly greater than one.

4. Processing method according to claim 3, further comprising the following steps implemented by said electronic processing device (40): obtaining a data vector w of size K supplied by a second shift register (29) the systolic neural circuit (1), said second shift register conveying at each clock cycle a data vector of size K corresponding to the results of K respective columns of elementary processors of said grid;

- the K components of w being (w ₀ , w _K -i), the storage in the second memory of Wk, k= o to KI is carried out at the address indO+4*C*S2 + Sxi=o to Lii -i ( Xi * S1 i ) + k*S2, where indO is a predefined constant, C is the column number and Xi is an integer index ranging from 0 to L1 i-1; supplying the first storage memory (10) with a sequence of successive data extracted by reading said data classified consecutively in the second memory (m).

5. Processing method according to one of the preceding claims, according to which the second memory (m) comprises at least K distinct memory banks, and the storage of the data in the second memory is carried out in such a way as to verify that data which are components of the same vector W(u) cannot be stored in the same bank.

6. Device (40) for processing data to be supplied as input to a first shift register (28) of a systolic neural electronic circuit (1) comprising a grid of elementary processors (22), said first shift register being adapted to convey a vector of data of size K at each clock cycle to K columns of elementary processors (PE) of said grid, said 18 electronic processing device (40) being adapted to obtain a sequence of successive data D(0), D(1), D(2)...D(d) extracted by reading said data classified consecutively in a first memory of storage (10) and for storing said data in a second memory (m) of the processing device; said electronic processing device being adapted to generate vectors W(u) as a function of said stored data, each vector W(u) comprising K components (W(u)o, ... , W(u) _K -i) such that W(u)k, k=oàK-i = D(K0 + k*S2 + u*S1 ), where K is an integer strictly greater than 1 , KO is a predefined constant, steps S1 and S2 are constants predefined integers; said electronic processing device (40) being adapted to supply, as input to said shift register (28), the vector W(u), such that successive values W(u) are supplied during successive clock strokes, with u integer ranging from 0 to L1 -1 , and L1 integer constant strictly greater than one.

7. Data processing device (40) according to claim 6, further adapted to obtain a data vector w of size K provided by a second shift register (29) of the systolic neural circuit (1), said second register to offset being adapted to convey at each clock cycle a data vector of size K corresponding to the results of K respective columns of elementary processors of said grid, the K components of w being (w ₀ , w _K -1); said electronic processing device being adapted to store in the second memory (m), Wk, k=oàK-i at the address indO +K*C*S2 + x*S1 + k*S2 , where indO is a predefined constant , C is the column number and x is an index ranging from 0 to L1 -1 and to provide the first storage memory with a sequence of successive data extracted by reading said data classified consecutively in the second memory.

8. Processing device (40) according to claim 6 or 7, in which, for k=0 to K-1, W(u)k=D(K0 + Ei=iàn u *S1 i + k*S2), steps S1i and S2 are predefined constants; the vector being provided at the input of said shift register, with Ui an integer ranging from 0 to L1 i-1 , and L1 i a constant strictly greater than one.

9. Processing device (40) according to claim 8, adapted to obtain a data vector w of size K provided by a second shift register (29) of the systolic neural circuit (1), said second shift register conveying at each clock cycle a data vector of size K corresponding to the results of K respective columns of elementary processors of said grid, the 19

K components of w being (w ₀ , w _K -i), said electronic processing device (40) being adapted to store in the second memory of w _k , k=o to K-i at the address indO+4*C *S2 + Sxi=oàLii-i ( Xi * S1 i ) + k*S2, where indO is a predefined constant, C is the column number and Xi is an integer index ranging from 0 to L1 i-1; said electronic processing device being suitable for supplying the first storage memory with a sequence of successive data extracted by reading said data classified consecutively in the second memory.

10. Processing device (40) according to one of claims 6 to 9, the second memory comprising at least K separate memory banks, and said processing device being adapted to perform the storage of data in the second memory so as to check that data which are components of the same vector W(u) cannot be stored in the same bank.