EP4264497A1

EP4264497A1 - Exploitation of low data density or nonzero weights in a weighted sum computer

Info

Publication number: EP4264497A1
Application number: EP21839468.2A
Authority: EP
Inventors: Michel Harrand
Original assignee: Commissariat a lEnergie Atomique CEA; Commissariat a lEnergie Atomique et aux Energies Alternatives CEA
Current assignee: Commissariat a lEnergie Atomique et aux Energies Alternatives CEA
Priority date: 2020-12-16
Filing date: 2021-12-15
Publication date: 2023-10-25
Also published as: FR3117645A1; WO2022129156A1; US20240054330A1; FR3117645B1

Abstract

Computing circuit for computing a weighted sum of a set of first data by way of at least one parsimony management circuit comprising a first buffer memory for storing all or some of the first data (MEM_A) delivered sequentially (SEQ1) and a second buffer memory for storing all or some of the second data (MEM_B) delivered sequentially (SEQ2). The parsimony management circuit (CGF) furthermore comprises a first processing circuit (CT_A) able: to analyse the first data in order to look for the first nonzero data (MNULL1- MNULL4) and define a first jump indicator (isl) indicating a jump between two successive nonzero data, and to command the transfer, to the distribution circuit (DIST), of a first datum (Xi) read from the first data buffer memory based on said first jump indicator. The parsimony management circuit (CGF) furthermore comprises a second processing circuit (CT_B) able to command the transfer, to the distribution circuit (DIST), of a second datum (Wi) read from the second data buffer memory based on said first jump indicator.

Description

Title of the invention: Taking advantage of low data density or non-zero weights in a weighted sum calculator

Scope

[0001] The invention generally relates to circuits for calculating weighted sums of data having a low data density or non-zero weighting weights and more particularly to digital neuromorphic network calculators for calculating artificial neural networks. based on convolutional or fully connected layers.

Issue Raised

[0002] Artificial neural networks constitute calculation models that imitate the operation of biological neural networks. Artificial neural networks include neurons interconnected by synapses, which are for example implemented by digital memories. Artificial neural networks are used in different fields of signal processing (visual, sound, or other) such as in the field of image classification or image recognition.

[0003] Convolutional neural networks correspond to a particular model of artificial neural network. Convolutional neural networks were initially described in the article by K. Fukushima,

“Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4): 193-202, 1980. ISSN 0340-1200. doi:10.1007/BF00344251”.

[0004] Convolutional neural networks (designated in English by the expressions “convolutional neural networks”, or “deep (convolutional) neural networks” or even “ConvNets”) are neural networks inspired by biological visual systems .

[0005] Convolutional neural networks (CNN) are used in particular in image classification systems to improve classification.

Applied to image recognition, these networks allow learning intermediate representations of objects in images that are smaller and generalizable for similar objects, which facilitates their recognition. However, the inherently parallel operation and the complexity of convolutional neural network classifiers make their implementation difficult in embedded systems with limited resources. Indeed, embedded systems impose strong constraints with respect to the surface of the circuit and the electrical consumption.

[0006] The convolutional neural network is based on a succession of layers of neurons, which may be convolutional layers or fully connected layers (generally at the end of the network). In convolutional layers, only a subset of neurons from one layer are connected to a subset of neurons from another layer. On the other hand, convolutional neural networks can process multiple input channels to generate multiple output channels. Each input channel corresponds, for example, to a different data matrix.

[0007] On the input channels, input images are presented in matrix form, thus forming an input matrix; an output raster image is obtained on the output channels.

[0008] The matrices of synaptic coefficients for a convolutional layer are also called “convolution kernels”.

[0009] In particular, convolutional neural networks comprise one or more convolution layers which are particularly costly in terms of the number of operations. The operations performed are mainly multiplication and accumulation (MAC) operations to calculate a sum of the data weighted by the synaptic coefficients. Furthermore, to respect the latency and processing time constraints specific to the targeted applications, it is necessary to minimize the number of calculation cycles necessary during an inference phase or during a back-propagation phase during network learning.

[0010] More particularly, when the convolutional neural networks are implemented in an embedded system with limited resources (as opposed to an implementation in computing center infrastructures), the reduction in electrical consumption and the reduction in the number of operations of necessary calculation becomes an essential criterion for the realization of the neural network. The basic operation implemented by an artificial neuron is an operation of multiplication then accumulation MAC. Depending on the number of neurons per layer and of layers of neurons that the network comprises, the number of MAC operations per unit of time necessary for real-time operation becomes restrictive.

There is therefore a need to develop calculation architectures optimized for neural networks which make it possible to limit the number of MAC operations without degrading either the performance of the algorithms implemented by the network or the precision of the calculations. More particularly, there is a need to develop calculation architectures optimized for neural networks which carry out the calculations of the weighted sums while avoiding the operations of multiplication by a zero datum received by a neuron and/or by a zero synaptic coefficient.

Prior Art / Limitations of the State of the Art

[0013] The publication “Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices” by Chen et al. presents a convolutional neural network calculator implementing a Network on Chip (NoC) circuit to handle low data density and non-zero weights. However, the disadvantage of this solution consists in the use of a very voluminous network-on-chip type circuit occupying a large area in an integrated circuit.

[0014] The publication “An 11.5TOPS/W 1024-MAC Butterfly Structure Dual-Core Sparsity-Aware Neural Processing Unit in 8nm Flagship Mobile SoC” by Jinook Song et al. presents a convolutional neural network calculator implementing a selection of the data to be read based on the low density of the non-zero weights by using a pre-calculation of the indices of the non-zero weights. The disadvantage of this solution is the limitation to taking advantage of the low density of non-zero weights only. In addition, the solution proposed by this publication is limited to an operation in inference being not adapted to the realization of a learning phase.

Answer to the problem and provide solution [0015] The invention proposes a computer architecture making it possible to reduce the electrical consumption and improve the performance of a neural network implemented on a chip by using a flow management circuit integrated in the same chip as the calculation network. The flow management circuit according to the invention takes advantage of the low density of non-zero data and/or non-zero weights.

The invention proposes an artificial neural network accelerator computer architecture comprising a plurality of MAC-type calculation units each receiving at each calculation cycle a non-zero datum and the synaptic coefficient associated with said datum or vice versa . The improvement in the performance of the calculation is carried out by means of at least one flow management circuit making it possible to identify the null data during their loading from a data memory and to synchronize the reading of the weights from a weight memory using jump information. The solution according to the invention further proposes a processing of the low density of non-zero data jointly for the data and the synaptic coefficients (or weights). It is possible in the solution according to the invention to carry out this joint processing on vectors each comprising a plurality of pairs of type [data, weight],

In the description of the invention, we will present the technical solution proposed in the context of a neuro-morphic network calculator. However, the proposed solution is suitable, in a more general way, for any calculation architecture intended to carry out multiplication and accumulation operations (MAC) to calculate sums of a first type of data A weighted with a second data type B.

The solution proposed according to the invention is optimal when the first type of data A and/or the second type of data B have a low density of non-zero values. The solution proposed according to the invention is symmetric with respect to the type of data A or B.

[0019] More generally, the solution according to the invention makes it possible to take into account the “low density of non-zero data” of at least one of the input data of a series of MAC type operations. This makes it possible to optimize the operation of the computer carrying out the MAC operations via a parsimonious calculation limiting the energy consumption and the necessary calculation time. [0020] In the following parts, the English term “sparcity” has been translated by parsimony also referring to the notion of “low density of non-zero data”.

Summary

The subject of the invention is a calculation circuit for calculating a weighted sum of a set of first data by a set of second data comprising:

At least a first data memory for storing the first data;

At least one second data memory for storing the second data;

At least one calculation unit configured to perform the weighted sum calculation;

At least one first sequencer circuit capable of controlling reading in the first data memory according to a first predefined addressing sequence;

At least one second sequencer circuit capable of controlling reading in the second data memory according to a second predefined addressing sequence;

At least one distribution circuit (DIST) associated with a calculation unit to successively deliver to it a new pair of first and second associated data;

At least one flow management circuit comprising: o A first buffer memory for storing all or part of the first data delivered sequentially by said first sequencer circuit; o A second buffer memory for storing all or part of the second data delivered sequentially by said second sequencer circuit; o a first processing circuit comprising a first circuit for controlling the read and write pointers of the first buffer memory and being capable of:

■ analyzing the first data delivered by said first sequencer circuit to search for the first non-zero data and define a first jump indicator between two successive non-zero data, and ■ controlling the transfer to the distribution circuit of a first data item read from the first data buffer memory as a function of said first jump indicator; a second processing circuit comprising a second circuit for controlling the read and write pointers of the second buffer memory and being capable of controlling the transfer to the distribution circuit of a second data item read in the second data buffer memory in function of said first jump indicator.

According to a particular aspect of the invention, the calculation circuit further comprises a plurality of null data detection circuits each configured to match each first input data, a null data indicator having a corresponding first state to zero datum and a second state corresponding to non-zero datum.

According to a particular aspect of the invention, the first sequencer circuit is configured to deliver the first data by vectors of N successive data, N being a non-zero natural integer. The first buffer memory is a memory capable of storing vectors of N data according to a “first in, first out” principle. The first processing circuit comprises a parsimony management stage per data intended to receive the vectors coming from the first buffer memory and configured to generate a jump signal per word between two successive non-zero data intended for the two pointer control circuits . Said word jump signal forms a first component of the first jump indicator.

According to a particular aspect of the invention, the first processing circuit comprises, upstream of the first buffer memory, a parsimony management stage per vector configured to generate a jump signal per vector intended for the two control circuits of the pointers when a vector is zero. Said vector jump signal forms a second component of the first jump indicator.

According to a particular aspect of the invention, the vector parsimony management stage comprises a first null vector detection logic circuit configured to generate, from the null data flags, the jump signal per vector when a vector comprises only null data.

According to a particular aspect of the invention, the parsimony management stage by data comprises:

A register for receiving a non-zero vector at the output of the first buffer memory;

A priority encoder stage configured to carry out the following operations in an iterative manner: o the generation of a distribution control signal corresponding to the index of the first non-zero input data of the vector.

The first pointer control circuit is configured to perform in the same iteration loop: o the generation of the skip signal per word from the distribution control signal; o and setting to the first state of the null data indicator of the input data distributed in the vector stored in the register following its distribution. The data parsimony management stage comprises a second null vector detection logic circuit for generating a read trigger signal for the next vector when all the null data indicators of the data belonging to said vector are in the first state.

According to a particular aspect of the invention, the second processing circuit is capable of:

- analyzing the second data delivered by said second sequencer circuit to search for the second zero data and define a second jump indicator between two successive non-zero data, and

- to control the transfer to the distribution circuit of a second data item read in the second data buffer memory according to said first and second jump indicators.

The second processing circuit is capable of controlling the transfer to the distribution circuit of a second piece of data read from the second data buffer memory as a function of said first and second jump indicators. According to a particular aspect of the invention, the flow management circuit is configured to read the data from the two memories by vectors of N successive pairs according to the first and the second predefined addressing sequence of said data, N being a non-zero natural integer. The first and second jump indicator are obtained by an analysis of said vectors such that the two data forming a distributed couple are non-zero.

According to a particular aspect of the invention, the calculation circuit further comprising a plurality of zero torque detection circuits each being configured to match each pair of first and second input data, a zero torque indicator having a first state corresponding to a pair comprising at least one zero datum and a second state corresponding to a pair comprising only non-zero data.

According to a particular aspect of the invention, the assembly formed by the first and the second buffer memory is a memory capable of storing vectors of N pairs according to a “first in first out” principle. The assembly formed by the first and the second processing circuit comprises a parsimony management stage per data intended to receive the vectors originating from the assembly of the first and the second buffer memory and configured to generate a jump signal per word between two successive pairs having two non-zero data intended for the two pointer control circuits. Said word jump signal forms a first component of the first jump indicator and the second jump indicator.

According to a particular aspect of the invention, the calculation circuit comprises a parsimony vector management stage upstream of the first and the second buffer memory. The vector parsimony management stage includes a first null vector detection logic circuit configured to generate a vector skip signal when a vector comprises only null data flags in the first state. Said vector jump signal forms a second component of the first jump indicator and the second jump indicator.

According to a particular aspect of the invention, the parsimony management stage per couple comprises: A register to receive a non-zero vector at the output of the buffer memory;

A priority encoder stage configured to carry out the following operations in an iterative manner: o The generation of a distribution control signal corresponding to the index of the first couple having two non-zero vector data;

At least the first or the second pointer control circuit being configured to carry out in the same iteration loop: o the setting to the first state of the null data indicator of the pair of input data distributed in the vector stored in the register, following its distribution;

A dispensing member controlled by the dispensing command signal;

A second null vector detection logic circuit for generating a read trigger signal for the next vector when all the null data indicators of the pairs belonging to said vector are in the first state.

According to a particular aspect of the invention, the calculation circuit is intended to calculate output data of a layer of an artificial neural network from input data. The neural network is composed of a succession of layers, each consisting of a set of neurons. Each layer is connected to an adjacent layer via a plurality of synapses associated with a set of synaptic coefficients forming at least one weight matrix.

The first set of data corresponds to the input data to a neuron of the layer being calculated. The second set of data corresponds to the synaptic coefficients connected to said neuron of the layer being calculated. The calculation circuit comprises: at least one group of calculation units, each group of calculation units comprising a plurality of calculation units of rank k=1 to K with K a strictly positive integer, a plurality of second memories data of rank k=1 to K for storing the second set of data; Each group of computing units is connected to a dedicated flow management circuit further comprising: a plurality of second buffer memories of rank k=1 to K such that each second buffer memory distributes to the computing unit of the same rank k the input data coming from the second data memory of the same rank k according to at least the first jump indicator.

The first set of data corresponds to the synaptic coefficients connected to said neuron of the layer being calculated. The second set of data corresponds to the input data of a neuron of the layer being calculated. The calculation circuit comprises: at least one group of calculation units, each group of calculation units comprising a plurality of calculation units of rank k=1 to K with K a strictly positive integer, a plurality of first memories data of rank k=1 to K for storing the first set of data;

A plurality of flow management circuits of rank k=1 to K, each being configured as for each calculation unit of rank k belonging to a group of calculation units. the flow management circuit of rank k is configured to distribute the non-zero synaptic coefficients coming from the first data memory of the same rank k to the calculation unit of the same rank k.

Other characteristics and advantages of the present invention will appear better on reading the following description in relation to the following appended drawings.

[0040] [Fig.1] represents an example of a convolutional neural network containing convolutional layers and fully connected layers. [0041] [Fig.2a] represents a first illustration of the operation of a convolution layer of a convolutional neural network with an input channel and an output channel.

[0042] [Fig.2b] represents a second illustration of the operation of a convolution layer of a convolutional neural network with an input channel and an output channel.

[0043] [Fig.2c] represents an illustration of the operation of a convolution layer of a convolutional neural network with several input channels and several output channels.

[0044] [Fig.3] illustrates an example of a block diagram of the general architecture of a calculation circuit of a convolutional neural network.

[0045] [Fig.4] illustrates a block diagram of an example of a computing network implemented on a system on chip according to the invention.

[0046] [Fig.5a] illustrates a block diagram of a CGF flow management circuit taking advantage of the low density in non-zero values according to the invention.

[0047] [Fig.5b] illustrates a functional diagram of a data flow management circuit according to a first embodiment in which the parsimony analysis is carried out only on a set of data.

[0048] [Fig.5c] illustrates an example of implementation of the first embodiment of the invention.

[0049] [Fig.6a] illustrates a block diagram of a data flow management circuit according to a second embodiment in which the parsimony analysis is carried out jointly on the two sets of data.

[0050] [Fig.6b] illustrates a functional diagram of a data flow management circuit according to a third embodiment in which the parsimony analysis is carried out jointly on the two sets of data.

[0051] It is recalled that the solution described by the invention applies to any calculation circuit performing multiplication and accumulation operations to calculate a sum of a first set of data A weighted by a second set of data B By way of illustration and without loss of generality, we will describe the technical solution according to the invention implemented in a circuit configured for a convolutional artificial neural network calculation application.

First, we start by describing an example of the overall structure of a convolutional neural network containing convolutional layers and fully connected layers.

Figure 1 shows the overall architecture of an example of a convolutional network for image classification. The images at the bottom of figure 1 represent an extract of the convolution kernels of the first layer. An artificial neural network (also called a “formal” neural network or simply referred to by the expression “neural network” below) consists of one or more layers of neurons, interconnected with one another.

Each layer is made up of a set of neurons, which are connected to one or more previous layers. Each neuron of a layer can be connected to one or more neurons of one or more previous layers. The last layer of the network is called the “output layer”. Neurons are connected to each other by synapses associated with synaptic weights, which weight the efficiency of the connection between neurons, and constitute the adjustable parameters of a network. Synaptic weights can be positive or negative.

Neural networks called "convolutional" (or "convolutional", "deep convolutional", "convnets") are also composed of layers of particular types such as convolution layers, grouping layers ("pooling in Anglo-Saxon language) and the fully connected layers. By definition, a convolutional neural network includes at least one convolution or “pooling” layer.

The architecture of the accelerator computer circuit according to the invention is compatible for performing the calculations of the convolutional layers. We will start first by detailing the calculations made for a convolutional layer.

Figures 2a-2c illustrate the general operation of a convolution layer. FIG. 2a represents an input matrix [I] of size (l _x ,l _y ) connected to an output matrix [O] of size (O _x ,O _y ) via a convolution layer performing an operation of convolution using a filter [W] of size (K _x ,K _y ).

A value Oj j of the output matrix [O] (corresponding to the output value of an output neuron) is obtained by applying the filter [W] to the corresponding sub-matrix of the input matrix [I],

In general, we define the convolution operation with symbol 0 between two matrices [X] composed of the elements Xj and [Y] composed of the elements yjj of equal dimensions. The result is the sum of the products of the coefficients Xjj.yjj each having the same position in the two matrices.

In FIG. 2a, the first value O ₀ ,o of the output matrix [O] obtained by applying the filter [W] to the first input sub-matrix denoted [X1] of equal dimensions has been represented. to that of the filter [W], The detail of the convolution operation is described by the following equation:

[0062]

Oo,o = [X1] 0 [W]

From where

[0063]

Oo,0 ⁼ Xoo.Woo + X01.W01 + X02.W02 + X10.W10 + X11.W11 + X12.W12 + X20 W20 ⁺ X21.W21 + X22 W22

FIG. 2b represents a general case of calculation of any value 03.2 of the output matrix.

In general, the output matrix [O] is connected to the input matrix [I] by a convolution operation, via a convolution kernel or filter noted [W]. Each neuron of the output matrix [O] is connected to a part of the input matrix [I]; this part is called "input sub-matrix" or "receptive field of the neuron" and it has the same dimensions as the filter [W], The filter [W] is common for all the neurons of a matrix output [O] .

The values of the output neurons Oj are given by the following relationship:

In the above formula, g() denotes the activation function of the neuron, while Sj and Sj denote the vertical and horizontal shift parameters respectively. Such a “stride” offset corresponds to the offset between each application of the convolution kernel on the input matrix. For example, if the lag is greater than or equal to the kernel size, then there is no overlap between each kernel application. We recall that this formula is valid in the case where the input matrix has been processed to add additional rows and columns (Padding in English). The filter matrix [W] is composed of the synaptic coefficients w _t ,i of ranks t=0 to K _x -1 and l=0 to Ky-1 .

By way of example, the ReLu function is defined as the network activation function such that g(x)=0 if x<0 and g(x)=x if x>0. Using the ReLu function as an activation function generates quite a significant amount of null data in the middle layers of the network. This justifies the interest in taking advantage of this characteristic to reduce the number of calculation cycles by avoiding performing multiplications with zero data when calculating a weighted sum of a neuron in order to save time for treatment and energy. The use of this type of activation function makes the computer circuit compatible with the technical solution according to the invention applied to the data propagated or back-propagated in the neural network.

Furthermore, it is possible to carry out “pruning” operations during the network learning phase. This is a mechanism for zeroing synaptic coefficients with values below a certain threshold. The use of this mechanism makes the computer circuit compatible with the technical solution according to the invention applied to synaptic weights.

Generally, each layer of convolutional neurons denoted Ck can receive a plurality of input matrices on several input channels of rank p=0 to P with P a positive integer and/or calculate several output matrices on a plurality of output channels of rank q=0 to Q with Q a positive integer. We denote by [W] _Pi q ^,k the filter corresponding to the convolution kernel which connects the matrix of output [O] _q to an input matrix [l] _p in the neuron layer C _k . Different filters can be associated with different input matrices, for the same output matrix.

[0072] Figures 2a-2b illustrate a case where a single output matrix (and therefore a single output channel) [O] is connected to a single input matrix [I] (and therefore a single input channel ).

FIG. 2c illustrates another case where several output matrices [O] _q are each connected to several input matrices [l] _p . In this case, each output matrix [O] _q of layer C _k is connected to each input matrix [l] _p via a convolution kernel [W] _pq ^,k which can be different depending on the output matrix.

Furthermore, when an output matrix is connected to several input matrices, the convolution layer performs, in addition to each convolution operation described above, a sum of the output values of the neurons obtained for each input matrix. In other words, the output value of an output neuron (or also called output channels) is in this case equal to the sum of the output values obtained for each convolution operation applied to each input matrix (or also called input channels)

The values of the output neurons Oj of the output matrix [O] _q are in this case given by the following relationship:

[0077] With p=0 to P the rank of an input matrix [l] _p connected to the output matrix [O] _q of layer C _k of rank q=0 to Q via the filter [W] _pq ^,k composed of synaptic coefficients w _p , _q ,t,i of ranks t=0 to K _x -1 and l=0 to K _y -1 .

FIG. 3 illustrates an example of a functional diagram of the general architecture of the calculation circuit of a convolutional neural network according to the invention.

The circuit for calculating a convolutional neural network CALC comprises an external volatile memory MEM_EXT for storing the input and output data of all the neurons of at least the layer of the network being calculated for one inference or learning phase and an integrated system on a single SoC chip. [0080] The SoC integrated system comprises a calculation network MAC_RES consisting of a plurality of calculation units for calculating neurons of a layer of the neural network, an internal volatile memory MEMJNT for storing the input data and output of the neurons of the layer being calculated, a weight memory stage MEM_POIDS comprising a plurality of internal non-volatile memories of rank n=0 to N-1 denoted MEM_POIDS _n to store the synaptic coefficients of the weight matrices, a circuit of memory control CONT_MEM connected to the set of memories MEMJNT, MEM_EXT and MEM_POIDS to play the role of interface between the external memory MEM_EXT and the system on chip SoC, a set of address generators ADDJ3EN to organize the distribution of data and synaptic coefficients during a calculation phase and to organize the transfer of the calculated results from the various calculation units of the calculation network MAC_RES to one of the memories ME M_EXT or MEMJNT.

The SoC system-on-chip notably comprises an image interface denoted I/O to receive the input images for the entire network during an inference or learning phase. It should be noted that the input data received via the I/O interface is not limited to images but can be, more generally, of a diverse nature.

The SoC system-on-chip also includes a PROC processor for configuring the MAC_RES calculation network and the ADDJ3EN address generators according to the type of neural layer calculated and the calculation phase carried out. The processor PROC is connected to an internal non-volatile memory MEM_PROG which contains the computer programming executable by the processor PROC.

Optionally, the SoC system on chip comprises a calculation accelerator of the SIMD (Single Instruction on Multiple Data) type connected to the processor PROC to improve the performance of the processor PROC.

The external MEM_EXT and internal MEMJNT data memories can be produced with DRAM type memories.

The internal data memory MEMJNT can also be made with SRAM type memories. The processor PROC, the accelerator SIMD, the programming memory MEM_PROG, all the address generators ADD_GEN and the memory control circuit CONT_MEM are part of the means for controlling the calculation circuit of a network of CALC convolutional neurons.

The memories of the weight data MEM_POIDS _n can be made with memories based on an emerging NVM technology.

FIG. 4 illustrates an example of a functional diagram of the calculation network MAC_RES implemented in the system on chip SoC according to a first embodiment of the invention. The calculation network MAC_RES comprises a plurality of groups of calculation units denoted Gj of rank j=1 to M with M a positive integer, each group comprises a plurality of calculation units denoted PE _n of rank n=0 to N -1 with N a positive integer representing the number of output channels.

Without loss of generality, the example implementation illustrated in FIG. 4 comprises 9 groups of calculation units; each group comprises 128 calculation units denoted PE _n . This design choice makes it possible to cover a wide range of convolution types such as 3x3 stridel, 3x3 stride2, 5x5 stridel, 7x7 stride2, 1x1 stridel and 11x11 stride4 based on the spatial parallelism ensured by the groups of calculation units and while parallel calculating 128 output channels.

During the calculation of a layer of neurons, each of the groups of calculation units Gj receives the input data Xÿ coming from a memory integrated in the calculation network MAC_RES denoted MEM_A comprising one of the input data Xÿ of a layer being calculated. The memory MEM_A receives a subset of the input data from the external memory MEM_EXT or from the internal memory MEMJNT. Input data from one or more input channels is used to calculate one or more output matrices on one or more output channels.

The memory MEM_A comprises a write port connected to the memories MEM_EXT or MEMJNT and 9 read ports each connected to a flow management circuit CGF itself connected to a group of calculation unit Gj For each group Gj of calculation units PE _n , the flow management circuit CGF is configured to distribute at each calculation cycle input data xy non- null coming from the first data memory MEM_A to the calculation units PE _n belonging to the group Gj.

As described previously, the system on chip SoC comprises a plurality of weight memories MEM_POIDS _n of rank n=0 to N-1. The computer circuit further comprises a plurality of weight buffer memories denoted BUFF_B _nj of rank n=0 to N-1 and of rank j=1 to M. The buffer memory of rank n and j receives the weights coming from the memory of weight MEM_POIDS _n of rank n to distribute them to the calculation unit PE _n of the same rank n belonging to the group Gj of rank j. By way of example, the rank 0 weight memory MEM_POIDSo is connected to 9 weight buffer memories BUFF_B _Oj . The weight buffer memory BUFF_Boi is connected to the calculation unit PE _o of the first group of calculation units Gi, The weight buffer memory BUFF_B ₀ 2 is connected to the calculation unit PEo of the second group of units G2 calculation, and so on. The set of buffer memories of weight BUFF_B _Oj of rank j belong to the flow management circuit CGFj associated with the group of the same rank j.

Each flow circuit CGFj associated with a group of calculation units Gj also generates jump information in the form of one or more signals to control the reading of the synaptic coefficients as a function of the jump information generated by each flow circuit CGFj.

This makes it possible to synchronize, at the input of each calculation unit PE _n of rank n, the distribution of synaptic weights Wÿ coming from a weight memory MEM_POIDS _n of rank n with the distribution of input data Xÿ non- null coming from the first data memory MEM_A via the flow circuit CGFj. Thus the computer performs the calculation of said weighted sum in the correct order.

Each rank n weight memory MEM_POIDS _n contains all the weight matrices [W] _pn ′ ^k associated with the synapses connected to all the neurons of the output matrices of a layer of rank k neurons of the network.

Said output matrix corresponding to the output channel of the same rank n with n an integer varying from 0 to 127 in the example implementation of figure 4.

Advantageously, the calculation network MAC_RES comprises in particular an average or maximum calculation circuit, denoted POOL, making it possible to carry out the “Max Pool” or “Average Pool” layer calculations. A treatment of “Max pooling” of an input matrix [I] generates an output matrix [O] of smaller size than that of the input matrix by taking the maximum of the values of a sub-matrix [X1] for example of the input matrix [I] in the output neuron Ooo- An “average pooling” processing calculates the average value of all the neurons of a sub-matrix of the input matrix.

Advantageously, the calculation network MAC_RES comprises in particular a circuit for calculating an activation function denoted ACT, generally used in convolutional neural networks. The activation function g(x) is a non-linear function, like a ReLu function for example.

To simplify the description of the first embodiment of the invention, the description below will be limited to the description of the solution with a single calculation unit PE corresponding to a single group of calculation units and a single output channel.

The input data x^ received by a layer being calculated constitute the first operand of the MAC operation carried out by the calculation unit PE. The synaptic weights Wÿ connected to a layer being calculated constitute the second operand of the MAC operation performed by the calculation unit PE.

Figure 5a illustrates the implementation of the flow management circuit CGF taking advantage of the low density in non-zero values for the data Xÿ to limit the number of calculation cycles of a weighted sum.

The calculation circuit CALC comprises a first data memory MEM_A for storing the first set of data corresponding to the input data Xÿ; a second data memory MEM_B (corresponding to MEM_POIDSo) for storing the second set of data corresponding to the synaptic weights Wÿ _; and a calculation unit PE ₀ for carrying out the calculation of a sum of the input data Xÿ weighted by the synaptic weights Wÿ.

The calculation circuit CALC further comprises a flow management circuit CGF configured to distribute on each cycle a non-zero input datum xij coming from the first data memory MEM_A to the calculation unit PE in order to do not perform multiplication operations with zero input data. Moreover, the flow management circuit CGF is configured to generate at least one skip indicator according to the number of zero data skipped between two non-zero data distributed successively. The jump indicator or indicators then make it possible to generate a new distribution sequence comprising only non-zero data. More generally, the jump indicators are used to synchronize the distribution of the synaptic weights from the second memory MEM_B to carry out the multiplication of an input datum Xÿ with the corresponding synaptic weight Wjj.

[0102] By way of example, the input data of an input matrix [I] in the external memory MEM_A are arranged such that all the channels for the same pixel of the input image are arranged sequentially . For example, if the input matrix is a matrix image of size NxN composed of 3 input channels of RGB colors (Red, Green, Blue in English) the input data Xj are stored as follows:

[0103]

X00R X00G XoOB , Xoi R Xoi G Xoi B , X02R Xo2G Xo2B , ■ ■ ■ , Xo(N-1)R XQ(N-1)G XO(N-1)B

X-10R X-i OG X-i OB , X-I 1 R X-11 G X-| 1B, X-| 2R X-|2G X-|2B , ■ ■ ■ , X-| (N-1)R Xi(N-1)G X-| (N-1)B

X20R X20G X20B , X21 R X21 G X21 B , X22R X22G X22B , ■ ■ ■ , X2(N-1)R X2(N-1)G X2(N-1)B

[0104]

X(N-1)0R X(N-1)OG X(N-1)0B , X(N-1)1 R X(N-1)1G X(N-1)1 B , ■ ■ ■ , X (N-1) (N-1)R X(N-1) (N-1)G X(N-1) (N-1)B

The second data memory MEM_B (corresponding to MEM_POIDSo) is connected to a weight buffer memory BUFF_B to store a subset of the weights Wÿ coming from the memory MEM_B.

The computer circuit comprises a first sequencer circuit SEQ1 capable of controlling reading in the first data memory MEM_A according to a first predefined addressing sequence. The first addressing sequence constitutes a raw sequence before the parsimony processing which is the subject of the invention.

Similarly, the computer circuit further comprises a second sequencer circuit SEQ2 capable of controlling reading in the second data memory MEM_B according to a second predefined addressing sequence.

The computer circuit further comprises a distribution circuit DIST associated with each calculation unit PE to successively deliver a new pair of first and second data associated with the output of the flow management circuit CGF.

The flow management circuit CGF receives the data Xÿ in the form of vectors V=(a1, a2, a3, ... aL) composed of L data with L a strictly positive integer. By way of example, we take L=4, thus the flow management circuit CGF initially receives and processes a first vector Vi=( X _O OR X _O OG XOOB , XOI R ), then a second vector V ₂ =( XOI G XOI B , X02R XO2G) and so on until the last vector V _k =( X(Ni) <N-2)B, X(NI) <NI)R, X(NI) <NI )G, X(NI) <NI)B )■

To simplify the illustration of the embodiments of the invention, we will consider the following sequence: Vi=(x-, x ₂ , x ₃ , x ₄ ), V ₂ =( x ₅ x ₆ , x ₇ , x ₈ ), v ₃ =( X9. X10, X11 , X1 ₂ ) ... V _k ⁼ ( x ₄ (ki)+i _: x ₄ (ki)+2, x _4(k -i) ₊₃ , x _4(k -i) ₊₄ ) with k a non-zero natural number.

The flow management circuit CGF comprises a first buffer memory BUFF_A for storing all or part of the first data delivered sequentially by the first sequencer circuit SEQ1 and a second buffer memory BUFF_B for storing all or part of the second data delivered sequentially by the second sequencer circuit SEQ2.

The flow management circuit CGF also comprises a first processing circuit CT_A for processing a vector V1 stored in the buffer memory BUFF_A and a second processing circuit CT_B for processing the synaptic coefficients stored in the buffer memory BUFF_B.

The buffer memory BUFF_A operates according to a “first in first out” or FIFO acronym for the English expression “First In First Out” principle.

The first processing circuit CT_A comprises a first control circuit for the read and write pointers ADD1 of the first buffer memory BUFF_A. The first processing circuit CT_A performs the following operations to obtain a new sequence not comprising zero input data Xj. First, the first processing circuit CT_A carries out an analysis of the first data Xj delivered by the said first sequencer circuit SEQ1 in the form of vectors to search for the first zero data and define a first jump indicator is1 between two successive non-zero data. Second, the first processing circuit CT_A is configured to controlling the transfer to the distribution circuit DIST of a first data item read from the first data buffer memory BUFF_A as a function of said first jump indicator is1.

The second processing circuit CT_B symmetrically comprises a second circuit for controlling the read and write pointers ADD2 of the second buffer memory. The processing circuit CT_B is capable of controlling the transfer to the distribution circuit of a second data item read from the second data buffer memory BUFF_B as a function of said first jump indicator is1.

Advantageously, it is possible for a particular embodiment for the second processing circuit CT_B to carry out operations for analyzing the second data (in this case it concerns the weights Wj) delivered by the second sequencer circuit SEQ2 for search for the second null data and set a second jump flag is2 between two successive non-null data. In addition, the second processing circuit CT_B controls the transfer to the distribution circuit of a second data item read from the second data buffer memory BUFF_B as a function of said first and second jump indicators is1 and is2. In this case, the assembly of the first and second processing circuits CT_B and CT_A is capable of controlling the transfer to the distribution circuit DIST of a second data item read in the second data buffer memory as a function of said first and second indicators jump.

FIG. 5b illustrates a first embodiment of the invention in which the parsimony analysis is carried out only on the first addressing sequence of the first input data Xj to generate a first jump indicator is1.

The transfer to the distribution circuit of a first datum Xj read in the first data buffer memory MEM_A is carried out according to the first jump indicator is1. The transfer to the distribution circuit DIST of a second data item Wj read from the second data buffer memory BUFF_B is carried out according to the first jump indicator is1. We speak here of a master-slave arrangement because the new distribution sequence of the set of second data is dependent on the first jump indicator is1 resulting from the analysis of the addressing sequence of the first input data Xj. The first processing circuit CT_A comprises a data parsimony management stage SPAR2 configured to generate a jump signal per word word_0 between two successive non-zero data items intended for the two pointer control circuits ADD1 and ADD2. The word jump signal mot_0 constitutes a first component of the first jump indicator is1.

Advantageously, the flow management circuit CGF comprises upstream of the first buffer memory BUFF_A a parsimony management stage by vector SPAR1 configured to generate a jump signal by vector vect_0 intended for the two pointer control circuits ADD1 and ADD2 when a vector V1 is zero. The jump signal by vector vect_0 constitutes a second component of the first jump indicator is1.

The vector parsimony management stage SPAR1 is configured to process the data vectors V supplied one after the other by the data memory MEM_A by detecting whether a vector V=(x ₁ , x ₂ , x ₃ , x ₄ ) is a null vector in the sense that xi= x ₂ = x ₃ = x ₄ = 0. When a null vector is detected by the first vector parsimony management stage SPAR1 , this last generates a jump signal by vector vect_0 towards the address generator ADD1 so as not to write the zero vector detected in the buffer memory BUFF_A and pass to the processing of the next vector. A stack of vectors V of four data items Xj comprising at least one non-zero data item is thus obtained in the memory BUFF_A.

To perform the detection of a zero vector it is possible to calculate for each datum xi, x ₂ , x ₃ , x ₄ of the vector being processed a zero datum indicator inside the stage of parsimony management by SPAR1 vector or previously in the data flow chain. For each datum xi, x ₂ , x ₃ , x ₄ of the vector V, the zero datum indicator can take the form of an additional bit concatenated to the word of bits constituting the datum itself.

Without loss of generality, in the illustrations presented, the calculation of the zero data indicator is an operation internal to the parsimony management circuit by vector SPAR1. But it is possible to calculate and match to each input data Xj, the zero data indicators outside the circuit CGF flow control. By way of example, if the datum x- i is coded on I bits, x1 ₍ i+i ) designates the zero data bit of the datum xi .

The zero datum indicator x1 ₍ i+i) has a first state N0 corresponding to zero datum and a second state N1 corresponding to non-zero datum.

The second stage for managing parsimony by data SPAR2 is configured to successively process the vectors stored in the buffer memory BUFF_A by distributing successively and at each calculation cycle a non-zero datum of the vector being processed.

The second data parsimony management stage SPAR2 performs a second function consisting of the generation of a skip signal per word mot_0 between two successive non-zero data. The combination of the vector jump signal vect_0 and the word jump signal mot_0 forms the first jump indicator is1. The first jump flag is1 allows the system to extract a new addressing sequence without null data to the address generator ADD1 . The address generator ADD1 thus controls the transfer to the distribution circuit of a first piece of data read from the first data buffer memory BUFF_A as a function of the first jump indicator is1 to the calculation unit PE.

In addition, the propagation of the jump indicator is1 to the address generator ADD2 associated with the processing circuit CT_B makes it possible to synchronize the distribution of the weights Wÿ to the calculation unit PE from the buffer memory BUFFJ3 .

The data parsimony management stage SPAR2 performs another function consisting of the generation of a read trigger signal for the next vector followject when all the non-zero data of the vector V have been sent to the distribution device DIST. The followject signal is propagated to the address generator ADD1 to trigger the processing of the next vector in the buffer memory BUFF_A following the end of the analysis of the vector being processed.

The proposed solution thus makes it possible to minimize the number of calculation cycles carried out to calculate a weighted sum by avoiding carrying out multiplications with zero xy data following the detection of these data. null, and the synchronization of the distribution of the weights w^ by means of at least one read jump information.

The solution described according to the invention is symmetrical in the sense that it is possible to invert the data and the weights in the detection and synchronization mechanism. In other words, it is possible to invert the concept of master-slave between the data Xj and the weights Wj . Thus, it is possible to detect the damaged weights Wÿ and to synchronize the reading of the data Xj according to jump information calculated from the processing of the weights Wj.

FIG. 5c represents an example of an implementation of the parsimony management stage by vector and of the parsimony management stage by data, belonging to the flow management circuit according to the invention.

We illustrate the example where a vector V includes 4 values. The vector parsimony management stage SPAR 1 comprises 4 null data detection circuits MNULL1, MNULL2, MNULL3 and MNULL4, each being intended to calculate the null data indicator xi ₍ i+i) of a datum belonging to the vector V received by the flow management circuit CGF. For example, MNULL1 receives data x1 at its input and generates at its output x1 concatenated with a null data indicator bit x1 (i+i)=1 if x1 is null and x1 (i ₊₁₎ =0 if x1 is non-zero. A null data detection circuit MNULL can be implemented with logic gates to form a combinatorial logic circuit.

[0131] In the example of FIG. 5c, the zero data detection circuits have been integrated into the vector parsimony management stage SPAR1 but this is not limiting because it is possible to perform the calculation of the zero data indicator x1 (i+i) upstream of the flow management circuit CGF and even upstream of the storage memory MEM_A.

Furthermore, the vector parsimony management stage SPAR 1 comprises a register REGI which stores the vector being analyzed by SPAR1 in the following form V1 = (x1 ₍ i+i) Xl , x2 ₍ i +i)x2, x3 ₍ i+i)x3, x4 ₍ i+i)x4).

The vector parsimony management stage SPAR 1 further comprises a null vector detection logic circuit VNULL1 having 4 inputs for receiving the set of null data indicators Xj(i+i) and an output for generating the jump signal by vector vect_0 indicating whether vector V is zero or not. We means by a null vector, a vector V1 = (x1 _(i+1) x1 , x2(i ₊₁₎ x2, x3(i ₊₁₎ x3, x4(i ₊₁₎ x4) such that x1 =x2=x3 =x4=0 or in other words, a vector such that the set of null value indicators xi ₍ i+i) belonging to said vector are in the first state N1. Thus, the jump signal by vector vect_0 commands the first memory management circuit ADD1 not to write in the buffer memory BUFF_A the null vectors coming from the first memory MEM_A.

Thus, the first parsimony management stage SPAR_1 performs a first filtering step to perform read jumps by vectors of 4 data when a zero vector is detected.

In steady state, the buffer memory BUFF_A then stores the data arranged by successive vectors V, each comprising at least one non-zero datum. The passage to the reading of a vector to a following vector from the buffer memory BUFF_A is controlled by the memory management circuit ADD1, and the order of the reading of the stored vectors is organized according to the principle of a FIFO “first in first out”.

On the other hand, the parsimony management stage by data SPAR2 comprises a register REG2 which stores the vector being processed by SPAR2. The parsimony management stage by datum SPAR2 is configured to successively process the vectors V coming from the buffer memory BUFF_A in the following way: o detect the position of the first non-zero datum belonging to the vector being processed by means of the null data flag bit; o controlling the distribution of said detected non-zero datum to the calculation unit PE; o set to the first state N0 the null data indicator of the input data distributed in the vector V being processed. o calculate and generate a jump signal per word mot_0 between two successive non-zero data,

For a vector V being analyzed, the steps described above are repeated by SPAR2 until all the zero data indicators of the vector are in the N0 state, indicating that all the non-zero data of the vector V have been distributed and then triggering the processing of the next vector from BUFF_A. The data parsimony management stage SPAR2 performs this function by generating the next vector read trigger signal followject when all the zero data indicators of the vector are in the NO state.

To perform the various functions detailed previously, the parsimony management stage by data SPAR2 comprises a priority encoder stage ENC configured to perform iteratively the generation of a distribution control signal c1 corresponding to the index of the first non-zero input data of vector V1 . Indeed, the encoder stage ENC receives as input the zero data indicators Xj(i ₊₁ ) of the vector being analyzed to generate said index.

Then the distribution control signal c1 is propagated as input to the memory management circuit ADD1 for the generation of the second jump indicator mot_0; followed by the setting to the first state N0 of the null data indicator of the input data distributed in the vector V1 stored in the register Reg2 following its distribution.

The second parsimony management stage by data SPAR2 also comprises a distribution member MUX controlled by the distribution control signal c1. A multiplexer with 4 inputs each receiving a datum Xj from the vector V and at an output can be used to implement this functionality.

The second data parsimony management stage SPAR2 further comprises a second zero vector detection logic circuit VNUL2 to generate a read trigger signal for the next vector followject when all the zero data indicators of the data belonging to the vector V1 being processed are in the first state N0 and control the memory BUFF_A through the memory management circuit ADD1 to trigger the processing of the next vector V2 by SPAR2 at the next cycle.

Simultaneously for each non-zero distributed datum Xÿ, the two jump indicators mot_0 and vect_0 form the two components of the first jump indicator is 1 . The latter is propagated to the second memory management circuit ADD2 which controls the reading of the buffer memory BUFF_B. That makes it possible to distribute the weight Wÿ associated with the distributed data Xÿ according to the read sequence initially provided by the sequencers SEQ1 and SEQ2.

By way of illustration, consider the following data sequence composed of successive vectors Vi=( xi=4, x ₂ =0, x ₃ =0, x ₄ =2), V ₂ =( x ₅ = 0 _: x ₆ =0, x ₇ =0 , x ₈ =0 ), V ₃ =( x ₉ =0 x ₁₀ =1, xn=0 , X ₁₂ =3 ). The first initially predefined addressing sequence which governs the reading of data from MEM_A is as follows: xi _: x ₂ , x ₃ , x ₄ , X5 X6, x ₇ , x ₃ , X9 X10, xn , xi ₂ . The second initially predefined addressing sequence which governs the reading of the weights from MEM_B is wi, w ₂ , w ₃ , w ₄ , W5 _: W6, w ₇ , W8, W9 W10, W11 , wi ₂ . The new addressing sequence which governs the reading of data from MEM_A after processing by the parsimony management circuit is: xi. x ₄ , x , xi ₂ and for the weights wi, w ₄ , W10, wi ₂ . The jump indicator is1 =mot_0+vect_0 successively takes the following values:

■ ^1st cycle: is1 = mot_0 + vector_0 x 4 = 0+0=0

■ 2nd cycle: is1 = ^mot_0 +vect_0 = 2+0x4=2

■ 3rd cycle: is1 = ^mot_0 +vect_0 = 1 +1x4=5

■ 4th cycle: is1 = ^mot_0 +vect_0 = 1 +0x4=1

In certain embodiments having for application a convolutional neural network, it is possible to have an initial sequencing for the data and the weights where several weights are associated with the same datum. Indeed, for an nxn convolution, an input data Xj is multiplied by n different weights of the weight matrix. Thus a unit jump in the jump indicator is1 in the first data sequence Xj corresponds to n jumps in the second weight addressing sequence Wj.

FIG. 6a represents a second embodiment of the data flow management circuit CGF' according to the invention. The specificity of this mode consists in the joint parsimony management of the first set of input data (in the illustrated case it is data coming from a layer in the network Xj) and of the second set of input data (in the case illustrated, these are the weights or synaptic coefficients Wj ). Similar to the first embodiment, the calculator circuit CALC in FIG. 6a comprises a first memory MEM_A for storing the first data Xj and a first sequencer SEQ1 for controlling the reading of the memory MEM_A according to a first predefined sequence.

Similar to the first embodiment, the calculator circuit CALC in FIG. 6a comprises a second memory MEM_B for storing the second data Wj and a second sequencer SEQ2 for controlling the reading of the memory MEM_B according to a second predefined sequence.

The advantage of the second embodiment is the management of the low density in simultaneous non-zero values of the two sets of data constituting the operands of the multiplications for a weighted sum. Thus, the compression result of the data distribution sequence over time is greater than that of the first embodiment. Indeed, in the master-slave arrangement described previously, account is taken of the low density of non-zero values only for one of the two sets of data. In the second embodiment, described later, the flow management circuit CGF' processes the two sets simultaneously. For this, from the first and the second initial distribution sequence provided by SEQ1 and SEQ2, each data item Xj is paired with the corresponding weight Wj to form a couple (Xj _: Wj). The vectors _V'i _: V'2. V' ₃ , .. V' _k correspond in this case to vectors of pairs V'i=( (xi,wi), (x ₂ ,w ₂ ), (x ₃ ,w ₃ ), (X4W4)), V' ₂ =( (xs.ws), (X6,w ₈ ), (x ₇ ,w ₇ ), (x ₈ ,w ₈ )), V' ₃ =( (x ₉ ,w ₉ ), (x ₁₀ ,w ₁₀ ), (xn.Wn), (x ₁₂ ,w ₁₂ )) and so on.

In the second embodiment, the second processing circuit CT_B is able to analyze the second data delivered by said second sequencer circuit to search for the second zero data and define a second jump indicator between two successive non-zero data, and controlling the transfer to the distribution circuit of a second data item read from the second data buffer memory as a function of said first and second jump indicators.

Thus, the second processing circuit CT_B is capable of controlling the transfer to the distribution circuit of a weight Wj read in the second data buffer memory BUFF_B as a function of the first and second indicators of skip is1 and is2 . Likewise for the first processing circuit CT_A, the control of the transfer of the data Xj to a calculation unit PE is carried out according to the first and the second jump indicators is1 and is2.

The vector parsimony management stage SPAR1 comprises 4 zero torque detection circuits CNLILL1, CNLILL2, CNLILL3 and CNLILL4, each being intended to calculate the zero torque indicator CP1 <i+i) =(xi, wi)(i+i) of a data pair belonging to the vector V' received by the flow management circuit CGF'. The term “zero torque” means each pair having at least one zero component.

For example, if CNLILL1 receives the pair (xi=1,wi=0) (or (xi=0,wi=1)) at its input, it generates the pair (XI.WI) at its output. concatenated with a zero torque indicator bit CP1<i ₊ i)=(XI,WI)(I ₊ D) to a first logic state N'0.

On the other hand, if CNLILL1 receives the couple (xi=1, wi=2) at its input, it generates at its output the couple (xi, wi) concatenated with a zero torque indicator bit CP1 <i ₊ i ) =(x ₁ ,w ₁ )(i ₊₁ ) to a second logic state N'1 .

Alternatively, it is possible to match the zero torque indicator to each of the component data of the torque to allow storage in separate memories of the data such that each first or second datum is concatenated with a zero torque indicator bit associated.

A null torque detection circuit CNULL can be produced with logic gates to form a combinatorial logic circuit.

In the example of FIG. 6a, the zero torque detection circuits have been integrated into the vector parsimony management stage SPAR1, but this is not limiting because it is possible to perform the calculation of the zero data indicator x1(i+i) upstream of the flow management circuit CGF' and even upstream of the first and second storage memories MEM_A and MEM_B.

The torque vector parsimony processing carried out by the vector parsimony management stage SPAR1 is carried out in a manner similar to that of the first embodiment. The difference is that the test is performed on zero torque indicators in the second embodiment. Thus, only the vectors comprising at least one pair having two non-zero components are transferred to the buffer memories BUFF_A and BUFF_B. For the production of FIFO type buffer memories, it is possible to store the data Xj and the weights Wj in the form of a pair belonging to a vector of pairs in a common buffer memory MEM_AB. It is also possible (as illustrated here) to keep two separate buffer memories MEM_A and MEM_B whose addressing of the read and write pointers is managed respectively by the first and the second memory control circuit ADD1 and ADD2.

In the illustrated embodiment, the processing circuits CT_A and CT_B share a register REG2' to receive a torque vector being analyzed and share a priority encoder circuit ENC' and a multiplexer MUX' to distribute successively pairs having no zero component.

At the level of the processing circuits CT_A and CT_B, the same principle of the first embodiment is applied to the data pairs and to the zero torque indicators previously calculated. The new sequence obtained then depends on a combination of the first jump indicator is1 linked to the first sequence of the first data (coming from MEM_A) and the second jump indicator is2 linked to the second sequence of the second data (coming from MEM_B).

[0161] To perform the various functions detailed above, the priority encoder stage ENC' is configured to iteratively generate a distribution control signal c'1 corresponding to the index of the first pair having two non-zero data of the vector V'i being processed. Indeed, the encoder stage ENC' receives as input the zero torque indicator CP1 (i+i) =(xi,wi)(i+i) of the vector being analyzed to generate said distribution control signal c'1 .

Then the distribution control signal c'1 is propagated as an input to the memory management circuit ADD1 for the generation of the second jump indicator word_0; followed by the setting to the first state N'0 of the zero torque indicator CP1 (i+i) =(x ₁ ,w ₁ )(i ₊₁ ) of the input data distributed in the vector V'j stored in the Reg2' register following its distribution. The multiplexer MUX' is controlled by the distribution control signal c'1.

The processing circuits CT_A and CT_B share a zero vector detection logic circuit VNUL2 to generate a read trigger signal for the next vector followject when all the zero torque indicators C Pi(i+i) =(Xj , Wj)(i+i) of the vector being processed in register REG2' are in the first state N0'.

We have thus produced a processing circuit for the distribution of data and weights based on the analysis not only of the data but also of the weights making it possible to further reduce the size of the sequence of distribution of the data to the units of PE calculation avoiding performing multiplications by zero data but also by zero weights.

According to a variant of the embodiment described in FIG. 6a, it is possible to start from a single sequence of pairs [data, weights] already matched. Thus, the flow management circuit CGF' comprises a common processing circuit CT_AB because there is a single joint sequence to be processed. In this variant, a single jump indicator is generated by the torque processing circuit. This joint jump indicator in the sequence of distribution of the couples [data, weight] makes it possible to avoid the distribution of couple including at least a null value. It is possible to implement this variant by using two memory management circuits ADD1 and ADD2 generating the same addresses or alternatively a common memory management circuit. The advantage of this variant is a reduction in the complexity and surface area of the CGF' flow management circuit compared to the embodiment of FIG. 6a.

[0167] FIG. 6b illustrates a third embodiment of the computer according to the invention making it possible to reduce the size of the buffer memories of the flow management circuit by taking into account the multiplication of the same datum Xj by a plurality of weights ( or synaptic coefficients).

In fact, when the computer circuit is dedicated to the calculation of a convolutional neural network, of size pxp, with p>1, the same datum is used in a multiplication operation with p different weights; thus in the initial sequence the data Xj concerned is copied p times when it is associated with p different weights. The problem posed by this mechanism consists in increasing the size of the buffer memory MEM_A and therefore of the surface occupied by the integrated circuit comprising the computer circuit.

The embodiment illustrated in FIG. 6b aims to overcome this drawback. Indeed, the flow management circuit is made according to the following architecture: instead of a buffer memory operating in FIFO and storing the pairs [Xj, w , there is a buffer memory operating in FIFO (denoted here BUFF_B ) dedicated only to weights and a separate data buffer (denoted here BUFF_A).

The joint data and weight flow management circuit in the embodiment described does not process the data and the weights in pairs. On the one hand, the data buffer memory BUFF_A is a double-port memory whose write pointer is incremented by 1 every p cycles modulo the size of the buffer memory BUFF_A. On the other hand, the FIFO buffer memory denoted BUFF_B storing the weights also stores the harmful torque indicators previously calculated in a manner similar to the second embodiment. We recall for each couple [Xj,Wj] that if at least one of the two components of the couple is zero, said couple is considered as a zero couple. A vector is considered as a zero vector if all its component pairs are null according to the previous definition.

The writing of the data Xj in the buffer memory BUFF_A is thus carried out at a rate p times slower than the writing of the weights Wj.

As before, the weights are grouped by vectors (of 4 weights Wj for example), while the data buffer BUFF_A has a word width of one datum (this is here for the data of a vector unit data Xj).

When a vector in the buffer memory BUFF_B (of FIFO type) only includes the weights of pairs of which at least one of the components is zero, the vector is not loaded into BUFF_B and the jump signal per vector is generated in a manner similar to the other embodiments.

The selection of the pairs having two non-zero components is carried out as previously by means of a priority encoder ENC with a criterion of choice based on the indicators of non-zero couples, thus making it possible to generate a new sequence of distribution of the weights Wj to the calculation unit PE.

Furthermore, the selection of the data to be presented to the calculation unit PE opposite each weight is made by incrementing the read pointer of the data buffer memory BUFF_A by (1 + N _ps )/p modulo the size of the buffer memory BUFF_A with N _ps the number of weights skipped in the new sequence, obtained by means of the priority encoder ENC. Indeed, the encoder ENC selects the weight of the output word from the weight memory BUFF_B to which the number of null vectors skipped multiplied by 4 is added (if the width of the analyzed vector is 4 pairs). This is how we find the data that was present, opposite the weight selected at the output, when loading this weight into the weight buffer memory BUFF_B. The operation of this system is therefore equivalent to the solution according to the second embodiment with joint processing of the first and second data, but with a lower storage capacity for the buffer memory BUFF_A.

Claims

1 . Calculation circuit (CALC) for calculating a weighted sum of a set of first data (A, X) by a set of second data (B, W) comprising:

At least a first data memory (MEM_A) for storing the first data (A, X);

At least one second data memory (MEM_B) for storing the second data (B, W);

At least one calculation unit (PE) configured to perform the weighted sum calculation;

At least one first sequencer circuit (SEQ1) capable of controlling reading in the first data memory according to a first predefined addressing sequence;

At least one second sequencer circuit (SEQ2) capable of controlling reading in the second data memory according to a second predefined addressing sequence;

At least one flow management circuit (CGF) comprising: o a plurality of null data detection circuits (MNLILL1, MNLILL2, MNLILL3, MNLILL4) each being configured to detect the null data delivered by said first sequencer circuit; o A first buffer memory (BUFF_A) for storing all or part of the first data delivered sequentially by said first sequencer circuit; o A second buffer memory (BUFF_B) for storing all or part of the second data delivered sequentially by said second sequencer circuit; o a first processing circuit (CT_A) comprising a first pointer control circuit (ADD1) for reading and writing the first buffer memory and being capable of: ■ analyzing the first data delivered by said first sequencer circuit to define a first jump indicator (is1) between two successive non-zero data, and

■ controlling the transfer to the distribution circuit of a first data item read from the first data buffer memory as a function of said first jump indicator (is1); a second processing circuit (CT_B) comprising a second pointer control circuit (ADD2) for reading and writing the second buffer memory and being capable of controlling the transfer to the distribution circuit of a second data item read in the second data buffer according to said first jump indicator (is1 ).

2. Calculation circuit (CALC) according to claim 1 wherein each null data detection circuit (MNULL1, MNLILL2, MNLILL3, MNLILL4) is configured to match each first input data (xi, x ₂ , x ₃ , x ₄ ), a zero datum indicator (xi ₍ i+i)) having a first state (NO) corresponding to zero datum and a second state corresponding to non-zero datum (N1).

3. Calculation circuit (CALC) according to claim 1 or 2 wherein: the first sequencer circuit (SEQ1) is configured to deliver the first data by vectors (V1, V2) of N successive data, N being a non-zero natural integer ; the first buffer memory (BUFF_A) is a memory capable of storing vectors of N data according to a “first in, first out” principle. the first processing circuit (CT_A) comprises

A parsimony management stage per data (SPAR2) intended to receive the vectors (V1, V2) coming from the first buffer memory (BUFF_A) and configured to generate a skip signal per word (mot_0) between two successive non-zero data at destination of the two pointer control circuits (ADD1.ADD2); said jump signal per word (mot_0) forming a first component of the first jump indicator (is 1 ).

4. Calculation circuit (CALC) according to claim 3 wherein the first processing circuit (CT_A) comprises upstream of the first buffer memory (BUFF_A) a parsimony management stage (SPAR1) per vector configured to generate a signal of jump by vector (vect_0) to the two pointer control circuits (ADD1, ADD2) when a vector (V1) is zero; said jump vector signal (vect_0) forming a second component of the first jump indicator (is1).

5. Calculation circuit (CALC) according to claim 4 wherein the parsimony management stage (SPAR1) per vector comprises a first null vector detection logic circuit (VNLIL1) configured to generate, from the null data indicators (x _1(i+1 )) , the jump signal per vector (vect_0) when a vector (V1 ) comprises only zero data.

6. Calculation circuit (CALC) according to any one of claims 3 to 5 in which the parsimony management stage by data (SPAR2) comprises:

A register (Reg2) to receive a non-zero vector (V1) at the output of the first buffer memory (BUFF_A);

A priority encoder stage (ENC) configured to carry out the following operations in an iterative manner: o the generation of a distribution control signal (c1) corresponding to the index of the first non-zero input data of the vector (V1); the first pointer control circuit (ADD1) being configured to carry out in the same iteration loop: the generation of the jump signal per word (mot_0) from the distribution control signal (c1); o and setting to the first state (NO) of the null data indicator (x1 ₍ i+i)) of the input data distributed in the vector (V1) stored in the register (Reg2) following its distribution;

A second zero vector detection logic circuit (VNUL2) for generating a next vector read trigger signal (next ect) when all the zero data indicators (x-, _(i+1) ) of the data belonging to said vector (V1) are in the first state (NO).

7. Calculation circuit (CALC) according to any one of claims 1 to 6 intended to calculate output data (Oj) of a layer of an artificial neural network from input data (Xjj), the neural network being composed of a succession of layers each consisting of a set of neurons, each layer being connected to an adjacent layer via a plurality of synapses associated with a set of synaptic coefficients (Wjj ) forming at least one matrix of weight ([W] _pq ); the first set of data (A, X) corresponding to the input data (Xj) to a neuron of the layer being calculated; the second set of data (B, W) corresponding to the synaptic coefficients (Xj) connected to said neuron of the layer being calculated; the calculation circuit (CALC) comprising: at least one group of calculation units (Gj), each group of calculation units (Gj) comprising a plurality of calculation units (PE _k ) of rank k=1 to K with K a strictly positive integer, a plurality of second data memories (MEM_B, MEM_POIDS) of rank k=1 to K to store the second set of data (B, W); each group of calculation units (Gj) being connected to a dedicated flow management circuit (CGF) further comprising: a plurality of second buffer memories (BUFF_B _{0 1} ) of rank k=1 to K such that each second memory buffer distributes to the calculation unit (PE _k ) of the same rank k the input data (B, W) coming from the second data memory of the same rank k according to at least the first jump indicator (is1 ).

8. Calculation circuit (CALC) according to any one of claims 1 to 6 intended to calculate output data (Oj,j) of a layer of an artificial neural network from input data (Xjj ), the neural network being composed of a succession of layers each consisting of a set of neurons, each layer being connected to an adjacent layer via a plurality of synapses associated with a set of synaptic coefficients (Wjj ) forming at least one weight matrix ([W] _pq ); the first set of data (A, W) corresponding to the synaptic coefficients (Wj) connected to said neuron of the layer being calculated the second set of data (B, X) corresponding to the input data (Xj) of a neuron of the layer being calculated; the calculation circuit (CALC) comprising: at least one group of calculation units (Gj), each group of calculation units (Gj) comprising a plurality of calculation units (PE _k ) of rank k=1 to K with K a strictly positive integer, a plurality of first data memories (MEM_A) of rank k=1 to K to store the first set of data (A, W);

A plurality of flow management circuits (CGF) of rank k=1 to K, each being configured such that for each calculation unit (PE _k ) of rank k belonging to a group of calculation units (Gj): the flow management circuit of rank k is configured to distribute the non-zero synaptic coefficients coming from the first data memory of the same rank k to the calculation unit of the same rank k.

9. Calculation circuit (CALC) according to claim 1, in which the second processing circuit is capable of:

- analyzing the second data delivered by said second sequencer circuit to search for the second zero data and define a second jump indicator (is2) between two successive non-zero data, and

- controlling the transfer to the distribution circuit of a second data item read from the second data buffer memory as a function of said first and second jump indicators (is1, is2); and in which the second processing circuit (CT_B) is adapted to control the transfer to the distribution circuit of a second data item read from the second data buffer memory as a function of said first and second jump indicators (is1, is2); .

10. Calculation circuit (CALC) according to claim 9 in which the flow management circuit (CGF') is configured to read the data from the two memories (MEM_A, MEM_B) by vectors of N successive pairs according to the first and the second predefined addressing sequence of said data ((xi wi), ( x ₂ w ₂ ), (xs w ₃ ), (X4 W4)), N being a non-zero natural integer; the first and second jump indicator (is1, is2) are obtained by an analysis of said vectors such that the two data forming a distributed pair are non-zero.

11 . Calculation circuit (CALC) according to claim 10 further comprising: A plurality of zero torque detection circuits (CNUL1, CNUL2, CNUL3, CNUL4) each being configured to match each pair of first and second input data (( xi wi), (X2W2), (x ₃ w ₃ ), (x _4i w ₄ )), a zero torque indicator (CP11+1 ) having a first state (N'0) corresponding to a pair comprising at least one null data and a second state (N′1) corresponding to a pair comprising only non-zero data.

12. Calculation circuit (CALC) according to one of claims 10 or 11 in which: The assembly formed by the first and the second buffer memory (BUFF_A,

BUFF_B) is a memory capable of storing vectors of N pairs according to a “first in, first out” principle.

The assembly formed by the first and the second processing circuit (CT_A, CT_B) comprises

A data parsimony management stage (SPAR2) intended to receive the vectors (V'1, V'2) coming from the set of the first and the second buffer memories (BUFF_A, BUFF_B) and configured to generate a jump per word (mot_0) between two successive pairs having two non-zero data intended for the two pointer control circuits (ADD1, ADD2); said jump signal per word (mot_0) forming a first component of the first jump indicator (is1) and of the second jump indicator (is2).

13. Computing circuit (CALC) according to any one of claims 10 to 12 comprising a vector parsimony management stage (SPAR1) upstream of the first and the second buffer memory (BUFF_A, BUFF, B), the vector parsimony management stage (SPAR1) comprising: A first null vector detection logic circuit (VNUL1 ) configured to generate a jump signal per vector (vect_0) when a vector (V'1 ) comprises only null data indicators (CP11+1) in the first state ( N'O); said vector jump signal (vect_0) forming a second component of the first jump indicator (is1) and of the second jump indicator (is2).

14. Calculation circuit (CALC) according to any one of claims 11 to 13, in which the parsimony management stage per couple (SPAR2) comprises:

A register (Reg2') to receive a non-zero vector (V'1) at the output of the buffer memory (BUFF_A, BUFF_B);

A priority encoder stage (ENC') configured to carry out the following operations in an iterative manner: o The generation of a distribution control signal (c'1 ) corresponding to the index of the first pair having two non-zero data of the vector (V'1);

At least the first or the second pointer control circuit (ADD1, ADD2) being configured to carry out in the same iteration loop: o setting to the first state (N'O) of the zero data indicator (CP11+ 1) of the input data pair distributed in the vector (V'1) stored in the register (Reg2'), following its distribution;

A distribution device (MUX') controlled by the distribution control signal (c'1);

A second null vector detection logic circuit (VNUL2) for generating a read trigger signal for the next vector (suiv ect) when all the null data indicators (CP1i+i) of the pairs belonging to said vector (V'1) are in the first state (N'O).