WO2024111644A1

WO2024111644A1 - Neural network circuit and neural network computing method

Info

Publication number: WO2024111644A1
Application number: PCT/JP2023/042052
Authority: WO
Inventors: 潤一金井
Original assignee: ＬｅａｐＭｉｎｄ株式会社
Priority date: 2022-11-22
Filing date: 2023-11-22
Publication date: 2024-05-30
Also published as: JP2024075106A

Abstract

This neural network circuit comprises a convolution operation circuit that performs a convolution operation on input data. The convolution operation circuit includes an instruction decompressor that decompresses a compressed instruction command in which a convolution operation circuit instruction command for operating the convolution operation circuit is compressed.

Description

Neural network circuit and neural network operation method

The present invention relates to a neural network circuit and a neural network calculation method. This application claims priority to Japanese Patent Application No. 2022-186308, filed on November 22, 2022, the contents of which are incorporated herein by reference.

In recent years, convolutional neural networks (CNNs) have been used as models for image recognition and other purposes. Convolutional neural networks have a multi-layer structure with convolutional layers and pooling layers, and require a large number of calculations, including convolutional operations. Various calculation methods have been devised to speed up calculations using convolutional neural networks (e.g., Patent Document 1).

JP 2018-077829 A

On the other hand, there is a demand for implementing image recognition and the like using convolutional neural networks in embedded devices such as IoT devices. It is difficult to incorporate large-scale dedicated circuits such as those described in Patent Document 1 into embedded devices. Also, in embedded devices with limited hardware resources such as CPU and memory, it is difficult to achieve sufficient computing performance of convolutional neural networks using software alone.

In light of the above circumstances, the present invention aims to provide a high-performance neural network circuit and neural network calculation method that can be incorporated into embedded devices such as IoT devices.

In order to solve the above problems, the present invention proposes the following means.
A neural network circuit according to a first aspect of the present invention includes a convolution circuit that performs a convolution operation on input data, and the convolution circuit has an instruction decompressor that decompresses compressed instruction commands that are instruction commands for the convolution circuit that operate the convolution circuit.

The neural network operation method according to the second aspect of the present invention is a control method for a neural network circuit including a convolution operation circuit that performs a convolution operation on input data, a quantization operation circuit that performs a quantization operation on the convolution operation output data of the convolution operation circuit, and an instruction fetch unit that reads from a memory a compressed instruction command obtained by compressing an instruction command for the convolution operation circuit that operates the convolution operation circuit, and a compressed instruction command obtained by compressing an instruction command for the quantization operation circuit that operates the quantization operation circuit, and includes the steps of: making the instruction fetch unit read the instruction command for the convolution operation circuit and the instruction command for the quantization operation circuit separately from the memory and supply the instruction commands separately to the convolution operation circuit and the quantization operation circuit; making the convolution operation circuit and the quantization operation circuit restore the instruction command from the compressed instruction command; and making the convolution operation circuit and the quantization operation circuit operate in parallel based on the restored instruction command.

The neural network circuit and neural network calculation method of the present invention can be incorporated into embedded devices such as IoT devices and has high performance.

FIG. 1 illustrates a convolutional neural network. FIG. 2 is a diagram for explaining a convolution operation performed by a convolution layer. FIG. 13 is a diagram for explaining data expansion in a convolution operation. 1 is a diagram showing an overall configuration of a neural network circuit according to a first embodiment; FIG. 2 is a diagram showing the overall configuration of an NN processing core. 4 is a timing chart showing an example of the operation of the NN processing core. 10 is a timing chart showing another example of the operation of the NN processing core. FIG. 1 illustrates an NN calculation multi-core. FIG. 2 is an internal block diagram of the DMAC of the neural network circuit. FIG. 2 is a state transition diagram of a control circuit of the DMAC. FIG. 2 is an internal block diagram of a convolution operation circuit of the neural network circuit. FIG. 2 is an internal block diagram of a multiplier in the convolution operation circuit. FIG. 2 is an internal block diagram of a multiply-and-accumulate unit of the multiplier. FIG. 2 is an internal block diagram of an accumulator circuit of the convolution operation circuit. FIG. 2 is an internal block diagram of an accumulator unit of the accumulator circuit. FIG. 2 is an internal block diagram of an instruction decompressor of the convolution operation circuit. FIG. 2 is a diagram showing an example of a compressed instruction command restored by the instruction decompressor. FIG. 13 illustrates a modified example of the instruction decompressor. FIG. 2 is an internal block diagram of a quantization calculation circuit of the neural network circuit. FIG. 2 is an internal block diagram of a vector operation circuit and a quantization circuit of the quantization operation circuit. FIG. 2 is a block diagram of a computing unit. FIG. 2 is an internal block diagram of a vector quantization unit of the quantization circuit.

First Embodiment
A first embodiment of the present invention will be described with reference to FIGS.
1 is a diagram showing a convolutional neural network 200 (hereinafter, referred to as "CNN 200"). The calculations performed by a neural network circuit 100 (hereinafter, referred to as "NN circuit 100") according to the first embodiment are at least a part of the trained CNN 200 used during inference.

[CNN200]
The CNN 200 is a multi-layer network including a convolution layer 210 that performs a convolution operation, a quantization operation layer 220 that performs a quantization operation, and an output layer 230. In at least a part of the CNN 200, the convolution layer 210 and the quantization operation layer 220 are alternately connected. The CNN 200 is a model that is widely used for image recognition and video recognition. The CNN 200 may further include a layer having other functions such as a fully connected layer.

FIG. 2 is a diagram illustrating the convolution operation performed by the convolution layer 210.
The convolution layer 210 performs a convolution operation on the input data a using a weight w. The convolution layer 210 performs a multiply-and-accumulate operation on the input data a and the weight w.

The input data a (also called activation data or feature map) to the convolutional layer 210 is multidimensional data such as image data. In this embodiment, the input data a is a three-dimensional tensor consisting of elements (x, y, c). The convolutional layer 210 of the CNN 200 performs a convolution operation on the low-bit input data a. In this embodiment, the elements of the input data a are 2-bit unsigned integers (0, 1, 2, 3). The elements of the input data a may be, for example, 4-bit or 8-bit unsigned integers.

If the input data input to CNN 200 has a different format from the input data a to convolutional layer 210, such as a 32-bit floating-point type, CNN 200 may further have an input layer that performs type conversion and quantization before convolutional layer 210.

The weights w (also called filters or kernels) of the convolutional layer 210 are multidimensional data having elements that are learnable parameters. In this embodiment, the weights w are four-dimensional tensors consisting of elements (i, j, c, d). The weights w have d three-dimensional tensors (hereinafter referred to as "weights wo") consisting of elements (i, j, c). The weights w in the trained CNN 200 are trained data. The convolutional layer 210 of the CNN 200 performs convolution operations using low-bit weights w. In this embodiment, the elements of the weights w are 1-bit signed integers (0, 1), where a value of "0" represents +1 and a value of "1" represents -1.

The convolution layer 210 performs the convolution operation shown in Equation 1 and outputs output data f. In Equation 1, s indicates the stride. The area indicated by the dotted line in FIG. 2 indicates one of the areas ao (hereinafter referred to as the "application area ao") where the weight wo is applied to the input data a. The elements of the application area ao are represented by (x+i, y+j, c).

The quantization operation layer 220 performs quantization and other operations on the output of the convolution operation output by the convolution layer 210. The quantization operation layer 220 has a pooling layer 221, a batch normalization layer 222, an activation function layer 223, and a quantization layer 224.

The pooling layer 221 compresses the output data f of the convolutional operation output by the convolutional layer 210 by performing calculations such as average pooling (Equation 2) and MAX pooling (Equation 3). In

Equations

2 and 3, u represents the input tensor, v represents the output tensor, and T represents the size of the pooling region. In Equation 3, max is a function that outputs the maximum value of u for the combination of i and j included in T.

The batch normalization layer 222 normalizes the data distribution of the output data of the quantization operation layer 220 and the pooling layer 221, for example, by performing an operation as shown in Equation 4. In Equation 4, u represents the input tensor, v represents the output tensor, α represents the scale, and β represents the bias. In the trained CNN 200, α and β are trained constant vectors.

The activation function layer 223 calculates an activation function such as ReLU (Equation 5) on the output of the quantization operation layer 220, the pooling layer 221, and the batch normalization layer 222. In Equation 5, u is an input tensor and v is an output tensor. In Equation 5, max is a function that outputs the largest numerical value among the arguments.

The quantization layer 224 performs quantization on the output of the pooling layer 221 and the activation function layer 223 based on the quantization parameter, for example as shown in Equation 6. The quantization shown in Equation 6 reduces the input tensor u to 2 bits. In Equation 6, q(c) is a vector of quantization parameters. In the trained CNN 200, q(c) is a trained constant vector. The inequality sign "≦" in Equation 6 may be "<".

The output layer 230 is a layer that outputs the results of the CNN 200 using an identity function, a softmax function, or the like. The layer preceding the output layer 230 may be the convolution layer 210 or the quantization operation layer 220.

In CNN200, the quantized output data of the quantization layer 224 is input to the convolution layer 210, so the load of the convolution calculation in the convolution layer 210 is smaller than in other convolution neural networks that do not perform quantization.

[Dividing the convolution operation]
The NN circuit 100 divides the input data of the convolution operation (Equation 1) of the convolution layer 210 into partial tensors and performs the operation. The method of division into the partial tensors and the number of divisions are not particularly limited. The partial tensor is formed, for example, by dividing the input data a(x+i, y+j, c) into a(x+i, y+j, co). The NN circuit 100 can also perform the operation without dividing the input data of the convolution operation (Equation 1) of the convolution layer 210.

In input data division for a convolution operation, the variable c in Equation 1 is divided into blocks of size Bc, as shown in Equation 7. Also, the variable d in Equation 1 is divided into blocks of size Bd, as shown in Equation 8. In Equation 7, co is an offset, and ci is an index from 0 to (Bc-1). In Equation 8, do is an offset, and di is an index from 0 to (Bd-1). Note that size Bc and size Bd may be the same.

The input data a(x+i, y+j, c) in Equation 1 is divided in the c-axis direction by size Bc, and is represented by the divided input data a(x+i, y+j, co). In the following explanation, the divided input data a is also referred to as "divided input data a".

The weight w(i,j,c,d) in Equation 1 is divided by the size Bc in the c-axis direction and the size Bd in the d-axis direction, and is expressed as the divided weight w(i,j,co,do). In the following explanation, the divided weight w is also referred to as the "divided weight w".

The output data f(x, y, do) divided by size Bd is calculated using Equation 9. The final output data f(x, y, d) can be calculated by combining the divided output data f(x, y, do).

[Expanding data for convolution operations]
The NN circuit 100 performs the convolution operation by expanding the input data a and the weights w in the convolution operation of the convolution layer 210.

FIG. 3 is a diagram for explaining the expansion of data in a convolution operation.
Divided input data a(x+i, y+j, co) is expanded into vector data having Bc elements. The elements of divided input data a are indexed by ci (0≦ci<Bc). In the following explanation, divided input data a expanded into vector data for each i and j is also referred to as "input vector A". Input vector A has elements from divided input data a(x+i, y+j, co×Bc) to divided input data a(x+i, y+j, co×Bc+(Bc-1)).

The split weights w(i,j,co,do) are expanded into matrix data with Bc×Bd elements. The elements of the split weights w expanded into the matrix data are indexed by ci and di (0≦di<Bd). In the following explanation, the split weights w expanded into matrix data for each i and j are also referred to as the "weight matrix W". The elements of the weight matrix W are split weights w(i,j,co×Bc,do×Bd) to w(i,j,co×Bc+(Bc-1),do×Bd+(Bd-1)).

Vector data is calculated by multiplying the input vector A by the weight matrix W. The vector data calculated for i, j, and co is shaped into a three-dimensional tensor to obtain output data f(x, y, do). By expanding the data in this way, the convolution operation of the convolution layer 210 can be performed by multiplying the vector data by the matrix data.

[NN circuit 100]
FIG. 4 is a diagram showing the overall configuration of an NN circuit 100 according to this embodiment.
The NN circuit 100 includes a DMA controller 3 (hereinafter also referred to as "DMAC 3"), a controller 6, an IFU 7, and at least one neural network calculation core 10 (hereinafter also referred to as "NN calculation core 10").

The NN circuit 100 can implement multiple NN calculation cores 10. The NN circuit 100 illustrated in FIG. 4 can implement up to four NN calculation cores 10. The multiple NN calculation cores 10 constitute a "neural network calculation multi-core 10M (hereinafter also referred to as "NN calculation multi-core 10M")" that cooperates to execute at least some of the calculations of the NN 200. In this embodiment, the multiple NN calculation cores 10 are daisy-chained. Note that the number of NN calculation cores 10 that can be implemented in the NN circuit 100 may be five or more.

The DMAC3 is connected to the external bus EB, and transfers data between an external memory 120 such as a DRAM and the NN calculation core 10. The DMAC3 transfers data read from the external memory 120 to one of the multiple NN calculation cores 10. The DMAC3 may be capable of transferring the same data read from the external memory 120 to multiple NN calculation cores 10, or may be capable of broadcasting the data.

The controller 6 is connected to the external bus EB and operates as a slave to the external host CPU 110. The controller 6 has a bus bridge 60 and a register 61.

The bus bridge 60 relays bus access from the external bus EB to the internal bus IB. The bus bridge 60 also relays write and read requests from the external host CPU 110 to the register 61.

The register 61 has a parameter register and a status register. The parameter register is a register that controls the operation of the NN circuit 100. The status register contains a pointer to the instruction sequence of each module, the number of instructions, etc., and is a register that indicates the status of the NN circuit 100. The status register may also be configured to contain a semaphore S. The external host CPU 110 can access the register 61 via the bus bridge 60 of the controller 6.

The controller 6 is connected to each block of the NN circuit 100 (DMAC3, IFU7, NN calculation core 10) via the internal bus IB. The external host CPU 110 can access each block of the NN circuit 100 via the controller 6. For example, the external host CPU 110 can issue commands to the NN calculation core 10 via the controller 6. In addition, each block can update a status register (which may include a semaphore S) held by the controller 6 via the internal bus IB. The status register may be configured to be updated via a dedicated wiring connected to each block.

The IFU (Instruction Fetch Unit) 7 reads instruction commands for each block (DMAC3, NN calculation core 10) of the NN circuit 100 from the external memory 120 via the external bus EB based on instructions from the external host CPU 110. The IFU 7 also transfers the read instruction commands to each corresponding block (DMAC3, NN calculation core 10) of the NN circuit 100. In this embodiment, the instruction commands are stored in a compressed state (hereinafter also referred to as "compressed instruction commands") in the external memory 120. The IFU 7 reads the compressed instruction commands.

[NN calculation core 10]
FIG. 5 is a diagram showing the overall configuration of the NN processing core 10. As shown in FIG.
The NN calculation core 10 includes a first memory 1, a second memory 2, a convolution calculation circuit 4, and a quantization calculation circuit 5. The NN calculation core 10 is characterized in that the convolution calculation circuit 4 and the quantization calculation circuit 5 are formed in a loop shape via the first memory 1 and the second memory 2.

The first memory 1 is a rewritable memory such as a volatile memory composed of, for example, an SRAM (Static RAM). Data is written to and read from the first memory 1 via the DMAC 3 and the internal bus IB. The external host CPU 110 can input and output data to and from the NN calculation core 10 by writing and reading data to and from the first memory 1.

The first memory 1 is connected to the input port of the convolution calculation circuit 4, and the convolution calculation circuit 4 can read data from the first memory 1. The first memory 1 is also loop-connected (C1) to the output port of the quantization calculation circuit 5, and the quantization calculation circuit 5 can write data to the first memory 1. The first memory 1 is also capable of data transfer via an inter-core connection (C2) between it and another NN calculation core 10, and the other NN calculation core 10 connected to the inter-core connection (C2) can write data to the first memory 1. In this embodiment, a daisy-chain connection is used as an example of the inter-core connection (C2).

The second memory 2 is a rewritable memory such as a volatile memory composed of, for example, SRAM (Static RAM). Data is written to and read from the second memory 2 via the DMAC 3 and the internal bus IB. The external host CPU 110 can input and output data to and from the NN calculation core 10 by writing and reading data to and from the second memory 2.

The second memory 2 is connected to an input port of the quantization calculation circuit 5, and the quantization calculation circuit 5 can read data from the second memory 2. In addition, the second memory 2 is connected to an output port of the convolution calculation circuit 4, and the convolution calculation circuit 4 can write data to the second memory 2.

The convolution operation circuit 4 is a circuit that performs convolution operations in the convolution layer 210 of the trained CNN 200. The convolution operation circuit 4 reads the input data a stored in the first memory 1 and performs a convolution operation on the input data a. The convolution operation circuit 4 writes the output data f of the convolution operation (hereinafter also referred to as "convolution operation output data") to the second memory 2.

The quantization operation circuit 5 is a circuit that performs at least a part of the quantization operation in the quantization operation layer 220 of the trained CNN 200. The quantization operation circuit 5 reads the output data f of the convolution operation stored in the second memory 2, and performs a quantization operation (an operation that includes at least quantization among pooling, batch normalization, activation function, and quantization) on the output data f of the convolution operation.

The quantization calculation circuit 5 writes the output data of the quantization calculation (hereinafter also referred to as "quantization calculation output data") to the first memory 1 connected in a loop (C1). In addition, the quantization calculation circuit 5 can transfer data to other NN calculation cores 10 via the inter-core connection (C2), and the quantization calculation circuit 5 can output the quantization calculation output data to other NN calculation cores 10 connected in an inter-core connection (C2).

The NN calculation core 10 has a first memory 1, a second memory 2, etc., so that the number of times duplicate data is transferred can be reduced when data is transferred by the DMAC 3 from an external memory such as a DRAM. This makes it possible to significantly reduce the power consumption or processing load caused by memory access.

[Operation Example 1 of the NN calculation core 10]
FIG. 6 is a timing chart showing an example of the operation of the NN processing core 10. In FIG.
The DMAC 3 stores the input data a of the layer 1 in the first memory 1. The DMAC 3 may divide the input data a of the layer 1 and transfer it to the first memory 1 in accordance with the order of the convolution operation performed by the convolution operation circuit 4.

The convolution operation circuit 4 reads out the input data a of layer 1 stored in the first memory 1. The convolution operation circuit 4 performs the convolution operation of layer 1 shown in FIG. 1 on the input data a of layer 1. The output data f of the convolution operation of layer 1 is stored in the second memory 2.

The quantization calculation circuit 5 reads the output data f of layer 1 stored in the second memory 2. The quantization calculation circuit 5 performs the quantization calculation of layer 2 on the output data f of layer 1. The output data of the quantization calculation of layer 2 is stored in the first memory 1.

The convolution operation circuit 4 reads the output data of the quantization operation of layer 2 stored in the first memory 1. The convolution operation circuit 4 performs the convolution operation of layer 3 using the output data of the quantization operation of layer 2 as input data a. The output data f of the convolution operation of layer 3 is stored in the second memory 2.

The convolution circuit 4 reads the output data of the quantization operation of layer 2M-2 (M is a natural number) stored in the first memory 1. The convolution circuit 4 performs the convolution operation of layer 2M-1 using the output data of the quantization operation of layer 2M-2 as input data a. The output data f of the convolution operation of layer 2M-1 is stored in the second memory 2.

The quantization calculation circuit 5 reads the output data f of layer 2M-1 stored in the second memory 2. The quantization calculation circuit 5 performs the quantization calculation of layer 2M on the output data f of the 2M-1 layer. The output data of the quantization calculation of layer 2M is stored in the first memory 1.

The convolution operation circuit 4 reads the output data of the quantization operation of layer 2M stored in the first memory 1. The convolution operation circuit 4 performs the convolution operation of layer 2M+1 using the output data of the quantization operation of layer 2M as input data a. The output data f of the convolution operation of layer 2M+1 is stored in the second memory 2.

The convolution calculation circuit 4 and the quantization calculation circuit 5 alternately perform calculations to advance the calculations of the CNN 200 shown in FIG. 1. In the NN calculation core 10, the convolution calculation circuit 4 performs the convolution calculations of layers 2M-1 and 2M+1 by time sharing. In the NN calculation core 10, the quantization calculation circuit 5 performs the quantization calculations of layers 2M-2 and 2M by time sharing. Therefore, the circuit scale of the NN calculation core 10 is significantly smaller than when a separate convolution calculation circuit 4 and quantization calculation circuit 5 are implemented for each layer.

The NN calculation core 10 performs calculations for the CNN 200, which is a multi-layer structure of multiple layers, using a circuit formed in a loop. The NN calculation core 10 can efficiently use hardware resources due to the loop circuit configuration. Since the NN calculation core 10 forms a circuit in a loop, the parameters in the convolution calculation circuit 4 and the quantization calculation circuit 5, which change in each layer, are updated appropriately.

If the calculations of CNN 200 include calculations that cannot be performed by NN calculation core 10, NN calculation core 10 transfers intermediate data to an external calculation device such as external host CPU 110. After the external calculation device performs calculations on the intermediate data, the results of the calculations by the external calculation device are input to first memory 1 and second memory 2. NN calculation core 10 resumes calculations on the results of the calculations by the external calculation device.

[Operation Example 2 of the NN calculation core 10]
FIG. 7 is a timing chart showing another example of the operation of the NN processing core 10. In FIG.
The NN processing core 10 may divide the input data a into partial tensors and perform operations on the partial tensors by time division. The method of division into the partial tensors and the number of divisions are not particularly limited.

7 shows an example of an operation when the input data a is decomposed into two partial tensors. The decomposed partial tensors are called "first partial tensor a ₁ " and "second partial tensor a ₂ ". For example, the convolution operation of layer 2M-1 is decomposed into a convolution operation corresponding to the first partial tensor a ₁ (in FIG. 7, indicated as "layer 2M-1 (a ₁ )") and a convolution operation corresponding to the second partial tensor a ₂ (in FIG. 7, indicated as "layer 2M-1 (a ₂ )").

The convolution and quantization operations corresponding to the first partial tensor a ₁ and the convolution and quantization operations corresponding to the second partial tensor a ₂ can be performed independently, as shown in FIG.

The convolution operation circuit 4 performs a convolution operation of the layer 2M-1 corresponding to the first partial tensor _a1 (operation shown by layer 2M-1 ( _a1 ) in FIG. 7). After that, the convolution operation circuit 4 performs a convolution operation of the layer 2M-1 corresponding to the second partial tensor _a2 (operation shown by layer 2M-1 ( _a2 ) in FIG. 7). In addition, the quantization operation circuit 5 performs a quantization operation of the layer 2M corresponding to the first partial tensor _a1 (operation shown by layer 2M ( _a1 ) in FIG. 7). In this way, the NN operation core 10 can perform the convolution operation of the layer 2M-1 corresponding to the second partial tensor _a2 and the quantization operation of the layer 2M corresponding to the first partial tensor _a1 in parallel.

Next, the convolution operation circuit 4 performs a convolution operation of layer 2M+1 corresponding to the first partial tensor _a1 (operation shown as layer 2M+1( _a1 ) in FIG. 7). Moreover, the quantization operation circuit 5 performs a quantization operation of layer 2M corresponding to the second partial tensor _a2 (operation shown as layer 2M( _a2 ) in FIG. 7). In this way, the NN operation core 10 can perform the convolution operation of layer 2M+1 corresponding to the first partial tensor _a1 and the quantization operation of layer 2M corresponding to the second partial tensor _a2 in parallel.

The convolution operation and quantization operation corresponding to the first partial tensor a ₁ and the convolution operation and quantization operation corresponding to the second partial tensor a ₂ can be performed independently. Therefore, the NN processing core 10 may perform, for example, the convolution operation of the layer 2M-1 corresponding to the first partial tensor a ₁ and the quantization operation of the layer 2M+2 corresponding to the second partial tensor a ₂ in parallel. In other words, the convolution operation and quantization operation performed in parallel by the NN processing core 10 are not limited to the operations of consecutive layers.

By dividing the input data a into partial tensors, the NN calculation core 10 can operate the convolution calculation circuit 4 and the quantization calculation circuit 5 in parallel. As a result, the waiting time of the convolution calculation circuit 4 and the quantization calculation circuit 5 is reduced, improving the calculation processing efficiency of the NN calculation core 10. In the operation example shown in FIG. 7, the number of divisions is 2, but even if the number of divisions is greater than 2, the NN calculation core 10 can similarly operate the convolution calculation circuit 4 and the quantization calculation circuit 5 in parallel.

For example, when the input data a is divided into a "first partial tensor a ₁ ", a "second partial tensor a ₂ ", and a "third partial tensor a ₃ ", the NN processing core 10 may perform in parallel a convolution operation of the layer 2M-1 corresponding to the second partial tensor a ₂ and a quantization operation of the layer 2M corresponding to the third partial tensor a _3. The order of the operations is appropriately changed depending on the storage status of the input data a in the first memory 1 and the second memory 2.

As an example of a method of computing partial tensors, an example (method 1) has been shown in which a partial tensor in the same layer is computed by the convolution computation circuit 4 or the quantization computation circuit 5, and then a partial tensor in the next layer is computed. For example, as shown in FIG. 7, in the convolution computation circuit 4, a convolution computation of layer 2M-1 corresponding to the first partial tensor _a1 and the second partial tensor _a2 (computations shown by layer 2M-1( _a1 ) and layer 2M-1( _a2 ) in FIG. 7) is performed, and then a convolution computation of layer 2M+1 corresponding to the first partial tensor _a1 and the second partial tensor _a2 (computations shown by layer 2M+1( _a1 ) and layer 2M+1( _a2 ) in FIG. 7).

However, the calculation method for the partial tensor is not limited to this. The calculation method for the partial tensor may be a method in which some partial tensors in multiple layers are calculated and then the remaining partial tensors are calculated (method 2). For example, in the convolution calculation circuit 4, after performing a convolution calculation of the layer 2M-1 corresponding to the first partial tensor _a1 and the layer 2M+1 corresponding to the first partial tensor _a1 , a convolution calculation of the layer 2M-1 corresponding to the second partial tensor a2 and the layer 2M+1 corresponding to the _second partial tensor _a2 may be performed.

In addition, the method of computing partial tensors may be a combination of

methods

1 and 2. However, when using method 2, the computation must be performed according to the dependency relationship regarding the order of computation of partial tensors.

[NN calculation multi-core 10M]
FIG. 8 is a diagram showing the NN computing multi-core 10M.
The NN computing multi-core 10M illustrated in Fig. 8 includes two daisy-chained NN computing cores 10. When distinguishing between the two NN computing cores 10, the two NN computing cores 10 are referred to as a "first NN computing core 10A" and a "second NN computing core 10B." In Fig. 8, the first memory 1 is abbreviated as "A," the convolution computing circuit 4 as "C," the second memory 2 as "F," and the quantization computing circuit 5 as "Q."

Specifically, the quantization calculation circuit 5 of the first NN calculation core 10A and the first memory 1 of the second NN calculation core 10B are daisy-chain connected (C2). The quantization calculation circuit 5 of the first NN calculation core 10A can write the quantization calculation output data to the first memory 1 of the first NN calculation core 10A, which is loop-connected (C1), and/or the first memory 1 of the second NN calculation core 10B, which is daisy-chain connected (C2).

Specifically, the quantization calculation circuit 5 of the second NN calculation core 10B and the first memory 1 of the first NN calculation core 10A are daisy-chain connected (C2). The quantization calculation circuit 5 of the second NN calculation core 10B can write the quantization calculation output data to the first memory 1 of the second NN calculation core 10B, which is loop-connected (C1), and/or the first memory 1 of the first NN calculation core 10A, which is daisy-chain connected (C2).

Similarly, when the NN calculation multi-core 10M has three or more NN calculation cores 10, the multiple NN calculation cores 10 are daisy-chained. The quantization calculation circuits 5 of the NN calculation cores 10 other than the final-stage NN calculation core 10 are daisy-chained (C2) with the first memory 1 of the subsequent-stage NN calculation core 10B. The quantization calculation circuit 5 of the final-stage NN calculation core 10 is daisy-chained (C2) with the first memory 1 of the first-stage NN calculation core 10. The multiple NN calculation cores 10 are characterized by being formed in a daisy-chain loop (linked together).

In one NN calculation core 10, the first memory (A) 1, the convolution calculation circuit (C) 4, the second memory (F) 2, and the quantization calculation circuit (Q) 5 are connected in a loop. On the other hand, in the NN calculation multi-core 10M, the first memory (A) 1, the convolution calculation circuit (C) 4, the second memory (F) 2, and the quantization calculation circuit (Q) 5 are connected in a daisy chain loop (linked together) so that the first memory (A) 1, the convolution calculation circuit (C) 4, the second memory (F) 2, and the quantization calculation circuit (Q) 5 are repeatedly arranged in the same order.

The multiple NN calculation cores 10 constituting the NN calculation multi-core 10M do not need to have the same hardware configuration. For example, the capacity and configuration of the first memory 1 of the first NN calculation core 10A may be different from the capacity and configuration of the first memory 1 of the second NN calculation core 10B. For example, the configuration of the quantization calculation circuit 5 of the first NN calculation core 10A may be different from the configuration of the quantization calculation circuit 5 of the second NN calculation core 10B.

Next, each component of the NN circuit 100 will be explained in detail.

[DMAC3]
FIG. 9 is an internal block diagram of the DMAC 3.
The DMAC 3 has a data transfer circuit 31 and a state controller 32. The DMAC 3 has a state controller 32 dedicated to the data transfer circuit 31, and when an instruction command is input, the DMAC 3 can perform DMA data transfer without requiring an external controller.

The data transfer circuit 31 is connected to the external bus EB, and performs DMA data transfer between an external memory 120 such as a DRAM and the NN calculation core 10. The number of DMA channels of the data transfer circuit 31 is not limited. For example, the first NN calculation core 10A and the second NN calculation core 10B may each have a dedicated DMA channel.

The state controller 32 controls the state of the data transfer circuit 31. The state controller 32 is also connected to the controller 6 via the internal bus IB. The state controller 32 has an instruction queue 33 and a control circuit 34.

The instruction queue 33 is a queue in which instruction commands C3 for the DMAC3 are stored, and is configured, for example, as a FIFO memory. One or more instruction commands C3 are written to the instruction queue 33 via the IFU 7 or the internal bus IB.

The control circuit 34 is a state machine that decodes the instruction command C3 and sequentially controls the data transfer circuit 31 based on the instruction command C3. The control circuit 34 may be implemented by a logic circuit or a CPU controlled by software.

FIG. 10 is a state transition diagram of the control circuit 34.
When the instruction command C3 is input to the instruction queue 33 (not empty), the control circuit 34 transitions from the idle state ST1 to the decode state ST2.

In the decode state ST2, the control circuit 34 decodes the instruction command C3 output from the instruction queue 33. The control circuit 34 also reads the semaphore S stored in the register 61 of the controller 6, and determines whether the operation of the data transfer circuit 31 instructed in the instruction command C3 can be executed. If it is not executable (Not ready), the control circuit 34 waits (Wait) until it is executable. If it is executable (Ready), the control circuit 34 transitions from the decode state ST2 to the execution state ST3.

In execution state ST3, the control circuit 34 controls the data transfer circuit 31 to cause the data transfer circuit 31 to perform the operation instructed in the instruction command C3. When the operation of the data transfer circuit 31 is completed, the control circuit 34 removes the executed instruction command C3 from the instruction queue 33 and updates the semaphore S stored in the register 61 of the controller 6. If there are instructions in the instruction queue 33 (Not empty), the control circuit 34 transitions from execution state ST3 to decode state ST2. If there are no instructions in the instruction queue 33 (Empty), the control circuit 34 transitions from execution state ST3 to idle state ST1.

[Convolution operation circuit 4]
FIG. 11 is an internal block diagram of the convolution circuit 4.
The convolution operation circuit 4 has a weight memory 41, a multiplier 42, an accumulator circuit 43, a state controller 44, and an instruction decompressor 49. The convolution operation circuit 4 has a state controller 44 dedicated to the multiplier 42 and the accumulator circuit 43, and when an instruction command is input, the convolution operation can be performed without the need for an external controller.

The weight memory 41 is a memory in which the weight w used in the convolution calculation is stored, and is a rewritable memory such as a volatile memory composed of, for example, an SRAM (Static RAM). The DMAC3 writes the weight w required for the convolution calculation to the weight memory 41 by DMA transfer.

FIG. 12 is an internal block diagram of the multiplier 42.
The multiplier 42 multiplies the input vector A by the weight matrix W. As described above, the input vector A is vector data having Bc elements in which the divided input data a(x+i, y+j, co) is expanded for each i and j. The weight matrix W is matrix data having Bc×Bd elements in which the divided weights w(i, j, co, do) are expanded for each i and j. The multiplier 42 has Bc×Bd product-sum operation units 47 and can multiply the input vector A by the weight matrix W in parallel.

The multiplier 42 reads the input vector A and weight matrix W required for the multiplication from the first memory 1 and weight memory 41, and performs the multiplication. The multiplier 42 outputs Bd product-sum operation results O(di).

FIG. 13 is an internal block diagram of the product-sum calculation unit 47. As shown in FIG.
The multiply-add unit 47 multiplies an element A(ci) of an input vector A by an element W(ci,di) of a weight matrix W. The multiply-add unit 47 also adds the multiplication result to a multiplication result S(ci,di) of another multiply-add unit 47. The multiply-add unit 47 outputs an addition result S(ci+1,di). The element A(ci) is a 2-bit unsigned integer (0,1,2,3). The element W(ci,di) is a 1-bit signed integer (0,1), where a value "0" represents +1 and a value "1" represents -1.

The multiply-and-accumulate unit 47 has an inverter 47a, a selector 47b, and an adder 47c. The multiply-and-accumulate unit 47 performs multiplication using only the inverter 47a and the selector 47b, without using a multiplier. When the element W(ci, di) is "0", the selector 47b selects the input of the element A(ci). When the element W(ci, di) is "1", the selector 47b selects the complement of the element A(ci) inverted by the inverter. The element W(ci, di) is also input to the carry-in of the adder 47c. When the element W(ci, di) is "0", the adder 47c outputs the value obtained by adding the element A(ci) to S(ci, di). When W(ci, di) is "1", adder 47c outputs the value obtained by subtracting element A(ci) from S(ci, di).

FIG. 14 is an internal block diagram of the accumulator circuit 43.
The accumulator circuit 43 accumulates the product-sum operation results O(di) of the multiplier 42 in the second memory 2. The accumulator circuit 43 has Bd accumulator units 48 and can accumulate the Bd product-sum operation results O(di) in parallel in the second memory 2.

FIG. 15 is an internal block diagram of the accumulator unit 48.
The accumulator unit 48 has an adder 48a and a mask unit 48b. The adder 48a adds an element O(di) of the sum-of-products operation result O and a partial sum which is an intermediate result of the convolution operation shown in Equation 1 stored in the second memory 2. The sum result is 16 bits per element. The sum result is not limited to 16 bits per element, and may be, for example, 15 bits or 17 bits per element.

The adder 48a writes the result of the addition to the same address in the second memory 2. When the initialization signal clear is asserted, the masking unit 48b masks the output from the second memory 2 and sets the addition target for element O(di) to zero. The initialization signal clear is asserted when no intermediate partial sums are stored in the second memory 2.

When the convolution operation by the multiplier 42 and the accumulator circuit 43 is completed, the output data f(x, y, do) is stored in the second memory 2.

The state controller 44 controls the states of the multiplier 42 and the accumulator circuit 43. The state controller 44 is also connected to the controller 6 via the internal bus IB. The state controller 44 has an instruction queue 45 and a control circuit 46.

The instruction queue 45 is a queue in which the instruction command C4 for the convolution operation circuit 4 is stored, and is configured, for example, as a FIFO memory. The instruction command C4 is written to the instruction queue 45 via the IFU 7 or the internal bus IB.

The control circuit 46 is a state machine that decodes the instruction command C4 and controls the multiplier 42 and the accumulator circuit 43 based on the instruction command C4. The control circuit 46 has a similar configuration to the control circuit 34 of the state controller 32 of the DMAC3.

FIG. 16 is an internal block diagram of the instruction decompressor 49.
The instruction decompressor 49 decompresses the instruction command C4 from the compressed instruction command obtained by compressing the instruction command C4. The instruction decompressor 49 includes a decompressor 49a and a ring buffer 49b. The decompressor 49a decodes the compressed instruction command input from the IFU 7 and restores the instruction command C4 based on the data stored in the ring buffer 49b. The ring buffer 49b is a ring-shaped buffer memory. Note that the ring buffer 49b is not limited to a ring-shaped buffer memory and may be a buffer memory of another type.

FIG. 17 is a diagram showing an example of a compressed instruction command that is decompressed by the instruction decompressor 49. As shown in FIG.
The Push instruction has an opcode field OF and an instruction field IF. The opcode field OF of the Push instruction stores an opcode indicating that it is a Push instruction. The instruction field IF stores an original instruction. The Push instruction stores the original instruction in the ring buffer 49b and outputs the original instruction to the instruction queue 45. The original instruction includes instructions such as an instruction to set input data a, an instruction to set weight w, and an instruction to set the output of convolution operation output data.

The Copy instruction has an opcode field OF, a seek field SF, and a count field CF. The opcode field OF of the Copy instruction stores an opcode indicating that it is a Copy instruction. The seek field SF stores a seek indicating an address in ring buffer 49b. The count field CF stores a count indicating the number of instructions to copy. The Copy instruction outputs to the instruction queue 45 the instructions stored in ring buffer 49b after the address indicated by the seek, up to the number of instructions indicated by the count.

The convolution operation circuit 4 of this embodiment can perform convolution operations without requiring an external controller, but in order to improve the degree of freedom of the convolution operation, it is preferable to be able to specify in detail the operation to be performed based on one instruction command C4. As an example, by specifying one instruction command C4 that executes a multiplication of one element (1x1) in a convolution operation and combining multiple such commands, it is possible to realize a variety of convolution operations, such as convolution operations using different weight filters. On the other hand, specifying instruction commands C4 in detail increases the total number of instruction commands C4, which causes problems such as an increase in the amount of usage of the external memory 120 and pressure on the bandwidth of the external bus EB. To solve this problem, the convolution operation circuit 4 of this embodiment uses a compressed instruction command that compresses the instruction command C4.

The instruction commands C4 for the convolution operation circuit 4 tend to be successive instructions that repeatedly perform convolution operations, and similar instruction commands C4 tend to occur in succession within a short period of time. Therefore, by storing the instructions in the ring buffer 49b using the Push command described above, and then copying the instructions stored in the ring buffer 49b using the Copy command, it is possible to reduce the number of compressed instruction commands obtained by compressing the instruction commands C4 for the convolution operation circuit 4.

The instruction command C4 for the convolution operation circuit 4 that is input to the instruction decompressor 49 is compressed in advance by a tool such as a compiler that generates the instruction command C4.

FIG. 18 shows a modified example of the instruction decompressor 49. In FIG.
The convolution circuit 4 may include a plurality of instruction decompressors 49. In the example shown in FIG. 18, three instruction decompressors 49 are provided in parallel. In this case, an individual instruction queue 45 corresponding to each instruction decompressor 49 is provided. The instruction command C4 for the convolution circuit 4 is divided into three groups and input to the three instruction decompressors 49. For example, the instruction command C4 for the convolution circuit 4 is divided into an instruction for setting input data a, an instruction for setting weight w, and an instruction for setting the output of convolution output data. Since the same type of instruction is more likely to be input to the instruction decompressor 49, the utilization efficiency of the ring buffer 49b is improved, and the compression rate of the instructions is improved. In addition, the control circuit 46 can efficiently read and execute the instructions stored in the instruction queue 45 divided into three groups.

[Quantization Calculation Circuit 5]
FIG. 19 is an internal block diagram of the quantization calculation circuit 5.
The quantization operation circuit 5 has a quantization parameter memory 51, a vector operation circuit 52, a quantization circuit 53, a state controller 54, and an instruction decompressor 59. The quantization operation circuit 5 has a state controller 54 dedicated to the vector operation circuit 52 and the quantization circuit 53, and when an instruction command is input, the quantization operation can be performed without the need for an external controller.

The quantization parameter memory 51 is a memory in which the quantization parameter q used in the quantization operation is stored, and is a rewritable memory such as a volatile memory composed of, for example, an SRAM (Static RAM). The DMAC3 writes the quantization parameter q required for the quantization operation to the quantization parameter memory 51 by DMA transfer.

FIG. 20 is an internal block diagram of the vector calculation circuit 52 and the quantization circuit 53.
The vector operation circuit 52 performs an operation on the output data f(x, y, do) stored in the second memory 2. The vector operation circuit 52 has Bd operation units 57, and performs SIMD operations in parallel on the output data f(x, y, do).

FIG. 21 is a block diagram of the arithmetic unit 57.
The arithmetic unit 57 includes, for example, an ALU 57 a, a first selector 57 b, a second selector 57 c, a register 57 d, and a shifter 57 e. The arithmetic unit 57 may further include other arithmetic units included in a known general-purpose SIMD arithmetic circuit.

The vector operation circuit 52 performs at least one of the operations of the pooling layer 221, the batch normalization layer 222, and the activation function layer 223 in the quantization operation layer 220 on the output data f(x, y, do) by combining the operators and the like contained in the operation unit 57.

The arithmetic unit 57 can add the data stored in the register 57d and the element f(di) of the output data f(x, y, do) read from the second memory 2 by the ALU 57a. The arithmetic unit 57 can store the addition result by the ALU 57a in the register 57d. The arithmetic unit 57 can initialize the addition result by inputting "0" to the ALU 57a instead of the data stored in the register 57d by selecting the first selector 57b. For example, if the pooling area is 2x2, the shifter 57e can output the average value of the addition results by shifting the output of the ALU 57a to the right by 2 bits. The vector calculation circuit 52 can perform the average pooling calculation shown in Equation 2 by repeating the above calculations by the Bd arithmetic units 57.

The arithmetic unit 57 can compare the data stored in the register 57d with the element f(di) of the output data f(x, y, do) read from the second memory 2 by the ALU 57a.
The arithmetic unit 57 controls the second selector 57c according to the comparison result by the ALU 57a, and can select the larger of the data stored in the register 57d and the element f(di). The arithmetic unit 57 can initialize the comparison target to the minimum value by inputting the minimum value of the possible value of the element f(di) to the ALU 57a by the selection of the first selector 57b. In this embodiment, the element f(di) is a 16-bit signed integer, so the minimum value of the possible value of the element f(di) is "0x8000". The vector calculation circuit 52 can perform the MAX pooling calculation of Equation 3 by repeating the above calculations by the Bd arithmetic units 57. Note that in the MAX pooling calculation, the shifter 57e does not shift the output of the second selector 57c.

The arithmetic unit 57 can subtract the data stored in the register 57d and the element f(di) of the output data f(x, y, do) read from the second memory 2 using the ALU 57a. The shifter 57e can shift the output of the ALU 57a to the left (i.e., multiplication) or right (i.e., division). The vector arithmetic circuit 52 can perform the batch normalization calculation of Equation 4 by repeating the above calculations by the Bd arithmetic units 57.

The arithmetic unit 57 can compare the element f(di) of the output data f(x, y, do) read from the second memory 2 with the "0" selected by the first selector 57b using the ALU 57a. Depending on the comparison result by the ALU 57a, the arithmetic unit 57 can select and output either the element f(di) or the constant value "0" previously stored in the register 57d. The vector arithmetic circuit 52 can perform the ReLU operation of Equation 5 by repeating the above calculations by the Bd arithmetic units 57.

The vector operation circuit 52 can perform average pooling, MAX pooling, batch normalization, activation function operations, and combinations of these operations. Since the vector operation circuit 52 can perform general-purpose SIMD operations, it may also perform other operations necessary for the operations in the quantization operation layer 220. In addition, the vector operation circuit 52 may also perform operations other than those in the quantization operation layer 220.

The quantization calculation circuit 5 does not have to have the vector calculation circuit 52. If the quantization calculation circuit 5 does not have the vector calculation circuit 52, the output data f(x, y, do) is input to the quantization circuit 53.

The quantization circuit 53 performs quantization on the output data of the vector calculation circuit 52. As shown in FIG. 20, the quantization circuit 53 has Bd quantization units 58, and performs calculations in parallel on the output data of the vector calculation circuit 52.

FIG. 22 is an internal block diagram of the quantization unit 58.
The quantization unit 58 quantizes the element in(di) of the output data of the vector operation circuit 52. The quantization unit 58 has a comparator 58a and an encoder 58b. The quantization unit 58 performs the operation (Equation 6) of the quantization layer 224 in the quantization operation layer 220 on the output data (16 bits/element) of the vector operation circuit 52. The quantization unit 58 reads out the necessary quantization parameters q(th0, th1, th2) from the quantization parameter memory 51, and compares the input in(di) with the quantization parameter q by the comparator 58a. The quantization unit 58 quantizes the comparison result by the comparator 58a to 2 bits/element by the encoder 58b. Since α(c) and β(c) in Equation 4 are parameters that differ for each variable c, the quantization parameters q(th0, th1, th2) that reflect α(c) and β(c) are parameters that differ for each in(di).

The quantization unit 58 classifies the input in (di) into four regions (e.g., in≦th0, th0<in≦th1, th1<in≦th2, th2<in) by comparing the input in (di) with three thresholds th0, th1, and th2, and outputs the classification result by encoding it into 2 bits. The quantization unit 58 can also perform batch normalization and activation function calculations in addition to quantization by setting the quantization parameter q (th0, th1, th2).

The quantization unit 58 sets the threshold th0 as β(c) in Equation 4, and the threshold differences (th1-th0) and (th2-th1) as α(c) in Equation 4, and performs quantization, thereby enabling the batch normalization calculation shown in Equation 4 to be performed in conjunction with quantization. By increasing (th1-th0) and (th2-th1), α(c) can be reduced. By decreasing (th1-th0) and (th2-th1), α(c) can be increased.

The quantization unit 58 can perform the activation function in conjunction with the quantization of the input in(di). For example, the quantization unit 58 saturates the output value in the region where in(di)≦th0 and th2<in(di). The quantization unit 58 can perform the calculation of the activation function in conjunction with the quantization by setting the quantization parameter q so that the output is nonlinear.

The state controller 54 controls the states of the vector operation circuit 52 and the quantization circuit 53. The state controller 54 is also connected to the controller 6 via the internal bus IB. The state controller 54 has an instruction queue 55 and a control circuit 56.

The instruction queue 55 is a queue in which the instruction command C5 for the quantization calculation circuit 5 is stored, and is configured, for example, as a FIFO memory. The instruction command C5 is written to the instruction queue 55 via the IFU 7 or the internal bus IB.

The control circuit 56 is a state machine that decodes the instruction command C5 and controls the vector calculation circuit 52 and the quantization circuit 53 based on the instruction command C5. The control circuit 56 has a similar configuration to the control circuit 34 of the state controller 32 of the DMAC3.

The instruction decompressor 59 restores (decompresses) the instruction command C5 from the compressed instruction command into which the instruction command C5 is compressed. The instruction decompressor 59 has a configuration similar to that of the instruction decompressor 49 of the convolution operation circuit 4.

The quantization calculation circuit 5 writes quantization calculation output data having Bd elements to the first memory 1. The preferred relationship between Bd and Bc is shown in Equation 10. In Equation 10, n is an integer.

[Controller 6]
The controller 6 transfers the instruction command transferred from the external host CPU 110 via the internal bus IB to the instruction queues of the DMAC 3, the convolution operation circuit 4, and the quantization operation circuit 5. The controller 6 may have an instruction memory that stores the instruction command for each circuit.

The controller 6 is connected to the external bus EB and operates as a slave to the external host CPU 110. The controller 6 has registers 61 including a parameter register and a status register. The parameter register is a register that controls the operation of the NN circuit 100. The status register is a register that indicates the status of the NN circuit 100, including the semaphore S.

The neural network circuit 100 according to this embodiment can operate the NN circuit 100 with high performance, which can be embedded in embedded devices such as IoT devices. By connecting multiple NN calculation cores 10, more neural network calculations can be performed efficiently and quickly.

The first embodiment of the present invention has been described above in detail with reference to the drawings, but the specific configuration is not limited to this embodiment, and includes design modifications and the like that do not deviate from the gist of the present invention. In addition, the components shown in the above-mentioned embodiment and modified examples can be configured in appropriate combinations.

(Variation 1)
Although the command command for controlling the neural network circuit 100 of the above embodiment is an example in which one command command is required for one calculation operation, the form of the command command is not limited to this. The command command may be an embodiment in which multiple calculation operations can be executed by one or more command commands. Specifically, consecutive 1x1 convolution operations are executed based on multiple command commands. The multiple command commands include at least an instruction to determine the range (offset and step) of the element A(ci) of the input vector A held in the first memory 1, an instruction to determine the range (offset and step) of the element W(ci, di) of the weight matrix W held in the weight memory 41, an instruction to determine the storage position (offset, step) in the second memory 2 where the product-sum operation result O(di) is stored, and an instruction to determine the number of repetitions (filter size) of the 1x1 convolution operation. In this way, by executing multiple calculation operations with fewer command commands, the total number of command commands can be reduced. And, by using the configuration of this embodiment, the number of command commands can be further reduced, and the increase in the usage of the external memory 120 and the pressure on the bandwidth of the external bus EB can be reduced.

(Variation 2)
In the above embodiment, the first memory 1 and the second memory 2 are separate memories, but the aspects of the first memory 1 and the second memory 2 are not limited to this. The first memory 1 and the second memory 2 may be, for example, a first memory area and a second memory area in the same memory.

(Variation 3)
For example, the data input to the NN circuit 100 described in the above embodiment is not limited to a single format, and can be composed of still images, moving images, sounds, characters, numbers, and combinations of these. The data input to the NN circuit 100 is not limited to the measurement results of physical quantity measuring instruments such as optical sensors, thermometers, Global Positioning System (GPS) measuring instruments, angular velocity measuring instruments, and anemometers that may be mounted on the edge device in which the NN circuit 100 is provided. Different information such as base station information received from peripheral devices via wired or wireless communication, information on vehicles and ships, weather information, congestion information, and other peripheral information, financial information, and personal information may also be combined.

(Variation 4)
The edge device in which the NN circuit 100 is provided is assumed to be a communication device such as a battery-powered mobile phone, a smart device such as a personal computer, a digital camera, a game device, a robot product, and other mobile devices, but is not limited thereto. It can be used in products that have a high demand for peak power limit that can be supplied by Power on Ethernet (PoE), reduction of product heat generation, or long-term operation, to obtain effects not seen in other prior art. For example, by applying the circuit to an in-vehicle camera mounted on a vehicle or ship, or a surveillance camera installed in a public facility or on the road, it is possible to realize long-term shooting, and also contributes to weight reduction and high durability. In addition, the circuit can be applied to display devices such as televisions and displays, medical equipment such as medical cameras and surgical robots, and work robots used at manufacturing sites and construction sites to obtain similar effects.

(Variation 5)
The NN circuit 100 may be realized in part or in whole by using one or more processors. For example, the NN circuit 100 may realize in part or in whole the input layer or output layer by software processing by a processor. A part of the input layer or output layer realized by software processing is, for example, data normalization or conversion. This makes it possible to support various input formats or output formats. The software executed by the processor may be configured to be rewritable using communication means or external media.

(Variation 6)
The NN circuit 100 may realize a part of the processing in the CNN 200 by combining a graphics processing unit (GPU) or the like on the cloud. The NN circuit 100 can realize more complex processing with fewer resources by performing further processing on the cloud in addition to the processing performed on the edge device in which the NN circuit 100 is provided, or by performing processing on the edge device in addition to the processing on the cloud. With such a configuration, the NN circuit 100 can reduce the amount of communication between the edge device and the cloud by distributing processing.

Furthermore, the effects described in this specification are merely descriptive or exemplary and are not limiting. In other words, the technology disclosed herein may achieve other effects that are apparent to a person skilled in the art from the description in this specification, in addition to or in place of the above effects.

The present invention can be applied to neural network calculations.

200 Convolutional Neural Network 100 Neural Network Circuit (NN Circuit)
10 Neural network calculation core (NN calculation core)
10A First Neural Network Calculation Core (First NN Calculation Core)
10B Second Neural Network Calculation Core (Second NN Calculation Core)
10M Neural network calculation multi-core (NN calculation multi-core)
1 First memory 2 Second memory 3 DMA controller (DMAC)
4 Convolution operation circuit 42 Multiplier 43 Accumulator circuit 5 Quantization operation circuit 52 Vector operation circuit 53 Quantization circuit 6 Controller 61 Register 7 IFU

Claims

A convolution operation circuit is provided for performing a convolution operation on input data,
The convolution operation circuit has an instruction decompressor that restores an instruction command for the convolution operation circuit from a compressed instruction command in which the instruction command is compressed, the instruction command operating the convolution operation circuit.
Neural network circuit.
A quantization circuit is further provided for performing a quantization operation on the convolution operation output data of the convolution operation circuit,
The quantization operation circuit has an instruction decompressor that restores an instruction command for the quantization operation circuit that operates the quantization operation circuit from a compressed instruction command.
2. The neural network circuit of claim 1.
further comprising an instruction fetch unit for reading from a memory an instruction command for operating the convolution operation circuit or the quantization operation circuit;
The instruction fetch unit inputs instruction commands to the instruction decompressor.
3. The neural network circuit of claim 2.
the convolution circuit includes a plurality of the instruction decompressors;
the divided compressed instruction commands are input to different instruction decompressors;
2. The neural network circuit of claim 1.
A first memory for storing the input data;
a second memory for storing the convolution operation output data;
Further equipped with
The quantization operation output data of the quantization operation circuit is stored in the first memory,
the quantization operation output data stored in the first memory is input to the convolution operation circuit as the input data;
3. The neural network circuit of claim 2.
the first memory, the convolution operation circuit, the second memory, and the quantization operation circuit are formed in a loop.
6. The neural network circuit of claim 5.
a convolution operation circuit that performs a convolution operation on input data;
a quantization circuit for performing a quantization operation on the convolution operation output data of the convolution operation circuit;
an instruction fetch unit for reading, from a memory, a compressed instruction command obtained by compressing an instruction command for a convolution operation circuit that operates the convolution operation circuit, and a compressed instruction command obtained by compressing an instruction command for a quantization operation circuit that operates the quantization operation circuit;
A method for controlling a neural network circuit comprising:
a step of causing the instruction fetch unit to read out an instruction command for the convolution operation circuit and an instruction command for the quantization operation circuit separately from the memory, and to supply the instruction commands separately to the convolution operation circuit and the quantization operation circuit;
causing the convolution operation circuit and the quantization operation circuit to restore the instruction command from the compressed instruction command;
operating the convolution operation circuit and the quantization operation circuit in parallel based on the restored instruction command;
having
A method for controlling a neural network circuit.