WO2024111644A1 - Neural network circuit and neural network computing method - Google Patents

Neural network circuit and neural network computing method Download PDF

Info

Publication number
WO2024111644A1
WO2024111644A1 PCT/JP2023/042052 JP2023042052W WO2024111644A1 WO 2024111644 A1 WO2024111644 A1 WO 2024111644A1 JP 2023042052 W JP2023042052 W JP 2023042052W WO 2024111644 A1 WO2024111644 A1 WO 2024111644A1
Authority
WO
WIPO (PCT)
Prior art keywords
circuit
quantization
instruction
memory
convolution
Prior art date
Application number
PCT/JP2023/042052
Other languages
French (fr)
Japanese (ja)
Inventor
潤一 金井
Original Assignee
LeapMind株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by LeapMind株式会社 filed Critical LeapMind株式会社
Publication of WO2024111644A1 publication Critical patent/WO2024111644A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present invention aims to provide a high-performance neural network circuit and neural network calculation method that can be incorporated into embedded devices such as IoT devices.
  • a neural network circuit includes a convolution circuit that performs a convolution operation on input data, and the convolution circuit has an instruction decompressor that decompresses compressed instruction commands that are instruction commands for the convolution circuit that operate the convolution circuit.
  • the neural network operation method is a control method for a neural network circuit including a convolution operation circuit that performs a convolution operation on input data, a quantization operation circuit that performs a quantization operation on the convolution operation output data of the convolution operation circuit, and an instruction fetch unit that reads from a memory a compressed instruction command obtained by compressing an instruction command for the convolution operation circuit that operates the convolution operation circuit, and a compressed instruction command obtained by compressing an instruction command for the quantization operation circuit that operates the quantization operation circuit, and includes the steps of: making the instruction fetch unit read the instruction command for the convolution operation circuit and the instruction command for the quantization operation circuit separately from the memory and supply the instruction commands separately to the convolution operation circuit and the quantization operation circuit; making the convolution operation circuit and the quantization operation circuit restore the instruction command from the compressed instruction command; and making the convolution operation circuit and the quantization operation circuit operate in parallel based on the restored instruction command.
  • FIGS. 1 is a diagram showing a convolutional neural network 200 (hereinafter, referred to as "CNN 200").
  • CNN 200 convolutional neural network
  • the calculations performed by a neural network circuit 100 (hereinafter, referred to as "NN circuit 100") according to the first embodiment are at least a part of the trained CNN 200 used during inference.
  • the CNN 200 is a multi-layer network including a convolution layer 210 that performs a convolution operation, a quantization operation layer 220 that performs a quantization operation, and an output layer 230. In at least a part of the CNN 200, the convolution layer 210 and the quantization operation layer 220 are alternately connected.
  • the CNN 200 is a model that is widely used for image recognition and video recognition.
  • the CNN 200 may further include a layer having other functions such as a fully connected layer.
  • FIG. 2 is a diagram illustrating the convolution operation performed by the convolution layer 210.
  • the convolution layer 210 performs a convolution operation on the input data a using a weight w.
  • the convolution layer 210 performs a multiply-and-accumulate operation on the input data a and the weight w.
  • the input data a (also called activation data or feature map) to the convolutional layer 210 is multidimensional data such as image data.
  • the input data a is a three-dimensional tensor consisting of elements (x, y, c).
  • the convolutional layer 210 of the CNN 200 performs a convolution operation on the low-bit input data a.
  • the elements of the input data a are 2-bit unsigned integers (0, 1, 2, 3).
  • the elements of the input data a may be, for example, 4-bit or 8-bit unsigned integers.
  • CNN 200 may further have an input layer that performs type conversion and quantization before convolutional layer 210.
  • the weights w (also called filters or kernels) of the convolutional layer 210 are multidimensional data having elements that are learnable parameters.
  • the weights w are four-dimensional tensors consisting of elements (i, j, c, d).
  • the weights w have d three-dimensional tensors (hereinafter referred to as "weights wo") consisting of elements (i, j, c).
  • the weights w in the trained CNN 200 are trained data.
  • the convolutional layer 210 of the CNN 200 performs convolution operations using low-bit weights w.
  • the elements of the weights w are 1-bit signed integers (0, 1), where a value of "0" represents +1 and a value of "1" represents -1.
  • the quantization operation layer 220 performs quantization and other operations on the output of the convolution operation output by the convolution layer 210.
  • the quantization operation layer 220 has a pooling layer 221, a batch normalization layer 222, an activation function layer 223, and a quantization layer 224.
  • the batch normalization layer 222 normalizes the data distribution of the output data of the quantization operation layer 220 and the pooling layer 221, for example, by performing an operation as shown in Equation 4.
  • Equation 4 u represents the input tensor, v represents the output tensor, ⁇ represents the scale, and ⁇ represents the bias.
  • ⁇ and ⁇ are trained constant vectors.
  • the input data a(x+i, y+j, c) in Equation 1 is divided in the c-axis direction by size Bc, and is represented by the divided input data a(x+i, y+j, co).
  • the divided input data a is also referred to as "divided input data a".
  • the weight w(i,j,c,d) in Equation 1 is divided by the size Bc in the c-axis direction and the size Bd in the d-axis direction, and is expressed as the divided weight w(i,j,co,do).
  • the divided weight w is also referred to as the "divided weight w".
  • the output data f(x, y, do) divided by size Bd is calculated using Equation 9.
  • the final output data f(x, y, d) can be calculated by combining the divided output data f(x, y, do).
  • the NN circuit 100 performs the convolution operation by expanding the input data a and the weights w in the convolution operation of the convolution layer 210.
  • FIG. 3 is a diagram for explaining the expansion of data in a convolution operation.
  • Divided input data a(x+i, y+j, co) is expanded into vector data having Bc elements.
  • the elements of divided input data a are indexed by ci (0 ⁇ ci ⁇ Bc).
  • divided input data a expanded into vector data for each i and j is also referred to as "input vector A".
  • Input vector A has elements from divided input data a(x+i, y+j, co ⁇ Bc) to divided input data a(x+i, y+j, co ⁇ Bc+(Bc-1)).
  • the split weights w(i,j,co,do) are expanded into matrix data with Bc ⁇ Bd elements.
  • the elements of the split weights w expanded into the matrix data are indexed by ci and di (0 ⁇ di ⁇ Bd).
  • the split weights w expanded into matrix data for each i and j are also referred to as the "weight matrix W".
  • the elements of the weight matrix W are split weights w(i,j,co ⁇ Bc,do ⁇ Bd) to w(i,j,co ⁇ Bc+(Bc-1),do ⁇ Bd+(Bd-1)).
  • Vector data is calculated by multiplying the input vector A by the weight matrix W.
  • the vector data calculated for i, j, and co is shaped into a three-dimensional tensor to obtain output data f(x, y, do).
  • the convolution operation of the convolution layer 210 can be performed by multiplying the vector data by the matrix data.
  • FIG. 4 is a diagram showing the overall configuration of an NN circuit 100 according to this embodiment.
  • the NN circuit 100 includes a DMA controller 3 (hereinafter also referred to as "DMAC 3"), a controller 6, an IFU 7, and at least one neural network calculation core 10 (hereinafter also referred to as "NN calculation core 10").
  • the NN circuit 100 can implement multiple NN calculation cores 10.
  • the NN circuit 100 illustrated in FIG. 4 can implement up to four NN calculation cores 10.
  • the multiple NN calculation cores 10 constitute a "neural network calculation multi-core 10M (hereinafter also referred to as "NN calculation multi-core 10M")" that cooperates to execute at least some of the calculations of the NN 200.
  • the multiple NN calculation cores 10 are daisy-chained. Note that the number of NN calculation cores 10 that can be implemented in the NN circuit 100 may be five or more.
  • the DMAC3 is connected to the external bus EB, and transfers data between an external memory 120 such as a DRAM and the NN calculation core 10.
  • the DMAC3 transfers data read from the external memory 120 to one of the multiple NN calculation cores 10.
  • the DMAC3 may be capable of transferring the same data read from the external memory 120 to multiple NN calculation cores 10, or may be capable of broadcasting the data.
  • the controller 6 is connected to the external bus EB and operates as a slave to the external host CPU 110.
  • the controller 6 has a bus bridge 60 and a register 61.
  • the bus bridge 60 relays bus access from the external bus EB to the internal bus IB.
  • the bus bridge 60 also relays write and read requests from the external host CPU 110 to the register 61.
  • the register 61 has a parameter register and a status register.
  • the parameter register is a register that controls the operation of the NN circuit 100.
  • the status register contains a pointer to the instruction sequence of each module, the number of instructions, etc., and is a register that indicates the status of the NN circuit 100.
  • the status register may also be configured to contain a semaphore S.
  • the external host CPU 110 can access the register 61 via the bus bridge 60 of the controller 6.
  • the controller 6 is connected to each block of the NN circuit 100 (DMAC3, IFU7, NN calculation core 10) via the internal bus IB.
  • the external host CPU 110 can access each block of the NN circuit 100 via the controller 6.
  • the external host CPU 110 can issue commands to the NN calculation core 10 via the controller 6.
  • each block can update a status register (which may include a semaphore S) held by the controller 6 via the internal bus IB.
  • the status register may be configured to be updated via a dedicated wiring connected to each block.
  • the IFU 7 reads instruction commands for each block (DMAC3, NN calculation core 10) of the NN circuit 100 from the external memory 120 via the external bus EB based on instructions from the external host CPU 110.
  • the IFU 7 also transfers the read instruction commands to each corresponding block (DMAC3, NN calculation core 10) of the NN circuit 100.
  • the instruction commands are stored in a compressed state (hereinafter also referred to as "compressed instruction commands") in the external memory 120.
  • the IFU 7 reads the compressed instruction commands.
  • FIG. 5 is a diagram showing the overall configuration of the NN processing core 10. As shown in FIG.
  • the NN calculation core 10 includes a first memory 1, a second memory 2, a convolution calculation circuit 4, and a quantization calculation circuit 5.
  • the NN calculation core 10 is characterized in that the convolution calculation circuit 4 and the quantization calculation circuit 5 are formed in a loop shape via the first memory 1 and the second memory 2.
  • the first memory 1 is a rewritable memory such as a volatile memory composed of, for example, an SRAM (Static RAM). Data is written to and read from the first memory 1 via the DMAC 3 and the internal bus IB.
  • the external host CPU 110 can input and output data to and from the NN calculation core 10 by writing and reading data to and from the first memory 1.
  • the first memory 1 is connected to the input port of the convolution calculation circuit 4, and the convolution calculation circuit 4 can read data from the first memory 1.
  • the first memory 1 is also loop-connected (C1) to the output port of the quantization calculation circuit 5, and the quantization calculation circuit 5 can write data to the first memory 1.
  • the first memory 1 is also capable of data transfer via an inter-core connection (C2) between it and another NN calculation core 10, and the other NN calculation core 10 connected to the inter-core connection (C2) can write data to the first memory 1.
  • a daisy-chain connection is used as an example of the inter-core connection (C2).
  • the second memory 2 is a rewritable memory such as a volatile memory composed of, for example, SRAM (Static RAM). Data is written to and read from the second memory 2 via the DMAC 3 and the internal bus IB.
  • the external host CPU 110 can input and output data to and from the NN calculation core 10 by writing and reading data to and from the second memory 2.
  • the second memory 2 is connected to an input port of the quantization calculation circuit 5, and the quantization calculation circuit 5 can read data from the second memory 2.
  • the second memory 2 is connected to an output port of the convolution calculation circuit 4, and the convolution calculation circuit 4 can write data to the second memory 2.
  • the convolution operation circuit 4 is a circuit that performs convolution operations in the convolution layer 210 of the trained CNN 200.
  • the convolution operation circuit 4 reads the input data a stored in the first memory 1 and performs a convolution operation on the input data a.
  • the convolution operation circuit 4 writes the output data f of the convolution operation (hereinafter also referred to as "convolution operation output data") to the second memory 2.
  • the quantization operation circuit 5 is a circuit that performs at least a part of the quantization operation in the quantization operation layer 220 of the trained CNN 200.
  • the quantization operation circuit 5 reads the output data f of the convolution operation stored in the second memory 2, and performs a quantization operation (an operation that includes at least quantization among pooling, batch normalization, activation function, and quantization) on the output data f of the convolution operation.
  • the quantization calculation circuit 5 writes the output data of the quantization calculation (hereinafter also referred to as "quantization calculation output data") to the first memory 1 connected in a loop (C1).
  • the quantization calculation circuit 5 can transfer data to other NN calculation cores 10 via the inter-core connection (C2), and the quantization calculation circuit 5 can output the quantization calculation output data to other NN calculation cores 10 connected in an inter-core connection (C2).
  • the NN calculation core 10 has a first memory 1, a second memory 2, etc., so that the number of times duplicate data is transferred can be reduced when data is transferred by the DMAC 3 from an external memory such as a DRAM. This makes it possible to significantly reduce the power consumption or processing load caused by memory access.
  • FIG. 6 is a timing chart showing an example of the operation of the NN processing core 10.
  • the DMAC 3 stores the input data a of the layer 1 in the first memory 1.
  • the DMAC 3 may divide the input data a of the layer 1 and transfer it to the first memory 1 in accordance with the order of the convolution operation performed by the convolution operation circuit 4.
  • the convolution operation circuit 4 reads out the input data a of layer 1 stored in the first memory 1.
  • the convolution operation circuit 4 performs the convolution operation of layer 1 shown in FIG. 1 on the input data a of layer 1.
  • the output data f of the convolution operation of layer 1 is stored in the second memory 2.
  • the quantization calculation circuit 5 reads the output data f of layer 1 stored in the second memory 2.
  • the quantization calculation circuit 5 performs the quantization calculation of layer 2 on the output data f of layer 1.
  • the output data of the quantization calculation of layer 2 is stored in the first memory 1.
  • the convolution operation circuit 4 reads the output data of the quantization operation of layer 2 stored in the first memory 1.
  • the convolution operation circuit 4 performs the convolution operation of layer 3 using the output data of the quantization operation of layer 2 as input data a.
  • the output data f of the convolution operation of layer 3 is stored in the second memory 2.
  • the convolution circuit 4 reads the output data of the quantization operation of layer 2M-2 (M is a natural number) stored in the first memory 1.
  • the convolution circuit 4 performs the convolution operation of layer 2M-1 using the output data of the quantization operation of layer 2M-2 as input data a.
  • the output data f of the convolution operation of layer 2M-1 is stored in the second memory 2.
  • the quantization calculation circuit 5 reads the output data f of layer 2M-1 stored in the second memory 2.
  • the quantization calculation circuit 5 performs the quantization calculation of layer 2M on the output data f of the 2M-1 layer.
  • the output data of the quantization calculation of layer 2M is stored in the first memory 1.
  • the convolution operation circuit 4 reads the output data of the quantization operation of layer 2M stored in the first memory 1.
  • the convolution operation circuit 4 performs the convolution operation of layer 2M+1 using the output data of the quantization operation of layer 2M as input data a.
  • the output data f of the convolution operation of layer 2M+1 is stored in the second memory 2.
  • the convolution calculation circuit 4 and the quantization calculation circuit 5 alternately perform calculations to advance the calculations of the CNN 200 shown in FIG. 1.
  • the convolution calculation circuit 4 performs the convolution calculations of layers 2M-1 and 2M+1 by time sharing.
  • the quantization calculation circuit 5 performs the quantization calculations of layers 2M-2 and 2M by time sharing. Therefore, the circuit scale of the NN calculation core 10 is significantly smaller than when a separate convolution calculation circuit 4 and quantization calculation circuit 5 are implemented for each layer.
  • the NN calculation core 10 performs calculations for the CNN 200, which is a multi-layer structure of multiple layers, using a circuit formed in a loop.
  • the NN calculation core 10 can efficiently use hardware resources due to the loop circuit configuration. Since the NN calculation core 10 forms a circuit in a loop, the parameters in the convolution calculation circuit 4 and the quantization calculation circuit 5, which change in each layer, are updated appropriately.
  • NN calculation core 10 transfers intermediate data to an external calculation device such as external host CPU 110. After the external calculation device performs calculations on the intermediate data, the results of the calculations by the external calculation device are input to first memory 1 and second memory 2. NN calculation core 10 resumes calculations on the results of the calculations by the external calculation device.
  • FIG. 7 is a timing chart showing another example of the operation of the NN processing core 10.
  • the NN processing core 10 may divide the input data a into partial tensors and perform operations on the partial tensors by time division.
  • the method of division into the partial tensors and the number of divisions are not particularly limited.
  • FIG. 7 shows an example of an operation when the input data a is decomposed into two partial tensors.
  • the decomposed partial tensors are called “first partial tensor a 1 " and “second partial tensor a 2 ".
  • the convolution operation of layer 2M-1 is decomposed into a convolution operation corresponding to the first partial tensor a 1 (in FIG. 7, indicated as “layer 2M-1 (a 1 )”) and a convolution operation corresponding to the second partial tensor a 2 (in FIG. 7, indicated as "layer 2M-1 (a 2 )").
  • the convolution and quantization operations corresponding to the first partial tensor a 1 and the convolution and quantization operations corresponding to the second partial tensor a 2 can be performed independently, as shown in FIG.
  • the convolution operation circuit 4 performs a convolution operation of the layer 2M-1 corresponding to the first partial tensor a1 (operation shown by layer 2M-1 ( a1 ) in FIG. 7). After that, the convolution operation circuit 4 performs a convolution operation of the layer 2M-1 corresponding to the second partial tensor a2 (operation shown by layer 2M-1 ( a2 ) in FIG. 7). In addition, the quantization operation circuit 5 performs a quantization operation of the layer 2M corresponding to the first partial tensor a1 (operation shown by layer 2M ( a1 ) in FIG. 7). In this way, the NN operation core 10 can perform the convolution operation of the layer 2M-1 corresponding to the second partial tensor a2 and the quantization operation of the layer 2M corresponding to the first partial tensor a1 in parallel.
  • the convolution operation circuit 4 performs a convolution operation of layer 2M+1 corresponding to the first partial tensor a1 (operation shown as layer 2M+1( a1 ) in FIG. 7).
  • the quantization operation circuit 5 performs a quantization operation of layer 2M corresponding to the second partial tensor a2 (operation shown as layer 2M( a2 ) in FIG. 7).
  • the NN operation core 10 can perform the convolution operation of layer 2M+1 corresponding to the first partial tensor a1 and the quantization operation of layer 2M corresponding to the second partial tensor a2 in parallel.
  • the convolution operation and quantization operation corresponding to the first partial tensor a 1 and the convolution operation and quantization operation corresponding to the second partial tensor a 2 can be performed independently. Therefore, the NN processing core 10 may perform, for example, the convolution operation of the layer 2M-1 corresponding to the first partial tensor a 1 and the quantization operation of the layer 2M+2 corresponding to the second partial tensor a 2 in parallel. In other words, the convolution operation and quantization operation performed in parallel by the NN processing core 10 are not limited to the operations of consecutive layers.
  • the NN calculation core 10 can operate the convolution calculation circuit 4 and the quantization calculation circuit 5 in parallel. As a result, the waiting time of the convolution calculation circuit 4 and the quantization calculation circuit 5 is reduced, improving the calculation processing efficiency of the NN calculation core 10.
  • the number of divisions is 2, but even if the number of divisions is greater than 2, the NN calculation core 10 can similarly operate the convolution calculation circuit 4 and the quantization calculation circuit 5 in parallel.
  • the NN processing core 10 may perform in parallel a convolution operation of the layer 2M-1 corresponding to the second partial tensor a 2 and a quantization operation of the layer 2M corresponding to the third partial tensor a 3.
  • the order of the operations is appropriately changed depending on the storage status of the input data a in the first memory 1 and the second memory 2.
  • an example has been shown in which a partial tensor in the same layer is computed by the convolution computation circuit 4 or the quantization computation circuit 5, and then a partial tensor in the next layer is computed.
  • a convolution computation of layer 2M-1 corresponding to the first partial tensor a1 and the second partial tensor a2 (computations shown by layer 2M-1( a1 ) and layer 2M-1( a2 ) in FIG.
  • the calculation method for the partial tensor is not limited to this.
  • the calculation method for the partial tensor may be a method in which some partial tensors in multiple layers are calculated and then the remaining partial tensors are calculated (method 2). For example, in the convolution calculation circuit 4, after performing a convolution calculation of the layer 2M-1 corresponding to the first partial tensor a1 and the layer 2M+1 corresponding to the first partial tensor a1 , a convolution calculation of the layer 2M-1 corresponding to the second partial tensor a2 and the layer 2M+1 corresponding to the second partial tensor a2 may be performed.
  • the method of computing partial tensors may be a combination of methods 1 and 2. However, when using method 2, the computation must be performed according to the dependency relationship regarding the order of computation of partial tensors.
  • FIG. 8 is a diagram showing the NN computing multi-core 10M.
  • the NN computing multi-core 10M illustrated in Fig. 8 includes two daisy-chained NN computing cores 10. When distinguishing between the two NN computing cores 10, the two NN computing cores 10 are referred to as a "first NN computing core 10A" and a "second NN computing core 10B.”
  • the first memory 1 is abbreviated as "A”
  • the convolution computing circuit 4 as "C”
  • the second memory 2 as "F”
  • Q quantization computing circuit 5
  • the quantization calculation circuit 5 of the first NN calculation core 10A and the first memory 1 of the second NN calculation core 10B are daisy-chain connected (C2).
  • the quantization calculation circuit 5 of the first NN calculation core 10A can write the quantization calculation output data to the first memory 1 of the first NN calculation core 10A, which is loop-connected (C1), and/or the first memory 1 of the second NN calculation core 10B, which is daisy-chain connected (C2).
  • the quantization calculation circuit 5 of the second NN calculation core 10B and the first memory 1 of the first NN calculation core 10A are daisy-chain connected (C2).
  • the quantization calculation circuit 5 of the second NN calculation core 10B can write the quantization calculation output data to the first memory 1 of the second NN calculation core 10B, which is loop-connected (C1), and/or the first memory 1 of the first NN calculation core 10A, which is daisy-chain connected (C2).
  • the multiple NN calculation cores 10 are daisy-chained.
  • the quantization calculation circuits 5 of the NN calculation cores 10 other than the final-stage NN calculation core 10 are daisy-chained (C2) with the first memory 1 of the subsequent-stage NN calculation core 10B.
  • the quantization calculation circuit 5 of the final-stage NN calculation core 10 is daisy-chained (C2) with the first memory 1 of the first-stage NN calculation core 10.
  • the multiple NN calculation cores 10 are characterized by being formed in a daisy-chain loop (linked together).
  • the first memory (A) 1, the convolution calculation circuit (C) 4, the second memory (F) 2, and the quantization calculation circuit (Q) 5 are connected in a loop.
  • the first memory (A) 1, the convolution calculation circuit (C) 4, the second memory (F) 2, and the quantization calculation circuit (Q) 5 are connected in a daisy chain loop (linked together) so that the first memory (A) 1, the convolution calculation circuit (C) 4, the second memory (F) 2, and the quantization calculation circuit (Q) 5 are repeatedly arranged in the same order.
  • the multiple NN calculation cores 10 constituting the NN calculation multi-core 10M do not need to have the same hardware configuration.
  • the capacity and configuration of the first memory 1 of the first NN calculation core 10A may be different from the capacity and configuration of the first memory 1 of the second NN calculation core 10B.
  • the configuration of the quantization calculation circuit 5 of the first NN calculation core 10A may be different from the configuration of the quantization calculation circuit 5 of the second NN calculation core 10B.
  • FIG. 9 is an internal block diagram of the DMAC 3.
  • the DMAC 3 has a data transfer circuit 31 and a state controller 32.
  • the DMAC 3 has a state controller 32 dedicated to the data transfer circuit 31, and when an instruction command is input, the DMAC 3 can perform DMA data transfer without requiring an external controller.
  • the data transfer circuit 31 is connected to the external bus EB, and performs DMA data transfer between an external memory 120 such as a DRAM and the NN calculation core 10.
  • the number of DMA channels of the data transfer circuit 31 is not limited.
  • the first NN calculation core 10A and the second NN calculation core 10B may each have a dedicated DMA channel.
  • the state controller 32 controls the state of the data transfer circuit 31.
  • the state controller 32 is also connected to the controller 6 via the internal bus IB.
  • the state controller 32 has an instruction queue 33 and a control circuit 34.
  • the instruction queue 33 is a queue in which instruction commands C3 for the DMAC3 are stored, and is configured, for example, as a FIFO memory. One or more instruction commands C3 are written to the instruction queue 33 via the IFU 7 or the internal bus IB.
  • the control circuit 34 is a state machine that decodes the instruction command C3 and sequentially controls the data transfer circuit 31 based on the instruction command C3.
  • the control circuit 34 may be implemented by a logic circuit or a CPU controlled by software.
  • FIG. 10 is a state transition diagram of the control circuit 34.
  • the control circuit 34 transitions from the idle state ST1 to the decode state ST2.
  • the control circuit 34 decodes the instruction command C3 output from the instruction queue 33.
  • the control circuit 34 also reads the semaphore S stored in the register 61 of the controller 6, and determines whether the operation of the data transfer circuit 31 instructed in the instruction command C3 can be executed. If it is not executable (Not ready), the control circuit 34 waits (Wait) until it is executable. If it is executable (Ready), the control circuit 34 transitions from the decode state ST2 to the execution state ST3.
  • control circuit 34 controls the data transfer circuit 31 to cause the data transfer circuit 31 to perform the operation instructed in the instruction command C3.
  • the control circuit 34 removes the executed instruction command C3 from the instruction queue 33 and updates the semaphore S stored in the register 61 of the controller 6. If there are instructions in the instruction queue 33 (Not empty), the control circuit 34 transitions from execution state ST3 to decode state ST2. If there are no instructions in the instruction queue 33 (Empty), the control circuit 34 transitions from execution state ST3 to idle state ST1.
  • FIG. 11 is an internal block diagram of the convolution circuit 4.
  • the convolution operation circuit 4 has a weight memory 41, a multiplier 42, an accumulator circuit 43, a state controller 44, and an instruction decompressor 49.
  • the convolution operation circuit 4 has a state controller 44 dedicated to the multiplier 42 and the accumulator circuit 43, and when an instruction command is input, the convolution operation can be performed without the need for an external controller.
  • the weight memory 41 is a memory in which the weight w used in the convolution calculation is stored, and is a rewritable memory such as a volatile memory composed of, for example, an SRAM (Static RAM).
  • the DMAC3 writes the weight w required for the convolution calculation to the weight memory 41 by DMA transfer.
  • FIG. 12 is an internal block diagram of the multiplier 42.
  • the multiplier 42 multiplies the input vector A by the weight matrix W.
  • the input vector A is vector data having Bc elements in which the divided input data a(x+i, y+j, co) is expanded for each i and j.
  • the weight matrix W is matrix data having Bc ⁇ Bd elements in which the divided weights w(i, j, co, do) are expanded for each i and j.
  • the multiplier 42 has Bc ⁇ Bd product-sum operation units 47 and can multiply the input vector A by the weight matrix W in parallel.
  • the multiplier 42 reads the input vector A and weight matrix W required for the multiplication from the first memory 1 and weight memory 41, and performs the multiplication.
  • the multiplier 42 outputs Bd product-sum operation results O(di).
  • FIG. 13 is an internal block diagram of the product-sum calculation unit 47.
  • the multiply-add unit 47 multiplies an element A(ci) of an input vector A by an element W(ci,di) of a weight matrix W.
  • the multiply-add unit 47 also adds the multiplication result to a multiplication result S(ci,di) of another multiply-add unit 47.
  • the multiply-add unit 47 outputs an addition result S(ci+1,di).
  • the element A(ci) is a 2-bit unsigned integer (0,1,2,3).
  • the element W(ci,di) is a 1-bit signed integer (0,1), where a value "0" represents +1 and a value "1" represents -1.
  • the multiply-and-accumulate unit 47 has an inverter 47a, a selector 47b, and an adder 47c.
  • the multiply-and-accumulate unit 47 performs multiplication using only the inverter 47a and the selector 47b, without using a multiplier.
  • the selector 47b selects the input of the element A(ci).
  • the selector 47b selects the complement of the element A(ci) inverted by the inverter.
  • the element W(ci, di) is also input to the carry-in of the adder 47c.
  • FIG. 14 is an internal block diagram of the accumulator circuit 43.
  • the accumulator circuit 43 accumulates the product-sum operation results O(di) of the multiplier 42 in the second memory 2.
  • the accumulator circuit 43 has Bd accumulator units 48 and can accumulate the Bd product-sum operation results O(di) in parallel in the second memory 2.
  • FIG. 15 is an internal block diagram of the accumulator unit 48.
  • the accumulator unit 48 has an adder 48a and a mask unit 48b.
  • the adder 48a adds an element O(di) of the sum-of-products operation result O and a partial sum which is an intermediate result of the convolution operation shown in Equation 1 stored in the second memory 2.
  • the sum result is 16 bits per element.
  • the sum result is not limited to 16 bits per element, and may be, for example, 15 bits or 17 bits per element.
  • the adder 48a writes the result of the addition to the same address in the second memory 2.
  • the masking unit 48b masks the output from the second memory 2 and sets the addition target for element O(di) to zero.
  • the initialization signal clear is asserted when no intermediate partial sums are stored in the second memory 2.
  • the output data f(x, y, do) is stored in the second memory 2.
  • the state controller 44 controls the states of the multiplier 42 and the accumulator circuit 43.
  • the state controller 44 is also connected to the controller 6 via the internal bus IB.
  • the state controller 44 has an instruction queue 45 and a control circuit 46.
  • the instruction queue 45 is a queue in which the instruction command C4 for the convolution operation circuit 4 is stored, and is configured, for example, as a FIFO memory.
  • the instruction command C4 is written to the instruction queue 45 via the IFU 7 or the internal bus IB.
  • the control circuit 46 is a state machine that decodes the instruction command C4 and controls the multiplier 42 and the accumulator circuit 43 based on the instruction command C4.
  • the control circuit 46 has a similar configuration to the control circuit 34 of the state controller 32 of the DMAC3.
  • FIG. 16 is an internal block diagram of the instruction decompressor 49.
  • the instruction decompressor 49 decompresses the instruction command C4 from the compressed instruction command obtained by compressing the instruction command C4.
  • the instruction decompressor 49 includes a decompressor 49a and a ring buffer 49b.
  • the decompressor 49a decodes the compressed instruction command input from the IFU 7 and restores the instruction command C4 based on the data stored in the ring buffer 49b.
  • the ring buffer 49b is a ring-shaped buffer memory. Note that the ring buffer 49b is not limited to a ring-shaped buffer memory and may be a buffer memory of another type.
  • FIG. 17 is a diagram showing an example of a compressed instruction command that is decompressed by the instruction decompressor 49.
  • the Push instruction has an opcode field OF and an instruction field IF.
  • the opcode field OF of the Push instruction stores an opcode indicating that it is a Push instruction.
  • the instruction field IF stores an original instruction.
  • the Push instruction stores the original instruction in the ring buffer 49b and outputs the original instruction to the instruction queue 45.
  • the original instruction includes instructions such as an instruction to set input data a, an instruction to set weight w, and an instruction to set the output of convolution operation output data.
  • the Copy instruction has an opcode field OF, a seek field SF, and a count field CF.
  • the opcode field OF of the Copy instruction stores an opcode indicating that it is a Copy instruction.
  • the seek field SF stores a seek indicating an address in ring buffer 49b.
  • the count field CF stores a count indicating the number of instructions to copy.
  • the Copy instruction outputs to the instruction queue 45 the instructions stored in ring buffer 49b after the address indicated by the seek, up to the number of instructions indicated by the count.
  • the convolution operation circuit 4 of this embodiment can perform convolution operations without requiring an external controller, but in order to improve the degree of freedom of the convolution operation, it is preferable to be able to specify in detail the operation to be performed based on one instruction command C4.
  • one instruction command C4 that executes a multiplication of one element (1x1) in a convolution operation and combining multiple such commands, it is possible to realize a variety of convolution operations, such as convolution operations using different weight filters.
  • specifying instruction commands C4 in detail increases the total number of instruction commands C4, which causes problems such as an increase in the amount of usage of the external memory 120 and pressure on the bandwidth of the external bus EB.
  • the convolution operation circuit 4 of this embodiment uses a compressed instruction command that compresses the instruction command C4.
  • the instruction commands C4 for the convolution operation circuit 4 tend to be successive instructions that repeatedly perform convolution operations, and similar instruction commands C4 tend to occur in succession within a short period of time. Therefore, by storing the instructions in the ring buffer 49b using the Push command described above, and then copying the instructions stored in the ring buffer 49b using the Copy command, it is possible to reduce the number of compressed instruction commands obtained by compressing the instruction commands C4 for the convolution operation circuit 4.
  • the instruction command C4 for the convolution operation circuit 4 that is input to the instruction decompressor 49 is compressed in advance by a tool such as a compiler that generates the instruction command C4.
  • FIG. 18 shows a modified example of the instruction decompressor 49.
  • the convolution circuit 4 may include a plurality of instruction decompressors 49. In the example shown in FIG. 18, three instruction decompressors 49 are provided in parallel. In this case, an individual instruction queue 45 corresponding to each instruction decompressor 49 is provided.
  • the instruction command C4 for the convolution circuit 4 is divided into three groups and input to the three instruction decompressors 49. For example, the instruction command C4 for the convolution circuit 4 is divided into an instruction for setting input data a, an instruction for setting weight w, and an instruction for setting the output of convolution output data.
  • control circuit 46 can efficiently read and execute the instructions stored in the instruction queue 45 divided into three groups.
  • FIG. 19 is an internal block diagram of the quantization calculation circuit 5.
  • the quantization operation circuit 5 has a quantization parameter memory 51, a vector operation circuit 52, a quantization circuit 53, a state controller 54, and an instruction decompressor 59.
  • the quantization operation circuit 5 has a state controller 54 dedicated to the vector operation circuit 52 and the quantization circuit 53, and when an instruction command is input, the quantization operation can be performed without the need for an external controller.
  • the quantization parameter memory 51 is a memory in which the quantization parameter q used in the quantization operation is stored, and is a rewritable memory such as a volatile memory composed of, for example, an SRAM (Static RAM).
  • the DMAC3 writes the quantization parameter q required for the quantization operation to the quantization parameter memory 51 by DMA transfer.
  • FIG. 20 is an internal block diagram of the vector calculation circuit 52 and the quantization circuit 53.
  • the vector operation circuit 52 performs an operation on the output data f(x, y, do) stored in the second memory 2.
  • the vector operation circuit 52 has Bd operation units 57, and performs SIMD operations in parallel on the output data f(x, y, do).
  • FIG. 21 is a block diagram of the arithmetic unit 57.
  • the arithmetic unit 57 includes, for example, an ALU 57 a, a first selector 57 b, a second selector 57 c, a register 57 d, and a shifter 57 e.
  • the arithmetic unit 57 may further include other arithmetic units included in a known general-purpose SIMD arithmetic circuit.
  • the vector operation circuit 52 performs at least one of the operations of the pooling layer 221, the batch normalization layer 222, and the activation function layer 223 in the quantization operation layer 220 on the output data f(x, y, do) by combining the operators and the like contained in the operation unit 57.
  • the arithmetic unit 57 can add the data stored in the register 57d and the element f(di) of the output data f(x, y, do) read from the second memory 2 by the ALU 57a.
  • the arithmetic unit 57 can store the addition result by the ALU 57a in the register 57d.
  • the arithmetic unit 57 can initialize the addition result by inputting "0" to the ALU 57a instead of the data stored in the register 57d by selecting the first selector 57b. For example, if the pooling area is 2x2, the shifter 57e can output the average value of the addition results by shifting the output of the ALU 57a to the right by 2 bits.
  • the vector calculation circuit 52 can perform the average pooling calculation shown in Equation 2 by repeating the above calculations by the Bd arithmetic units 57.
  • the arithmetic unit 57 can compare the data stored in the register 57d with the element f(di) of the output data f(x, y, do) read from the second memory 2 by the ALU 57a.
  • the arithmetic unit 57 controls the second selector 57c according to the comparison result by the ALU 57a, and can select the larger of the data stored in the register 57d and the element f(di).
  • the arithmetic unit 57 can initialize the comparison target to the minimum value by inputting the minimum value of the possible value of the element f(di) to the ALU 57a by the selection of the first selector 57b.
  • the element f(di) is a 16-bit signed integer, so the minimum value of the possible value of the element f(di) is "0x8000".
  • the vector calculation circuit 52 can perform the MAX pooling calculation of Equation 3 by repeating the above calculations by the Bd arithmetic units 57. Note that in the MAX pooling calculation, the shifter 57e does not shift the output of the second selector 57c.
  • the arithmetic unit 57 can subtract the data stored in the register 57d and the element f(di) of the output data f(x, y, do) read from the second memory 2 using the ALU 57a.
  • the shifter 57e can shift the output of the ALU 57a to the left (i.e., multiplication) or right (i.e., division).
  • the vector arithmetic circuit 52 can perform the batch normalization calculation of Equation 4 by repeating the above calculations by the Bd arithmetic units 57.
  • the arithmetic unit 57 can compare the element f(di) of the output data f(x, y, do) read from the second memory 2 with the "0" selected by the first selector 57b using the ALU 57a. Depending on the comparison result by the ALU 57a, the arithmetic unit 57 can select and output either the element f(di) or the constant value "0" previously stored in the register 57d.
  • the vector arithmetic circuit 52 can perform the ReLU operation of Equation 5 by repeating the above calculations by the Bd arithmetic units 57.
  • the vector operation circuit 52 can perform average pooling, MAX pooling, batch normalization, activation function operations, and combinations of these operations. Since the vector operation circuit 52 can perform general-purpose SIMD operations, it may also perform other operations necessary for the operations in the quantization operation layer 220. In addition, the vector operation circuit 52 may also perform operations other than those in the quantization operation layer 220.
  • the quantization calculation circuit 5 does not have to have the vector calculation circuit 52. If the quantization calculation circuit 5 does not have the vector calculation circuit 52, the output data f(x, y, do) is input to the quantization circuit 53.
  • the quantization circuit 53 performs quantization on the output data of the vector calculation circuit 52. As shown in FIG. 20, the quantization circuit 53 has Bd quantization units 58, and performs calculations in parallel on the output data of the vector calculation circuit 52.
  • FIG. 22 is an internal block diagram of the quantization unit 58.
  • the quantization unit 58 quantizes the element in(di) of the output data of the vector operation circuit 52.
  • the quantization unit 58 has a comparator 58a and an encoder 58b.
  • the quantization unit 58 performs the operation (Equation 6) of the quantization layer 224 in the quantization operation layer 220 on the output data (16 bits/element) of the vector operation circuit 52.
  • the quantization unit 58 reads out the necessary quantization parameters q(th0, th1, th2) from the quantization parameter memory 51, and compares the input in(di) with the quantization parameter q by the comparator 58a.
  • the quantization unit 58 quantizes the comparison result by the comparator 58a to 2 bits/element by the encoder 58b. Since ⁇ (c) and ⁇ (c) in Equation 4 are parameters that differ for each variable c, the quantization parameters q(th0, th1, th2) that reflect ⁇ (c) and ⁇ (c) are parameters that differ for each in(di).
  • the quantization unit 58 classifies the input in (di) into four regions (e.g., in ⁇ th0, th0 ⁇ in ⁇ th1, th1 ⁇ in ⁇ th2, th2 ⁇ in) by comparing the input in (di) with three thresholds th0, th1, and th2, and outputs the classification result by encoding it into 2 bits.
  • the quantization unit 58 can also perform batch normalization and activation function calculations in addition to quantization by setting the quantization parameter q (th0, th1, th2).
  • the quantization unit 58 sets the threshold th0 as ⁇ (c) in Equation 4, and the threshold differences (th1-th0) and (th2-th1) as ⁇ (c) in Equation 4, and performs quantization, thereby enabling the batch normalization calculation shown in Equation 4 to be performed in conjunction with quantization.
  • ⁇ (c) By increasing (th1-th0) and (th2-th1), ⁇ (c) can be reduced.
  • the quantization unit 58 can perform the activation function in conjunction with the quantization of the input in(di). For example, the quantization unit 58 saturates the output value in the region where in(di) ⁇ th0 and th2 ⁇ in(di). The quantization unit 58 can perform the calculation of the activation function in conjunction with the quantization by setting the quantization parameter q so that the output is nonlinear.
  • the state controller 54 controls the states of the vector operation circuit 52 and the quantization circuit 53.
  • the state controller 54 is also connected to the controller 6 via the internal bus IB.
  • the state controller 54 has an instruction queue 55 and a control circuit 56.
  • the instruction queue 55 is a queue in which the instruction command C5 for the quantization calculation circuit 5 is stored, and is configured, for example, as a FIFO memory.
  • the instruction command C5 is written to the instruction queue 55 via the IFU 7 or the internal bus IB.
  • the control circuit 56 is a state machine that decodes the instruction command C5 and controls the vector calculation circuit 52 and the quantization circuit 53 based on the instruction command C5.
  • the control circuit 56 has a similar configuration to the control circuit 34 of the state controller 32 of the DMAC3.
  • the instruction decompressor 59 restores (decompresses) the instruction command C5 from the compressed instruction command into which the instruction command C5 is compressed.
  • the instruction decompressor 59 has a configuration similar to that of the instruction decompressor 49 of the convolution operation circuit 4.
  • the quantization calculation circuit 5 writes quantization calculation output data having Bd elements to the first memory 1.
  • the preferred relationship between Bd and Bc is shown in Equation 10.
  • Equation 10 n is an integer.
  • the controller 6 transfers the instruction command transferred from the external host CPU 110 via the internal bus IB to the instruction queues of the DMAC 3, the convolution operation circuit 4, and the quantization operation circuit 5.
  • the controller 6 may have an instruction memory that stores the instruction command for each circuit.
  • the controller 6 is connected to the external bus EB and operates as a slave to the external host CPU 110.
  • the controller 6 has registers 61 including a parameter register and a status register.
  • the parameter register is a register that controls the operation of the NN circuit 100.
  • the status register is a register that indicates the status of the NN circuit 100, including the semaphore S.
  • the neural network circuit 100 can operate the NN circuit 100 with high performance, which can be embedded in embedded devices such as IoT devices. By connecting multiple NN calculation cores 10, more neural network calculations can be performed efficiently and quickly.
  • command command for controlling the neural network circuit 100 of the above embodiment is an example in which one command command is required for one calculation operation, the form of the command command is not limited to this.
  • the command command may be an embodiment in which multiple calculation operations can be executed by one or more command commands. Specifically, consecutive 1x1 convolution operations are executed based on multiple command commands.
  • the multiple command commands include at least an instruction to determine the range (offset and step) of the element A(ci) of the input vector A held in the first memory 1, an instruction to determine the range (offset and step) of the element W(ci, di) of the weight matrix W held in the weight memory 41, an instruction to determine the storage position (offset, step) in the second memory 2 where the product-sum operation result O(di) is stored, and an instruction to determine the number of repetitions (filter size) of the 1x1 convolution operation.
  • the total number of command commands can be reduced.
  • the number of command commands can be further reduced, and the increase in the usage of the external memory 120 and the pressure on the bandwidth of the external bus EB can be reduced.
  • first memory 1 and the second memory 2 are separate memories, but the aspects of the first memory 1 and the second memory 2 are not limited to this.
  • the first memory 1 and the second memory 2 may be, for example, a first memory area and a second memory area in the same memory.
  • the data input to the NN circuit 100 described in the above embodiment is not limited to a single format, and can be composed of still images, moving images, sounds, characters, numbers, and combinations of these.
  • the data input to the NN circuit 100 is not limited to the measurement results of physical quantity measuring instruments such as optical sensors, thermometers, Global Positioning System (GPS) measuring instruments, angular velocity measuring instruments, and anemometers that may be mounted on the edge device in which the NN circuit 100 is provided.
  • GPS Global Positioning System
  • angular velocity measuring instruments angular velocity measuring instruments
  • anemometers that may be mounted on the edge device in which the NN circuit 100 is provided.
  • Different information such as base station information received from peripheral devices via wired or wireless communication, information on vehicles and ships, weather information, congestion information, and other peripheral information, financial information, and personal information may also be combined.
  • the edge device in which the NN circuit 100 is provided is assumed to be a communication device such as a battery-powered mobile phone, a smart device such as a personal computer, a digital camera, a game device, a robot product, and other mobile devices, but is not limited thereto. It can be used in products that have a high demand for peak power limit that can be supplied by Power on Ethernet (PoE), reduction of product heat generation, or long-term operation, to obtain effects not seen in other prior art. For example, by applying the circuit to an in-vehicle camera mounted on a vehicle or ship, or a surveillance camera installed in a public facility or on the road, it is possible to realize long-term shooting, and also contributes to weight reduction and high durability. In addition, the circuit can be applied to display devices such as televisions and displays, medical equipment such as medical cameras and surgical robots, and work robots used at manufacturing sites and construction sites to obtain similar effects.
  • display devices such as televisions and displays, medical equipment such as medical cameras and surgical robots, and work robots used at manufacturing
  • the NN circuit 100 may be realized in part or in whole by using one or more processors.
  • the NN circuit 100 may realize in part or in whole the input layer or output layer by software processing by a processor.
  • a part of the input layer or output layer realized by software processing is, for example, data normalization or conversion. This makes it possible to support various input formats or output formats.
  • the software executed by the processor may be configured to be rewritable using communication means or external media.
  • the NN circuit 100 may realize a part of the processing in the CNN 200 by combining a graphics processing unit (GPU) or the like on the cloud.
  • the NN circuit 100 can realize more complex processing with fewer resources by performing further processing on the cloud in addition to the processing performed on the edge device in which the NN circuit 100 is provided, or by performing processing on the edge device in addition to the processing on the cloud. With such a configuration, the NN circuit 100 can reduce the amount of communication between the edge device and the cloud by distributing processing.
  • the present invention can be applied to neural network calculations.
  • Neural Network Circuit 10 Neural network calculation core (NN calculation core) 10A First Neural Network Calculation Core (First NN Calculation Core) 10B Second Neural Network Calculation Core (Second NN Calculation Core) 10M Neural network calculation multi-core (NN calculation multi-core) 1 First memory 2 Second memory 3 DMA controller (DMAC) 4 Convolution operation circuit 42 Multiplier 43 Accumulator circuit 5 Quantization operation circuit 52 Vector operation circuit 53 Quantization circuit 6 Controller 61 Register 7 IFU

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Mathematics (AREA)
  • Algebra (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

This neural network circuit comprises a convolution operation circuit that performs a convolution operation on input data. The convolution operation circuit includes an instruction decompressor that decompresses a compressed instruction command in which a convolution operation circuit instruction command for operating the convolution operation circuit is compressed.

Description

ニューラルネットワーク回路およびニューラルネットワーク演算方法Neural network circuit and neural network operation method
 本発明は、ニューラルネットワーク回路およびニューラルネットワーク演算方法に関する。本願は、2022年11月22日に、日本国に出願された特願2022-186308号に基づき優先権を主張し、その内容をここに援用する。 The present invention relates to a neural network circuit and a neural network calculation method. This application claims priority to Japanese Patent Application No. 2022-186308, filed on November 22, 2022, the contents of which are incorporated herein by reference.
 近年、畳み込みニューラルネットワーク(Convolutional Neural Network:CNN)が画像認識等のモデルとして用いられている。畳み込みニューラルネットワークは、畳み込み層やプーリング層を有する多層構造であり、畳み込み演算等の多数の演算を必要とする。畳み込みニューラルネットワークによる演算を高速化する演算手法が様々考案されている(特許文献1など)。 In recent years, convolutional neural networks (CNNs) have been used as models for image recognition and other purposes. Convolutional neural networks have a multi-layer structure with convolutional layers and pooling layers, and require a large number of calculations, including convolutional operations. Various calculation methods have been devised to speed up calculations using convolutional neural networks (e.g., Patent Document 1).
特開2018-077829号公報JP 2018-077829 A
 一方で、IoT機器などの組み込み機器においても畳み込みニューラルネットワークを利用した画像認識等を実現することが望まれている。組み込み機器においては、特許文献1等に記載された大規模な専用回路を組み込むことは難しい。また、CPUやメモリ等のハードウェアリソースが限られた組み込み機器においては、畳み込みニューラルネットワークの十分な演算性能をソフトウェアのみにより実現することは難しい。 On the other hand, there is a demand for implementing image recognition and the like using convolutional neural networks in embedded devices such as IoT devices. It is difficult to incorporate large-scale dedicated circuits such as those described in Patent Document 1 into embedded devices. Also, in embedded devices with limited hardware resources such as CPU and memory, it is difficult to achieve sufficient computing performance of convolutional neural networks using software alone.
 上記事情を踏まえ、本発明は、IoT機器などの組み込み機器に組み込み可能かつ高性能なニューラルネットワーク回路およびニューラルネットワーク演算方法を提供することを目的とする。 In light of the above circumstances, the present invention aims to provide a high-performance neural network circuit and neural network calculation method that can be incorporated into embedded devices such as IoT devices.
 上記課題を解決するために、この発明は以下の手段を提案している。
 本発明の第一の態様に係るニューラルネットワーク回路は、入力データに対して畳み込み演算を行う畳み込み演算回路を備え、前記畳み込み演算回路は、前記畳み込み演算回路を動作させる畳み込み演算回路用の命令コマンドが圧縮された圧縮命令コマンドを復元する命令デコンプレッサを有する。
In order to solve the above problems, the present invention proposes the following means.
A neural network circuit according to a first aspect of the present invention includes a convolution circuit that performs a convolution operation on input data, and the convolution circuit has an instruction decompressor that decompresses compressed instruction commands that are instruction commands for the convolution circuit that operate the convolution circuit.
 本発明の第二の態様に係るニューラルネットワーク演算方法は、入力データに対して畳み込み演算を行う畳み込み演算回路と、前記畳み込み演算回路の畳み込み演算出力データに対して量子化演算を行う量子化演算回路と、前記畳み込み演算回路を動作させる畳み込み演算回路用の命令コマンドが圧縮された圧縮命令コマンドと、前記量子化演算回路を動作させる量子化演算回路用の命令コマンドが圧縮された圧縮命令コマンドと、をメモリから読み出す命令フェッチユニットと、を備えるニューラルネットワーク回路の制御方法であって、前記命令フェッチユニットに、前記畳み込み演算回路用の命令コマンドと量子化演算回路用の命令コマンドとを別々に前記メモリから読み出させて、前記畳み込み演算回路と前記量子化演算回路とに対して前記命令コマンドを別々に供給させるステップと、前記畳み込み演算回路および前記量子化演算回路に、前記圧縮命令コマンドから前記命令コマンドを復元させるステップと、復元された前記命令コマンドに基づいて前記畳み込み演算回路と前記量子化演算回路とを並列して動作させるステップと、を有する。 The neural network operation method according to the second aspect of the present invention is a control method for a neural network circuit including a convolution operation circuit that performs a convolution operation on input data, a quantization operation circuit that performs a quantization operation on the convolution operation output data of the convolution operation circuit, and an instruction fetch unit that reads from a memory a compressed instruction command obtained by compressing an instruction command for the convolution operation circuit that operates the convolution operation circuit, and a compressed instruction command obtained by compressing an instruction command for the quantization operation circuit that operates the quantization operation circuit, and includes the steps of: making the instruction fetch unit read the instruction command for the convolution operation circuit and the instruction command for the quantization operation circuit separately from the memory and supply the instruction commands separately to the convolution operation circuit and the quantization operation circuit; making the convolution operation circuit and the quantization operation circuit restore the instruction command from the compressed instruction command; and making the convolution operation circuit and the quantization operation circuit operate in parallel based on the restored instruction command.
 本発明のニューラルネットワーク回路およびニューラルネットワーク演算方法は、IoT機器などの組み込み機器に組み込み可能かつ高性能である。 The neural network circuit and neural network calculation method of the present invention can be incorporated into embedded devices such as IoT devices and has high performance.
畳み込みニューラルネットワークを示す図である。FIG. 1 illustrates a convolutional neural network. 畳み込み層が行う畳み込み演算を説明する図である。FIG. 2 is a diagram for explaining a convolution operation performed by a convolution layer. 畳み込み演算のデータの展開を説明する図である。FIG. 13 is a diagram for explaining data expansion in a convolution operation. 第一実施形態に係るニューラルネットワーク回路の全体構成を示す図である。1 is a diagram showing an overall configuration of a neural network circuit according to a first embodiment; NN演算コアの全体構成を示す図である。FIG. 2 is a diagram showing the overall configuration of an NN processing core. 同NN演算コアの動作例を示すタイミングチャートである。4 is a timing chart showing an example of the operation of the NN processing core. 同NN演算コアの他の動作例を示すタイミングチャートである。10 is a timing chart showing another example of the operation of the NN processing core. NN演算マルチコアを示す図である。FIG. 1 illustrates an NN calculation multi-core. 同ニューラルネットワーク回路のDMACの内部ブロック図である。FIG. 2 is an internal block diagram of the DMAC of the neural network circuit. 同DMACの制御回路のステート遷移図である。FIG. 2 is a state transition diagram of a control circuit of the DMAC. 同ニューラルネットワーク回路の畳み込み演算回路の内部ブロック図である。FIG. 2 is an internal block diagram of a convolution operation circuit of the neural network circuit. 同畳み込み演算回路の乗算器の内部ブロック図である。FIG. 2 is an internal block diagram of a multiplier in the convolution operation circuit. 同乗算器の積和演算ユニットの内部ブロック図である。FIG. 2 is an internal block diagram of a multiply-and-accumulate unit of the multiplier. 同畳み込み演算回路のアキュムレータ回路の内部ブロック図である。FIG. 2 is an internal block diagram of an accumulator circuit of the convolution operation circuit. 同アキュムレータ回路のアキュムレータユニットの内部ブロック図である。FIG. 2 is an internal block diagram of an accumulator unit of the accumulator circuit. 同畳み込み演算回路の命令デコンプレッサの内部ブロック図である。FIG. 2 is an internal block diagram of an instruction decompressor of the convolution operation circuit. 同命令デコンプレッサが復元する圧縮命令コマンドの一例を示す図である。FIG. 2 is a diagram showing an example of a compressed instruction command restored by the instruction decompressor. 同命令デコンプレッサの変形例を示す図である。FIG. 13 illustrates a modified example of the instruction decompressor. 同ニューラルネットワーク回路の量子化演算回路の内部ブロック図である。FIG. 2 is an internal block diagram of a quantization calculation circuit of the neural network circuit. 同量子化演算回路のベクトル演算回路と量子化回路の内部ブロック図である。FIG. 2 is an internal block diagram of a vector operation circuit and a quantization circuit of the quantization operation circuit. 演算ユニットのブロック図である。FIG. 2 is a block diagram of a computing unit. 同量子化回路のベクトル量子化ユニットの内部ブロック図である。FIG. 2 is an internal block diagram of a vector quantization unit of the quantization circuit.
(第一実施形態)
 本発明の第一実施形態について、図1から図22を参照して説明する。
 図1は、畳み込みニューラルネットワーク200(以下、「CNN200」という)を示す図である。第一実施形態に係るニューラルネットワーク回路100(以下、「NN回路100」という)が行う演算は、推論時に使用する学習済みのCNN200の少なくとも一部である。
First Embodiment
A first embodiment of the present invention will be described with reference to FIGS.
1 is a diagram showing a convolutional neural network 200 (hereinafter, referred to as "CNN 200"). The calculations performed by a neural network circuit 100 (hereinafter, referred to as "NN circuit 100") according to the first embodiment are at least a part of the trained CNN 200 used during inference.
[CNN200]
 CNN200は、畳み込み演算を行う畳み込み層210と、量子化演算を行う量子化演算層220と、出力層230と、を含む多層構造のネットワークである。CNN200の少なくとも一部において、畳み込み層210と量子化演算層220とが交互に連結されている。CNN200は、画像認識や動画認識に広く使われるモデルである。CNN200は、全結合層などの他の機能を有する層(レイヤ)をさらに有してもよい。
[CNN200]
The CNN 200 is a multi-layer network including a convolution layer 210 that performs a convolution operation, a quantization operation layer 220 that performs a quantization operation, and an output layer 230. In at least a part of the CNN 200, the convolution layer 210 and the quantization operation layer 220 are alternately connected. The CNN 200 is a model that is widely used for image recognition and video recognition. The CNN 200 may further include a layer having other functions such as a fully connected layer.
 図2は、畳み込み層210が行う畳み込み演算を説明する図である。
 畳み込み層210は、入力データaに対して重みwを用いた畳み込み演算を行う。畳み込み層210は、入力データaと重みwとを入力とする積和演算を行う。
FIG. 2 is a diagram illustrating the convolution operation performed by the convolution layer 210.
The convolution layer 210 performs a convolution operation on the input data a using a weight w. The convolution layer 210 performs a multiply-and-accumulate operation on the input data a and the weight w.
 畳み込み層210への入力データa(アクティベーションデータ、特徴マップともいう)は、画像データ等の多次元データである。本実施形態において、入力データaは、要素(x,y,c)からなる3次元テンソルである。CNN200の畳み込み層210は、低ビットの入力データaに対して畳み込み演算を行う。本実施形態において、入力データaの要素は、2ビットの符号なし整数(0,1,2,3)である。入力データaの要素は、例えば、4ビットや8ビット符号なし整数でもよい。 The input data a (also called activation data or feature map) to the convolutional layer 210 is multidimensional data such as image data. In this embodiment, the input data a is a three-dimensional tensor consisting of elements (x, y, c). The convolutional layer 210 of the CNN 200 performs a convolution operation on the low-bit input data a. In this embodiment, the elements of the input data a are 2-bit unsigned integers (0, 1, 2, 3). The elements of the input data a may be, for example, 4-bit or 8-bit unsigned integers.
 CNN200に入力される入力データが、例えば32ビットの浮動小数点型など、畳み込み層210への入力データaと形式が異なる場合、CNN200は畳み込み層210の前に型変換や量子化を行う入力層をさらに有してもよい。 If the input data input to CNN 200 has a different format from the input data a to convolutional layer 210, such as a 32-bit floating-point type, CNN 200 may further have an input layer that performs type conversion and quantization before convolutional layer 210.
 畳み込み層210の重みw(フィルタ、カーネルともいう)は、学習可能なパラメータである要素を有する多次元データである。本実施形態において、重みwは、要素(i,j,c,d)からなる4次元テンソルである。重みwは、要素(i,j,c)からなる3次元テンソル(以降、「重みwo」という)をd個有している。学習済みのCNN200における重みwは、学習済みのデータである。CNN200の畳み込み層210は、低ビットの重みwを用いて畳み込み演算を行う。本実施形態において、重みwの要素は、1ビットの符号付整数(0,1)であり、値「0」は+1を表し、値「1」は-1を表す。 The weights w (also called filters or kernels) of the convolutional layer 210 are multidimensional data having elements that are learnable parameters. In this embodiment, the weights w are four-dimensional tensors consisting of elements (i, j, c, d). The weights w have d three-dimensional tensors (hereinafter referred to as "weights wo") consisting of elements (i, j, c). The weights w in the trained CNN 200 are trained data. The convolutional layer 210 of the CNN 200 performs convolution operations using low-bit weights w. In this embodiment, the elements of the weights w are 1-bit signed integers (0, 1), where a value of "0" represents +1 and a value of "1" represents -1.
 畳み込み層210は、式1に示す畳み込み演算を行い、出力データfを出力する。式1において、sはストライドを示す。図2において点線で示された領域は、入力データaに対して重みwoが適用される領域ao(以降、「適用領域ao」という)の一つを示している。適用領域aoの要素は、(x+i,y+j,c)で表される。 The convolution layer 210 performs the convolution operation shown in Equation 1 and outputs output data f. In Equation 1, s indicates the stride. The area indicated by the dotted line in FIG. 2 indicates one of the areas ao (hereinafter referred to as the "application area ao") where the weight wo is applied to the input data a. The elements of the application area ao are represented by (x+i, y+j, c).
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
 量子化演算層220は、畳み込み層210が出力する畳み込み演算の出力に対して量子化などを実施する。量子化演算層220は、プーリング層221と、Batch Normalization層222と、活性化関数層223と、量子化層224と、を有する。 The quantization operation layer 220 performs quantization and other operations on the output of the convolution operation output by the convolution layer 210. The quantization operation layer 220 has a pooling layer 221, a batch normalization layer 222, an activation function layer 223, and a quantization layer 224.
 プーリング層221は、畳み込み層210が出力する畳み込み演算の出力データfに対して平均プーリング(式2)やMAXプーリング(式3)などの演算を実施して、畳み込み層210の出力データfを圧縮する。式2および式3において、uは入力テンソルを示し、vは出力テンソルを示し、Tはプーリング領域の大きさを示す。式3において、maxはTに含まれるiとjの組み合わせに対するuの最大値を出力する関数である。 The pooling layer 221 compresses the output data f of the convolutional operation output by the convolutional layer 210 by performing calculations such as average pooling (Equation 2) and MAX pooling (Equation 3). In Equations 2 and 3, u represents the input tensor, v represents the output tensor, and T represents the size of the pooling region. In Equation 3, max is a function that outputs the maximum value of u for the combination of i and j included in T.
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000003
Figure JPOXMLDOC01-appb-M000003
 Batch Normalization層222は、量子化演算層220やプーリング層221の出力データに対して、例えば式4に示すような演算によりデータ分布の正規化を行う。式4において、uは入力テンソルを示し、vは出力テンソルを示し、αはスケールを示し、βはバイアスを示す。学習済みのCNN200において、αおよびβは学習済みの定数ベクトルである。 The batch normalization layer 222 normalizes the data distribution of the output data of the quantization operation layer 220 and the pooling layer 221, for example, by performing an operation as shown in Equation 4. In Equation 4, u represents the input tensor, v represents the output tensor, α represents the scale, and β represents the bias. In the trained CNN 200, α and β are trained constant vectors.
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000004
 活性化関数層223は、量子化演算層220やプーリング層221やBatch Normalization層222の出力に対してReLU(式5)などの活性化関数の演算を行う。式5において、uは入力テンソルであり、vは出力テンソルである。式5において、maxは引数のうち最も大きい数値を出力する関数である。 The activation function layer 223 calculates an activation function such as ReLU (Equation 5) on the output of the quantization operation layer 220, the pooling layer 221, and the batch normalization layer 222. In Equation 5, u is an input tensor and v is an output tensor. In Equation 5, max is a function that outputs the largest numerical value among the arguments.
Figure JPOXMLDOC01-appb-M000005
Figure JPOXMLDOC01-appb-M000005
 量子化層224は、量子化パラメータに基づいて、プーリング層221や活性化関数層223の出力に対して例えば式6に示すような量子化を行う。式6に示す量子化は、入力テンソルuを2ビットにビット削減している。式6において、q(c)は量子化パラメータのベクトルである。学習済みのCNN200において、q(c)は学習済みの定数ベクトルである。式6における不等号「≦」は「<」であってもよい。 The quantization layer 224 performs quantization on the output of the pooling layer 221 and the activation function layer 223 based on the quantization parameter, for example as shown in Equation 6. The quantization shown in Equation 6 reduces the input tensor u to 2 bits. In Equation 6, q(c) is a vector of quantization parameters. In the trained CNN 200, q(c) is a trained constant vector. The inequality sign "≦" in Equation 6 may be "<".
Figure JPOXMLDOC01-appb-M000006
Figure JPOXMLDOC01-appb-M000006
 出力層230は、恒等関数やソフトマックス関数等によりCNN200の結果を出力する層である。出力層230の前段のレイヤは、畳み込み層210であってもよいし、量子化演算層220であってもよい。 The output layer 230 is a layer that outputs the results of the CNN 200 using an identity function, a softmax function, or the like. The layer preceding the output layer 230 may be the convolution layer 210 or the quantization operation layer 220.
 CNN200は、量子化された量子化層224の出力データが、畳み込み層210に入力されるため、量子化を行わない他の畳み込みニューラルネットワークと比較して、畳み込み層210の畳み込み演算の負荷が小さい。 In CNN200, the quantized output data of the quantization layer 224 is input to the convolution layer 210, so the load of the convolution calculation in the convolution layer 210 is smaller than in other convolution neural networks that do not perform quantization.
[畳み込み演算の分割]
 NN回路100は、畳み込み層210の畳み込み演算(式1)の入力データを部分テンソルに分割して演算する。部分テンソルへの分割方法や分割数は特に限定されない。部分テンソルは、例えば、入力データa(x+i,y+j,c)をa(x+i,y+j,co)に分割することにより形成される。なお、NN回路100は、畳み込み層210の畳み込み演算(式1)の入力データを分割せずに演算することもできる。
[Dividing the convolution operation]
The NN circuit 100 divides the input data of the convolution operation (Equation 1) of the convolution layer 210 into partial tensors and performs the operation. The method of division into the partial tensors and the number of divisions are not particularly limited. The partial tensor is formed, for example, by dividing the input data a(x+i, y+j, c) into a(x+i, y+j, co). The NN circuit 100 can also perform the operation without dividing the input data of the convolution operation (Equation 1) of the convolution layer 210.
 畳み込み演算の入力データ分割において、式1における変数cは、式7に示すように、サイズBcのブロックで分割される。また、式1における変数dは、式8に示すように、サイズBdのブロックで分割される。式7において、coはオフセットであり、ciは0から(Bc-1)までのインデックスである。式8において、doはオフセットであり、diは0から(Bd-1)までのインデックスである。なお、サイズBcとサイズBdは同じであってもよい。 In input data division for a convolution operation, the variable c in Equation 1 is divided into blocks of size Bc, as shown in Equation 7. Also, the variable d in Equation 1 is divided into blocks of size Bd, as shown in Equation 8. In Equation 7, co is an offset, and ci is an index from 0 to (Bc-1). In Equation 8, do is an offset, and di is an index from 0 to (Bd-1). Note that size Bc and size Bd may be the same.
Figure JPOXMLDOC01-appb-M000007
Figure JPOXMLDOC01-appb-M000007
Figure JPOXMLDOC01-appb-M000008
Figure JPOXMLDOC01-appb-M000008
 式1における入力データa(x+i,y+j,c)は、c軸方向においてサイズBcにより分割され、分割された入力データa(x+i,y+j,co)で表される。以降の説明において、分割された入力データaを「分割入力データa」ともいう。 The input data a(x+i, y+j, c) in Equation 1 is divided in the c-axis direction by size Bc, and is represented by the divided input data a(x+i, y+j, co). In the following explanation, the divided input data a is also referred to as "divided input data a".
 式1における重みw(i,j,c,d)は、c軸方向においてサイズBcおよびd軸方向においてサイズBdにより分割され、分割された重みw(i,j,co,do)で表される。以降の説明において、分割された重みwを「分割重みw」ともいう。 The weight w(i,j,c,d) in Equation 1 is divided by the size Bc in the c-axis direction and the size Bd in the d-axis direction, and is expressed as the divided weight w(i,j,co,do). In the following explanation, the divided weight w is also referred to as the "divided weight w".
 サイズBdにより分割された出力データf(x,y,do)は、式9により求まる。分割された出力データf(x,y,do)を組み合わせることで、最終的な出力データf(x,y,d)を算出できる。 The output data f(x, y, do) divided by size Bd is calculated using Equation 9. The final output data f(x, y, d) can be calculated by combining the divided output data f(x, y, do).
Figure JPOXMLDOC01-appb-M000009
Figure JPOXMLDOC01-appb-M000009
[畳み込み演算のデータの展開]
 NN回路100は、畳み込み層210の畳み込み演算における入力データaおよび重みwを展開して畳み込み演算を行う。
[Expanding data for convolution operations]
The NN circuit 100 performs the convolution operation by expanding the input data a and the weights w in the convolution operation of the convolution layer 210.
 図3は、畳み込み演算のデータの展開を説明する図である。
 分割入力データa(x+i、y+j、co)は、Bc個の要素を持つベクトルデータに展開される。分割入力データaの要素は、ciでインデックスされる(0≦ci<Bc)。以降の説明において、i,jごとにベクトルデータに展開された分割入力データaを「入力ベクトルA」ともいう。入力ベクトルAは、分割入力データa(x+i、y+j、co×Bc)から分割入力データa(x+i、y+j、co×Bc+(Bc-1))までを要素とする。
FIG. 3 is a diagram for explaining the expansion of data in a convolution operation.
Divided input data a(x+i, y+j, co) is expanded into vector data having Bc elements. The elements of divided input data a are indexed by ci (0≦ci<Bc). In the following explanation, divided input data a expanded into vector data for each i and j is also referred to as "input vector A". Input vector A has elements from divided input data a(x+i, y+j, co×Bc) to divided input data a(x+i, y+j, co×Bc+(Bc-1)).
 分割重みw(i,j,co、do)は、Bc×Bd個の要素を持つマトリクスデータに展開される。マトリクスデータに展開された分割重みwの要素は、ciとdiでインデックスされる(0≦di<Bd)。以降の説明において、i,jごとにマトリクスデータに展開された分割重みwを「重みマトリクスW」ともいう。重みマトリクスWは、分割重みw(i,j,co×Bc、do×Bd)から分割重みw(i,j,co×Bc+(Bc-1)、do×Bd+(Bd-1))までを要素とする。 The split weights w(i,j,co,do) are expanded into matrix data with Bc×Bd elements. The elements of the split weights w expanded into the matrix data are indexed by ci and di (0≦di<Bd). In the following explanation, the split weights w expanded into matrix data for each i and j are also referred to as the "weight matrix W". The elements of the weight matrix W are split weights w(i,j,co×Bc,do×Bd) to w(i,j,co×Bc+(Bc-1),do×Bd+(Bd-1)).
 入力ベクトルAと重みマトリクスWとを乗算することで、ベクトルデータが算出される。i,j,coごとに算出されたベクトルデータを3次元テンソルに整形することで、出力データf(x,y,do)を得ることができる。このようなデータの展開を行うことで、畳み込み層210の畳み込み演算を、ベクトルデータとマトリクスデータとの乗算により実施できる。 Vector data is calculated by multiplying the input vector A by the weight matrix W. The vector data calculated for i, j, and co is shaped into a three-dimensional tensor to obtain output data f(x, y, do). By expanding the data in this way, the convolution operation of the convolution layer 210 can be performed by multiplying the vector data by the matrix data.
[NN回路100]
 図4は、本実施形態に係るNN回路100の全体構成を示す図である。
 NN回路100は、DMAコントローラ3(以下、「DMAC3」ともいう)と、コントローラ6と、IFU7と、少なくとも一つのニューラルネットワーク演算コア10(以下、「NN演算コア10」ともいう)と、を備える。
[NN circuit 100]
FIG. 4 is a diagram showing the overall configuration of an NN circuit 100 according to this embodiment.
The NN circuit 100 includes a DMA controller 3 (hereinafter also referred to as "DMAC 3"), a controller 6, an IFU 7, and at least one neural network calculation core 10 (hereinafter also referred to as "NN calculation core 10").
 NN回路100は、複数のNN演算コア10を実装可能である。図4に例示するNN回路100は、NN演算コア10を最大4つまで実装可能である。複数のNN演算コア10は、NN200の少なくとの一部の演算を連携して実行する「ニューラルネットワーク演算マルチコア10M(以下、「NN演算マルチコア10M」ともいう)」を構成する。複数のNN演算コア10は、本実施形態においてデイジーチェーン接続されている。なお、NN回路100に実装可能なNN演算コア10の数は5個以上であってもよい。 The NN circuit 100 can implement multiple NN calculation cores 10. The NN circuit 100 illustrated in FIG. 4 can implement up to four NN calculation cores 10. The multiple NN calculation cores 10 constitute a "neural network calculation multi-core 10M (hereinafter also referred to as "NN calculation multi-core 10M")" that cooperates to execute at least some of the calculations of the NN 200. In this embodiment, the multiple NN calculation cores 10 are daisy-chained. Note that the number of NN calculation cores 10 that can be implemented in the NN circuit 100 may be five or more.
 DMAC3は、外部バスEBに接続されており、DRAMなどの外部メモリ120とNN演算コア10との間のデータ転送を行う。DMAC3は、複数のNN演算コア10のいずれかに対して外部メモリ120から読み出したデータの転送を行う。なお、DMAC3は、複数のNN演算コア10に対して外部メモリ120から読み出した同一のデータを転送可能であってもよいし、ブロードキャスト可能であってもよい。 The DMAC3 is connected to the external bus EB, and transfers data between an external memory 120 such as a DRAM and the NN calculation core 10. The DMAC3 transfers data read from the external memory 120 to one of the multiple NN calculation cores 10. The DMAC3 may be capable of transferring the same data read from the external memory 120 to multiple NN calculation cores 10, or may be capable of broadcasting the data.
 コントローラ6は、外部バスEBに接続されており、外部のホストCPU110のスレーブとして動作する。コントローラ6は、バスブリッジ60と、レジスタ61と、を有する。 The controller 6 is connected to the external bus EB and operates as a slave to the external host CPU 110. The controller 6 has a bus bridge 60 and a register 61.
 バスブリッジ60は、外部バスEBから内部バスIBへのバスアクセスを中継する。また、バスブリッジ60は、外部ホストCPU110からレジスタ61への書き込み要求および読み込み要求を中継する。 The bus bridge 60 relays bus access from the external bus EB to the internal bus IB. The bus bridge 60 also relays write and read requests from the external host CPU 110 to the register 61.
 レジスタ61は、パラメータレジスタや状態レジスタを有する。パラメータレジスタは、NN回路100の動作を制御するレジスタである。状態レジスタは各モジュールの命令列のポインタ・命令数などを含み、NN回路100の状態を示すレジスタである。また、状態レジスタはセマフォSを含む構成としてよい。外部ホストCPU110は、コントローラ6のバスブリッジ60を経由して、レジスタ61にアクセスできる。 The register 61 has a parameter register and a status register. The parameter register is a register that controls the operation of the NN circuit 100. The status register contains a pointer to the instruction sequence of each module, the number of instructions, etc., and is a register that indicates the status of the NN circuit 100. The status register may also be configured to contain a semaphore S. The external host CPU 110 can access the register 61 via the bus bridge 60 of the controller 6.
 コントローラ6は、内部バスIBを介して、NN回路100の各ブロック(DMAC3、IFU7、NN演算コア10)と接続されている。外部ホストCPU110は、コントローラ6を経由して、NN回路100の各ブロックに対してアクセスできる。例えば、外部ホストCPU110は、コントローラ6を経由して、NN演算コア10に対する命令を指示することができる。また、各ブロックは、内部バスIBを介して、コントローラ6が有する状態レジスタ(セマフォSを含んでもよい)を更新できる。状態レジスタは、各ブロックと接続された専用配線を介して更新されるように構成されていてもよい。 The controller 6 is connected to each block of the NN circuit 100 (DMAC3, IFU7, NN calculation core 10) via the internal bus IB. The external host CPU 110 can access each block of the NN circuit 100 via the controller 6. For example, the external host CPU 110 can issue commands to the NN calculation core 10 via the controller 6. In addition, each block can update a status register (which may include a semaphore S) held by the controller 6 via the internal bus IB. The status register may be configured to be updated via a dedicated wiring connected to each block.
 IFU(Instruction Fetch Unit)7は、外部ホストCPU110の指示に基づいて、外部バスEBを経由してNN回路100の各ブロック(DMAC3、NN演算コア10)に対する命令コマンドを外部メモリ120から読み出す。また、IFU7は、読み出した命令コマンドを対応するNN回路100の各ブロック(DMAC3、NN演算コア10)に転送する。本実施形態において、命令コマンドは圧縮された状態(以下、「圧縮命令コマンド」ともいう)で外部メモリ120に記憶されている。IFU7は、圧縮命令コマンドを読み出す。 The IFU (Instruction Fetch Unit) 7 reads instruction commands for each block (DMAC3, NN calculation core 10) of the NN circuit 100 from the external memory 120 via the external bus EB based on instructions from the external host CPU 110. The IFU 7 also transfers the read instruction commands to each corresponding block (DMAC3, NN calculation core 10) of the NN circuit 100. In this embodiment, the instruction commands are stored in a compressed state (hereinafter also referred to as "compressed instruction commands") in the external memory 120. The IFU 7 reads the compressed instruction commands.
[NN演算コア10]
 図5は、NN演算コア10の全体構成を示す図である。
 NN演算コア10は、第一メモリ1と、第二メモリ2と、畳み込み演算回路4と、量子化演算回路5と、を備える。NN演算コア10は、第一メモリ1および第二メモリ2を介して、畳み込み演算回路4と量子化演算回路5とがループ状に形成されていることを特徴とする。
[NN calculation core 10]
FIG. 5 is a diagram showing the overall configuration of the NN processing core 10. As shown in FIG.
The NN calculation core 10 includes a first memory 1, a second memory 2, a convolution calculation circuit 4, and a quantization calculation circuit 5. The NN calculation core 10 is characterized in that the convolution calculation circuit 4 and the quantization calculation circuit 5 are formed in a loop shape via the first memory 1 and the second memory 2.
 第一メモリ1は、例えばSRAM(Static RAM)などで構成された揮発性のメモリ等の書き換え可能なメモリである。第一メモリ1には、DMAC3や内部バスIBを介してデータの書き込みおよび読み出しが行われる。外部ホストCPU110は、第一メモリ1に対するデータの書き込みや読み出しにより、NN演算コア10に対するデータの入出力を行うことができる。 The first memory 1 is a rewritable memory such as a volatile memory composed of, for example, an SRAM (Static RAM). Data is written to and read from the first memory 1 via the DMAC 3 and the internal bus IB. The external host CPU 110 can input and output data to and from the NN calculation core 10 by writing and reading data to and from the first memory 1.
 第一メモリ1は、畳み込み演算回路4の入力ポートと接続されており、畳み込み演算回路4は第一メモリ1からデータを読み出すことができる。また、第一メモリ1は、量子化演算回路5の出力ポートとループ接続(C1)されており、量子化演算回路5は第一メモリ1にデータを書き込むことができる。また、第一メモリ1は、他のNN演算コア10との間のコア間接続(C2)でデータ転送が可能であり、コア間接続(C2)された他のNN演算コア10は第一メモリ1にデータを書き込むことができる。なお、本実施形態において、コア間接続(C2)の一例として、デイジーチェーン接続を用いている。 The first memory 1 is connected to the input port of the convolution calculation circuit 4, and the convolution calculation circuit 4 can read data from the first memory 1. The first memory 1 is also loop-connected (C1) to the output port of the quantization calculation circuit 5, and the quantization calculation circuit 5 can write data to the first memory 1. The first memory 1 is also capable of data transfer via an inter-core connection (C2) between it and another NN calculation core 10, and the other NN calculation core 10 connected to the inter-core connection (C2) can write data to the first memory 1. In this embodiment, a daisy-chain connection is used as an example of the inter-core connection (C2).
 第二メモリ2は、例えばSRAM(Static RAM)などで構成された揮発性のメモリ等の書き換え可能なメモリである。第二メモリ2には、DMAC3や内部バスIBを介してデータの書き込みおよび読み出しが行われる。外部ホストCPU110は、第二メモリ2に対するデータの書き込みや読み出しにより、NN演算コア10に対するデータの入出力を行うことができる。 The second memory 2 is a rewritable memory such as a volatile memory composed of, for example, SRAM (Static RAM). Data is written to and read from the second memory 2 via the DMAC 3 and the internal bus IB. The external host CPU 110 can input and output data to and from the NN calculation core 10 by writing and reading data to and from the second memory 2.
 第二メモリ2は、量子化演算回路5の入力ポートと接続されており、量子化演算回路5は第二メモリ2からデータを読み出すことができる。また、第二メモリ2は、畳み込み演算回路4の出力ポートと接続されており、畳み込み演算回路4は第二メモリ2にデータを書き込むことができる。 The second memory 2 is connected to an input port of the quantization calculation circuit 5, and the quantization calculation circuit 5 can read data from the second memory 2. In addition, the second memory 2 is connected to an output port of the convolution calculation circuit 4, and the convolution calculation circuit 4 can write data to the second memory 2.
 畳み込み演算回路4は、学習済みのCNN200の畳み込み層210における畳み込み演算を行う回路である。畳み込み演算回路4は、第一メモリ1に格納された入力データaを読み出し、入力データaに対して畳み込み演算を実施する。畳み込み演算回路4は、畳み込み演算の出力データf(以降、「畳み込み演算出力データ」ともいう)を第二メモリ2に書き込む。 The convolution operation circuit 4 is a circuit that performs convolution operations in the convolution layer 210 of the trained CNN 200. The convolution operation circuit 4 reads the input data a stored in the first memory 1 and performs a convolution operation on the input data a. The convolution operation circuit 4 writes the output data f of the convolution operation (hereinafter also referred to as "convolution operation output data") to the second memory 2.
 量子化演算回路5は、学習済みのCNN200の量子化演算層220における量子化演算の少なくとも一部を行う回路である。量子化演算回路5は、第二メモリ2に格納された畳み込み演算の出力データfを読み出し、畳み込み演算の出力データfに対して量子化演算(プーリング、Batch Normalization、活性化関数、および量子化のうち少なくとも量子化を含む演算)を実施する。 The quantization operation circuit 5 is a circuit that performs at least a part of the quantization operation in the quantization operation layer 220 of the trained CNN 200. The quantization operation circuit 5 reads the output data f of the convolution operation stored in the second memory 2, and performs a quantization operation (an operation that includes at least quantization among pooling, batch normalization, activation function, and quantization) on the output data f of the convolution operation.
 量子化演算回路5は、量子化演算の出力データ(以降、「量子化演算出力データ」ともいう)をループ接続(C1)された第一メモリ1に書き込む。また、量子化演算回路5は、他のNN演算コア10とコア間接続(C2)経由でデータ転送可能であり、量子化演算回路5はコア間接続(C2)された他のNN演算コア10に量子化演算出力データを出力することができる。 The quantization calculation circuit 5 writes the output data of the quantization calculation (hereinafter also referred to as "quantization calculation output data") to the first memory 1 connected in a loop (C1). In addition, the quantization calculation circuit 5 can transfer data to other NN calculation cores 10 via the inter-core connection (C2), and the quantization calculation circuit 5 can output the quantization calculation output data to other NN calculation cores 10 connected in an inter-core connection (C2).
 NN演算コア10は、第一メモリ1や第二メモリ2等を有するため、DRAMなどの外部メモリからのDMAC3によるデータ転送において、重複するデータのデータ転送の回数を低減できる。これにより、メモリアクセスにより発生する消費電力または処理負荷を大幅に低減することができる。 The NN calculation core 10 has a first memory 1, a second memory 2, etc., so that the number of times duplicate data is transferred can be reduced when data is transferred by the DMAC 3 from an external memory such as a DRAM. This makes it possible to significantly reduce the power consumption or processing load caused by memory access.
[NN演算コア10の動作例1]
 図6は、NN演算コア10の動作例を示すタイミングチャートである。
 DMAC3は、レイヤ1の入力データaを第一メモリ1に格納する。DMAC3は、畳み込み演算回路4が行う畳み込み演算の順序にあわせて、レイヤ1の入力データaを分割して第一メモリ1に転送してもよい。
[Operation Example 1 of the NN calculation core 10]
FIG. 6 is a timing chart showing an example of the operation of the NN processing core 10. In FIG.
The DMAC 3 stores the input data a of the layer 1 in the first memory 1. The DMAC 3 may divide the input data a of the layer 1 and transfer it to the first memory 1 in accordance with the order of the convolution operation performed by the convolution operation circuit 4.
 畳み込み演算回路4は、第一メモリ1に格納されたレイヤ1の入力データaを読み出す。畳み込み演算回路4は、レイヤ1の入力データaに対して図1に示すレイヤ1の畳み込み演算を行う。レイヤ1の畳み込み演算の出力データfは、第二メモリ2に格納される。 The convolution operation circuit 4 reads out the input data a of layer 1 stored in the first memory 1. The convolution operation circuit 4 performs the convolution operation of layer 1 shown in FIG. 1 on the input data a of layer 1. The output data f of the convolution operation of layer 1 is stored in the second memory 2.
 量子化演算回路5は、第二メモリ2に格納されたレイヤ1の出力データfを読み出す。量子化演算回路5は、レイヤ1の出力データfに対してレイヤ2の量子化演算を行う。レイヤ2の量子化演算の出力データは、第一メモリ1に格納される。 The quantization calculation circuit 5 reads the output data f of layer 1 stored in the second memory 2. The quantization calculation circuit 5 performs the quantization calculation of layer 2 on the output data f of layer 1. The output data of the quantization calculation of layer 2 is stored in the first memory 1.
 畳み込み演算回路4は、第一メモリ1に格納されたレイヤ2の量子化演算の出力データを読み出す。畳み込み演算回路4は、レイヤ2の量子化演算の出力データを入力データaとしてレイヤ3の畳み込み演算を行う。レイヤ3の畳み込み演算の出力データfは、第二メモリ2に格納される。 The convolution operation circuit 4 reads the output data of the quantization operation of layer 2 stored in the first memory 1. The convolution operation circuit 4 performs the convolution operation of layer 3 using the output data of the quantization operation of layer 2 as input data a. The output data f of the convolution operation of layer 3 is stored in the second memory 2.
 畳み込み演算回路4は、第一メモリ1に格納されたレイヤ2M-2(Mは自然数)の量子化演算の出力データを読み出す。畳み込み演算回路4は、レイヤ2M-2の量子化演算の出力データを入力データaとしてレイヤ2M-1の畳み込み演算を行う。レイヤ2M-1の畳み込み演算の出力データfは、第二メモリ2に格納される。 The convolution circuit 4 reads the output data of the quantization operation of layer 2M-2 (M is a natural number) stored in the first memory 1. The convolution circuit 4 performs the convolution operation of layer 2M-1 using the output data of the quantization operation of layer 2M-2 as input data a. The output data f of the convolution operation of layer 2M-1 is stored in the second memory 2.
 量子化演算回路5は、第二メモリ2に格納されたレイヤ2M-1の出力データfを読み出す。量子化演算回路5は、2M-1レイヤの出力データfに対してレイヤ2Mの量子化演算を行う。レイヤ2Mの量子化演算の出力データは、第一メモリ1に格納される。 The quantization calculation circuit 5 reads the output data f of layer 2M-1 stored in the second memory 2. The quantization calculation circuit 5 performs the quantization calculation of layer 2M on the output data f of the 2M-1 layer. The output data of the quantization calculation of layer 2M is stored in the first memory 1.
 畳み込み演算回路4は、第一メモリ1に格納されたレイヤ2Mの量子化演算の出力データを読み出す。畳み込み演算回路4は、レイヤ2Mの量子化演算の出力データを入力データaとしてレイヤ2M+1の畳み込み演算を行う。レイヤ2M+1の畳み込み演算の出力データfは、第二メモリ2に格納される。 The convolution operation circuit 4 reads the output data of the quantization operation of layer 2M stored in the first memory 1. The convolution operation circuit 4 performs the convolution operation of layer 2M+1 using the output data of the quantization operation of layer 2M as input data a. The output data f of the convolution operation of layer 2M+1 is stored in the second memory 2.
 畳み込み演算回路4と量子化演算回路5とが交互に演算を行い、図1に示すCNN200の演算を進めていく。NN演算コア10は、畳み込み演算回路4が時分割によりレイヤ2M-1とレイヤ2M+1の畳み込み演算を実施する。また、NN演算コア10は、量子化演算回路5が時分割によりレイヤ2M-2とレイヤ2Mの量子化演算を実施する。そのため、NN演算コア10は、レイヤごとに別々の畳み込み演算回路4と量子化演算回路5を実装する場合と比較して、回路規模が著しく小さい。 The convolution calculation circuit 4 and the quantization calculation circuit 5 alternately perform calculations to advance the calculations of the CNN 200 shown in FIG. 1. In the NN calculation core 10, the convolution calculation circuit 4 performs the convolution calculations of layers 2M-1 and 2M+1 by time sharing. In the NN calculation core 10, the quantization calculation circuit 5 performs the quantization calculations of layers 2M-2 and 2M by time sharing. Therefore, the circuit scale of the NN calculation core 10 is significantly smaller than when a separate convolution calculation circuit 4 and quantization calculation circuit 5 are implemented for each layer.
 NN演算コア10は、複数のレイヤの多層構造であるCNN200の演算を、ループ状に形成された回路により演算する。NN演算コア10は、ループ状の回路構成により、ハードウェア資源を効率的に利用できる。なお、NN演算コア10は、ループ状に回路を形成するために、各レイヤで変化する畳み込み演算回路4や量子化演算回路5におけるパラメータは適宜更新される。 The NN calculation core 10 performs calculations for the CNN 200, which is a multi-layer structure of multiple layers, using a circuit formed in a loop. The NN calculation core 10 can efficiently use hardware resources due to the loop circuit configuration. Since the NN calculation core 10 forms a circuit in a loop, the parameters in the convolution calculation circuit 4 and the quantization calculation circuit 5, which change in each layer, are updated appropriately.
 CNN200の演算にNN演算コア10により実施できない演算が含まれる場合、NN演算コア10は外部ホストCPU110などの外部演算デバイスに中間データを転送する。外部演算デバイスが中間データに対して演算を行った後、外部演算デバイスによる演算結果は第一メモリ1や第二メモリ2に入力される。NN演算コア10は、外部演算デバイスによる演算結果に対する演算を再開する。 If the calculations of CNN 200 include calculations that cannot be performed by NN calculation core 10, NN calculation core 10 transfers intermediate data to an external calculation device such as external host CPU 110. After the external calculation device performs calculations on the intermediate data, the results of the calculations by the external calculation device are input to first memory 1 and second memory 2. NN calculation core 10 resumes calculations on the results of the calculations by the external calculation device.
[NN演算コア10の動作例2]
 図7は、NN演算コア10の他の動作例を示すタイミングチャートである。
 NN演算コア10は、入力データaを部分テンソルに分割して、時分割により部分テンソルに対する演算を行ってもよい。部分テンソルへの分割方法や分割数は特に限定されない。
[Operation Example 2 of the NN calculation core 10]
FIG. 7 is a timing chart showing another example of the operation of the NN processing core 10. In FIG.
The NN processing core 10 may divide the input data a into partial tensors and perform operations on the partial tensors by time division. The method of division into the partial tensors and the number of divisions are not particularly limited.
 図7は、入力データaを二つの部分テンソルに分解した場合の動作例を示している。分解された部分テンソルを、「第一部分テンソルa1」、「第二部分テンソルa2」とする。例えば、レイヤ2M-1の畳み込み演算は、第一部分テンソルa1に対応する畳み込み演算(図7において、「レイヤ2M-1(a1)」と表記)と、第二部分テンソルa2に対応する畳み込み演算(図7において、「レイヤ2M-1(a2)」と表記)と、に分解される。 7 shows an example of an operation when the input data a is decomposed into two partial tensors. The decomposed partial tensors are called "first partial tensor a 1 " and "second partial tensor a 2 ". For example, the convolution operation of layer 2M-1 is decomposed into a convolution operation corresponding to the first partial tensor a 1 (in FIG. 7, indicated as "layer 2M-1 (a 1 )") and a convolution operation corresponding to the second partial tensor a 2 (in FIG. 7, indicated as "layer 2M-1 (a 2 )").
 第一部分テンソルa1に対応する畳み込み演算および量子化演算と、第二部分テンソルa2に対応する畳み込み演算および量子化演算とは、図7に示すように、独立して実施することができる。 The convolution and quantization operations corresponding to the first partial tensor a 1 and the convolution and quantization operations corresponding to the second partial tensor a 2 can be performed independently, as shown in FIG.
 畳み込み演算回路4は、第一部分テンソルa1に対応するレイヤ2M-1の畳み込み演算(図7において、レイヤ2M-1(a1)で示す演算)を行う。その後、畳み込み演算回路4は、第二部分テンソルaに対応するレイヤ2M-1の畳み込み演算(図7において、レイヤ2M-1(a)で示す演算)を行う。また、量子化演算回路5は、第一部分テンソルa1に対応するレイヤ2Mの量子化演算(図7において、レイヤ2M(a1)で示す演算)を行う。このように、NN演算コア10は、第二部分テンソルaに対応するレイヤ2M-1の畳み込み演算と、第一部分テンソルa1に対応するレイヤ2Mの量子化演算と、を並列に実施できる。 The convolution operation circuit 4 performs a convolution operation of the layer 2M-1 corresponding to the first partial tensor a1 (operation shown by layer 2M-1 ( a1 ) in FIG. 7). After that, the convolution operation circuit 4 performs a convolution operation of the layer 2M-1 corresponding to the second partial tensor a2 (operation shown by layer 2M-1 ( a2 ) in FIG. 7). In addition, the quantization operation circuit 5 performs a quantization operation of the layer 2M corresponding to the first partial tensor a1 (operation shown by layer 2M ( a1 ) in FIG. 7). In this way, the NN operation core 10 can perform the convolution operation of the layer 2M-1 corresponding to the second partial tensor a2 and the quantization operation of the layer 2M corresponding to the first partial tensor a1 in parallel.
 次に、畳み込み演算回路4は、第一部分テンソルa1に対応するレイヤ2M+1の畳み込み演算(図7において、レイヤ2M+1(a1)で示す演算)を行う。また、量子化演算回路5は、第二部分テンソルaに対応するレイヤ2Mの量子化演算(図7において、レイヤ2M(a)で示す演算)を行う。このように、NN演算コア10は、第一部分テンソルa1に対応するレイヤ2M+1の畳み込み演算と、第二部分テンソルaに対応するレイヤ2Mの量子化演算と、を並列に実施できる。 Next, the convolution operation circuit 4 performs a convolution operation of layer 2M+1 corresponding to the first partial tensor a1 (operation shown as layer 2M+1( a1 ) in FIG. 7). Moreover, the quantization operation circuit 5 performs a quantization operation of layer 2M corresponding to the second partial tensor a2 (operation shown as layer 2M( a2 ) in FIG. 7). In this way, the NN operation core 10 can perform the convolution operation of layer 2M+1 corresponding to the first partial tensor a1 and the quantization operation of layer 2M corresponding to the second partial tensor a2 in parallel.
 第一部分テンソルa1に対応する畳み込み演算および量子化演算と、第二部分テンソルa2に対応する畳み込み演算および量子化演算とは、独立して実施することができる。そのため、NN演算コア10は、例えば、第一部分テンソルa1に対応するレイヤ2M-1の畳み込み演算と、第二部分テンソルaに対応するレイヤ2M+2の量子化演算と、を並列に実施してもよい。すなわち、NN演算コア10が並列で演算する畳み込み演算と量子化演算は、連続するレイヤの演算に限定されない。 The convolution operation and quantization operation corresponding to the first partial tensor a 1 and the convolution operation and quantization operation corresponding to the second partial tensor a 2 can be performed independently. Therefore, the NN processing core 10 may perform, for example, the convolution operation of the layer 2M-1 corresponding to the first partial tensor a 1 and the quantization operation of the layer 2M+2 corresponding to the second partial tensor a 2 in parallel. In other words, the convolution operation and quantization operation performed in parallel by the NN processing core 10 are not limited to the operations of consecutive layers.
 入力データaを部分テンソルに分割することで、NN演算コア10は畳み込み演算回路4と量子化演算回路5とを並列して動作させることができる。その結果、畳み込み演算回路4と量子化演算回路5が待機する時間が削減され、NN演算コア10の演算処理効率が向上する。図7に示す動作例において分割数は2であったが、分割数が2より大きい場合も同様に、NN演算コア10は畳み込み演算回路4と量子化演算回路5とを並列して動作させることができる。 By dividing the input data a into partial tensors, the NN calculation core 10 can operate the convolution calculation circuit 4 and the quantization calculation circuit 5 in parallel. As a result, the waiting time of the convolution calculation circuit 4 and the quantization calculation circuit 5 is reduced, improving the calculation processing efficiency of the NN calculation core 10. In the operation example shown in FIG. 7, the number of divisions is 2, but even if the number of divisions is greater than 2, the NN calculation core 10 can similarly operate the convolution calculation circuit 4 and the quantization calculation circuit 5 in parallel.
 例えば、入力データaが「第一部分テンソルa1」、「第二部分テンソルa2」および「第三部分テンソルa」に分割される場合、NN演算コア10は、第二部分テンソルaに対応するレイヤ2M-1の畳み込み演算と、第三部分テンソルaに対応するレイヤ2Mの量子化演算と、を並列に実施してもよい。演算の順序は、第一メモリ1および第二メモリ2における入力データaの格納状況によって適宜変更される。 For example, when the input data a is divided into a "first partial tensor a 1 ", a "second partial tensor a 2 ", and a "third partial tensor a 3 ", the NN processing core 10 may perform in parallel a convolution operation of the layer 2M-1 corresponding to the second partial tensor a 2 and a quantization operation of the layer 2M corresponding to the third partial tensor a 3. The order of the operations is appropriately changed depending on the storage status of the input data a in the first memory 1 and the second memory 2.
 なお、部分テンソルに対する演算方法としては、同一レイヤにおける部分テンソルの演算を畳み込み演算回路4または量子化演算回路5で行った後に次のレイヤにおける部分テンソルの演算を行う例(方法1)を示した。例えば、図7に示すように、畳み込み演算回路4において、第一部分テンソルa1および第二部分テンソルaに対応するレイヤ2M-1の畳み込み演算(図7において、レイヤ2M-1(a1)およびレイヤ2M-1(a)で示す演算)を行った後に、第一部分テンソルa1および第二部分テンソルaに対応するレイヤ2M+1の畳み込み演算(図7において、レイヤ2M+1(a1)およびレイヤ2M+1(a)で示す演算)を実施している。 As an example of a method of computing partial tensors, an example (method 1) has been shown in which a partial tensor in the same layer is computed by the convolution computation circuit 4 or the quantization computation circuit 5, and then a partial tensor in the next layer is computed. For example, as shown in FIG. 7, in the convolution computation circuit 4, a convolution computation of layer 2M-1 corresponding to the first partial tensor a1 and the second partial tensor a2 (computations shown by layer 2M-1( a1 ) and layer 2M-1( a2 ) in FIG. 7) is performed, and then a convolution computation of layer 2M+1 corresponding to the first partial tensor a1 and the second partial tensor a2 (computations shown by layer 2M+1( a1 ) and layer 2M+1( a2 ) in FIG. 7).
 しかしながら、部分テンソルに対する演算方法はこれに限られない。部分テンソルに対する演算方法は、複数レイヤにおける一部の部分テンソルの演算をした後に残部の部分テンソルの演算を実施する方法でもよい(方法2)。例えば、畳み込み演算回路4において、第一部分テンソルa1に対応するレイヤ2M-1および第一部分テンソルa1に対応するレイヤ2M+1の畳み込み演算を行った後に、第二部分テンソルaに対応するレイヤ2M-1および第二部分テンソルaに対応するレイヤ2M+1の畳み込み演算を実施してもよい。 However, the calculation method for the partial tensor is not limited to this. The calculation method for the partial tensor may be a method in which some partial tensors in multiple layers are calculated and then the remaining partial tensors are calculated (method 2). For example, in the convolution calculation circuit 4, after performing a convolution calculation of the layer 2M-1 corresponding to the first partial tensor a1 and the layer 2M+1 corresponding to the first partial tensor a1 , a convolution calculation of the layer 2M-1 corresponding to the second partial tensor a2 and the layer 2M+1 corresponding to the second partial tensor a2 may be performed.
 また、部分テンソルに対する演算方法は、方法1と方法2とを組み合わせて部分テンソルを演算する方法でもよい。ただし、方法2を用いる場合は、部分テンソルの演算順序に関する依存関係に従って演算を実施する必要がある。 In addition, the method of computing partial tensors may be a combination of methods 1 and 2. However, when using method 2, the computation must be performed according to the dependency relationship regarding the order of computation of partial tensors.
[NN演算マルチコア10M]
 図8は、NN演算マルチコア10Mを示す図である。
 図8に例示するNN演算マルチコア10Mは、デイジーチェーン接続された二つのNN演算コア10を備える。二つのNN演算コア10を区別する場合、二つのNN演算コア10を、「第一NN演算コア10A」と、「第二NN演算コア10B」という。なお、図8において、第一メモリ1は「A」、畳み込み演算回路4は「C」、第二メモリ2は「F」、量子化演算回路5は「Q」として略記されている。
[NN calculation multi-core 10M]
FIG. 8 is a diagram showing the NN computing multi-core 10M.
The NN computing multi-core 10M illustrated in Fig. 8 includes two daisy-chained NN computing cores 10. When distinguishing between the two NN computing cores 10, the two NN computing cores 10 are referred to as a "first NN computing core 10A" and a "second NN computing core 10B." In Fig. 8, the first memory 1 is abbreviated as "A," the convolution computing circuit 4 as "C," the second memory 2 as "F," and the quantization computing circuit 5 as "Q."
 具体的には、第一NN演算コア10Aの量子化演算回路5と、第二NN演算コア10Bの第一メモリ1とがデイジーチェーン接続(C2)されている。第一NN演算コア10Aの量子化演算回路5は、ループ接続(C1)された第一NN演算コア10Aの第一メモリ1または/およびデイジーチェーン接続(C2)された第二NN演算コア10Bの第一メモリ1に量子化演算出力データを書き込むことができる。 Specifically, the quantization calculation circuit 5 of the first NN calculation core 10A and the first memory 1 of the second NN calculation core 10B are daisy-chain connected (C2). The quantization calculation circuit 5 of the first NN calculation core 10A can write the quantization calculation output data to the first memory 1 of the first NN calculation core 10A, which is loop-connected (C1), and/or the first memory 1 of the second NN calculation core 10B, which is daisy-chain connected (C2).
 具体的には、第二NN演算コア10Bの量子化演算回路5と、第一NN演算コア10Aの第一メモリ1とがデイジーチェーン接続(C2)されている。第二NN演算コア10Bの量子化演算回路5は、ループ接続(C1)された第二NN演算コア10Bの第一メモリ1または/およびデイジーチェーン接続(C2)された第一NN演算コア10Aの第一メモリ1に量子化演算出力データを書き込むことができる。 Specifically, the quantization calculation circuit 5 of the second NN calculation core 10B and the first memory 1 of the first NN calculation core 10A are daisy-chain connected (C2). The quantization calculation circuit 5 of the second NN calculation core 10B can write the quantization calculation output data to the first memory 1 of the second NN calculation core 10B, which is loop-connected (C1), and/or the first memory 1 of the first NN calculation core 10A, which is daisy-chain connected (C2).
 NN演算マルチコア10Mが三つ以上のNN演算コア10を備える場合も同様に、複数のNN演算コア10はデイジーチェーン接続される。最終段のNN演算コア10以外のNN演算コア10の量子化演算回路5は、後段のNN演算コア10Bの第一メモリ1とデイジーチェーン接続(C2)される。最終段のNN演算コア10の量子化演算回路5は、最初段のNN演算コア10の第一メモリ1とデイジーチェーン接続(C2)されている。複数のNN演算コア10はデイジーチェーンループ(数珠繋ぎ)状に形成されていることを特徴とする。 Similarly, when the NN calculation multi-core 10M has three or more NN calculation cores 10, the multiple NN calculation cores 10 are daisy-chained. The quantization calculation circuits 5 of the NN calculation cores 10 other than the final-stage NN calculation core 10 are daisy-chained (C2) with the first memory 1 of the subsequent-stage NN calculation core 10B. The quantization calculation circuit 5 of the final-stage NN calculation core 10 is daisy-chained (C2) with the first memory 1 of the first-stage NN calculation core 10. The multiple NN calculation cores 10 are characterized by being formed in a daisy-chain loop (linked together).
 一つのNN演算コア10において、第一メモリ(A)1と畳み込み演算回路(C)4と第二メモリ(F)2と量子化演算回路(Q)5とは、ループ状に接続されている。一方、NN演算マルチコア10Mにおいては、第一メモリ(A)1と畳み込み演算回路(C)4と第二メモリ(F)2と量子化演算回路(Q)5とは、第一メモリ(A)1と畳み込み演算回路(C)4と第二メモリ(F)2と量子化演算回路(Q)5とが同じ順番で繰り返し配列するように、デイジーチェーンループ(数珠繋ぎ)状に接続されている。 In one NN calculation core 10, the first memory (A) 1, the convolution calculation circuit (C) 4, the second memory (F) 2, and the quantization calculation circuit (Q) 5 are connected in a loop. On the other hand, in the NN calculation multi-core 10M, the first memory (A) 1, the convolution calculation circuit (C) 4, the second memory (F) 2, and the quantization calculation circuit (Q) 5 are connected in a daisy chain loop (linked together) so that the first memory (A) 1, the convolution calculation circuit (C) 4, the second memory (F) 2, and the quantization calculation circuit (Q) 5 are repeatedly arranged in the same order.
 NN演算マルチコア10Mを構成する複数のNN演算コア10は、同一のハードウェア構成でなくてもよい。例えば、第一NN演算コア10Aの第一メモリ1の容量・構成は、第二NN演算コア10Bの第一メモリ1の容量・構成と異なっていてもよい。例えば、第一NN演算コア10Aの量子化演算回路5の構成は、第二NN演算コア10Bの量子化演算回路5の構成と異なっていてもよい。 The multiple NN calculation cores 10 constituting the NN calculation multi-core 10M do not need to have the same hardware configuration. For example, the capacity and configuration of the first memory 1 of the first NN calculation core 10A may be different from the capacity and configuration of the first memory 1 of the second NN calculation core 10B. For example, the configuration of the quantization calculation circuit 5 of the first NN calculation core 10A may be different from the configuration of the quantization calculation circuit 5 of the second NN calculation core 10B.
 次に、NN回路100の各構成に関して詳しく説明する。 Next, each component of the NN circuit 100 will be explained in detail.
[DMAC3]
 図9は、DMAC3の内部ブロック図である。
 DMAC3は、データ転送回路31と、ステートコントローラ32と、を有する。DMAC3は、データ転送回路31に対する専用のステートコントローラ32を有しており、命令コマンドが入力されると、外部のコントローラを必要とせずにDMAデータ転送を実施できる。
[DMAC3]
FIG. 9 is an internal block diagram of the DMAC 3.
The DMAC 3 has a data transfer circuit 31 and a state controller 32. The DMAC 3 has a state controller 32 dedicated to the data transfer circuit 31, and when an instruction command is input, the DMAC 3 can perform DMA data transfer without requiring an external controller.
 データ転送回路31は、外部バスEBに接続されており、DRAMなどの外部メモリ120とNN演算コア10との間のDMAデータ転送を行う。データ転送回路31のDMAチャンネル数は限定されない。例えば、第一NN演算コア10Aと第二NN演算コア10Bのそれぞれに専用のDMAチャンネルを有していてもよい。 The data transfer circuit 31 is connected to the external bus EB, and performs DMA data transfer between an external memory 120 such as a DRAM and the NN calculation core 10. The number of DMA channels of the data transfer circuit 31 is not limited. For example, the first NN calculation core 10A and the second NN calculation core 10B may each have a dedicated DMA channel.
 ステートコントローラ32は、データ転送回路31のステートを制御する。また、ステートコントローラ32は、内部バスIBを介してコントローラ6と接続されている。ステートコントローラ32は、命令キュー33と制御回路34とを有する。 The state controller 32 controls the state of the data transfer circuit 31. The state controller 32 is also connected to the controller 6 via the internal bus IB. The state controller 32 has an instruction queue 33 and a control circuit 34.
 命令キュー33は、DMAC3用の命令コマンドC3が格納されるキューであり、例えばFIFOメモリで構成される。命令キュー33には、IFU7経由または内部バスIB経由で1つ以上の命令コマンドC3が書き込まれる。 The instruction queue 33 is a queue in which instruction commands C3 for the DMAC3 are stored, and is configured, for example, as a FIFO memory. One or more instruction commands C3 are written to the instruction queue 33 via the IFU 7 or the internal bus IB.
 制御回路34は、命令コマンドC3をデコードし、命令コマンドC3に基づいて順次データ転送回路31を制御するステートマシンである。制御回路34は、論理回路により実装されていてもよいし、ソフトウェアによって制御されるCPUによって実装されていてもよい。 The control circuit 34 is a state machine that decodes the instruction command C3 and sequentially controls the data transfer circuit 31 based on the instruction command C3. The control circuit 34 may be implemented by a logic circuit or a CPU controlled by software.
 図10は、制御回路34のステート遷移図である。
 制御回路34は、命令キュー33に命令コマンドC3が入力されると(Not empty)、アイドルステートST1からデコードステートST2に遷移する。
FIG. 10 is a state transition diagram of the control circuit 34.
When the instruction command C3 is input to the instruction queue 33 (not empty), the control circuit 34 transitions from the idle state ST1 to the decode state ST2.
 制御回路34は、デコードステートST2において、命令キュー33から出力される命令コマンドC3をデコードする。また、制御回路34は、コントローラ6のレジスタ61に格納されたセマフォSを読み出し、命令コマンドC3において指示されたデータ転送回路31の動作を実行可能であるかを判定する。実行不能である場合(Not ready)、制御回路34は実行可能となるまで待つ(Wait)。実行可能である場合(ready)、制御回路34はデコードステートST2から実行ステートST3に遷移する。 In the decode state ST2, the control circuit 34 decodes the instruction command C3 output from the instruction queue 33. The control circuit 34 also reads the semaphore S stored in the register 61 of the controller 6, and determines whether the operation of the data transfer circuit 31 instructed in the instruction command C3 can be executed. If it is not executable (Not ready), the control circuit 34 waits (Wait) until it is executable. If it is executable (Ready), the control circuit 34 transitions from the decode state ST2 to the execution state ST3.
 制御回路34は、実行ステートST3において、データ転送回路31を制御して、データ転送回路31に命令コマンドC3において指示された動作を実施させる。制御回路34は、データ転送回路31の動作が終わると、命令キュー33から実行を終えた命令コマンドC3を取り除くとともに、コントローラ6のレジスタ61に格納されたセマフォSを更新する。制御回路34は、命令キュー33に命令がある場合(Not empty)、実行ステートST3からデコードステートST2に遷移する。制御回路34は、命令キュー33に命令がない場合(empty)、実行ステートST3からアイドルステートST1に遷移する。 In execution state ST3, the control circuit 34 controls the data transfer circuit 31 to cause the data transfer circuit 31 to perform the operation instructed in the instruction command C3. When the operation of the data transfer circuit 31 is completed, the control circuit 34 removes the executed instruction command C3 from the instruction queue 33 and updates the semaphore S stored in the register 61 of the controller 6. If there are instructions in the instruction queue 33 (Not empty), the control circuit 34 transitions from execution state ST3 to decode state ST2. If there are no instructions in the instruction queue 33 (Empty), the control circuit 34 transitions from execution state ST3 to idle state ST1.
[畳み込み演算回路4]
 図11は、畳み込み演算回路4の内部ブロック図である。
 畳み込み演算回路4は、重みメモリ41と、乗算器42と、アキュムレータ回路43と、ステートコントローラ44と、命令デコンプレッサ49と、を有する。畳み込み演算回路4は、乗算器42およびアキュムレータ回路43に対する専用のステートコントローラ44を有しており、命令コマンドが入力されると、外部のコントローラを必要とせずに畳み込み演算を実施できる。
[Convolution operation circuit 4]
FIG. 11 is an internal block diagram of the convolution circuit 4.
The convolution operation circuit 4 has a weight memory 41, a multiplier 42, an accumulator circuit 43, a state controller 44, and an instruction decompressor 49. The convolution operation circuit 4 has a state controller 44 dedicated to the multiplier 42 and the accumulator circuit 43, and when an instruction command is input, the convolution operation can be performed without the need for an external controller.
 重みメモリ41は、畳み込み演算に用いる重みwが格納されるメモリであり、例えばSRAM(Static RAM)などで構成された揮発性のメモリ等の書き換え可能なメモリである。DMAC3は、DMA転送により、畳み込み演算に必要な重みwを重みメモリ41に書き込む。 The weight memory 41 is a memory in which the weight w used in the convolution calculation is stored, and is a rewritable memory such as a volatile memory composed of, for example, an SRAM (Static RAM). The DMAC3 writes the weight w required for the convolution calculation to the weight memory 41 by DMA transfer.
 図12は、乗算器42の内部ブロック図である。
 乗算器42は、入力ベクトルAと重みマトリクスWとを乗算する。入力ベクトルAは、上述したように、分割入力データa(x+i、y+j、co)がi、jごとに展開されたBc個の要素を持つベクトルデータである。また、重みマトリクスWは、分割重みw(i,j,co、do)がi、jごとに展開されたBc×Bd個の要素を持つマトリクスデータである。乗算器42は、Bc×Bd個の積和演算ユニット47を有し、入力ベクトルAと重みマトリクスWとを乗算を並列して実施できる。
FIG. 12 is an internal block diagram of the multiplier 42.
The multiplier 42 multiplies the input vector A by the weight matrix W. As described above, the input vector A is vector data having Bc elements in which the divided input data a(x+i, y+j, co) is expanded for each i and j. The weight matrix W is matrix data having Bc×Bd elements in which the divided weights w(i, j, co, do) are expanded for each i and j. The multiplier 42 has Bc×Bd product-sum operation units 47 and can multiply the input vector A by the weight matrix W in parallel.
 乗算器42は、乗算に必要な入力ベクトルAと重みマトリクスWを、第一メモリ1および重みメモリ41から読み出して乗算を実施する。乗算器42は、Bd個の積和演算結果O(di)を出力する。 The multiplier 42 reads the input vector A and weight matrix W required for the multiplication from the first memory 1 and weight memory 41, and performs the multiplication. The multiplier 42 outputs Bd product-sum operation results O(di).
 図13は、積和演算ユニット47の内部ブロック図である。
 積和演算ユニット47は、入力ベクトルAの要素A(ci)と、重みマトリクスWの要素W(ci,di)との乗算を実施する。また、積和演算ユニット47は、乗算結果と他の積和演算ユニット47の乗算結果S(ci,di)と加算する。積和演算ユニット47は、加算結果S(ci+1,di)を出力する。要素A(ci)は、2ビットの符号なし整数(0,1,2,3)である。要素W(ci,di)は、1ビットの符号付整数(0,1)であり、値「0」は+1を表し、値「1」は-1を表す。
FIG. 13 is an internal block diagram of the product-sum calculation unit 47. As shown in FIG.
The multiply-add unit 47 multiplies an element A(ci) of an input vector A by an element W(ci,di) of a weight matrix W. The multiply-add unit 47 also adds the multiplication result to a multiplication result S(ci,di) of another multiply-add unit 47. The multiply-add unit 47 outputs an addition result S(ci+1,di). The element A(ci) is a 2-bit unsigned integer (0,1,2,3). The element W(ci,di) is a 1-bit signed integer (0,1), where a value "0" represents +1 and a value "1" represents -1.
 積和演算ユニット47は、反転器(インバータ)47aと、セレクタ47bと、加算器47cと、を有する。積和演算ユニット47は、乗算器を用いず、反転器47aおよびセレクタ47bのみを用いて乗算を行う。セレクタ47bは、要素W(ci,di)が「0」の場合、要素A(ci)の入力を選択する。セレクタ47bは、要素W(ci,di)が「1」の場合、要素A(ci)を反転器により反転させた補数を選択する。要素W(ci,di)は、加算器47cのCarry-inにも入力される。加算器47cは、要素W(ci,di)が「0」のとき、S(ci,di)に要素A(ci)を加算した値を出力する。加算器47cは、W(ci,di)が「1」のとき、S(ci,di)から要素A(ci)を減算した値を出力する。 The multiply-and-accumulate unit 47 has an inverter 47a, a selector 47b, and an adder 47c. The multiply-and-accumulate unit 47 performs multiplication using only the inverter 47a and the selector 47b, without using a multiplier. When the element W(ci, di) is "0", the selector 47b selects the input of the element A(ci). When the element W(ci, di) is "1", the selector 47b selects the complement of the element A(ci) inverted by the inverter. The element W(ci, di) is also input to the carry-in of the adder 47c. When the element W(ci, di) is "0", the adder 47c outputs the value obtained by adding the element A(ci) to S(ci, di). When W(ci, di) is "1", adder 47c outputs the value obtained by subtracting element A(ci) from S(ci, di).
 図14は、アキュムレータ回路43の内部ブロック図である。
 アキュムレータ回路43は、乗算器42の積和演算結果O(di)を第二メモリ2にアキュムレートする。アキュムレータ回路43は、Bd個のアキュムレータユニット48を有し、Bd個の積和演算結果O(di)を並列して第二メモリ2にアキュムレートできる。
FIG. 14 is an internal block diagram of the accumulator circuit 43.
The accumulator circuit 43 accumulates the product-sum operation results O(di) of the multiplier 42 in the second memory 2. The accumulator circuit 43 has Bd accumulator units 48 and can accumulate the Bd product-sum operation results O(di) in parallel in the second memory 2.
 図15は、アキュムレータユニット48の内部ブロック図である。
 アキュムレータユニット48は、加算器48aと、マスク部48bとを有している。加算器48aは、積和演算結果Oの要素O(di)と、第二メモリ2に格納された式1に示す畳み込み演算の途中経過である部分和と、を加算する。加算結果は、要素あたり16ビットである。加算結果は、要素あたり16ビットに限定されず、例えば要素あたり15ビットや17ビットであってもよい。
FIG. 15 is an internal block diagram of the accumulator unit 48.
The accumulator unit 48 has an adder 48a and a mask unit 48b. The adder 48a adds an element O(di) of the sum-of-products operation result O and a partial sum which is an intermediate result of the convolution operation shown in Equation 1 stored in the second memory 2. The sum result is 16 bits per element. The sum result is not limited to 16 bits per element, and may be, for example, 15 bits or 17 bits per element.
 加算器48aは、加算結果を第二メモリ2の同一アドレスに書き込む。マスク部48bは、初期化信号clearがアサートされた場合に、第二メモリ2からの出力をマスクし、要素O(di)に対する加算対象をゼロにする。初期化信号clearは、第二メモリ2に途中経過の部分和が格納されていない場合にアサートされる。 The adder 48a writes the result of the addition to the same address in the second memory 2. When the initialization signal clear is asserted, the masking unit 48b masks the output from the second memory 2 and sets the addition target for element O(di) to zero. The initialization signal clear is asserted when no intermediate partial sums are stored in the second memory 2.
 乗算器42およびアキュムレータ回路43による畳み込み演算が完了すると、第二メモリ2に、出力データf(x,y,do)が格納される。 When the convolution operation by the multiplier 42 and the accumulator circuit 43 is completed, the output data f(x, y, do) is stored in the second memory 2.
 ステートコントローラ44は、乗算器42およびアキュムレータ回路43のステートを制御する。また、ステートコントローラ44は、内部バスIBを介してコントローラ6と接続されている。ステートコントローラ44は、命令キュー45と制御回路46とを有する。 The state controller 44 controls the states of the multiplier 42 and the accumulator circuit 43. The state controller 44 is also connected to the controller 6 via the internal bus IB. The state controller 44 has an instruction queue 45 and a control circuit 46.
 命令キュー45は、畳み込み演算回路4用の命令コマンドC4が格納されるキューであり、例えばFIFOメモリで構成される。命令キュー45には、IFU7経由または内部バスIB経由で命令コマンドC4が書き込まれる。 The instruction queue 45 is a queue in which the instruction command C4 for the convolution operation circuit 4 is stored, and is configured, for example, as a FIFO memory. The instruction command C4 is written to the instruction queue 45 via the IFU 7 or the internal bus IB.
 制御回路46は、命令コマンドC4をデコードし、命令コマンドC4に基づいて乗算器42およびアキュムレータ回路43を制御するステートマシンである。制御回路46は、DMAC3のステートコントローラ32の制御回路34と同様の構成である。 The control circuit 46 is a state machine that decodes the instruction command C4 and controls the multiplier 42 and the accumulator circuit 43 based on the instruction command C4. The control circuit 46 has a similar configuration to the control circuit 34 of the state controller 32 of the DMAC3.
 図16は、命令デコンプレッサ49の内部ブロック図である。
 命令デコンプレッサ49は、命令コマンドC4が圧縮された圧縮命令コマンドから命令コマンドC4を復元(Decompress)する。命令デコンプレッサ49は、デコンプレッサ49aと、リングバッファ49bと、を有する。デコンプレッサ49aは、IFU7から入力される圧縮された圧縮命令コマンドをデコードして、リングバッファ49bに記憶されたデータに基づいて、命令コマンドC4を復元する。リングバッファ49bは、リング状のバッファメモリである。なお、リングバッファ49bは、リング状のバッファメモリに限定されず、他の態様のバッファメモリであってもよい。
FIG. 16 is an internal block diagram of the instruction decompressor 49.
The instruction decompressor 49 decompresses the instruction command C4 from the compressed instruction command obtained by compressing the instruction command C4. The instruction decompressor 49 includes a decompressor 49a and a ring buffer 49b. The decompressor 49a decodes the compressed instruction command input from the IFU 7 and restores the instruction command C4 based on the data stored in the ring buffer 49b. The ring buffer 49b is a ring-shaped buffer memory. Note that the ring buffer 49b is not limited to a ring-shaped buffer memory and may be a buffer memory of another type.
 図17は、命令デコンプレッサ49が復元する圧縮命令コマンドの一例を示す図である。
 Push命令は、オプコードフィールドOFと、命令フィールドIFと、を有する。Push命令のオプコードフィールドOFには、Push命令であることを示すオプコードが格納されている。命令フィールドIFには、オリジナルインストラクションが格納されている。Push命令は、オリジナルインストラクションをリングバッファ49bに記憶するとともに、オリジナルインストラクションを命令キュー45に出力する。なお、オリジナルインストラクションには、入力データaを設定する命令と、重みwを設定する命令と、畳み込み演算出力データの出力を設定する命令などのインストラクションが含まれる。
FIG. 17 is a diagram showing an example of a compressed instruction command that is decompressed by the instruction decompressor 49. As shown in FIG.
The Push instruction has an opcode field OF and an instruction field IF. The opcode field OF of the Push instruction stores an opcode indicating that it is a Push instruction. The instruction field IF stores an original instruction. The Push instruction stores the original instruction in the ring buffer 49b and outputs the original instruction to the instruction queue 45. The original instruction includes instructions such as an instruction to set input data a, an instruction to set weight w, and an instruction to set the output of convolution operation output data.
 Copy命令は、オプコードフィールドOFと、シークフィールドSFと、カウントフィールドCFと、を有する。Copy命令のオプコードフィールドOFには、Copy命令であることを示すオプコードが格納されている。シークフィールドSFには、リングバッファ49bのアドレスを示すシーク(seek)が格納されている。カウントフィールドCFには、コピーする命令数を示すカウント(count)が格納されている。Copy命令は、リングバッファ49bにおいてシーク(seek)が示すアドレス以降に記憶された命令を、カウント(count)が示す命令数だけ命令キュー45に出力する。 The Copy instruction has an opcode field OF, a seek field SF, and a count field CF. The opcode field OF of the Copy instruction stores an opcode indicating that it is a Copy instruction. The seek field SF stores a seek indicating an address in ring buffer 49b. The count field CF stores a count indicating the number of instructions to copy. The Copy instruction outputs to the instruction queue 45 the instructions stored in ring buffer 49b after the address indicated by the seek, up to the number of instructions indicated by the count.
 本実施形態の畳み込み演算回路4は、外部コントローラを必要とせずに畳み込み演算を実行することができるが、畳み込み演算の自由度を向上させるためには、一つの命令コマンドC4に基づいて実行する動作を細かく規定できることが好ましい。一例として、畳み込み演算において一つの要素同士(1×1)の積を実行する命令コマンドC4を1つ規定し、これを複数個組み合わせることにより、異なる重みフィルタを用いた畳み込み演算等の多様な畳み込み演算を実現できる。一方で、細かく規定された命令コマンドC4は、命令コマンドC4の全体数が増加することになり、外部メモリ120の使用量の増加や外部バスEBの帯域を圧迫するなど問題が生じる。この課題を解決するため、本実施形態の畳み込み演算回路4は命令コマンドC4を圧縮した圧縮命令コマンドを用いている。 The convolution operation circuit 4 of this embodiment can perform convolution operations without requiring an external controller, but in order to improve the degree of freedom of the convolution operation, it is preferable to be able to specify in detail the operation to be performed based on one instruction command C4. As an example, by specifying one instruction command C4 that executes a multiplication of one element (1x1) in a convolution operation and combining multiple such commands, it is possible to realize a variety of convolution operations, such as convolution operations using different weight filters. On the other hand, specifying instruction commands C4 in detail increases the total number of instruction commands C4, which causes problems such as an increase in the amount of usage of the external memory 120 and pressure on the bandwidth of the external bus EB. To solve this problem, the convolution operation circuit 4 of this embodiment uses a compressed instruction command that compresses the instruction command C4.
 畳み込み演算回路4用の命令コマンドC4は、畳み込み演算を繰り返し実施する命令が連続しやすく、類似する命令コマンドC4が短期間に連続しやすい。そのため、上記のPush命令により命令をリングバッファ49bに記憶しておき、Copy命令によりリングバッファ49bに記憶した命令をコピーすることにより、畳み込み演算回路4用の命令コマンドC4を圧縮した圧縮命令コマンドの数を削減することができる。 The instruction commands C4 for the convolution operation circuit 4 tend to be successive instructions that repeatedly perform convolution operations, and similar instruction commands C4 tend to occur in succession within a short period of time. Therefore, by storing the instructions in the ring buffer 49b using the Push command described above, and then copying the instructions stored in the ring buffer 49b using the Copy command, it is possible to reduce the number of compressed instruction commands obtained by compressing the instruction commands C4 for the convolution operation circuit 4.
 なお、命令デコンプレッサ49に入力される畳み込み演算回路4用の命令コマンドC4は、命令コマンドC4を生成するコンパイラ等のツールにより事前に圧縮されている。 The instruction command C4 for the convolution operation circuit 4 that is input to the instruction decompressor 49 is compressed in advance by a tool such as a compiler that generates the instruction command C4.
 図18は、命令デコンプレッサ49の変形例を示す図である。
 畳み込み演算回路4は、複数の命令デコンプレッサ49を備えてもよい。図18に例示する図においては、3個の命令デコンプレッサ49が並列に設けられている。この場合、各命令デコンプレッサ49に対応する個別の命令キュー45が設けられる。畳み込み演算回路4用の命令コマンドC4は、3個のグループに分割されて3個の命令デコンプレッサ49に入力される。例えば、畳み込み演算回路4用の命令コマンドC4は、入力データaを設定する命令と、重みwを設定する命令と、畳み込み演算出力データの出力を設定する命令と、に分割される。同種類の命令が命令デコンプレッサ49に入力されやすくなるため、リングバッファ49bの利用効率が向上し、命令の圧縮率が向上する。また、制御回路46は、3個のグループに分割されて命令キュー45に格納された命令を効率的に読み出して実行できる。
FIG. 18 shows a modified example of the instruction decompressor 49. In FIG.
The convolution circuit 4 may include a plurality of instruction decompressors 49. In the example shown in FIG. 18, three instruction decompressors 49 are provided in parallel. In this case, an individual instruction queue 45 corresponding to each instruction decompressor 49 is provided. The instruction command C4 for the convolution circuit 4 is divided into three groups and input to the three instruction decompressors 49. For example, the instruction command C4 for the convolution circuit 4 is divided into an instruction for setting input data a, an instruction for setting weight w, and an instruction for setting the output of convolution output data. Since the same type of instruction is more likely to be input to the instruction decompressor 49, the utilization efficiency of the ring buffer 49b is improved, and the compression rate of the instructions is improved. In addition, the control circuit 46 can efficiently read and execute the instructions stored in the instruction queue 45 divided into three groups.
[量子化演算回路5]
 図19は、量子化演算回路5の内部ブロック図である。
 量子化演算回路5は、量子化パラメータメモリ51と、ベクトル演算回路52と、量子化回路53と、ステートコントローラ54と、命令デコンプレッサ59と、を有する。量子化演算回路5は、ベクトル演算回路52および量子化回路53に対する専用のステートコントローラ54を有しており、命令コマンドが入力されると、外部のコントローラを必要とせずに量子化演算を実施できる。
[Quantization Calculation Circuit 5]
FIG. 19 is an internal block diagram of the quantization calculation circuit 5.
The quantization operation circuit 5 has a quantization parameter memory 51, a vector operation circuit 52, a quantization circuit 53, a state controller 54, and an instruction decompressor 59. The quantization operation circuit 5 has a state controller 54 dedicated to the vector operation circuit 52 and the quantization circuit 53, and when an instruction command is input, the quantization operation can be performed without the need for an external controller.
 量子化パラメータメモリ51は、量子化演算に用いる量子化パラメータqが格納されるメモリであり、例えばSRAM(Static RAM)などで構成された揮発性のメモリ等の書き換え可能なメモリである。DMAC3は、DMA転送により、量子化演算に必要な量子化パラメータqを量子化パラメータメモリ51に書き込む。 The quantization parameter memory 51 is a memory in which the quantization parameter q used in the quantization operation is stored, and is a rewritable memory such as a volatile memory composed of, for example, an SRAM (Static RAM). The DMAC3 writes the quantization parameter q required for the quantization operation to the quantization parameter memory 51 by DMA transfer.
 図20は、ベクトル演算回路52と量子化回路53の内部ブロック図である。
 ベクトル演算回路52は、第二メモリ2に格納された出力データf(x,y,do)に対して演算を行う。ベクトル演算回路52は、Bd個の演算ユニット57を有し、出力データf(x,y,do)に対して並列にSIMD演算を行う。
FIG. 20 is an internal block diagram of the vector calculation circuit 52 and the quantization circuit 53.
The vector operation circuit 52 performs an operation on the output data f(x, y, do) stored in the second memory 2. The vector operation circuit 52 has Bd operation units 57, and performs SIMD operations in parallel on the output data f(x, y, do).
 図21は、演算ユニット57のブロック図である。
 演算ユニット57は、例えば、ALU57aと、第一セレクタ57bと、第二セレクタ57cと、レジスタ57dと、シフタ57eと、を有する。演算ユニット57は、公知の汎用SIMD演算回路が有する他の演算器等をさらに有してもよい。
FIG. 21 is a block diagram of the arithmetic unit 57.
The arithmetic unit 57 includes, for example, an ALU 57 a, a first selector 57 b, a second selector 57 c, a register 57 d, and a shifter 57 e. The arithmetic unit 57 may further include other arithmetic units included in a known general-purpose SIMD arithmetic circuit.
 ベクトル演算回路52は、演算ユニット57が有する演算器等を組み合わせることで、出力データf(x,y,do)に対して、量子化演算層220におけるプーリング層221や、Batch Normalization層222や、活性化関数層223の演算のうち少なくとも一つの演算を行う。 The vector operation circuit 52 performs at least one of the operations of the pooling layer 221, the batch normalization layer 222, and the activation function layer 223 in the quantization operation layer 220 on the output data f(x, y, do) by combining the operators and the like contained in the operation unit 57.
 演算ユニット57は、レジスタ57dに格納されたデータと第二メモリ2から読み出した出力データf(x,y,do)の要素f(di)とをALU57aにより加算できる。演算ユニット57は、ALU57aによる加算結果をレジスタ57dに格納できる。演算ユニット57は、第一セレクタ57bの選択によりレジスタ57dに格納されたデータに代えて「0」をALU57aに入力することで加算結果を初期化できる。例えばプーリング領域が2×2である場合、シフタ57eはALU57aの出力を2bit右シフトすることで加算結果の平均値を出力できる。ベクトル演算回路52は、Bd個の演算ユニット57による上記の演算等を繰り返すことで、式2に示す平均プーリングの演算を実施できる。 The arithmetic unit 57 can add the data stored in the register 57d and the element f(di) of the output data f(x, y, do) read from the second memory 2 by the ALU 57a. The arithmetic unit 57 can store the addition result by the ALU 57a in the register 57d. The arithmetic unit 57 can initialize the addition result by inputting "0" to the ALU 57a instead of the data stored in the register 57d by selecting the first selector 57b. For example, if the pooling area is 2x2, the shifter 57e can output the average value of the addition results by shifting the output of the ALU 57a to the right by 2 bits. The vector calculation circuit 52 can perform the average pooling calculation shown in Equation 2 by repeating the above calculations by the Bd arithmetic units 57.
 演算ユニット57は、レジスタ57dに格納されたデータと第二メモリ2から読み出した出力データf(x,y,do)の要素f(di)とをALU57aにより比較できる。
演算ユニット57は、ALU57aによる比較結果に応じて第二セレクタ57cを制御して、レジスタ57dに格納されたデータと要素f(di)の大きい方を選択できる。演算ユニット57は、第一セレクタ57bの選択により要素f(di)の取りうる値の最小値をALU57aに入力することで比較対象を最小値に初期化できる。本実施形態において要素f(di)は16bit符号付き整数であるので、要素f(di)の取りうる値の最小値は「0x8000」である。ベクトル演算回路52は、Bd個の演算ユニット57による上記の演算等を繰り返すことで、式3のMAXプーリングの演算を実施できる。なお、MAXプーリングの演算ではシフタ57eは第二セレクタ57cの出力をシフトしない。
The arithmetic unit 57 can compare the data stored in the register 57d with the element f(di) of the output data f(x, y, do) read from the second memory 2 by the ALU 57a.
The arithmetic unit 57 controls the second selector 57c according to the comparison result by the ALU 57a, and can select the larger of the data stored in the register 57d and the element f(di). The arithmetic unit 57 can initialize the comparison target to the minimum value by inputting the minimum value of the possible value of the element f(di) to the ALU 57a by the selection of the first selector 57b. In this embodiment, the element f(di) is a 16-bit signed integer, so the minimum value of the possible value of the element f(di) is "0x8000". The vector calculation circuit 52 can perform the MAX pooling calculation of Equation 3 by repeating the above calculations by the Bd arithmetic units 57. Note that in the MAX pooling calculation, the shifter 57e does not shift the output of the second selector 57c.
 演算ユニット57は、レジスタ57dに格納されたデータと第二メモリ2から読み出した出力データf(x,y,do)の要素f(di)とをALU57aにより減算できる。シフタ57eはALU57aの出力を左シフト(すなわち乗算)もしくは右シフト(すなわち除算)できる。ベクトル演算回路52は、Bd個の演算ユニット57による上記の演算等を繰り返すことで、式4のBatch Normalizationの演算を実施できる。 The arithmetic unit 57 can subtract the data stored in the register 57d and the element f(di) of the output data f(x, y, do) read from the second memory 2 using the ALU 57a. The shifter 57e can shift the output of the ALU 57a to the left (i.e., multiplication) or right (i.e., division). The vector arithmetic circuit 52 can perform the batch normalization calculation of Equation 4 by repeating the above calculations by the Bd arithmetic units 57.
 演算ユニット57は、第二メモリ2から読み出した出力データf(x,y,do)の要素f(di)と第一セレクタ57bにより選択された「0」とをALU57aにより比較できる。演算ユニット57は、ALU57aによる比較結果に応じて要素f(di)と予めレジスタ57dに格納された定数値「0」のいずれかを選択して出力できる。ベクトル演算回路52は、Bd個の演算ユニット57による上記の演算等を繰り返すことで、式5のReLU演算を実施できる。 The arithmetic unit 57 can compare the element f(di) of the output data f(x, y, do) read from the second memory 2 with the "0" selected by the first selector 57b using the ALU 57a. Depending on the comparison result by the ALU 57a, the arithmetic unit 57 can select and output either the element f(di) or the constant value "0" previously stored in the register 57d. The vector arithmetic circuit 52 can perform the ReLU operation of Equation 5 by repeating the above calculations by the Bd arithmetic units 57.
 ベクトル演算回路52は、平均プーリング、MAXプーリング、Batch Normalization、活性化関数の演算およびこれらの演算の組み合わせを実施できる。ベクトル演算回路52は、汎用SIMD演算を実施できるため、量子化演算層220における演算に必要な他の演算を実施してもよい。また、ベクトル演算回路52は、量子化演算層220における演算以外の演算を実施してもよい。 The vector operation circuit 52 can perform average pooling, MAX pooling, batch normalization, activation function operations, and combinations of these operations. Since the vector operation circuit 52 can perform general-purpose SIMD operations, it may also perform other operations necessary for the operations in the quantization operation layer 220. In addition, the vector operation circuit 52 may also perform operations other than those in the quantization operation layer 220.
 なお、量子化演算回路5は、ベクトル演算回路52を有してなくてもよい。量子化演算回路5がベクトル演算回路52を有していない場合、出力データf(x,y,do)は量子化回路53に入力される。 The quantization calculation circuit 5 does not have to have the vector calculation circuit 52. If the quantization calculation circuit 5 does not have the vector calculation circuit 52, the output data f(x, y, do) is input to the quantization circuit 53.
 量子化回路53は、ベクトル演算回路52の出力データに対して、量子化を行う。量子化回路53は、図20に示すように、Bd個の量子化ユニット58を有し、ベクトル演算回路52の出力データに対して並列に演算を行う。 The quantization circuit 53 performs quantization on the output data of the vector calculation circuit 52. As shown in FIG. 20, the quantization circuit 53 has Bd quantization units 58, and performs calculations in parallel on the output data of the vector calculation circuit 52.
 図22は、量子化ユニット58の内部ブロック図である。
 量子化ユニット58は、ベクトル演算回路52の出力データの要素in(di)に対して量子化を行う。量子化ユニット58は、比較器58aと、エンコーダ58bと、を有する。量子化ユニット58はベクトル演算回路52の出力データ(16ビット/要素)に対して、量子化演算層220における量子化層224の演算(式6)を行う。量子化ユニット58は、量子化パラメータメモリ51から必要な量子化パラメータq(th0,th1,th2)を読み出し、比較器58aにより入力in(di)と量子化パラメータqとの比較を行う。量子化ユニット58は、比較器58aによる比較結果をエンコーダ58bにより2ビット/要素に量子化する。式4におけるα(c)とβ(c)は、変数cごとに異なるパラメータであるため、α(c)とβ(c)を反映する量子化パラメータq(th0,th1,th2)はin(di)ごとに異なるパラメータである。
FIG. 22 is an internal block diagram of the quantization unit 58.
The quantization unit 58 quantizes the element in(di) of the output data of the vector operation circuit 52. The quantization unit 58 has a comparator 58a and an encoder 58b. The quantization unit 58 performs the operation (Equation 6) of the quantization layer 224 in the quantization operation layer 220 on the output data (16 bits/element) of the vector operation circuit 52. The quantization unit 58 reads out the necessary quantization parameters q(th0, th1, th2) from the quantization parameter memory 51, and compares the input in(di) with the quantization parameter q by the comparator 58a. The quantization unit 58 quantizes the comparison result by the comparator 58a to 2 bits/element by the encoder 58b. Since α(c) and β(c) in Equation 4 are parameters that differ for each variable c, the quantization parameters q(th0, th1, th2) that reflect α(c) and β(c) are parameters that differ for each in(di).
 量子化ユニット58は、入力in(di)を3つの閾値th0,th1,th2と比較することにより、入力in(di)を4領域(例えば、in≦th0,th0<in≦th1,th1<in≦th2,th2<in)に分類し、分類結果を2ビットにエンコードして出力する。量子化ユニット58は、量子化パラメータq(th0,th1,th2)の設定により、量子化と併せてBatch Normalizationや活性化関数の演算を行うこともできる。 The quantization unit 58 classifies the input in (di) into four regions (e.g., in≦th0, th0<in≦th1, th1<in≦th2, th2<in) by comparing the input in (di) with three thresholds th0, th1, and th2, and outputs the classification result by encoding it into 2 bits. The quantization unit 58 can also perform batch normalization and activation function calculations in addition to quantization by setting the quantization parameter q (th0, th1, th2).
 量子化ユニット58は、閾値th0を式4のβ(c)、閾値の差(th1―th0)および(th2―th1)を式4のα(c)として設定して量子化を行うことで、式4に示すBatch Normalizationの演算を量子化と併せて実施できる。(th1―th0)および(th2―th1)を大きくすることでα(c)を小さくできる。(th1―th0)および(th2―th1)を小さくすることで、α(c)を大きくできる。 The quantization unit 58 sets the threshold th0 as β(c) in Equation 4, and the threshold differences (th1-th0) and (th2-th1) as α(c) in Equation 4, and performs quantization, thereby enabling the batch normalization calculation shown in Equation 4 to be performed in conjunction with quantization. By increasing (th1-th0) and (th2-th1), α(c) can be reduced. By decreasing (th1-th0) and (th2-th1), α(c) can be increased.
 量子化ユニット58は、入力in(di)の量子化と併せて活性化関数を実施できる。例えば、量子化ユニット58は、in(di)≦th0およびth2<in(di)となる領域では出力値を飽和させる。量子化ユニット58は、出力が非線形とするように量子化パラメータqを設定することで活性化関数の演算を量子化と併せて実施できる。 The quantization unit 58 can perform the activation function in conjunction with the quantization of the input in(di). For example, the quantization unit 58 saturates the output value in the region where in(di)≦th0 and th2<in(di). The quantization unit 58 can perform the calculation of the activation function in conjunction with the quantization by setting the quantization parameter q so that the output is nonlinear.
 ステートコントローラ54は、ベクトル演算回路52および量子化回路53のステートを制御する。また、ステートコントローラ54は、内部バスIBを介してコントローラ6と接続されている。ステートコントローラ54は、命令キュー55と制御回路56とを有する。 The state controller 54 controls the states of the vector operation circuit 52 and the quantization circuit 53. The state controller 54 is also connected to the controller 6 via the internal bus IB. The state controller 54 has an instruction queue 55 and a control circuit 56.
 命令キュー55は、量子化演算回路5用の命令コマンドC5が格納されるキューであり、例えばFIFOメモリで構成される。命令キュー55には、IFU7経由または内部バスIB経由で命令コマンドC5が書き込まれる。 The instruction queue 55 is a queue in which the instruction command C5 for the quantization calculation circuit 5 is stored, and is configured, for example, as a FIFO memory. The instruction command C5 is written to the instruction queue 55 via the IFU 7 or the internal bus IB.
 制御回路56は、命令コマンドC5をデコードし、命令コマンドC5に基づいてベクトル演算回路52および量子化回路53を制御するステートマシンである。制御回路56は、DMAC3のステートコントローラ32の制御回路34と同様の構成である。 The control circuit 56 is a state machine that decodes the instruction command C5 and controls the vector calculation circuit 52 and the quantization circuit 53 based on the instruction command C5. The control circuit 56 has a similar configuration to the control circuit 34 of the state controller 32 of the DMAC3.
 命令デコンプレッサ59は、命令コマンドC5が圧縮された圧縮命令コマンドから命令コマンドC5を復元(Decompress)する。命令デコンプレッサ59は、畳み込み演算回路4の命令デコンプレッサ49と同様の構成である。 The instruction decompressor 59 restores (decompresses) the instruction command C5 from the compressed instruction command into which the instruction command C5 is compressed. The instruction decompressor 59 has a configuration similar to that of the instruction decompressor 49 of the convolution operation circuit 4.
 量子化演算回路5は、Bd個の要素を持つ量子化演算出力データを第一メモリ1に書き込む。なお、BdとBcの好適な関係を式10に示す。式10においてnは整数である。 The quantization calculation circuit 5 writes quantization calculation output data having Bd elements to the first memory 1. The preferred relationship between Bd and Bc is shown in Equation 10. In Equation 10, n is an integer.
Figure JPOXMLDOC01-appb-M000010
Figure JPOXMLDOC01-appb-M000010
[コントローラ6]
 コントローラ6は、外部ホストCPU110から転送される命令コマンドを、内部バスIBを介して、DMAC3、畳み込み演算回路4および量子化演算回路5が有する命令キューに転送する。コントローラ6は、各回路に対する命令コマンドを格納する命令メモリを有してもよい。
[Controller 6]
The controller 6 transfers the instruction command transferred from the external host CPU 110 via the internal bus IB to the instruction queues of the DMAC 3, the convolution operation circuit 4, and the quantization operation circuit 5. The controller 6 may have an instruction memory that stores the instruction command for each circuit.
 コントローラ6は、外部バスEBに接続されており、外部ホストCPU110のスレーブとして動作する。コントローラ6は、パラメータレジスタや状態レジスタを含むレジスタ61を有している。パラメータレジスタは、NN回路100の動作を制御するレジスタである。状態レジスタは、セマフォSを含むNN回路100の状態を示すレジスタである。 The controller 6 is connected to the external bus EB and operates as a slave to the external host CPU 110. The controller 6 has registers 61 including a parameter register and a status register. The parameter register is a register that controls the operation of the NN circuit 100. The status register is a register that indicates the status of the NN circuit 100, including the semaphore S.
 本実施形態に係るニューラルネットワーク回路100によれば、IoT機器などの組み込み機器に組み込み可能なNN回路100を高性能に動作させることができる。複数のNN演算コア10を接続することで、より多くのニューラルネットワーク演算を効率的かつ高速に実施することができる。 The neural network circuit 100 according to this embodiment can operate the NN circuit 100 with high performance, which can be embedded in embedded devices such as IoT devices. By connecting multiple NN calculation cores 10, more neural network calculations can be performed efficiently and quickly.
 以上、本発明の第一実施形態について図面を参照して詳述したが、具体的な構成はこの実施形態に限られるものではなく、本発明の要旨を逸脱しない範囲の設計変更等も含まれる。また、上述の実施形態および変形例において示した構成要素は適宜に組み合わせて構成することが可能である。 The first embodiment of the present invention has been described above in detail with reference to the drawings, but the specific configuration is not limited to this embodiment, and includes design modifications and the like that do not deviate from the gist of the present invention. In addition, the components shown in the above-mentioned embodiment and modified examples can be configured in appropriate combinations.
(変形例1)
 上記実施形態のニューラルネットワーク回路100を制御する命令コマンドは一つの演算動作について一つの命令コマンドを必要とする例を示したが、命令コマンドの態様はこれに限定されない。命令コマンドは、複数の演算動作を1以上の複数の命令コマンドで実行できる態様であってもよい。具体的には、連続する1x1の畳み込み演算が複数の命令コマンドに基づいて実行させる。当該複数の命令コマンドは、第一メモリ1に保持された入力ベクトルAの要素A(ci)の範囲(オフセットおよびステップ)を決定する命令と、重みメモリ41に保持された重みマトリクスWの要素W(ci,di)の範囲(オフセットおよびステップ)を決定する命令と、積和演算結果O(di)を保存する第二メモリ2における保存位置(オフセット、ステップ)を決定する命令と、1x1の畳み込み演算の繰り返し回数(フィルタサイズ)を決定する命令と、を少なくとも含む。このように複数の演算動作を、より少ない命令コマンドで実行することで、全体の命令コマンドの数も削減することができる。そして、本実施形態の構成を用いることによりさらに命令コマンドの数を削減し、外部メモリ120の使用量の増加や外部バスEBの帯域の圧迫を低減することができる。
(Variation 1)
Although the command command for controlling the neural network circuit 100 of the above embodiment is an example in which one command command is required for one calculation operation, the form of the command command is not limited to this. The command command may be an embodiment in which multiple calculation operations can be executed by one or more command commands. Specifically, consecutive 1x1 convolution operations are executed based on multiple command commands. The multiple command commands include at least an instruction to determine the range (offset and step) of the element A(ci) of the input vector A held in the first memory 1, an instruction to determine the range (offset and step) of the element W(ci, di) of the weight matrix W held in the weight memory 41, an instruction to determine the storage position (offset, step) in the second memory 2 where the product-sum operation result O(di) is stored, and an instruction to determine the number of repetitions (filter size) of the 1x1 convolution operation. In this way, by executing multiple calculation operations with fewer command commands, the total number of command commands can be reduced. And, by using the configuration of this embodiment, the number of command commands can be further reduced, and the increase in the usage of the external memory 120 and the pressure on the bandwidth of the external bus EB can be reduced.
(変形例2)
 上記実施形態において、第一メモリ1と第二メモリ2は別のメモリであったが、第一メモリ1と第二メモリ2の態様はこれに限定されない。第一メモリ1と第二メモリ2は、例えば、同一メモリにおける第一メモリ領域と第二メモリ領域であってもよい。
(Variation 2)
In the above embodiment, the first memory 1 and the second memory 2 are separate memories, but the aspects of the first memory 1 and the second memory 2 are not limited to this. The first memory 1 and the second memory 2 may be, for example, a first memory area and a second memory area in the same memory.
(変形例3)
 例えば、上記実施形態に記載のNN回路100に入力されるデータは単一の形式に限定されず、静止画像、動画像、音声、文字、数値およびこれらの組み合わせで構成することが可能である。なお、NN回路100に入力されるデータは、NN回路100が設けられるエッジデバイスに搭載され得る、光センサ、温度計、Global Positioning System(GPS)計測器、角速度計測器、風速計などの物理量測定器における測定結果に限られない。周辺機器から有線または無線通信経由で受信する基地局情報、車両・船舶等の情報、天候情報、混雑状況に関する情報などの周辺情報や金融情報や個人情報等の異なる情報を組み合わせてもよい。
(Variation 3)
For example, the data input to the NN circuit 100 described in the above embodiment is not limited to a single format, and can be composed of still images, moving images, sounds, characters, numbers, and combinations of these. The data input to the NN circuit 100 is not limited to the measurement results of physical quantity measuring instruments such as optical sensors, thermometers, Global Positioning System (GPS) measuring instruments, angular velocity measuring instruments, and anemometers that may be mounted on the edge device in which the NN circuit 100 is provided. Different information such as base station information received from peripheral devices via wired or wireless communication, information on vehicles and ships, weather information, congestion information, and other peripheral information, financial information, and personal information may also be combined.
(変形例4)
 NN回路100が設けられるエッジデバイスは、バッテリー等で駆動する携帯電話などの通信機器、パーソナルコンピュータなどのスマートデバイス、デジタルカメラ、ゲーム機器、ロボット製品などのモバイル機器を想定するが、これに限られるものではない。Power on Ethernet(PoE)などでの供給可能なピーク電力制限、製品発熱の低減または長時間駆動の要請が高い製品に利用することでも他の先行例にない効果を得ることができる。例えば、車両や船舶などに搭載される車載カメラや、公共施設や路上などに設けられる監視カメラ等に適用することで長時間の撮影を実現できるだけでなく、軽量化や高耐久化にも寄与する。また、テレビやディスプレイ等の表示デバイス、医療カメラや手術ロボット等の医療機器、製造現場や建築現場で使用される作業ロボットなどにも適用することで同様の効果を奏することができる。
(Variation 4)
The edge device in which the NN circuit 100 is provided is assumed to be a communication device such as a battery-powered mobile phone, a smart device such as a personal computer, a digital camera, a game device, a robot product, and other mobile devices, but is not limited thereto. It can be used in products that have a high demand for peak power limit that can be supplied by Power on Ethernet (PoE), reduction of product heat generation, or long-term operation, to obtain effects not seen in other prior art. For example, by applying the circuit to an in-vehicle camera mounted on a vehicle or ship, or a surveillance camera installed in a public facility or on the road, it is possible to realize long-term shooting, and also contributes to weight reduction and high durability. In addition, the circuit can be applied to display devices such as televisions and displays, medical equipment such as medical cameras and surgical robots, and work robots used at manufacturing sites and construction sites to obtain similar effects.
(変形例5)
 NN回路100は、NN回路100の一部または全部を一つ以上のプロセッサを用いて実現してもよい。例えば、NN回路100は、入力層または出力層の一部または全部をプロセッサによるソフトウェア処理により実現してもよい。ソフトウェア処理により実現する入力層または出力層の一部は、例えば、データの正規化や変換である。これにより、様々な形式の入力形式または出力形式に対応できる。なお、プロセッサで実行するソフトウェアは、通信手段や外部メディアを用いて書き換え可能に構成してもよい。
(Variation 5)
The NN circuit 100 may be realized in part or in whole by using one or more processors. For example, the NN circuit 100 may realize in part or in whole the input layer or output layer by software processing by a processor. A part of the input layer or output layer realized by software processing is, for example, data normalization or conversion. This makes it possible to support various input formats or output formats. The software executed by the processor may be configured to be rewritable using communication means or external media.
(変形例6)
 NN回路100は、CNN200における処理の一部をクラウド上のGraphics Processing Unit(GPU)等を組み合わせることで実現してもよい。NN回路100は、NN回路100が設けられるエッジデバイスで行った処理に加えて、クラウド上でさらに処理を行ったり、クラウド上での処理に加えてエッジデバイス上で処理を行ったりすることで、より複雑な処理を少ないリソースで実現できる。このような構成によれば、NN回路100は、処理分散によりエッジデバイスとクラウドとの間の通信量を低減できる。
(Variation 6)
The NN circuit 100 may realize a part of the processing in the CNN 200 by combining a graphics processing unit (GPU) or the like on the cloud. The NN circuit 100 can realize more complex processing with fewer resources by performing further processing on the cloud in addition to the processing performed on the edge device in which the NN circuit 100 is provided, or by performing processing on the edge device in addition to the processing on the cloud. With such a configuration, the NN circuit 100 can reduce the amount of communication between the edge device and the cloud by distributing processing.
 また、本明細書に記載された効果は、あくまで説明的または例示的なものであって限定的ではない。つまり、本開示に係る技術は、上記の効果とともに、または上記の効果に代えて、本明細書の記載から当業者には明らかな他の効果を奏しうる。 Furthermore, the effects described in this specification are merely descriptive or exemplary and are not limiting. In other words, the technology disclosed herein may achieve other effects that are apparent to a person skilled in the art from the description in this specification, in addition to or in place of the above effects.
 本発明は、ニューラルネットワークの演算に適用することができる。 The present invention can be applied to neural network calculations.
200 畳み込みニューラルネットワーク
100 ニューラルネットワーク回路(NN回路)
10 ニューラルネットワーク演算コア(NN演算コア)
10A 第一ニューラルネットワーク演算コア(第一NN演算コア)
10B 第二ニューラルネットワーク演算コア(第二NN演算コア)
10M ニューラルネットワーク演算マルチコア(NN演算マルチコア)
1 第一メモリ
2 第二メモリ
3 DMAコントローラ(DMAC)
4 畳み込み演算回路
42 乗算器
43 アキュムレータ回路
5 量子化演算回路
52 ベクトル演算回路
53 量子化回路
6 コントローラ
61 レジスタ
7 IFU
200 Convolutional Neural Network 100 Neural Network Circuit (NN Circuit)
10 Neural network calculation core (NN calculation core)
10A First Neural Network Calculation Core (First NN Calculation Core)
10B Second Neural Network Calculation Core (Second NN Calculation Core)
10M Neural network calculation multi-core (NN calculation multi-core)
1 First memory 2 Second memory 3 DMA controller (DMAC)
4 Convolution operation circuit 42 Multiplier 43 Accumulator circuit 5 Quantization operation circuit 52 Vector operation circuit 53 Quantization circuit 6 Controller 61 Register 7 IFU

Claims (7)

  1.  入力データに対して畳み込み演算を行う畳み込み演算回路を備え、
     前記畳み込み演算回路は、前記畳み込み演算回路を動作させる畳み込み演算回路用の命令コマンドが圧縮された圧縮命令コマンドから前記命令コマンドを復元する命令デコンプレッサを有する、
     ニューラルネットワーク回路。
    A convolution operation circuit is provided for performing a convolution operation on input data,
    The convolution operation circuit has an instruction decompressor that restores an instruction command for the convolution operation circuit from a compressed instruction command in which the instruction command is compressed, the instruction command operating the convolution operation circuit.
    Neural network circuit.
  2.  前記畳み込み演算回路の畳み込み演算出力データに対して量子化演算を行う量子化演算回路をさらに備え、
     前記量子化演算回路は、前記量子化演算回路を動作させる量子化演算回路用の命令コマンドが圧縮された圧縮命令コマンドから前記命令コマンドを復元する命令デコンプレッサを有する、
     請求項1に記載のニューラルネットワーク回路。
    A quantization circuit is further provided for performing a quantization operation on the convolution operation output data of the convolution operation circuit,
    The quantization operation circuit has an instruction decompressor that restores an instruction command for the quantization operation circuit that operates the quantization operation circuit from a compressed instruction command.
    2. The neural network circuit of claim 1.
  3.  前記畳み込み演算回路または前記量子化演算回路を動作させる命令コマンドをメモリから読み出す命令フェッチユニットをさらに備え、
     前記命令フェッチユニットは、前記命令デコンプレッサに命令コマンドを入力する、
     請求項2に記載のニューラルネットワーク回路。
    further comprising an instruction fetch unit for reading from a memory an instruction command for operating the convolution operation circuit or the quantization operation circuit;
    The instruction fetch unit inputs instruction commands to the instruction decompressor.
    3. The neural network circuit of claim 2.
  4.  前記畳み込み演算回路は、複数の前記命令デコンプレッサを有し、
     分割された前記圧縮命令コマンドが、異なる命令デコンプレッサに入力される、
     請求項1に記載のニューラルネットワーク回路。
    the convolution circuit includes a plurality of the instruction decompressors;
    the divided compressed instruction commands are input to different instruction decompressors;
    2. The neural network circuit of claim 1.
  5.  前記入力データを格納する第一メモリと、
     前記畳み込み演算出力データを格納する第二メモリと、
     をさらに備え、
     前記量子化演算回路の量子化演算出力データは、前記第一メモリに格納されて、
     前記第一メモリに格納された前記量子化演算出力データは、前記畳み込み演算回路に前記入力データとして入力される、
     請求項2に記載のニューラルネットワーク回路。
    A first memory for storing the input data;
    a second memory for storing the convolution operation output data;
    Further equipped with
    The quantization operation output data of the quantization operation circuit is stored in the first memory,
    the quantization operation output data stored in the first memory is input to the convolution operation circuit as the input data;
    3. The neural network circuit of claim 2.
  6.  前記第一メモリと、前記畳み込み演算回路と、前記第二メモリと、前記量子化演算回路とは、ループ状に形成されている、
     請求項5に記載のニューラルネットワーク回路。
    the first memory, the convolution operation circuit, the second memory, and the quantization operation circuit are formed in a loop.
    6. The neural network circuit of claim 5.
  7.  入力データに対して畳み込み演算を行う畳み込み演算回路と、
     前記畳み込み演算回路の畳み込み演算出力データに対して量子化演算を行う量子化演算回路と、
     前記畳み込み演算回路を動作させる畳み込み演算回路用の命令コマンドが圧縮された圧縮命令コマンドと、前記量子化演算回路を動作させる量子化演算回路用の命令コマンドが圧縮された圧縮命令コマンドと、をメモリから読み出す命令フェッチユニットと、
     を備えるニューラルネットワーク回路の制御方法であって、
     前記命令フェッチユニットに、前記畳み込み演算回路用の命令コマンドと量子化演算回路用の命令コマンドとを別々に前記メモリから読み出させて、前記畳み込み演算回路と前記量子化演算回路とに対して前記命令コマンドを別々に供給させるステップと、
     前記畳み込み演算回路および前記量子化演算回路に、前記圧縮命令コマンドから前記命令コマンドを復元させるステップと、
     復元された前記命令コマンドに基づいて前記畳み込み演算回路と前記量子化演算回路とを並列して動作させるステップと、
     を有する、
     ニューラルネットワーク回路の制御方法。
    a convolution operation circuit that performs a convolution operation on input data;
    a quantization circuit for performing a quantization operation on the convolution operation output data of the convolution operation circuit;
    an instruction fetch unit for reading, from a memory, a compressed instruction command obtained by compressing an instruction command for a convolution operation circuit that operates the convolution operation circuit, and a compressed instruction command obtained by compressing an instruction command for a quantization operation circuit that operates the quantization operation circuit;
    A method for controlling a neural network circuit comprising:
    a step of causing the instruction fetch unit to read out an instruction command for the convolution operation circuit and an instruction command for the quantization operation circuit separately from the memory, and to supply the instruction commands separately to the convolution operation circuit and the quantization operation circuit;
    causing the convolution operation circuit and the quantization operation circuit to restore the instruction command from the compressed instruction command;
    operating the convolution operation circuit and the quantization operation circuit in parallel based on the restored instruction command;
    having
    A method for controlling a neural network circuit.
PCT/JP2023/042052 2022-11-22 2023-11-22 Neural network circuit and neural network computing method WO2024111644A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2022186308A JP2024075106A (en) 2022-11-22 2022-11-22 Neural network circuit and neural network operation method
JP2022-186308 2022-11-22

Publications (1)

Publication Number Publication Date
WO2024111644A1 true WO2024111644A1 (en) 2024-05-30

Family

ID=91196083

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2023/042052 WO2024111644A1 (en) 2022-11-22 2023-11-22 Neural network circuit and neural network computing method

Country Status (2)

Country Link
JP (1) JP2024075106A (en)
WO (1) WO2024111644A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09265397A (en) * 1996-03-29 1997-10-07 Hitachi Ltd Processor for vliw instruction
US20160140686A1 (en) * 2014-11-18 2016-05-19 Intel Corporation Efficient preemption for graphics processors
JP2021082285A (en) * 2019-11-15 2021-05-27 インテル コーポレイション Data locality enhancement for graphics processing units
JP2022030486A (en) * 2020-08-07 2022-02-18 LeapMind株式会社 Neural network circuit and method for controlling neural network circuit

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09265397A (en) * 1996-03-29 1997-10-07 Hitachi Ltd Processor for vliw instruction
US20160140686A1 (en) * 2014-11-18 2016-05-19 Intel Corporation Efficient preemption for graphics processors
JP2021082285A (en) * 2019-11-15 2021-05-27 インテル コーポレイション Data locality enhancement for graphics processing units
JP2022030486A (en) * 2020-08-07 2022-02-18 LeapMind株式会社 Neural network circuit and method for controlling neural network circuit

Also Published As

Publication number Publication date
JP2024075106A (en) 2024-06-03

Similar Documents

Publication Publication Date Title
KR102354718B1 (en) Computing device and method
EP3709225A1 (en) System and method for efficient utilization of multipliers in neural-network computations
WO2022163861A1 (en) Neural network generation device, neural network computing device, edge device, neural network control method, and software generation program
KR102667790B1 (en) Neural network circuit, edge device and method for neural network calculation
WO2021210527A1 (en) Method for controlling neural network circuit
CN113240101A (en) Method for realizing heterogeneous SoC (system on chip) by cooperative acceleration of software and hardware of convolutional neural network
WO2024111644A1 (en) Neural network circuit and neural network computing method
US20100257342A1 (en) Row of floating point accumulators coupled to respective pes in uppermost row of pe array for performing addition operation
CN116795324A (en) Mixed precision floating-point multiplication device and mixed precision floating-point number processing method
WO2022030037A1 (en) Neural network circuit and neural network circuit control method
CN116151321A (en) Semiconductor device with a semiconductor device having a plurality of semiconductor chips
WO2023139990A1 (en) Neural network circuit and neural network computation method
WO2024038662A1 (en) Neural network training device and neural network training method
JP2022183833A (en) Neural network circuit and neural network operation method
WO2023058422A1 (en) Neural network circuit and neural network circuit control method
JP2022105437A (en) Neural network circuit and neural network operation method
WO2022085661A1 (en) Neural network generation device, neural network control method, and software generation program
WO2022230906A1 (en) Neural network generation device, neural network computing device, edge device, neural network control method, and software generation program
WO2022004815A1 (en) Neural network generating device, neural network generating method, and neural network generating program
JP4243277B2 (en) Data processing device
JP2023154880A (en) Neural network creation method and neural network creation program
JP2023006509A (en) Software generation device and software generation method
JP2024026993A (en) Information processor and method for processing information
CN116882475A (en) Training method and device applied to neural network and related products