WO2020238843A1 - Dispositif et procédé de calcul de réseau neuronal, et dispositif de calcul - Google Patents

Dispositif et procédé de calcul de réseau neuronal, et dispositif de calcul Download PDF

Info

Publication number
WO2020238843A1
WO2020238843A1 PCT/CN2020/092053 CN2020092053W WO2020238843A1 WO 2020238843 A1 WO2020238843 A1 WO 2020238843A1 CN 2020092053 W CN2020092053 W CN 2020092053W WO 2020238843 A1 WO2020238843 A1 WO 2020238843A1
Authority
WO
WIPO (PCT)
Prior art keywords
calculation
convolution
circuit
output
dram cells
Prior art date
Application number
PCT/CN2020/092053
Other languages
English (en)
Chinese (zh)
Inventor
李�诚
杨伟
李鸽子
苏尔达山·奇拉格
韦恩·诺伯特
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2020238843A1 publication Critical patent/WO2020238843A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present invention relates to the technical field of neural networks, in particular to a neural network computing device, method and computing device.
  • convolutional neural network calculations are highly parallel and data-intensive calculations
  • the processor of the computing device uses Feng
  • the processor first stores the data of the convolutional neural network into the memory.
  • the memory is dynamic random access memory (DRAM).
  • DRAM dynamic random access memory
  • the data bus transfers the data stored in the DRAM to the neural network computing node, and the calculation of the convolutional neural network is completed in the neural network computing node.
  • the data bus outputs the calculation result to the DRAM.
  • the data bus needs to transfer data back and forth between DRAM and neural network computing nodes.
  • the bandwidth of the data bus connecting the processor and DRAM will be insufficient
  • the memory wall problem is extremely prone to occur.
  • the embodiments of the present invention provide a neural network computing device, a method, and a computing device, which can complete neural network calculations in memory and solve the problem of the memory wall.
  • the technical scheme is as follows:
  • a neural network computing device in the first aspect, includes:
  • the first storage array includes multiple rows and multiple columns of dynamic random access memory DRAM cells, the first row of DRAM cells in the storage array is used for storing convolution cores, and at least two rows of DRAM cells in the storage array are used for storing Input data for convolution calculation to be performed;
  • the first calculation circuit connected to the storage array, is used to perform multiple convolution operations on the input data and the convolution kernel, and each convolution calculation is used to combine the convolution kernel and the at least two rows
  • the input data stored in a row of DRAM cells in the DRAM cell is multiplied and added.
  • the neural network computing device stores the convolution kernel and the input data to be performed convolution calculation in a first storage array including multiple rows and multiple columns of DRAM cells.
  • the first calculation circuit can read the convolution from the first storage array.
  • the input data of the kernel and the convolution calculation to be performed, and the convolution calculation is completed.
  • the neural network calculation is realized in the memory, the transmission of neural network data between the memory and the neural network computing node is reduced, and the problem of the memory wall is solved.
  • the first calculation circuit includes:
  • each latch is connected to a row of DRAM cells, and the output of each latch is connected to one of the multiple multipliers, and each latch is used to cache the DRAM cells of the corresponding row
  • a plurality of multipliers the first input of each multiplier is connected to a row of DRAM cells, and the second input of each multiplier is connected to the output of a latch;
  • a plurality of adders the input end of each adder is respectively connected to at least two multipliers;
  • the multiple multipliers are respectively used to combine the convolution kernels buffered by the multiple latches with the input data stored in one row of the multiple rows of DRAM cells A multiplication operation is performed, and the plurality of adders are respectively used to add output results of the plurality of multipliers, and the output of the plurality of adders is a result of a convolution calculation.
  • the neural network computing device further includes:
  • a second calculation circuit the second calculation circuit is connected to the first calculation circuit, and the second calculation circuit includes a summation circuit, which is used to calculate the sum of the The results of the first convolution calculation output by the two adders are summed.
  • the second calculation circuit is configured to perform multiple summation calculations, and the second calculation circuit further includes:
  • a buffer connected to the summation circuit, for buffering the calculation result of the summation circuit during multiple summation calculations
  • a pooling circuit where the pooling circuit is connected to the buffer and configured to perform a pooling operation on the calculation results of the multiple summation calculations buffered in the buffer.
  • the second calculation circuit further includes:
  • a selection circuit the first input terminal of the selection circuit is used to connect the output terminal of the pooling circuit, the second input terminal of the selection circuit is used to connect the output terminal of the summing circuit, and the selection circuit is used For selecting and outputting the output result of the pooling circuit or the output result of the summing circuit;
  • the activation circuit connected to the output terminal of the selection circuit, is used to perform an activation operation on the output result of the selection circuit and input the output result to a second storage array, the second storage array including multiple rows and multiple columns of DRAM cells .
  • each of the multipliers includes:
  • At least one XOR unit the first input terminal of each XOR unit is connected to a latch, the second input terminal of each XOR unit is connected to a row of DRAM cells, and each XOR unit is used to cache a latch
  • the one-bit value of one element in the convolution kernel is XORed with the one-bit value of one element in the input data stored in one row of the at least two rows of DRAM cells, wherein the at least two rows of DRAM cells
  • Each DRAM cell in the cell is respectively used to store the one-bit value of an element in the input data to be performed convolution calculation;
  • a combination unit the first input terminal of the combination unit is connected to the output terminal of the at least one XOR unit, the second input terminal of the combination unit is connected to at least one latch, and the combination unit is used to connect the The data output by the at least one XOR unit is combined with the data output by the at least one latch.
  • the neural network calculation is implemented in the memory, the transmission of neural network data between the memory and the neural network computing node is reduced, and the memory wall problem is solved.
  • the convolution calculation is implemented, the The convolution calculation is split into an exclusive OR operation, and the result of the convolution calculation can be obtained through the exclusive OR unit, thereby avoiding the process of splitting the convolution calculation into many basic calculation logics, and further reducing the time to implement the convolution calculation Power consumption.
  • a computing device in a second aspect, includes a controller for receiving a convolution kernel and input data to be performed convolution calculation, and sending the convolution kernel and the input data to the connection The memory of the controller;
  • the memory includes:
  • the first storage array includes multiple rows and multiple columns of dynamic random access memory DRAM cells for storing the convolution kernel and input data sent by the controller, wherein the first row of DRAM cells in the storage array is used for storing Convolution kernel, where at least two rows of DRAM cells in the storage array are used to store input data for convolution calculations to be performed; and
  • the first calculation circuit connected to the storage array, is used to perform multiple convolution operations on the input data and the convolution kernel, and each convolution calculation is used to combine the convolution kernel and the at least two rows
  • the input data stored in a row of DRAM cells in the DRAM cell is multiplied and added.
  • the first calculation circuit includes:
  • each latch is connected to a row of DRAM cells, and the output of each latch is connected to one of the multiple multipliers, and each latch is used to cache the DRAM cells of the corresponding row
  • a plurality of multipliers the first input of each multiplier is connected to a row of DRAM cells, and the second input of each multiplier is connected to the output of a latch;
  • a plurality of adders the input end of each adder is respectively connected to at least two multipliers;
  • the multiple multipliers are respectively used to combine the convolution kernels buffered by the multiple latches with the input data stored in one row of the multiple rows of DRAM cells A multiplication operation is performed, and the plurality of adders are respectively used to add output results of the plurality of multipliers, and the output of the plurality of adders is a result of a convolution calculation.
  • the memory further includes:
  • a second calculation circuit the second calculation circuit is connected to the first calculation circuit, and the second calculation circuit includes a summation circuit for calculating the sum of the The results of the first convolution calculation output by the two adders are summed.
  • the second calculation circuit is configured to perform multiple summation calculations, and the second calculation circuit further includes:
  • a buffer connected to the summation circuit, for buffering the calculation result of the summation circuit during multiple summation calculations
  • a pooling circuit where the pooling circuit is connected to the buffer and configured to perform a pooling operation on the calculation results in the multiple summation calculation process buffered in the buffer.
  • the second calculation circuit further includes:
  • a selection circuit the first input terminal of the selection circuit is used to connect the output terminal of the pooling circuit, the second input terminal of the selection circuit is used to connect the output terminal of the summing circuit, and the selection circuit is used For selecting and outputting the output result of the pooling circuit or the output result of the summing circuit;
  • the activation circuit connected to the output terminal of the selection circuit, is used to perform an activation operation on the output result of the selection circuit and input the output result to a second storage array, the second storage array including multiple rows and multiple columns of DRAM cells .
  • each of the multipliers includes:
  • At least one XOR unit the first input terminal of each XOR unit is connected to a latch, the second input terminal of each XOR unit is connected to a row of DRAM cells, and each XOR unit is used to cache a latch
  • the one-bit value of one element in the convolution kernel is XORed with the one-bit value of one element in the input data stored in one row of the at least two rows of DRAM cells, wherein the at least two rows of DRAM cells
  • Each DRAM cell in the cell is respectively used to store the one-bit value of an element in the input data to be performed convolution calculation;
  • a combination unit the first input terminal of the combination unit is connected to the output terminal of the at least one XOR unit, the second input terminal of the combination unit is connected to at least one latch, and the combination unit is used to connect the The data output by the at least one XOR unit is combined with the data output by the at least one latch.
  • a neural network calculation method is provided, the method is executed by a neural network calculation device, and includes:
  • each convolution calculation is used to combine the convolution kernel and the at least two
  • the input data stored in a row of DRAM cells in a row of DRAM cells are multiplied and added.
  • the performing multiple convolution operations on the input data and the convolution kernel based on the first calculation circuit connected to the first storage array includes:
  • the convolution kernel stored in the storage array is stored in a plurality of latches in the first calculation circuit, each latch is used to cache an element of the convolution kernel stored in the DRAM cell of the corresponding column , Each DRAM cell in the first row of DRAM cells is used to store an element in the convolution kernel;
  • the output results of the multiple multipliers are added, and the output of the multiple adders is the result of a convolution calculation.
  • the convolution kernels buffered by the multiple latches are stored with one row of DRAM cells among the multiple rows of DRAM cells.
  • the multiplication operation performed on the input data includes:
  • the one-bit value of an element in the convolution kernel buffered by a latch is compared with the input data stored in one row of the at least two rows of DRAM cells XOR a one-bit value of an element of, wherein each DRAM cell in the at least two rows of DRAM cells is used to store a one-bit value of an element in the input data to be performed convolution calculation;
  • the combination unit is used to combine the data output by the at least one XOR unit with the data output by the at least one latch.
  • the method further includes:
  • the results of the one-time convolution calculation output by the plurality of adders are summed.
  • the method further includes:
  • a pooling operation is performed on the calculation results of the multiple summation calculation processes buffered in the buffer.
  • the method further includes:
  • an activation operation is performed on the output result of the selection circuit.
  • FIG. 1 is a schematic diagram of performing convolution processing on a feature map based on a convolution kernel according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of layer calculation of a convolutional neural network provided by an embodiment of the present invention
  • FIG. 3 is a schematic diagram of a neural network computing device provided by an embodiment of the present invention.
  • FIG. 4 is an exploded schematic diagram of a neural network calculation process provided by an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of a circuit for realizing exclusive OR calculation provided by an embodiment of the present invention.
  • Figure 6 is a schematic diagram of an adder provided by an embodiment of the present invention.
  • FIG. 7 is a block diagram of logic circuits near PSA in a neural network computing device provided by an embodiment of the present invention.
  • FIG. 8 is a schematic diagram of a second calculation circuit provided by an embodiment of the present invention.
  • FIG. 9 is a schematic diagram of a pooling calculation process provided by an embodiment of the present invention.
  • FIG. 10 is a block diagram of a logic circuit near the SSA in a neural network computing device according to an embodiment of the present invention.
  • FIG. 11 is a schematic structural diagram of a computing device provided by an embodiment of the present invention.
  • Figure 12 is a schematic diagram of an implementation environment provided by an embodiment of the present invention.
  • Figure 13 is a schematic structural diagram of a computer device provided by an embodiment of the present invention.
  • FIG. 14 is a flowchart of a neural network calculation method provided by an embodiment of the present invention.
  • 15 is a schematic diagram of a convolution calculation process provided by an embodiment of the present invention.
  • FIG. 16 is a schematic diagram of deploying calculation data according to an embodiment of the present invention.
  • the convolutional neural network is constructed by N layers (N is a positive integer greater than 1), and the output of any layer can be used as the input of the next layer, such as the output features of the first layer
  • the map can be used as the feature map of the second layer input, and the feature map output by the second layer can be used as the feature map input of the third layer, and so on, until the Nth layer, the feature map is directly output.
  • Convolutional neural network layers From the perspective of the position of the layers, the layers of the convolutional neural network can be divided into input layer, hidden layer and output layer.
  • the input layer refers to the first layer
  • the output layer The (output layer) refers to the last layer
  • the hidden layer (hidden layer) refers to each layer between the input layer and the output layer.
  • the layers of convolutional neural networks can be divided into convolutional layers, pooling layers, activation function layers, and fully connected layers. Among them, the convolutional layer is the core layer in the convolutional neural network.
  • Other layers can be regarded as variants of the convolutional layer or further processing of the feature map output by the convolutional layer.
  • Feature map It can also be called feature map, activation map, activation map, convolved feature, etc.
  • the feature map is the input and output data of each layer in the convolutional neural network.
  • the feature map may be image data or voice data.
  • the embodiment of the present invention does not specifically limit the data type of the feature map.
  • the feature map output from the first layer to the (N-1)th layer includes multiple feature points.
  • Each feature point in the feature map represents a feature element on the feature map.
  • the feature map is image data
  • the feature represents the pixel value.
  • the feature map is voice data
  • the feature point represents the voice value.
  • the feature map output by the Nth layer may include only one feature point that represents the category to which the feature map recognized by the convolutional neural network belongs, or the feature map output by the Nth layer may include multiple feature points, each The feature points represent the category to which the feature map recognized by the convolutional neural network may belong and the probability of belonging to the corresponding category.
  • Convolution kernel (kernel): It can also be called filter, filter, feature detector, kernel function, kernel, weight kernel, etc.
  • the convolution kernel can be regarded as a weight matrix with a scanning window.
  • the weight matrix includes multiple weights. Each weight can also be called a parameter. Different convolution kernels correspond to different weights, which can detect different features in the feature map. feature.
  • the size of the convolution kernel (filter size, that is, the size of the weight matrix) can be determined according to actual business requirements, where the size of each convolution kernel in the convolution layer is L ⁇ M (L and M are positive integers greater than 1), For example, 2 ⁇ 2, the size of each convolution kernel in the fully connected layer is 1 ⁇ 1.
  • Each convolution kernel is used to convolve the feature map to obtain an output feature map.
  • Each layer may include at least one convolution kernel, and at least one feature map can be output through the at least one convolution kernel.
  • Scan window used to determine the area targeted for each convolution process in the feature map.
  • the size of the scan window is equal to the size of the convolution kernel (filter size, that is, the size of the weight matrix).
  • the stride of the scan window is used To determine the distance of each sliding of the scan window, the step length of the scan window is determined according to actual needs, and can be 1.
  • convolution processing can be understood as the process of corresponding multiplication and accumulation.
  • the scanning window of the convolution kernel will slide on the feature map. By scanning each area of the feature map, each area of the feature map is calculated in turn.
  • the scan window of the convolution kernel slides to any area of the feature map, the feature elements in the area will be read, and the feature elements in the area and the convolution kernel will be point-multiplied to obtain a feature point.
  • the scanning window will slide by one step to be in the next area of the feature map, and output a feature point again.
  • all the output feature points can be combined into a feature map and output to the next layer in.
  • FIG. 1 shows a schematic diagram of performing convolution processing on a feature map based on a convolution kernel.
  • the size of the convolution kernel is 3 ⁇ 3, and the convolution kernel can be expressed as [1, 0, 1 ;0,1,0;1,0,1], the step size of the scan window is 1.
  • the scan window will slide once, and then point multiplication is performed on the feature matrix and the convolution kernel in the scan window at this time to obtain another feature point.
  • point multiplication is performed on the feature matrix and the convolution kernel in the scan window at this time to obtain another feature point.
  • the large cuboid in Figure 2 is the feature map to be recognized in a certain layer of the convolutional neural network, and the small cube is the convolution kernel.
  • the length, width, and height of the large cuboid are 3, 3, and 2, respectively.
  • the feature map can be expressed as 3x3x2 , Where the height of the large cuboid can be regarded as the depth of the feature map, that is, the number of layers of the feature map.
  • the feature map in Figure 2 consists of two layers of data. The length, width, and height of the small cube are all 2.
  • the product kernel can be expressed as 2x2x2, where the height of the small cube can be regarded as the depth of the convolution kernel, or the number of convolution kernels, that is, the feature map requires 2 convolution kernels for convolution, Each convolution kernel is used to perform convolution calculation on the layer 1 of the feature map. When the convolution calculation of each convolution kernel is completed, a 2x2 feature map can be obtained, and then the 2x2 feature map is pooled. , And then output the pooled feature map.
  • the foregoing calculation process may be implemented in a neural network computing device.
  • FIG. 3 is a neural network computing device provided by an embodiment of the present invention.
  • the neural network computing device includes: a first storage array, including multiple rows and multiple columns of dynamic random access memory (dynamic random access memory, DRAM) cells, and the first row of DRAM cells in the storage array is used to store the convolution core At least two rows of DRAM cells in the storage array are used to store input data for convolution calculation; the first calculation circuit is connected to the storage array and is used to perform multiple convolution operations on the input data and the convolution kernel Each convolution calculation is used for multiplying and adding the convolution kernel and the input data stored in one row of the at least two rows of DRAM cells.
  • DRAM dynamic random access memory
  • the first storage array may be a subarry in the DRAM, and the DRAM cell may be a cell in the subarray.
  • Multiple DRAM cells can be arranged in a row of DRAM cells, each row of DRAM cells is connected to a word line, any word line is used to select a row of DRAM cells connected to any word line, multiple DRAM cells
  • the cells can also be arranged in a row of DRAM cells, each row of DRAM cells is connected to a bit line, and any bit line is used to select a row of DRAM cells connected to any bit line.
  • Each DRAM cell can include a capacitor and a transistor.
  • the capacitor can store 1-bit data.
  • the amount of charge (potential) after charging and discharging corresponds to binary data 0 and 1, respectively.
  • a DRAM cell can be used to store 0 or 1 Since a weight can be represented by at least one binary bit value, a weight can be stored by at least one DRAM cell, that is, a DRAM cell can be used to store a weighted 1-bit binary value.
  • a DRAM unit can also store a 1-bit binary value of the input data to be performed convolution calculation.
  • the neural network computing device may also include a row decoder and a column decoder, the row decoder is connected to each word line in the first storage array, Thus, the DRAM cells of each row in the first memory array can be connected through word lines; the column decoder is connected to each bit line in the first memory array, so that each column of DRAM cells in the first memory array can be connected through bit lines.
  • the neural network computing device can activate the DRAM cell connected to any word line and the DRAM cell connected to any bit line through the column decoder and row decoder, so that the intersecting DRAM cells can be read and written, and read operations It is used to read the data stored in the intersecting DRAM cells, and the write operation is used to store data in the intersecting DRAM cells, which can further realize the reading and storing of data in the first storage array.
  • the row decoder can select the corresponding row address (a row address corresponds to a word line) through the row address in the row address buffer to activate a specific row address, and then realize the addressing of the row address.
  • the column decoder selects the corresponding column address (a column address corresponds to a bit line) through the column address in the column address buffer to activate the specific column address, and then realize the addressing of the column address.
  • the activated row address and the activated column address intersect at the position where a DRAM cell is located, so that the DRAM cell can be addressed, and data can be stored in the DRAM cell through a write operation, or The data stored in the DRAM cell is read through a read operation.
  • the first memory array may further include at least one primary sense amplifier (PSA) and at least one secondary sense amplifier (SSA), and the input end of one PSA is connected to a row of DRAM cells,
  • PSA primary sense amplifier
  • SSA secondary sense amplifier
  • the amplified charge amount is used to indicate whether the data stored in the DRAM cell is 0 or 1.
  • the amplified charge amount is greater than the preset value, it is explained
  • the data stored in the storage element is 1, on the contrary, it means that the data stored in the storage element is 0, and the implementation of the present invention does not specifically limit the preset value.
  • the input end of an SSA is connected to multiple first storage arrays, and the SSA is used to select and read the output result of the PSA, and then write the output result to the read-write bus of the neural network computing device.
  • the first row of DRAM cells mentioned above is used to refer to any row of DRAM cells in the first storage array.
  • each The convolution kernel can be stored in any row of DRAM cells in the first storage array.
  • the convolution kernel 1 is stored in the 9th row in the storage array
  • the convolution kernel 2 is stored in the 10th row in the storage array.
  • the neural network computing device may include multiple first storage arrays.
  • the input The feature map is split into multiple layers, and the convolution kernel is used to convolve the feature map of each layer. Then, the input data to be performed convolution calculation corresponding to the convolution kernel may be more.
  • the neural network computing device can also split the convolution kernel with too many weights into multiple sub-convolution kernels, and each sub-convolution kernel can correspond to multiple groups of convolutions to be performed. Input data for calculation.
  • the number of input data to be performed for convolution calculations corresponding to the convolution kernel and convolution kernel in any network layer is relatively large.
  • a first storage array may not be able to store A large number of convolution kernels and input data for convolution calculations to be performed, therefore, multiple first storage arrays can be used to jointly store the convolution kernels in any network layer and input data for convolution calculations to be performed.
  • FIG. 4 is a decomposition schematic diagram of a neural network calculation process provided by an embodiment of the present invention. It can be seen from Figure 4 that the input feature map is a 3-dimensional feature map, and the input feature map is split into multiple two-dimensional sub-feature maps, the scanning window of the 4*4 convolution kernel, and the convolution sliding sub-interval The calculation process is further decomposed into a 3*3 convolution kernel to slide 4 times on a 4*4 scanning window. The area covered by the scan window after each sliding is regarded as a sub-feature map, and each convolution kernel will convolve each sub-feature map.
  • the data on each sub-feature map can form a set of input data for convolution calculation.
  • the input data for each group of convolution calculations to be executed can be stored in any row of DRAM cells other than the first row of DRAM cells in the storage array. Therefore, at least two rows of DRAM cells in the storage array can be used to store Input data for convolution calculation.
  • convolution calculation includes multiple calculation processes. For example, the weight is multiplied by the corresponding input data, and the product of the weight in the convolution kernel and the corresponding input data is added to obtain the feature points.
  • the convolution kernel may correspond to multiple sets of input data for convolution calculations to be performed, so the output of multiple feature points corresponding to the convolution kernel will be obtained. These multiple feature points can form the feature map of the convolution kernel after one convolution. That is, it is the result of a convolution calculation. Therefore, the first calculation circuit is connected to the first storage array so that the first calculation circuit can read the convolution kernel and the input data of the convolution calculation to be performed from the first storage array. And based on the read data to achieve convolution calculation.
  • the first calculation circuit includes: a plurality of latches, the input of each latch is connected to a row of DRAM cells, and each The output of each latch is connected to one of the multiple multipliers, and each latch is used to buffer the one-bit value of an element in the convolution kernel stored in the DRAM cell of the corresponding column, where the first Each DRAM cell in the row of DRAM cells is used to store the one-bit value of an element in the convolution core; multiple multipliers, the first input of each multiplier is connected to a row of DRAM cells, each multiplier The second input terminal is connected to the output terminal of a latch; multiple adders, and the input terminal of each adder is connected to at least two multipliers; wherein, in a convolution calculation process, the multiple multipliers The convolution kernels buffered by the plurality of latches are respectively used to perform a point multiplication operation with the input data stored in a row of DRAM cells in the
  • the weight in the convolution kernel can be regarded as an element in the convolution kernel, that is, a weight is an element, and accordingly, a piece of data in the input data to be performed convolution calculation can also be regarded as the convolution calculation to be performed An element in the input data.
  • the quantized convolution kernel and quantized input data are stored in the first storage array, the quantized data is represented by a binary value, so an element in the convolution kernel can be represented by multiple bits of data, for example The quantized binary bit 11 of the weight value 3, where the first 1 in 11 is the highest bit of the weight value 3, and the first 1 is the second highest bit of the weight value 3, so that two DRAM cells in a row can be used for storage Weight 3.
  • multiple DRAM cells in a row can also be used to store one of the input data to be convolved.
  • the embodiment of the present invention does not perform the quantized weight and the number of bits of the quantized input data. Specific restrictions.
  • the neural network computing device when the neural network computing device performs convolution calculation, it calculates according to the rows of the first storage array, and the process is as follows:
  • the neural network computing device selects the row of DRAM cells storing the convolution core, opens multiple latches, and outputs the elements in the convolution core to multiple latches for caching; then, disconnects multiple The connection between the latch and the first memory array opens a row of DRAM cells (for example, the first row of DRAM cells stores the input data of the first window), so that the multiplier combines the input data of the first row of DRAM cells with the latch
  • the elements in the convolution kernel in the device perform the first multiplication calculation; an adder is connected to at least two multipliers, and the calculation results of at least two multipliers are subjected to the first addition operation, and the calculation results of multiple adders constitute The result of the first convolution calculation, the result of the first convolution calculation is also a feature point.
  • the multiplier In the second calculation, open the second row of DRAM cells (stored the input data of the second window), and the multiplier will first
  • the input data stored in the two rows of DRAM cells and the convolution kernel in the latch perform the second multiplication calculation.
  • the adder performs the second addition operation on the calculation results of at least two multipliers, and the calculation results of multiple adders
  • the result of the second convolution calculation is formed, and so on, until the calculation of the input data to be performed convolution calculation is completed, where the first window can be the scanning window of the convolution kernel and perform the first operation on the input feature map.
  • the window after the second sliding, and the second window is the window after the second sliding of the scan window of the convolution kernel on the input feature map.
  • the neural network computing device implements calculations in any network layer of the convolutional neural network, multiple convolution kernels may be required.
  • the neural network computing device executes the above-mentioned row according to the first storage array. By performing convolution calculation, the convolution result of each convolution kernel can be obtained.
  • each multiplier includes: at least one XOR unit, the first input terminal of each XOR unit is connected to a latch, and each The second input of each XOR unit is connected to a row of DRAM cells, and each XOR unit is used to buffer the one-bit value of an element in the convolution kernel of a latch with one row of the at least two rows of DRAM cells The one-bit value of an element in the input data stored in the DRAM cell is XORed, wherein each DRAM cell in the at least two rows of DRAM cells is used to store the input data of the to-be-executed convolution calculation.
  • a one-bit value of an element a combination unit, the first input terminal of the combination unit is connected to the output terminal of the at least one XOR unit, the second input terminal of the combination unit is connected to at least one latch, the The combination unit is used to combine the data output by the at least one XOR unit with the data output by the at least one latch.
  • the output data of the combination unit is referred to as the first data
  • the first data is the product of an element in the convolution kernel and an element in the input data for which the convolution calculation is to be performed.
  • An exclusive OR unit can output the second highest bit of the first data, then the combination unit can combine the weight in the convolution kernel with the second highest bit of the first data to obtain the first data.
  • D0 can be represented by 1bit D0_0 and 1bit D0_1.
  • D0 D0_0D0_1, D0 and The product of W0 can be expressed as:
  • the first weight when weighting, if any weight is less than or equal to the first preset value, the first weight is converted to 0, and when any weight is less than or equal to the first preset value, the first weight is converted to 1. Then, in a possible implementation, for any weight stored in the at least one latch, when the any weight is less than or equal to the first preset value, the weight is used as the first The highest bit of a data, the exclusive OR of any weight and the corresponding input data is taken as the second highest bit of the first data, and then a first data can be obtained; when the any weight is greater than the first preset value, Use the any weight as the highest bit of the second data, the first data corresponding to the any weight as the second highest bit of the second data, and add the second data to the second preset value to obtain the First data.
  • the embodiment of the present invention does not specifically limit the first preset value and the second preset value.
  • FIG. 5 is a schematic diagram of a circuit for implementing XOR calculation according to an embodiment of the present invention.
  • a latch is connected to a switch, and this switch is connected to a PSA.
  • the neural network computing device closes the switch, and selects the row decoder through a read operation, opens the row where the weight W0 is located, and opens the row through PSA
  • the stored weight W0 is written into the latch; then disconnect the switch, disconnect the data bus and the latch, use the read operation to select the row where the input data D0 is located, and after PSA amplification, input the input data D0_1 into the exclusive OR unit
  • each weight in each convolution kernel may be 1bit data or more Bit data.
  • each input data may be 1bit data or multi-bit data after quantization. Therefore, the amount stored in the first array is relatively large, and the stepwise summation can be achieved through multiple adders to obtain The result of a convolution calculation.
  • Figure 6 is a schematic diagram of an adder provided by an embodiment of the present invention.
  • This adder also called (Addition tree) can add 32 3bit data to get an 8bit sum, then, when the convolution kernel size is greater than 32, the result of a convolution calculation is the sum result of more than 32 3bit data, this one
  • the 8-bit sum obtained by the addition tree cannot be used as the result of a convolution calculation.
  • At least the partial sum results of the output of multiple addition trees need to be summed again to obtain the result of a convolution calculation.
  • FIG. 7 is a block diagram of the logic circuit near the PSA in the neural network computing device according to an embodiment of the present invention.
  • the dashed box is the new logic circuit diagram in the neural network computing device. It can be seen that each PSA is connected to a column of the first storage array. The first row in the first storage array stores the weights W0 and W1 in the convolution kernel.
  • the first storage stores a set of input data D0 (D0_0D0_1) and D1 (D1_0D1_1), the weight W0 can be stored in the latch through PSA, and the W0 stored in the latch and D0_0 in the first storage array Input to an XOR unit, the XOR unit outputs the XOR of W0 and D0_0, the XOR of W0 and D0_0, the XOR of W0 and D1_0, and W0 are input to the combination unit, and the combination unit outputs the product of D0 and W0 D0*W0 In the same way, another combination unit can get the product D1*W1 of D1 and W1, and sum D0*W0 and D1*W1 through the adder tree (adder) to get the convolution kernel [W0, W1] and a set of input data
  • the dot product result of [D0, D1] is the result of a convolution calculation of the convolution kernel [W0, W1].
  • the neural network computing device further includes: a second calculation circuit connected to the first calculation circuit, and the second calculation circuit includes a summation circuit for performing a summation calculation In the process, the results of the convolution calculation output from the multiple adders in the first calculation circuit are summed.
  • the summing circuit can obtain the final convolution result through cyclic accumulation.
  • the first input of the summing circuit is connected to the output of the summing circuit, and the second input of the summing circuit is Connect multiple adders.
  • the result of a convolution calculation input by the summing circuit to the second input terminal is convolved with the N times input of the first input terminal
  • the sum result of the product calculation is added, and the summing circuit can output the sum result of N+1 convolution calculation, where N is a positive integer.
  • S2 is the summing result of the first convolution calculation output by the first calculation circuit
  • T2 is the summation result of two convolution calculations.
  • the summation circuit sums the results of multiple convolution calculations, each time the summation circuit performs an accumulation, it also needs to determine whether the result of one convolution calculation of this accumulation is the result of multiple convolution calculations. If the result of the last convolution calculation in this time is the result of the last convolution calculation, the summing circuit ends the accumulation, otherwise, the summing circuit will continue to receive the convolution calculation results As a result, and accumulate again.
  • the summation circuit in the schematic diagram of a second calculation circuit provided by the embodiment of the present invention in FIG. 8, and the 8bit data input by one input port 1 of the summation circuit in FIG. 8 is the output of a first calculation circuit. Data.
  • the 32-bit data input from input port 2 is the result of the previous summation of the summing circuit. After the summation is completed, the summing circuit inputs the result of this summation to input port 2 so that it can be used as input port 1 again After receiving another 8bit data, perform another summation calculation until the final 32bit summation result is obtained, and the final 32bit summation result is also the final convolution result.
  • the computing device may perform a pooling operation on the final convolution result to downsample the final convolution result.
  • the second calculation circuit is used to perform multiple summation calculations, and the second calculation circuit further includes: a buffer connected to the summation circuit for buffering the summation circuit in multiple times The calculation result in the summation calculation process; a pooling circuit, the pooling circuit is connected to the buffer, and is used to perform a pooling calculation on the calculation result in the multiple summation calculation process buffered in the buffer.
  • the pooling operation is described as follows:
  • the purpose of the pooling operation is to down-sample the first feature map.
  • some relatively large feature points can be selected from the first feature map to achieve down-sampling processing.
  • a 4*4 first feature map can use 4 2*2 pooling cores for pooling. 2*2 is used to indicate the size of the pooling core.
  • the neural network computing device can take the 4*4 first
  • the feature map is split into 4 2*2 first sub-feature maps 1-4, and the size of each 2*2 first sub-feature map is the same as the size of a pooling kernel, and then from each first sub-feature Filter out one of the largest feature points in the figure.
  • the feature points in the first sub-feature map 1 include 1, 1, 5, and 6, then the feature point 6 can be filtered out from the first sub-feature map 1, so as Four feature points are screened out in the feature map, and then these four feature points are combined into a second feature map, the second feature map is also the result of the pooling operation on the first feature map.
  • the pooling circuit further includes a plurality of comparators, and each comparator is used to compare the feature points in each first sub-characteristic map.
  • a comparator includes two input interfaces and one output interface. The two input interfaces are input interface 1 and input interface 2, respectively. Input interface 1 is connected to the output interface of the comparator.
  • the output interface inputs the larger feature point to the input interface 2 for the next comparison.
  • the input interface 2 outputs the larger feature point to the combined sub-unit, and then each comparator can output a feature point separately through cyclic comparison, so that the pooling circuit can Output a second feature map.
  • the pooling circuit in FIG. 8 cyclically compares the 16 32-bit data output in the buffer, and one 32-bit data can be obtained, which is the result of the pooling operation.
  • the neural network computing device also needs to perform an activation operation on the pooling result to obtain the feature map output by the network layer.
  • the activation operation refers to the operation of using the activation function to calculate the feature map, so as to realize the nonlinear transformation of the feature map.
  • the activation function can be a linear rectification (rectified linear unit, ReLU) function Relu function.
  • the final convolution result may not be pooled, but the activation operation may be directly performed.
  • the second calculation circuit further includes a selection circuit, the first input terminal of the selection circuit is used to connect to the output terminal of the pooling circuit, and the second input terminal of the selection circuit is used to connect to The output terminal of the summing circuit, the selection circuit is used to select and output the output result of the pooling circuit or the output result of the summing circuit; the activation circuit is connected to the output terminal of the selection circuit for outputting the selection circuit As a result, the activation operation is performed, and the output result is input to the second memory array, which includes multiple rows and multiple columns of DRAM cells.
  • the selection circuit in Figure 7 has input interfaces 0 and 1, where input interface 0 is connected to the summing circuit, and input interface 1 is connected to the pooling circuit. When the output result of the summing circuit meets the preset conditions, select The circuit outputs the output result of the summing circuit, and when the output result of the pooling circuit meets the preset condition, the selection circuit outputs the output result of the pooling circuit.
  • the second storage array may be the first storage array, or may be another storage array other than the first storage array.
  • Input the output result to the second storage array, so that the neural network computing device can realize the calculation process of the next network layer through the second storage array, so that the calculation process of each network layer of the neural network can be in the neural network computing device
  • the result output by the activation circuit is also the calculation result of the entire neural network.
  • FIG. 10 is a neural network computing device near the SSA in an embodiment of the present invention.
  • the logic circuit block diagram of the adder outputs the result of each convolution calculation of the convolution kernel to the summing circuit in the second calculation circuit through SSA.
  • the results of multiple convolution calculations are summed to obtain the final Then input the final convolution result into the buffer, and output the convolution result in the buffer to the pooling circuit.
  • the final convolution result is pooled through the pooling circuit, and finally the pool
  • the result of the activation is input to the activation unit, and the pooling result is activated by the activation unit, so that the activation unit outputs a network layer output result, and then the result of the activation unit is input to the array write-back circuit, and the array write-back circuit outputs this
  • the result is rewritten into the second storage array so that the next network layer can be calculated.
  • the neural network computing device can include multiple partial banks, each bank can include multiple block blocks, and each block includes 16 first storage arrays. In fact, after layout and routing, it is found that if all are configured as computing circuits, the neural network computing device The area overhead is too large. Therefore, eight of the first storage arrays are configured as computing arrays (8 multiply accumulate (multiply accumulate, MAC) rows (Rows)), and the remaining eight first storage arrays remain unchanged. Good area performance balance. In this case, each bank has 64SSAs, and all the first storage arrays can be calculated independently in parallel.
  • Multiple banks have independent control and read-write buses, and can also run in parallel. This architecture fully considers the repetitiveness and parallelism of the internal resources of the computing device, and can realize the maximum parallelization of distributed computing operation resources. Due to the high repeatability of the design of the first storage array, for the failed modules that may appear in the process of computing equipment, the DRAM cell can be checked and marked before the data is stored to ensure the yield rate of chip production.
  • the neural network computing device may also include a data input/output interface, where the data input interface is used to receive convolution kernels sent by other devices, input data to be convolutional calculations, etc., and the data output interface is used to Output the data output by the last network layer of the neural network to other devices.
  • the data input interface is used to receive convolution kernels sent by other devices, input data to be convolutional calculations, etc.
  • the data output interface is used to Output the data output by the last network layer of the neural network to other devices.
  • the first calculation circuit by storing the convolution kernel and the input data to be performed convolution calculation in a first storage array that includes multiple rows and multiple columns of DRAM cells, the first calculation circuit can read from the first storage
  • the input data of the convolution kernel and the convolution calculation to be executed is read in the array, and the convolution calculation is completed, thereby realizing the neural network calculation in the memory, reducing the transmission of the neural network data between the memory and the neural network computing node, Solved the memory wall problem.
  • the neural network computing device splits the multiplication operation in the convolution calculation into an exclusive OR operation, thereby avoiding the process of splitting the convolution calculation into many basic calculation logics, and further reducing the work when implementing the convolution calculation. Consumption.
  • the neural network computing device may include multiple first storage arrays, so that parallel operation can be realized, and the efficiency of completing the convolutional neural network computing process in the memory is improved.
  • the calculation processes of convolution, pooling and activation in the convolutional neural network can be completed in the neural network computing device, so that all the calculation processes in the neural network can be realized in the memory, which can further reduce the neural network The transmission of data between memory and neural network computing nodes.
  • the neural network computing device can be installed on a computer device, and the computer device controls the neural network computing device, so that the computer device can complete the entire calculation process of the neural network on the neural network computing device.
  • FIG. 11 is a schematic structural diagram of a computing device provided by an embodiment of the present invention.
  • the computing device 1100 includes:
  • the controller 1101 is configured to receive the convolution kernel and the input data of the convolution calculation to be performed, and send the convolution kernel and the input data to the memory 1102 connected to the controller;
  • the memory 1102 includes:
  • the first storage array includes multiple rows and multiple columns of dynamic random access memory DRAM cells for storing the convolution kernel and input data sent by the controller, wherein the first row of DRAM cells in the storage array is used for storing Convolution kernel, where at least two rows of DRAM cells in the storage array are used to store input data for convolution calculations to be performed; and
  • the first calculation circuit connected to the storage array, is used to perform multiple convolution operations on the input data and the convolution kernel, and each convolution calculation is used to combine the convolution kernel and the at least two rows
  • the input data stored in a row of DRAM cells in the DRAM cell is multiplied and added.
  • the first calculation circuit includes:
  • each latch is connected to a row of DRAM cells, and the output of each latch is connected to one of the multiple multipliers, and each latch is used to cache the DRAM cells of the corresponding row
  • a plurality of multipliers the first input of each multiplier is connected to a row of DRAM cells, and the second input of each multiplier is connected to the output of a latch;
  • a plurality of adders the input end of each adder is respectively connected to at least two multipliers;
  • the multiple multipliers are respectively used to combine the convolution kernels buffered by the multiple latches with the input data stored by one row of the multiple rows of DRAM cells A multiplication operation is performed, and the plurality of adders are respectively used to perform an addition operation on output results of the plurality of multipliers, and the output of the plurality of adders is a result of a convolution calculation.
  • the memory further includes:
  • a second calculation circuit the second calculation circuit is connected to the first calculation circuit, and the second calculation circuit includes a summation circuit, which is used to calculate the sum of the The results of the first convolution calculation output by the two adders are summed.
  • the second calculation circuit is configured to perform multiple summation calculations, and the second calculation circuit further includes:
  • a buffer connected to the summation circuit, for buffering the calculation result of the summation circuit during multiple summation calculations
  • a pooling circuit where the pooling circuit is connected to the buffer and configured to perform a pooling operation on the calculation results in the multiple summation calculation process buffered in the buffer.
  • the second calculation circuit further includes:
  • a selection circuit the first input terminal of the selection circuit is used to connect the output terminal of the pooling circuit, the second input terminal of the selection circuit is used to connect the output terminal of the summing circuit, and the selection circuit is used For selecting and outputting the output result of the pooling circuit or the output result of the summing circuit;
  • the activation circuit connected to the output terminal of the selection circuit, is used to perform an activation operation on the output result of the selection circuit and input the output result to a second storage array, the second storage array including multiple rows and multiple columns of DRAM cells .
  • each of the multipliers includes:
  • At least one XOR unit the first input of each XOR unit is connected to a latch, the second input of each XOR unit is connected to a row of DRAM cells, and each XOR unit is used to cache a latch
  • the one-bit value of one element in the convolution kernel is XORed with the one-bit value of one element in the input data stored in one row of DRAM cells in the at least two rows of DRAM cells, wherein the at least two rows of DRAM cells
  • Each DRAM cell in the cell is respectively used to store a one-bit value of an element in the input data to be performed convolution calculation;
  • a combination unit the first input terminal of the combination unit is connected to the output terminal of the at least one XOR unit, the second input terminal of the combination unit is connected to at least one latch, and the combination unit is used to connect the The data output by the at least one XOR unit is combined with the data output by the at least one latch.
  • the first calculation circuit can read the convolution kernel from the first storage array And the input data of the convolution calculation to be executed, and complete the convolution calculation, thereby realizing the neural network calculation in the memory, reducing the transmission of the neural network data between the memory and the neural network computing node, and solving the memory wall problem.
  • the multiplication operation in the convolution calculation is split into an exclusive OR operation, thereby avoiding the process of splitting the convolution calculation into many basic calculation logics, and further reducing the power consumption when implementing the convolution calculation .
  • the memory may include multiple first storage arrays, so that parallel operation can be realized, and the efficiency of completing the convolutional neural network calculation process in the memory is improved.
  • the calculation process of convolution, pooling and activation in the convolutional neural network can be completed in the memory, so that all the calculation processes in the neural network can be realized in the memory, which can further reduce the neural network data in the memory And neural network computing nodes.
  • FIG. 12 is a schematic diagram of an implementation environment provided by an embodiment of the present invention.
  • the computer device in FIG. 12 includes a bus interface (peripheral component interconnect-express, PCI-e), a controller, and a dual in-line Memory modules (dual inline memory modules, DIMM), users can install neural network computing devices in the DIMM. Users can train the quantitative neural network according to the needs of the application scenario to obtain the network structure and weight value that meet the accuracy requirements. Then, the computer equipment uses the compiler to analyze the original data (pictures, audio, etc.) and network structure of the application scenario. After processing and weighting, it is allocated to the memory address space after processing, and the instruction set is generated.
  • PCI-e peripheral component interconnect-express
  • DIMM dual in-line Memory modules
  • the computer device writes the original data, weight and instruction set into the neural network computing device. Finally, the neural network computing device will The calculation is automatically performed. After the calculation is completed, the message is sent to the computer equipment processor (central processing units, CPU) through an interrupt, and the CPU reads back the result data to complete the entire neural network calculation operation.
  • the compiler compiles the original data, it needs to call the functions in the library to compile.
  • the compiler can compile the original data based on the programming language (language), and send the compiled result to the driver of the computing device. (diver), and then run the compiled result in the computer device at runtime (runtime) controlled by the driver.
  • the application programming interface application programming interface, API
  • user interface can also call functions in the function library to implement the functions of the API.
  • FIG. 13 is a schematic structural diagram of a computer device provided by an embodiment of the present invention.
  • the computer device 1300 includes a relatively large difference due to different configurations or performance, and may include one or more CPU 401 and one or more memories 1302, Wherein, the memory 1302 stores at least one instruction, and the at least one instruction is loaded and executed by the processor 1301 to implement the methods provided in the following method embodiments.
  • the computer device 1300 may also have components such as a wired or wireless network interface, a keyboard, and an input and output interface for input and output.
  • the computer device 1300 may also include other components for implementing device functions, which are not described here.
  • a computer-readable storage medium such as a memory including instructions, which may be executed by a processor in a terminal to complete the neural network calculation method in the following embodiments.
  • the computer-readable storage medium may be a read-only memory (read-only memory, ROM), a random access memory (random access memory, RAM), a compact disc read-only memory (CD-ROM), Tapes, floppy disks and optical data storage devices, etc.
  • FIG. 14 In order to further reflect the process of the neural network computing device implementing the calculation in the convolutional neural network, refer to the flowchart of a neural network computing method provided by an embodiment of the present invention as shown in FIG. 14, which specifically includes:
  • the computer device sends a convolution kernel and input data for convolution calculation to be performed on the first network layer to the neural network calculation device.
  • the neural network computing device may be the memory in the computer device.
  • the memory may include multiple rows and multiple columns of DRAM cells.
  • the first network layer can be any layer in the neural network.
  • the input data to be performed convolution calculation can be divided into multiple groups of input data, each group of input data includes multiple input data, each group of input data is a scan window of a convolution kernel, and it is performed on the feature map of any network layer input After one slide, the data in the area covered by the scan window, because the area of the input feature map is larger than the area of the convolution kernel, so the scan window of a convolution kernel needs to slide multiple times on the input feature map. Therefore, one convolution kernel can correspond to multiple sets of input data. Since one convolution kernel can include multiple weights, one weight corresponds to one input data in a group of input data. Each set of input data can be regarded as a sub feature map.
  • the input feature map in FIG. 15 consists of 2 layers, There are 2 cores, and the convolution core 1 slides 4 times in the first layer of the input feature map, and then 4 sub-feature maps corresponding to the convolution core 1 can be obtained, which are the sub-feature images 1-4, the convolution core 2 Slide 4 times in the second layer of the feature map, and then 4 sub-feature maps corresponding to the convolution kernel 2 can be obtained, which are the sub-feature maps AD.
  • the neural network computing device receives the convolution kernel of the first network layer of the convolutional neural network and input data to be performed for convolution calculation.
  • the input data of the first network layer may be the data to be processed that is input to the neural network computing device for the first time, or the output data of other network layers.
  • the input data of other network layers may be the output data of the upper layer of the other network layer, or the data to be processed that is input to the neural network computing device for the first time.
  • the neural network computing device can directly obtain multiple sets of inputs for convolution calculations to be performed based on the positional relationship between the convolution kernel of each layer and the output data of the previous layer Data, one convolution kernel can correspond to at least one set of input data to be performed convolution calculation.
  • the neural network computing device stores the convolution kernel in a first row of dynamic random access memory DRAM cells in a first storage array, where the first storage array includes multiple rows and multiple columns of DRAM cells.
  • the neural network computing device stores the input data in at least two rows of DRAM cells except the first row of DRAM cells in the first storage array.
  • each convolution kernel can correspond to at least one sub-feature map (that is, a set of input data), there is a correspondence between multiple sub-feature maps and at least one convolution kernel. Therefore, the neural network computing device can directly base on this Corresponding relationship, the multiple sub-feature maps and the at least one convolution kernel are associated and stored.
  • the DRAM unit storing any convolution kernel and the sub-feature map corresponding to any convolution kernel are located in the same target column in the first storage array, so that The input data in the sub-feature map can correspond to the weight in the convolution kernel.
  • FIG. 16 is a schematic diagram of a deployment calculation data provided by an embodiment of the present invention. It can be seen from the first storage array in FIG.
  • the convolution core 1 stores the 9th row and the Ath target column of the first storage array, and the coordinates of the convolution core 1 on the first storage array can be expressed as (9, A)
  • the weight W0 of the convolution kernel 1 is stored in the A1 sub-column in the 9th row, that is, the coordinates of the weight W0 on the first storage array can be expressed as (9, A1)
  • the convolution kernel 2 is stored in the first storage
  • the 10th row of the array, the convolution kernel 1-2, the sub-feature map 1-4, and the sub-feature map AD are located in the A column of the first storage array at the same time. Convolution is achieved by activating the data in two rows and one target column. The correspondence between the kernel and the feature map.
  • the convolution kernel 1 and the sub-feature map 1 will correspond, and it can also be regarded as a convolution kernel
  • the scan window of 1 covers the sub-feature map 1 after sliding once. Then the neural network computing device can perform convolution calculation based on the data in the 9th row and the 1st row in the Ath target column.
  • the neural network computing device can store any data in the memory address by performing a write operation on any memory address in the first storage array, for example, if the weight W is stored in the unit corresponding to the memory address xxxx At this time, the neural network computing device can charge the capacitor in the DRAM cell corresponding to the memory address xxxx through a write operation, and the amount of charge in the capacitor can represent the weight W and then the weight W can be stored in the cell corresponding to the memory address xxxx. Since the weights in the convolution kernel and the input data to be performed for the convolution calculation are all represented by quantized data, the quantized data may be 1 bit or multiple bits, and each memory address in the first storage array The corresponding DRAM cell can only store 1 bit of data.
  • multiple adjacent DRAM cells in the first storage array can be used to store one weight or one input data, for example, W0, W1, D0_0, D0_1 in Figure 10 , D1_0 and D1_1 are all 1-bit data.
  • W0 and W1 are two weights in the convolution kernel, D0_0 and D0_1 can represent D0, D1_0 and D1_1 can represent D1, and D0 is the feature corresponding to W0 in the sub-feature map Element, D1 is the feature element corresponding to W1 in the sub feature map, so multiple DRAM cells can be used to store one weight or one input data.
  • the target list is used to store the failure conditions of each DRAM unit in the first storage array.
  • any DRAM cell cannot be used to store data.
  • any DRAM cell is marked as fault-free, then any DRAM cell is used For storing data.
  • the DRAM cells in the first storage array can be verified and marked, and the weight or input data can be prevented from being stored in the faulty DRAM cell, and the weight or input data can not be read from the first storage array. In turn, errors can be avoided in the subsequent calculation process, thereby ensuring that the output result of the neural network computing device is more accurate.
  • the neural network computing device performs multiple convolution operations on the input data and the convolution kernel based on the first calculation circuit connected to the first storage array, and each convolution calculation is used for the convolution kernel and the at least The input data stored in one row of DRAM cells among the two rows of DRAM cells is multiplied and added.
  • the first calculation circuit may also implement a convolution calculation through a latch, a multiplier, and an adder.
  • this step 1405 can be implemented through the process shown in the following steps 51-53.
  • Step 51 The neural network computing device stores the convolution kernels stored in the storage array in a plurality of latches in the first computing circuit, and each latch is used to cache the convolution kernels stored in the DRAM cells of the corresponding column. Each DRAM cell in the first row of DRAM cells is used to store an element in the convolution kernel.
  • a row of DRAM cells storing convolution cores in the first storage array can be opened, since each latch is connected to a row of DRAM cells. Therefore, by turning on the lock memory switch, the 1-bit value of each element in the convolution kernel can be stored in the latch on each column.
  • Step 52 The neural network computing device performs multiplication on the input data stored in a row of DRAM cells among the rows of DRAM cells based on the plurality of multipliers in the first calculation circuit. Operation.
  • an exclusive OR operation can be performed on the elements in the convolution kernel and a set of input data, and based on the result of the exclusive OR operation and the weight value, the product of the input data stored in one row of the multiple rows of DRAM cells is obtained.
  • this step 52 can be implemented through the following steps 1-2.
  • Step 1 Based on at least one XOR unit in the multiplier, the neural network computing device compares the one-bit value of an element in the convolution kernel buffered by a latch with one row of DRAM cells in the at least two rows of DRAM cells The one-bit value of an element in the input data stored in the XOR is XORed, wherein each DRAM cell in the at least two rows of DRAM cells is used to store one element of an element in the input data for which the convolution calculation is to be performed. Place value.
  • the 1-bit value of the target element stored in the latch on any column is input to the XOR unit of the multiplier, and the value stored in any column is compared with the target element.
  • the 1-bit value of the input data corresponding to the element is input to the XOR unit of the multiplier.
  • the XOR unit can XOR the 1-bit value of the target element and the 1-bit value of the input data, and output the second highest bit of the first data. Since the convolution kernel may include multiple weights, the above XOR process needs to be performed for each weight. Therefore, the neural network computing device needs to perform the above operations on each weight of the convolution kernel stored in the first storage array. Multiple second highest bits of the first data can be obtained.
  • the neural network computing device is based on a combination unit in the multiplier, and the combination unit is used to combine the data output by the at least one XOR unit with the data output by the at least one latch.
  • the data output by the exclusive OR unit is also the second highest bit of the first data, and the data output by the combination unit is also a first data.
  • the data output by the at least one XOR unit is input to the combination unit, and each 1-bit data of an element stored in the latch is stored and input to the combination unit, and the combination unit stores these input data Combine into one first data.
  • each convolution kernel corresponds to multiple input data
  • multiple first data will be obtained. Based on the data in Figure 16, corresponding to convolution kernel 1.
  • Multiple first data include D0*W0, D1*W1, D3 *W2 and D4*W3.
  • Step 53 The neural network computing device performs an addition operation on the output results of the multiple multipliers based on the multiple adders in the first calculation circuit, and the output of the multiple adders is the result of a convolution calculation.
  • Multiple adders can sum multiple first data, which can be regarded as a multi-part summation.
  • P0 is also the result of multi-part summation.
  • the convolution kernel Step 1405 is performed for each group of data, so that multiple multi-part summation results can be obtained.
  • the multiple multi-part summation results can form a convolution feature map.
  • the neural network performs step 1405 on these multiple convolution kernels to obtain multiple convolution feature maps. For example, the part of the summation in Figure 15 is based on the convolution kernel 1, and a convolution feature map 1 is obtained based on the convolution.
  • Product kernel 2 get a convolution feature map 2.
  • step 1405 is performed for each group of input data corresponding to a convolution kernel, and then multiple feature points can be obtained, and these multiple feature points are also the result of multiple convolutions of one convolution kernel.
  • the process of step 1405 can be performed for each convolution kernel, so that the results of multiple convolution calculations for each convolution kernel can be obtained.
  • the neural network computing device sums the results of the primary convolution calculation output by the multiple adders based on the summation circuit in the second calculation circuit in the neural network calculation device.
  • the summation circuit can realize multiple summation, and can sum up the results of the current convolution calculation and the results of multiple previous convolution calculations.
  • the neural network computing device calculates the sum of each convolution calculation. The result is input to the summation circuit, and the summation circuit accumulates according to the summation results of the previous layer in the summation circuit, and then the results of multiple convolution calculations can be summed to achieve multiple summation to obtain the final Convolution result.
  • the neural network computing device sums the convolution feature map 1 and the convolution feature map 2 to obtain the matrix of the first feature map [R0, R1; R2, R3].
  • the neural network computing device caches the calculation results of the summation circuit in the multiple summation calculation process based on the buffer in the neural network calculation device.
  • the calculation result in the multiple summation calculation process is the final convolution result.
  • the neural network computing device performs a pooling operation on the calculation result of the multiple summation calculation process buffered in the buffer based on the pooling circuit in the neural network computing device.
  • the neural network computing device inputs the final convolution result into the multiple comparators of the pooling circuit, and the multiple comparators perform cyclic comparison, so as to output multiple representative feature points, which are multiple representative features
  • the point is the result of the pooling operation.
  • the neural network device inputs the first feature map [R0, R1; R2, R3] to the pooling circuit, and the pooling circuit compares the first feature map [R0, R1; R2, R3] Perform the pooling operation and output the pooling result D0.
  • the specific comparison process of the comparator has been described in the foregoing, so I will not repeat it here.
  • the neural network computing device selects and outputs the output result of the pooling circuit or the output result of the summing circuit based on the selection circuit in the neural network computing device.
  • the neural network computing device may not perform the process shown in step 1407-1408, but may directly activate the final convolution result (step 1410), and some embodiments need to perform the process shown in step 1407-1408
  • the output result of the summing circuit and the output result of the pooling circuit can be input to the selection circuit, so that the selection circuit can select the input of the activation operation according to the preset conditions .
  • the selection process of the selection circuit has been described in the foregoing, and will not be repeated here.
  • the neural network computing device Based on the activation circuit in the neural network computing device, the neural network computing device performs an activation operation on the output result of the selection circuit, and uses the output result of the activation circuit as the convolution calculation to be performed in the next network layer of the first network layer When the first network layer is the last network layer of the volume neural network, the neural network computing device outputs the output result of the activation circuit.
  • the convolutional neural network has been mapped to the neural network computing device at this time, when the input feature map needs to be calculated based on the target convolutional neural network, because the target convolutional neural network and the aforementioned convolutional neural network The network is not the same neural network, it is necessary to remap the target convolutional neural network to the neural network computing device, so that the neural network computing device can realize the calculation process in the target convolutional neural network.
  • the neural network computing device when the neural network computing device implements the calculation process in the convolutional neural network based on other input feature maps, for the first network layer of the convolutional neural network, receive the convolutional neural network The input data of the first network layer of the network to be convolutional calculation is performed, and the steps of the neural network calculation are performed; when the neural network computing device realizes the calculation process in the target convolutional neural network based on other input feature maps, the goal is The first network layer of the convolutional neural network receives the convolution kernel of each network layer of the target convolutional neural network and the input data to be performed convolution calculation of the first network layer of the target convolutional neural network, and executes the aforementioned neural network Calculation steps.
  • the first calculation circuit can read the convolution kernel and the waiting data from the first storage array. Execute the input data of convolution calculation and complete the convolution calculation, thereby realizing the neural network calculation in the memory, reducing the transmission of neural network data between the memory and the neural network computing node, and solving the problem of the memory wall.
  • the multiplication operation in the convolution calculation into an exclusive OR operation, it is avoided that the process of the convolution calculation is split into many basic calculation logics, and the power consumption when implementing the convolution calculation is further reduced.
  • the calculation processes of convolution, pooling and activation in the convolutional neural network can be completed in the neural network computing device, so that all the calculation processes in the neural network can be realized in the memory, which can further reduce the neural network The transmission of data between memory and neural network computing nodes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

La présente invention, qui appartient au domaine technique des réseaux neuronaux, concerne un dispositif et un procédé de calcul de réseau neuronal, et un dispositif de calcul. Le dispositif de calcul de réseau neuronal stocke un noyau de convolution et des données d'entrée à soumettre à un calcul par convolution dans une première zone de stockage d'une mémoire de sorte qu'un premier circuit de calcul peut lire le noyau de convolution et les données d'entrée à soumettre à un calcul par convolution à partir de la première zone de stockage, et un calcul par convolution est accompli. La présente invention met en œuvre un calcul de réseau neuronal dans la mémoire, réduit une transmission de données de réseau neuronal entre la mémoire et un nœud de calcul de réseau neuronal, et résout le problème du mur de mémoire.
PCT/CN2020/092053 2019-05-24 2020-05-25 Dispositif et procédé de calcul de réseau neuronal, et dispositif de calcul WO2020238843A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910441581.5 2019-05-24
CN201910441581.5A CN111985602A (zh) 2019-05-24 2019-05-24 神经网络计算设备、方法以及计算设备

Publications (1)

Publication Number Publication Date
WO2020238843A1 true WO2020238843A1 (fr) 2020-12-03

Family

ID=73437022

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/092053 WO2020238843A1 (fr) 2019-05-24 2020-05-25 Dispositif et procédé de calcul de réseau neuronal, et dispositif de calcul

Country Status (2)

Country Link
CN (1) CN111985602A (fr)
WO (1) WO2020238843A1 (fr)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113222129A (zh) * 2021-04-02 2021-08-06 西安电子科技大学 一种基于多级缓存循环利用的卷积运算处理单元及系统
CN113672855A (zh) * 2021-08-25 2021-11-19 恒烁半导体(合肥)股份有限公司 一种存内运算方法、装置及其应用
CN113792868A (zh) * 2021-09-14 2021-12-14 绍兴埃瓦科技有限公司 神经网络计算模块、方法和通信设备
CN114119344A (zh) * 2021-11-30 2022-03-01 中国科学院半导体研究所 数据处理方法及数据处理装置
CN116167425A (zh) * 2023-04-26 2023-05-26 浪潮电子信息产业股份有限公司 一种神经网络加速方法、装置、设备及介质

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022124993A1 (fr) * 2020-12-11 2022-06-16 National University Of Singapore Matrice à décalage planaire pour accélérateurs de dcnn
CN113138957A (zh) * 2021-03-29 2021-07-20 北京智芯微电子科技有限公司 用于神经网络推理的芯片及加速神经网络推理的方法
CN113487020B (zh) * 2021-07-08 2023-10-17 中国科学院半导体研究所 用于神经网络计算的参差存储结构及神经网络计算方法
CN115049885B (zh) * 2022-08-16 2022-12-27 之江实验室 一种存算一体卷积神经网络图像分类装置及方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169563A (zh) * 2017-05-08 2017-09-15 中国科学院计算技术研究所 应用于二值权重卷积网络的处理系统及方法
CN108573305A (zh) * 2017-03-15 2018-09-25 杭州海康威视数字技术股份有限公司 一种数据处理方法、设备及装置
US20190005376A1 (en) * 2017-06-30 2019-01-03 Intel Corporation In-memory spiking neural networks for memory array architectures
CN109784483A (zh) * 2019-01-24 2019-05-21 电子科技大学 基于fd-soi工艺的二值化卷积神经网络内存内计算加速器

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107153873B (zh) * 2017-05-08 2018-06-01 中国科学院计算技术研究所 一种二值卷积神经网络处理器及其使用方法
CN107657581B (zh) * 2017-09-28 2020-12-22 中国人民解放军国防科技大学 一种卷积神经网络cnn硬件加速器及加速方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108573305A (zh) * 2017-03-15 2018-09-25 杭州海康威视数字技术股份有限公司 一种数据处理方法、设备及装置
CN107169563A (zh) * 2017-05-08 2017-09-15 中国科学院计算技术研究所 应用于二值权重卷积网络的处理系统及方法
US20190005376A1 (en) * 2017-06-30 2019-01-03 Intel Corporation In-memory spiking neural networks for memory array architectures
CN109784483A (zh) * 2019-01-24 2019-05-21 电子科技大学 基于fd-soi工艺的二值化卷积神经网络内存内计算加速器

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113222129A (zh) * 2021-04-02 2021-08-06 西安电子科技大学 一种基于多级缓存循环利用的卷积运算处理单元及系统
CN113222129B (zh) * 2021-04-02 2024-02-13 西安电子科技大学 一种基于多级缓存循环利用的卷积运算处理单元及系统
CN113672855A (zh) * 2021-08-25 2021-11-19 恒烁半导体(合肥)股份有限公司 一种存内运算方法、装置及其应用
CN113672855B (zh) * 2021-08-25 2024-05-28 恒烁半导体(合肥)股份有限公司 一种存内运算方法、装置及其应用
CN113792868A (zh) * 2021-09-14 2021-12-14 绍兴埃瓦科技有限公司 神经网络计算模块、方法和通信设备
CN113792868B (zh) * 2021-09-14 2024-03-29 绍兴埃瓦科技有限公司 神经网络计算模块、方法和通信设备
CN114119344A (zh) * 2021-11-30 2022-03-01 中国科学院半导体研究所 数据处理方法及数据处理装置
CN116167425A (zh) * 2023-04-26 2023-05-26 浪潮电子信息产业股份有限公司 一种神经网络加速方法、装置、设备及介质
CN116167425B (zh) * 2023-04-26 2023-08-04 浪潮电子信息产业股份有限公司 一种神经网络加速方法、装置、设备及介质

Also Published As

Publication number Publication date
CN111985602A (zh) 2020-11-24

Similar Documents

Publication Publication Date Title
WO2020238843A1 (fr) Dispositif et procédé de calcul de réseau neuronal, et dispositif de calcul
US20210117810A1 (en) On-chip code breakpoint debugging method, on-chip processor, and chip breakpoint debugging system
US10242728B2 (en) DPU architecture
WO2022037257A1 (fr) Moteur de calcul de convolution, puce d'intelligence artificielle et procédé de traitement de données
US20160093343A1 (en) Low power computation architecture
US10180808B2 (en) Software stack and programming for DPU operations
WO2019205617A1 (fr) Procédé et appareil de calcul pour la multiplication de matrices
CN110222818B (zh) 一种用于卷积神经网络数据存储的多bank行列交织读写方法
WO2023116314A1 (fr) Appareil et procédé d'accélération de réseau neuronal, dispositif, et support de stockage informatique
US9922696B1 (en) Circuits and micro-architecture for a DRAM-based processing unit
US20190244084A1 (en) Processing element and operating method thereof in neural network
US20180181861A1 (en) Neuromorphic circuits for storing and generating connectivity information
US11487342B2 (en) Reducing power consumption in a neural network environment using data management
CN113762493A (zh) 神经网络模型的压缩方法、装置、加速单元和计算系统
CN116167424B (zh) 基于cim的神经网络加速器、方法、存算处理系统与设备
CN114003198A (zh) 内积处理部件、任意精度计算设备、方法及可读存储介质
WO2022179075A1 (fr) Procédé et appareil de traitement de données, dispositif informatique et support de stockage
US11500629B2 (en) Processing-in-memory (PIM) system including multiplying-and-accumulating (MAC) circuit
US11847451B2 (en) Processing-in-memory (PIM) device for implementing a quantization scheme
WO2022047802A1 (fr) Dispositif de traitement en mémoire et procédé de traitement de données associé
CN115221102A (zh) 用于优化片上系统的卷积运算操作的方法和相关产品
US20220108203A1 (en) Machine learning hardware accelerator
US11663446B2 (en) Data reuse and efficient processing scheme in executing convolutional neural network
CN114330687A (zh) 数据处理方法及装置、神经网络处理装置
CN111382835A (zh) 一种神经网络压缩方法、电子设备及计算机可读介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20813775

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20813775

Country of ref document: EP

Kind code of ref document: A1