CN108304922B

CN108304922B - Computing device and computing method for neural network computing

Info

Publication number: CN108304922B
Application number: CN201710025196.3A
Authority: CN
Inventors: 刘武龙; 姚骏; 汪玉
Original assignee: Tsinghua University; Huawei Technologies Co Ltd
Current assignee: Tsinghua University; Huawei Technologies Co Ltd
Priority date: 2017-01-13
Filing date: 2017-01-13
Publication date: 2020-12-15
Anticipated expiration: 2037-01-13
Also published as: CN108304922A; EP3561737A1; WO2018130029A1; US20190340508A1; EP3561737B1; EP3561737A4

Abstract

The application provides a computing device and a computing method for neural network computing, the computing device comprising: the first calculation unit is used for executing M times of first operation on the input first matrix to obtain a second matrix; a second calculation unit for performing a second operation on the input second matrix; the control unit is used for controlling the first calculation unit to execute the ith first operation in the M first operations on the first matrix to obtain the ith data element of the second matrix; storing the ith data element of the second matrix into a first storage unit; and controlling the second computing unit to execute the second operation once if the data element currently stored by the first storage unit can be used for executing the second operation once. The computing device and the computing method can reduce storage overhead of the computing device for neural network computing.

Description

Computing device and computing method for neural network computing

Technical Field

The present application relates to the field of data processing, and more particularly, to a computing device and a computing method for neural network computing.

Background

Neural networks (e.g., deep neural networks) are widely used in the fields of computer vision, natural language processing, big data mining, and the like. The neural network calculation has the following two typical characteristics:

1) is computationally intensive

The operation mainly performed by the neural network is multidimensional matrix multiplication, and the computational complexity is generally O (N)³). For example, a 22-level googlenet typically requires 6GFLOPS (Floating-point Operations Per Second performed)The amount of calculation of (a).

2) Memory access intensive

The training process of the neural network generally needs massive data, and the training process needs a large amount of storage space for caching the connection weight of the neurons and intermediate data obtained by calculation of each neural network layer.

The prior art is replete with a wide variety of computing devices that are specialized for neural network computing, such as logic-based computing devices or crossbar-array-based computing devices. However, in the prior art, computing devices for neural network computing all need a large amount of storage resources to store intermediate data obtained by each neural network layer, and have high requirements on storage capacity of the computing devices and high storage overhead.

Disclosure of Invention

The application provides a computing device and a computing method for neural network computing, so as to reduce storage overhead of the computing device for the neural network computing.

In a first aspect, a computing device for neural network computing is provided, where the neural network includes a K-th neural network layer and a K + 1-th neural network layer, operations performed by the K-th neural network layer include first operations, and operations performed by the K + 1-th neural network layer include second operations, where K is a positive integer not less than 1, the computing device including: the first calculation unit is used for executing the first operation for M times on the input first matrix to obtain a second matrix, and M is a positive integer not less than 1; a second calculation unit configured to perform the second operation on the input second matrix; a control unit for: controlling the first computing unit to execute the ith first operation in the M first operations on the first matrix to obtain the ith data element of the second matrix, wherein i is more than or equal to 1 and less than or equal to M; storing the ith data element of the second matrix into a first storage unit; if the data element currently stored in the first storage unit can be used for executing the second operation once, controlling the second computing unit to execute the second operation once; wherein the first operation is a convolution operation and the second operation is a convolution operation or a pooling operation, or the first operation is a pooling operation and the second operation is a convolution operation.

In the prior art, after the K-th neural network layer completes calculation, the K + 1-th neural network layer starts calculation, and therefore, the calculation device needs to store all calculation results of the K-th neural network layer, which results in large storage overhead of the calculation device. In this scheme, before the kth neural network layer has not completed the first operation on the input matrix, if the first storage unit has stored enough data elements required to perform the second operation once, the second calculation unit may be controlled to perform the second operation once. In other words, the scheme does not require that the calculation of the K +1 th neural network layer is performed after the calculation of the K +1 th neural network layer is completed, and once the first storage unit stores the data elements which can be used for executing the second operation once, the K +1 th neural network layer can be controlled to execute the second operation once through the flow control mechanism among layers, so that the calculation efficiency of the neural network can be improved.

Furthermore, according to the scheme, the K +1 neural network layer is triggered to perform calculation before the calculation of the K neural network layer is completed, so that the first storage unit does not need to store all intermediate data obtained by the calculation of the K neural network layer at the same time, only needs to store part of intermediate data between the K neural network layer and the K +1 neural network layer, and the storage overhead of the data can be reduced.

With reference to the first aspect, in certain implementations of the first aspect, the computing device includes the first storage unit, the first storage unit includes a first line buffer, the first line buffer includes N registers, and the N registers in the first line buffer sequentially store each element of a third matrix in a row-first or column-first manner, where the third matrix is a matrix obtained by complementing 0 a second matrix in order to perform the second operation on the second matrix, where N ═ h-1 x (W + p) + W, h denotes a row number of a core corresponding to the second operation, W denotes a column number of the second matrix, p denotes a row number or a column number of 0 elements of the second matrix required to complement the second operation on the second matrix, wherein h, W, p, W and N are positive integers not less than 1.

By setting N ═ h-1 × (W + p) + W, the first line buffer realizes data caching between neural network layers with minimum storage cost.

With reference to the first aspect, in certain implementations of the first aspect, the second computing unit is a cross array, X target registers in the N registers are directly connected to X rows of the second computing unit, and the X target registers are from 1+ kx (W + p) th to W + kx (W + p) th registers in the N registers, where k is a positive integer that takes a value from 0 to h-1, and X ═ hxw; the control unit is specifically configured to: storing an ith data element of the second matrix in the first line buffer; and if the data elements currently stored in the X target registers can be used for executing a second operation once, controlling the second computing unit to work, and executing the second operation once on the data elements stored in the X target registers.

The X target registers are respectively and directly connected with the X rows of the second computing unit, and data to be computed can be input into the second computing unit without complex addressing operation, so that the computing efficiency of the neural network is improved.

With reference to the first aspect, in certain implementations of the first aspect, the first computing unit is a crossbar array, the first operation is a convolution operation, and a kernel of the first operation and a kernel of the second operation are the same in size; the computing device further comprises: a second storage unit, where the second storage unit includes a second line buffer, where the second line buffer includes N registers, where the N registers in the second line buffer sequentially store each element in a fourth matrix according to a row-first or column-first manner, and the fourth matrix is a matrix obtained by performing the first operation on the first matrix to complement the first matrix by 0; the control unit is specifically configured to: in an nth clock cycle, controlling the first computing unit to execute the ith first operation on the first matrix to obtain an ith data element of the second matrix, where the ith data element of the second matrix is located in a last column of the second matrix, and an (i + 1) th data element of the second matrix is located at a starting position of a row next to a row where the ith data element is located, or the ith data element of the second matrix is located in a last row of the second matrix, and an (i + 1) th data element of the second matrix is located at a starting position of a column next to a column where the ith data element is located; the control unit is further configured to: controlling the first computing unit to execute the (i + 1) th first operation in the M first operations on the first matrix in the (n + t) th clock cycle, wherein t is a positive integer greater than 1; and controlling the first line buffer to store a 0 element in at least one clock cycle from the n +1 clock cycle to the n + t clock cycle.

By controlling the first line buffer to read in 0 element in the idle clock period between the ith first operation and the (i + 1) th first operation, the waste of the clock period is reduced, and the calculation efficiency of the neural network is improved.

With reference to the first aspect, in certain implementations of the first aspect, the control unit is specifically configured to: in an nth clock cycle, controlling the first computing unit to execute the ith first operation on the first matrix to obtain an ith data element of the second matrix, where the ith data element of the second matrix is located in a last column of the second matrix, and an (i + 1) th data element of the second matrix is located at a starting position of a row next to a row where the ith data element is located, or the ith data element of the second matrix is located in a last row of the second matrix, and an (i + 1) th data element of the second matrix is located at a starting position of a column next to a column where the ith data element is located; the control unit is further configured to: controlling the first computing unit to execute the (i + 1) th first operation in the M first operations on the first matrix in the (n + t) th clock cycle, wherein t is a positive integer greater than 1; and controlling the first line buffer to store a 0 element in at least one clock cycle from the n +1 clock cycle to the n + t clock cycle.

With reference to the first aspect, in certain implementations of the first aspect, t ═ s-1 × (W + p) + (W-1), and the control unit is specifically configured to: and controlling the first line buffer to sequentially store (s-1) × (W + p) + (W-1) 0 elements from the n +1 clock cycle to the n + t clock cycle, wherein s represents a sliding step of a first operation.

By controlling the first line buffer to read in 10 element in each idle clock cycle between the ith first operation and the (i + 1) th first operation, the waste of clock cycles is avoided, and the efficiency of the neural network calculation is maximized.

With reference to the first aspect, in certain implementations of the first aspect, the first computing unit is a crossbar array.

The calculation unit in the form of the cross array can convert digital operation into analog operation, and the calculation efficiency of the neural network is improved.

In a second aspect, a computing method for neural network computing is provided, the neural network including a K-th neural network layer and a K + 1-th neural network layer, operations performed by the K-th neural network layer including first operations, and operations performed by the K + 1-th neural network layer including second operations, where K is a positive integer not less than 1, and a computing device to which the computing method is applied includes: the first calculation unit is used for executing the first operation for M times on the input first matrix to obtain a second matrix, and M is a positive integer not less than 1; a second calculation unit configured to perform the second operation on the input second matrix; the calculation method comprises the following steps: controlling the first computing unit to execute the ith first operation in the M first operations on the first matrix to obtain the ith data element of the second matrix, wherein i is more than or equal to 1 and less than or equal to M; storing the ith data element of the second matrix into a first storage unit; if the data element currently stored in the first storage unit can be used for executing the second operation once, controlling the second computing unit to execute the second operation once; wherein the first operation is a convolution operation and the second operation is a convolution operation or a pooling operation, or the first operation is a pooling operation and the second operation is a convolution operation.

With reference to the second aspect, in certain implementations of the second aspect, the computing device includes the first storage unit, the first storage unit includes a first line buffer, the first line buffer includes N registers, the N registers in the first line buffer sequentially store each element of a third matrix in a row-first or column-first manner, the third matrix is a matrix obtained after 0-complementing the second matrix in order to perform the second operation on the second matrix, where N ═ 1 × (W + p) + W, h denotes a row number of a core corresponding to the second operation, W denotes a column number of the second matrix, p denotes a row number or a column number of 0 elements required to complement the second matrix in order to perform the second operation on the second matrix, wherein h, W, p, W and N are positive integers not less than 1.

With reference to the second aspect, in certain implementations of the second aspect, the second computing unit is a cross array, X target registers in the N registers are directly connected to X rows of the second computing unit, and the X target registers are from 1+ kx (W + p) th to W + kx (W + p) th registers in the N registers, where k is a positive integer taking a value from 0 to h-1, and X ═ hxw; the storing the ith data element of the second matrix into the first storage unit includes: storing an ith data element of the second matrix in the first line buffer; controlling the second computing unit to execute the second operation once if the data element currently stored by the first storage unit can be used for executing the second operation once, and the method comprises the following steps: and if the data elements currently stored in the X target registers can be used for executing a second operation once, controlling the second computing unit to work, and executing the second operation once on the data elements stored in the X target registers.

With reference to the second aspect, in some implementations of the second aspect, the controlling the first computing unit to perform an ith first operation of the M first operations on the first matrix includes: in an nth clock cycle, controlling the first computing unit to execute the ith first operation on the first matrix to obtain an ith data element of the second matrix, where the ith data element of the second matrix is located in a last column of the second matrix, and an (i + 1) th data element of the second matrix is located at a starting position of a row next to a row where the ith data element is located, or the ith data element of the second matrix is located in a last row of the second matrix, and an (i + 1) th data element of the second matrix is located at a starting position of a column next to a column where the ith data element is located; the calculation method further comprises: controlling the first computing unit to execute the (i + 1) th first operation in the M first operations on the first matrix in the (n + t) th clock cycle, wherein t is a positive integer greater than 1; and controlling the first line buffer to store a 0 element in at least one clock cycle from the n +1 clock cycle to the n + t clock cycle.

With reference to the second aspect, in some implementations of the second aspect, the first computing unit is a crossbar array, the first operation is a convolution operation, and a kernel of the first operation and a kernel of the second operation are the same in size; the computing device further comprises: a second storage unit, where the second storage unit includes a second line buffer, where the second line buffer includes N registers, where the N registers in the second line buffer sequentially store each element in a fourth matrix according to a row-first or column-first manner, and the fourth matrix is a matrix obtained by performing the first operation on the first matrix to complement the first matrix by 0; the controlling the first calculation unit to perform an ith first operation of the M first operations on the first matrix includes: in an nth clock cycle, controlling the first computing unit to execute the ith first operation on the first matrix to obtain an ith data element of the second matrix, where the ith data element of the second matrix is located in a last column of the second matrix, and an (i + 1) th data element of the second matrix is located at a starting position of a row next to a row where the ith data element is located, or the ith data element of the second matrix is located in a last row of the second matrix, and an (i + 1) th data element of the second matrix is located at a starting position of a column next to a column where the ith data element is located; the calculation method further comprises: controlling the first computing unit to execute the (i + 1) th first operation in the M first operations on the first matrix in the (n + t) th clock cycle, wherein t is a positive integer greater than 1; and controlling the first line buffer to store a 0 element in at least one clock cycle from the n +1 clock cycle to the n + t clock cycle.

With reference to the second aspect, in certain implementations of the second aspect, the controlling the first line buffer to store 0 elements for at least one clock cycle between the (n + 1) th clock cycle and the (n + t) th clock cycle comprises: and controlling the first line buffer to sequentially store (s-1) × (W + p) + (W-1) 0 elements from the n +1 clock cycle to the n + t clock cycle, wherein s represents a sliding step of a first operation.

With reference to the second aspect, in certain implementations of the second aspect, the first computational unit is a crossbar array.

In a third aspect, there is provided a computer readable medium storing program code for execution by a computing device, the program code comprising instructions for performing the method of the second aspect.

In some aspects or some implementations of some aspects above, the first operation is a convolution operation.

In some aspects or some implementations of some aspects above, the second operation is a convolution operation.

In some aspects or some implementations of some aspects above, the first computational unit is a crossbar array.

In some aspects or some implementations of some aspects above, the second computational unit is a crossbar array.

In some aspects or some implementations of some aspects described above, the cores of the first operation are the same size as the cores of the second operation.

The technical scheme provided by the application can reduce the storage overhead of data and improve the calculation efficiency of the neural network.

Drawings

Fig. 1 is a diagram showing an example of a calculation process of a convolution operation.

Fig. 2 is a diagram showing an example of the structure of the crossbar array.

FIG. 3 is a schematic block diagram of a computing device of one embodiment of the present application.

Fig. 4 is a diagram illustrating a structure of a line buffer according to an embodiment of the present application.

FIG. 5 is a diagram illustrating a comparison of the line buffer storage state with the convolution operation process according to an embodiment of the present application.

FIG. 6 is a schematic block diagram of a computing device of another embodiment of the present application.

Fig. 7 is a schematic block diagram of a computing device of yet another embodiment of the present application.

FIG. 8 is an exemplary diagram of a convolution operation according to one embodiment of the present application.

Fig. 9 is an exemplary diagram of a convolution operation according to another embodiment of the present application.

Fig. 10 is a schematic flow chart of a calculation method for neural network calculation according to an embodiment of the present application.

Detailed Description

For ease of understanding, the neural network and the computing device used for the neural network calculations will be described in detail.

Neural networks typically include multiple neural network layers, each of which may implement different operations or operations. Common neural network layers include convolutional layers, pooling layers, fully-connected layers, and the like. There are various combinations of adjacent neural network layers, and the more common combinations include: convolutional layer-convolutional layer and convolutional layer-pooling layer-convolutional layer. The convolutional layer is mainly used for performing convolution operation on the input matrix, and the pooling layer is mainly used for performing pooling operation on the input matrix. Whether a convolution operation or a pooling operation may correspond to one kernel, where the kernel to which the convolution operation corresponds may be referred to as a convolution kernel. The convolution operation and the pooling operation are described in detail below.

Convolution operations are mainly used in the field of image processing, where the input matrix may also be referred to as a feature map. The convolution operation corresponds to a convolution kernel. The convolution kernel may also be referred to as a weight matrix, where each element in the weight matrix is a weight. In the convolution process, the input matrix is divided into a plurality of sub-matrices with the same size as the weight matrix by a sliding window, each sub-matrix and the weight matrix carry out matrix multiplication, and the obtained result is the weighted average of the data elements in each sub-matrix.

For ease of understanding, the process of the convolution operation is illustrated below in conjunction with FIG. 1.

As shown in fig. 1, the input matrix is a 3 × 3 matrix. In order to ensure that the input matrix and the output matrix are consistent in dimension, 2 rows, 2 columns and 0 elements need to be supplemented to the edge of the input matrix before the convolution operation is performed on the input matrix, so that the input matrix is converted into a 5x5 matrix. The size of the sliding window represents the size of the convolution kernel, and fig. 3 illustrates an example of a weight matrix in which the convolution kernel is 3 × 3. The sliding window may slide according to a certain sliding step length s with the upper left corner position of the input matrix as the starting position, and fig. 3 illustrates the sliding step length s as 1. The output matrix is obtained by performing 9 convolution operations in the manner shown in fig. 3, where the first convolution operation results in the element (1,1) of the output matrix, the second convolution operation results in the element (1,2) of the output matrix, and so on.

It should be understood that the convolution operation generally requires the input matrix and the output matrix to have the same dimension, but the embodiment of the present application is not limited thereto, and may not require the input matrix and the output matrix to have the same dimension. If the convolution operation does not require the input matrix and output matrix dimensions to be consistent, then the input matrix may not be complemented by 0 before performing the convolution operation.

It should also be understood that the above is described by taking the example that the sliding step s of the convolution operation is 1, but the embodiment of the present application is not limited thereto, and the sliding step of the convolution operation may also be greater than 1.

The pooling operation is typically used to reduce the dimensionality of the input matrix, i.e., to down-sample the input matrix. The pooling operation is similar to the convolution operation and is also calculated based on a check input matrix, so there is also a sliding window and the sliding step size of the pooling operation is typically greater than 1 (and may also be equal to 1). The types of pooling operations are various, such as average pooling and maximum pooling. Average pooling is the averaging of all elements in a sliding window. The maximum pooling is the calculation of the maximum of all elements in the sliding window. The pooling process is substantially similar to the convolution process, except that the data elements in the sliding window operate differently and are not described in detail herein.

As noted above, neural networks typically have multiple neural network layers. A computing device for neural network computing (which may be, for example, a neural network accelerator) includes computing units corresponding to respective neural network layers, and the computing units corresponding to each neural network layer may be used to perform operations or operations of that neural network layer. It should be noted that the computing units corresponding to the neural network layers may be integrated together or separated from each other, which is not specifically limited in this embodiment of the present application.

The computational cells may be implemented using logic computational circuitry or a crossbar array. The logic calculation circuit may be, for example, a complementary metal-oxide-semiconductor transistor (CMOS) based logic calculation circuit.

The calculation unit in the form of a cross array is a calculation unit that has recently come into widespread use, and when a neural network operation is performed using the cross array, the connection weights of the neurons may be stored in a Non-Volatile Memory (NVM) of the cross array. This may reduce the storage overhead of the computing device since the NVM may still be able to efficiently store data in the event of a power loss. The crossbar array is described in detail below in conjunction with fig. 2.

As shown in fig. 2, the crossbar array (crossbar or xbar) has a row-column crossbar structure. Each cross node is provided with an NVM (hereinafter cross node is NVM node) for storing data and calculations. The type of the NVM in the NVM node is not specifically limited in the embodiments of the present application, and may be, for example, a Resistive Random Access Memory (RRAM), a ferroelectric Random Access Memory (FeRAM), a Magnetic Random Access Memory (MRAM), a Phase-change Random Access Memory (PRAM), and the like.

Since the calculation of the neural network layer is mainly based on vector-matrix multiplication or matrix-matrix multiplication, the cross array is well suited for the neural network calculation. The basic working principle of the crossbar array in neural network computation is described in detail below.

Each NVM node in the crossbar array is first initialized to store the connection weights of the neurons. Taking the cross array as an example for performing convolution operation of the convolutional layer, as shown in fig. 2, assuming that the convolutional layer performs T types of convolution operations, since each type of convolution operation corresponds to one two-dimensional convolution kernel (a convolution kernel is a weight matrix and is therefore two-dimensional), each two-dimensional convolution kernel may be vector-expanded to obtain a one-dimensional convolution kernel vector, and then the convolution kernel vector is mapped onto the T columns of the cross array, so that the NVM node of each column stores one convolution kernel vector. Taking a convolution kernel with a two-dimensional convolution kernel of 3 × 3 as an example, vector expansion may be performed on the convolution kernel of 3 × 3 to obtain a one-dimensional convolution kernel vector including 9 data elements, and then the 9 data elements of the one-dimensional convolution kernel vector are respectively stored in 9 NVM nodes in a certain column of the cross array, where the data element stored in each NVM node may be represented by a resistance value (or referred to as a conductance value) of the NVM node.

Each sub-matrix of the input matrix is subjected to a convolution operation. Before performing the convolution operation on each sub-matrix, each sub-matrix may be converted into a vector to be calculated. As shown in fig. 2, assuming that the dimension of the vector to be calculated is n, n elements in the vector to be calculated are represented by digital signals D1 to Dn, respectively. Then, the Digital signals D1 to Dn are converted into Analog signals V1 to Vn by an Analog-to-Digital Converter (DAC), at which time n elements in the vector to be calculated are represented by the Analog signals V1 to Vn, respectively. Then, the analog signals V1 to Vn are inputted to n rows of the crossbar array, respectively. The conductance value of the NVM node of each column in the cross array represents the magnitude of the weight value stored by the NVM node, so that, after the analog signals V1-Vn act on the corresponding NVM node of each column, the current value output by each NVM node represents the product of the weight value stored by the NVM node and the data element represented by the analog signal received by the NVM node. Because each column of the cross array corresponds to one convolution kernel vector, the sum of the output currents of each column represents the operation result of the matrix product of the convolution kernel corresponding to the column and the sub-matrix corresponding to the vector to be calculated. Then, as shown in fig. 2, the operation result of the matrix multiplication is converted from an Analog quantity to a Digital quantity by an Analog-to-Digital Converter (ADC) at the end of each column of the crossbar array, and is output.

Based on the working principle, the cross array converts the matrix-matrix multiplication into the multiplication operation of two vectors (a vector to be calculated and a convolution kernel vector), can quickly obtain a calculation result based on analog calculation, and is very suitable for processing the operations of vector-matrix multiplication or matrix-matrix multiplication and the like. Since more than 90% of the operations in the neural network are all such operations, the crossbar array is very suitable for being used as a computing unit in the neural network, and is particularly suitable for processing convolution operation.

In addition to the calculation unit, the calculation apparatus for neural network calculation includes a storage unit for storing intermediate data of each neural network layer or connection weights of neurons (if the calculation unit is a cross array, the connection weights of the neurons may be stored in NVM nodes of the cross array). The memory unit of the conventional computing device for neural network computing is generally implemented by a Dynamic Random Access Memory (DRAM), and may also be implemented by an enhanced dynamic random access memory (eDRAM).

As indicated above, the neural network has the characteristics of dense computation and dense access, and therefore, a large amount of storage resources are required to store intermediate data obtained by the operation of each neural network layer, and the storage overhead is large.

In order to solve the above problem, a computing device for neural network computation according to an embodiment of the present application is described in detail below with reference to fig. 3.

Fig. 3 is a schematic block diagram of a computing device for neural network computing according to an embodiment of the present disclosure. The neural network comprises a K neural network layer and a K +1 neural network layer, the operations performed by the K neural network layer comprise first operations, wherein K is a positive integer not less than 1,

the computing device 300 includes a first computing unit 310, a second computing unit 330, and a control unit 340.

The first calculating unit 310 is configured to perform M first operations on the input first matrix to obtain a second matrix, where M is a positive integer not less than 1.

The second calculation unit 330 is configured to perform a second operation on the input second matrix.

The control unit 340 is configured to:

controlling the first calculating unit 310 to execute the ith first operation of the M first operations on the first matrix to obtain the ith data element of the second matrix, wherein i is more than or equal to 1 and less than or equal to M;

storing the ith data element of the second matrix in the first storage unit 320;

if the data element currently stored in the first storage unit 320 can be used to perform the second operation once, controlling the second calculation unit 330 to perform the second operation once;

wherein the first operation is a convolution operation and the second operation is a convolution operation or a pooling operation, or the first operation is a pooling operation and the second operation is a convolution operation.

In the prior art, after the K-th neural network layer completes calculation, the K + 1-th neural network layer starts calculation, and therefore, the calculation device needs to store all calculation results of the K-th neural network layer, which results in large storage overhead of the calculation device. In the embodiment of the present application, before the kth neural network layer has not completed the first operation on the input matrix, if the first storage unit has stored enough data elements required to perform the second operation once, the second calculation unit may be controlled to perform the second operation once. In other words, the embodiment of the application does not require that the calculation of the K +1 th neural network layer is performed after the calculation of the K +1 th neural network layer is completed, and once the first storage unit stores the data elements which can be used for executing the second operation once, the K +1 th neural network layer can be controlled to execute the second operation once through the flow control mechanism between layers, so that the calculation efficiency of the neural network can be improved.

Further, according to the embodiment of the application, the K +1 th neural network layer is triggered to perform calculation before the calculation of the K-th neural network layer is completed, which means that the first storage unit does not need to store all intermediate data calculated by the K-th neural network layer at the same time, and only needs to store part of intermediate data between the K-th neural network layer and the K +1 th neural network layer, so that the storage overhead of data can be reduced.

It is indicated above that the first calculation unit 310 is configured to perform the first operation M times on the input matrix. M represents the number of times the input matrix needs to perform the first operation. The specific value of M is related to one or more of the dimensions of the input matrix, the type of the first operation, the size of the sliding window corresponding to the first operation, the sliding step length, and the like, which is not specifically limited in this embodiment of the present application. Taking fig. 1 as an example, the input matrix is a 3 × 3 matrix, the size of the sliding window is 3 × 3, the sliding step is 1, and M is equal to 9.

As indicated above, the first storage unit 320 is used for storing the output matrix calculated by the first calculation unit 310. It should be understood that the output matrix is relative to the first calculation unit 310, which is actually the input matrix of the second calculation unit 320.

The embodiment of the present application does not specifically limit the type of the first storage unit 320. In some embodiments, the first storage unit 320 may be a DRAM; in some embodiments, the first memory unit 310 may be an eDRAM; in some embodiments, the first storage unit 320 may be a Line Buffer (LB). The first storage unit 320 is an LB for example, which will not be described herein.

In some embodiments, first storage unit 320 may be part of computing device 300. For example, the first storage unit 320 may be integrated with the computing unit in the computing device 300 on a chip, dedicated to neural network computation. In other embodiments, first storage unit 320 may be a memory located external to computing device 300.

The computing device of the embodiment of the present application may be a general computing device supporting neural network computing, or may be a computing device dedicated to neural network computing, for example, may be a neural network accelerator.

The control unit 340 in the embodiment of the present application is mainly used to implement the control logic in the computing device 300, and the control unit 340 may be a complete control unit or may be formed by combining a plurality of separate sub-units.

The type of the first calculating unit 310 is not particularly limited in the embodiment of the present application, and for example, the first calculating unit may be implemented by a cross array, or may also be implemented by a logic calculating circuit, for example, a CMOS-based logic calculating circuit.

As indicated above, if the data element currently stored in the first storage unit 320 can be used to perform the second operation once, the control unit 340 controls the second calculation unit 330 to perform the second operation once. In other words, if the data element currently stored by the first storage unit 320 contains a data element required to perform the second operation once, the control unit 340 controls the second calculation unit 330 to perform the second operation once.

Assuming that the second operation performed by the second computing unit 330 is a pooling operation and the size of the sliding window corresponding to the second operation is 2 × 2, the second computing unit 330 needs to obtain the elements (1,1), (1,2), (2,1), (2,2) of the second matrix when performing the pooling operation for the 1 st time. Taking the input matrix shown in fig. 1 as the first matrix as an example, when the first calculating unit 310 performs the fifth convolution operation, the first storing unit 320 obtains the element (2,2) of the second matrix, and at this time, the first storing unit 320 stores the elements (1,1), (1,2), (1,3), (2,1), (2,2) of the second matrix, which include the elements (1,1), (1,2), (2,1), (2,2) of the second matrix, which are required for the second calculating unit 330 to perform the 1 st pooling operation. Therefore, after the first calculation unit 310 performs the fifth convolution operation, the second calculation unit 330 may be controlled to perform one pooling operation.

Alternatively, in some embodiments, the first calculation unit 310 may be a crossbar array and the first operation may be a convolution operation.

Alternatively, in some embodiments, the second computing unit 330 may be a crossbar array and the second operation may be a convolution operation.

Optionally, in some embodiments, the cores of the first operation and the second operation may be the same size.

As can be seen from the above description of the embodiment of fig. 2, the cross array converts the operation based on the digital signal into the operation based on the analog signal (analog operation for short) through the ADC, and the analog operation has the characteristic of fast calculation speed, so as to improve the efficiency of the neural network calculation. Furthermore, the convolution kernel is stored in the NVM node of the cross array by the cross array, and the NVM node has the characteristic of non-volatility, so that the convolution kernel does not need to be stored in a storage unit, and the storage overhead of the storage unit is reduced.

Optionally, in some embodiments, computing device 300 may also include a first storage unit 320. Further, as shown in fig. 4, in some embodiments, the first memory unit 320 may include a first line buffer 410. Each register 420 in the first line buffer 410 may be used to store one element in the matrix.

It should be understood that a Line Buffer (LB) may also be referred to as a line cascade register, or as a line buffer. The line buffer may be formed by a plurality of registers connected end to end, each register being arranged to store one data element. The registers in the line buffer, which may also be referred to as shift registers, shift back the old data elements in the line buffer whenever the 1 st register of the line buffer stores 1 new data element, and the data elements in the last register may be discarded.

It should be understood that the storage medium of the register 420 is not particularly limited in the embodiments of the present application. For example, the storage medium of the register 420 may be a Static Random Access Memory (SRAM) or an NVM.

As can be seen in FIG. 4, the first memory cell 320 may include C_inIndividual line buffer (C)_inNot less than 1). The first line buffer 410 may be C_inAny one of the line buffers. C_inMay represent the number of convolution kernels that the first calculation unit 310 contains. In other words, each convolution kernel stored in the first computing unit 310 may correspond to a line buffer, and when the first computing unit 310 performs a convolution operation using the certain convolution kernel, the control unit 340 stores the intermediate data obtained by the computation of the convolution kernel into the line buffer corresponding to the certain convolution kernel.

The computing device provided by the embodiment of the application uses the line buffer as the storage unit, and compared with the DRAM and the eDRAM, the line buffer has the characteristics of simplicity in operation and high addressing speed, and the efficiency of neural network computing can be improved.

Optionally, in some embodiments, the first line buffer 410 may include N registers 420. The N registers 420 in the first line buffer 410 may store each element of the third matrix in turn in a row-first or column-first manner. The third matrix is obtained by complementing 0 on the second matrix in order to perform the second operation on the second matrix, wherein N is greater than or equal to (h-1) × (W + p) + W. h may represent the number of rows of the core to which the second operation corresponds. w may represent the number of columns of cores to which the second operation corresponds. W may represent the number of columns of the second matrix. p may represent the number of rows or columns of 0 elements that are needed to supplement the second matrix in order to perform the second operation on the second matrix. h. W, p, W and N are all positive integers not less than 1.

As noted above, the N registers 420 in the first line buffer 410 may store each element of the third matrix in turn in a row-first or column-first manner. By row-first, it is meant that the first line buffer 410 reads the 0 th element to the last element of the 0 th row of the third matrix, and then reads the 0 th element to the last element of the 1 st row of the third matrix, and so on. By column-first, the first line buffer 410 reads the 0 th element to the last element of the 0 th column of the third matrix, and then reads the 0 th element to the last element of the 1 st column of the third matrix, and so on. Whether the first line buffer 410 reads the elements in the third matrix in a row-first manner or in a column-first manner depends on the sliding direction of the sliding window corresponding to the second operation. If the sliding window corresponding to the second operation slides along the rows of the matrix first, the first line buffer 410 may read the elements in the third matrix in a row-first manner; the first line buffer 410 may read the elements in the third matrix in a column-first manner if the sliding window corresponding to the second operation first slides along the columns of the matrix.

It should be noted that, if the second operation is a convolution operation, it is generally required that the dimensions of the input matrix and the output matrix are consistent, and therefore, 0 needs to be complemented for the second matrix to obtain a third matrix, but the embodiment of the present application is not limited thereto. In some embodiments, it may not be required that the input matrix and the output matrix are consistent, in which case, the number of rows and/or columns that need to be 0-complemented for the second matrix is 0 (i.e., the second matrix does not need to be 0-complemented), and in this case, the third matrix and the second matrix in the embodiment of the present invention are the same matrix, and p is equal to 0.

As noted above, N ═ h-1 × (W + p) + W. The meaning of the above value of N will be described in detail below with reference to fig. 4 and 5 as examples. In the embodiment shown in fig. 4, h and w are both 3, i.e. the convolution kernel of the second operation is a convolution kernel of 3 × 3, then the first line buffer contains 13 registers. Considering the input matrix shown in fig. 1 as the second matrix, the third matrix may be a 5 × 5 matrix obtained by complementing the second matrix with 0 as shown in fig. 1. Assuming that the sliding window is slid in turn in the manner shown in fig. 1, i.e., along the row direction of the third matrix, the first line buffer 410 reads each element of the third matrix in turn in a row-first manner. When the first line buffer 410 reads the 13 th element of the third matrix in a line-first manner (corresponding to the storage state 1 in fig. 5), the first line buffer 410 stores the element required by the second computing unit to perform the first second operation (the element stored in the register in the dashed box in fig. 5 is the element required by the second computing unit to perform the first second operation), and at this time, the second computing unit 330 may be controlled to perform the first second operation. Next, when the first line buffer 410 reads the 14 th element of the third matrix in a row-first manner (corresponding to the storage state 2 of fig. 5), the first line buffer 410 stores the element required by the second calculation unit to perform the second operation for the second time, and at this time, the second calculation unit 330 may be controlled to perform the second operation for the second time. Next, when the first line buffer 410 reads the 15 th element of the third matrix in a row-first manner (corresponding to the storage state 3 of fig. 5), the first line buffer 410 stores the element required by the second calculation unit 330 to perform the second operation for the third time, and at this time, the second calculation unit 330 may be controlled to perform the second operation for the third time. Next, when the first line buffer 410 reads the 16 th element and the 17 th element of the third matrix in a row-first manner (corresponding to the storage state 4 and the storage state 5 of fig. 5), the elements stored in the first line buffer 410 are insufficient, and the second calculation unit 330 cannot yet perform the fourth second operation, at which time, the control unit 340 may control the second calculation unit to enter a sleep state. Next, when the first line buffer 410 reads the 18 th element of the third matrix in a row-first manner (corresponding to the storage state 6 of fig. 5), the first line buffer 410 stores the element required by the second calculating unit 330 to perform the fourth second operation, and at this time, the second calculating unit 330 may be controlled to perform the fourth second operation. The latter processes are similar and will not be described in detail here.

As can be seen from the process shown in fig. 5, the number of N registers in the first line buffer 410 is set such that: although the N registers cannot store all the data elements calculated by the first calculation unit at the same time, the data elements required by the second calculation unit 330 to perform any one of the second operations always appear in the first line buffer 410 at the same time, specifically, in the register within the dashed box shown in fig. 5. If the number of registers in the first line buffer 410 is less than (h-1) × (W + p) + W, there is no guarantee that the data elements required by the second calculation unit 330 to perform any one of the second operations will always appear in the first line buffer 410 at the same time; if the number of registers of the first line buffer 410 is greater than (h-1) × (W + p) + W, there may be a waste of register resources.

Therefore, the embodiment of the present application sets N ═ h-1 × (W + p) + W so that the first line buffer realizes data caching between neural network layers with minimum storage cost.

It should be noted that whether the first buffer 410 reads in the data elements calculated by the first calculating unit 310 or complements 0 may be implemented by a two-way selector MUX. Specifically, as shown in fig. 4, the first buffer 410 may include a controller and a two-way selector MUX, where the controller sends a control signal to the two-way selector MUX to control whether the MUX reads in the data element calculated by the first calculation unit 310 or complements 0. The control signals issued by the controller may be from pre-stored control instructions or logic.

Alternatively, in some embodiments, as shown in fig. 4, the second computing unit is a crossbar array, and X destination registers 420 in the N registers 420 are directly connected to X rows of the second computing unit 330, respectively. The X target registers 420 are from the 1 st + kx (W + p) to the W th + kx (W + p) registers of the N registers 420, where k is a positive integer with a value from 0 to h-1, and X is h × W. The control unit 340 is specifically configured to store the ith data element of the second matrix in the first line buffer 410; if the data elements currently stored in the X destination registers 420 can be used to perform a second operation once, the second calculation unit 330 is controlled to operate to perform the second operation once on the data elements stored in the X destination registers 420.

As can be seen from the above description, when N is (h-1) × (W + p) + W, the data required by the second execution unit to perform any second operation always appears at the same position of the N registers, i.e. the position of the X target registers 420. Taking fig. 4 as an example, in fig. 4, h is w is 3, the first line buffer 410 includes N is 13 registers 420, X target registers are 9 registers within a dashed frame, and the 9 registers are respectively the 1 st register, the 2 nd register, the 3 rd register, the 6 th register, the 7 th register, the 8 th register, the 11 th register, the 12 th register and the 13 th register of the 13 registers.

As can be seen from fig. 5, the data elements required by the second calculation unit 330 to perform any one second operation always appear in the 9 registers. The present embodiment utilizes the feature of the first line buffer 410 to directly connect the X target registers 420 with the X rows of the second calculating unit 330, so that the control unit does not need to perform the addressing operation, and only needs to control the second calculating unit to enter the active state from the sleep state when the X target registers 420 store the data elements required for performing the second operation for one time, and perform the second operation for one time. Therefore, directly connecting the above-mentioned X destination registers 420 with X rows of the second calculation unit 330, respectively, avoids an addressing operation, thereby improving the efficiency of neural network calculation.

In some embodiments, directly connecting the X destination registers 420 with X rows of the second computing unit 330, respectively, may refer to hard-wiring the X destination registers 420 with X rows of the second computing unit 330, respectively.

Optionally, in some embodiments, the control unit 340 may be specifically configured to: in the nth clock cycle, controlling the first calculating unit 310 to perform the ith first operation on the first matrix to obtain the ith data element of the second matrix, where the ith data element of the second matrix is located in the last column of the second matrix, and the (i + 1) th data element of the second matrix is located at the start position of the next row of the row where the ith data element is located, or the ith data element of the second matrix is located in the last row of the second matrix, and the (i + 1) th data element of the second matrix is located at the start position of the next column of the column where the ith data element is located; the control unit 340 may also be configured to: in the (n + t) th clock cycle, controlling the first computing unit 310 to execute the (i + 1) th first operation of the M first operations on the first matrix, wherein t is a positive integer greater than 1; and controlling the first line buffer to store the 0 element in at least one clock cycle from the (n + 1) th clock cycle to the (n + t) th clock cycle.

It is pointed out above that the first calculation unit 310 performs the first operation on the input first matrix, and in some embodiments, if the first operation is a convolution operation requiring the input matrix and the output matrix to have consistent dimensions, the first matrix needs to be supplemented with 0 to obtain a fourth matrix before performing the first operation. Further, in some embodiments, to store the elements in the fourth matrix, the computing device 300 may also configure the first computing unit 310 with a second storage unit that is identical in structure and/or function to the first storage unit 320 described above. The second memory unit may include a second line buffer. The second line buffer may comprise N registers, the N registers of the second line buffer being arranged to store each element of the fourth matrix in sequence in a row-first or column-first manner.

As shown in fig. 6, a second storage unit 350 may be connected to the first calculation unit 310 for storing data elements required by the first calculation unit 310 to perform the first operation. The first storage unit 320 is connected to the second calculation unit 330 for storing data elements required by the second calculation unit 330 for performing the second operation. In other words, in the embodiment of the present application, the crossbar arrays and the line buffers are alternately arranged, which is equivalent to configuring a cache close to each crossbar array for each crossbar array, so that not only the access efficiency can be improved, but also the subsequent pipeline control mechanism is facilitated. Taking fig. 7 as an example, in fig. 7, the crossbar array at the K-1 th layer of the neural network is connected to the K-th line buffer, and the crossbar array at the K-1 th layer of the neural network is connected to the K +1 th line buffer. The Kth line buffer may include C_in1Individual line buffer, C_in1Indicating the number of convolution kernels contained in the crossbar array at layer K-1 of the neural network. Similarly, the K +1 line buffer may include C_in2Individual line buffer, C_in2Indicating the number of convolution kernels contained in the crossbar array at the K-th layer of the neural network. The structure of alternately arranging the cross array and the line buffer is equivalent to configuring a cache which is very close to each computing unit for each computing unit, so that the access efficiency can be improved, and the subsequent pipeline control mechanism is facilitated.

It should be noted that, if the ith data element of the second matrix is located in the last column of the second matrix, and the (i + 1) th data element of the second matrix is located at the start position of the row next to the row where the ith data element is located, it indicates that the sliding window corresponding to the first operation slides according to the row-first manner, and the sliding window already slides to the end of the row of the fourth matrix when the sliding window has calculated the ith data element of the second matrix. Similarly, if the ith data element of the second matrix is located in the last row of the second matrix, and the (i + 1) th data element of the second matrix is located at the start position of the next column of the column where the ith data element is located, it indicates that the sliding window corresponding to the first operation slides according to the column-first mode, and the sliding window has already slid to the column tail of the fourth matrix when the sliding window has calculated the ith data element of the second matrix.

Taking the sliding window sliding along the row of the fourth matrix as an example, when the sliding window slides to the end of the row of the fourth matrix, the first calculating unit 310 may perform the first operation (i + 1) th time after the second line buffer needs to read in (s-1) × (W + p) + (W-1) data. Therefore, the sliding window wrapping process may introduce some idle periods in which the first computing unit 310 has no data element input value first buffer unit 420, which is referred to as a Line Feeding bottleneck in the embodiment of the present application. In order to alleviate the Line Feeding bottleneck, the embodiment of the present application uses these idle periods to complement 0 in the first cache unit 420, so as to prepare for the second computing unit 330 to perform the next second operation.

In some embodiments, t is greater than (s-1) × (W + p) + (W-1).

In some embodiments, t ═ s-1 × (W + p) + (W-1), control unit 340 may be specifically configured to: the first line buffer 410 is controlled to sequentially store (s-1) × (W + p) + (W-1) 0 elements in the (n + 1) th clock cycle to the (n + t) th clock cycle, s representing the sliding step size of the first operation.

As indicated above, when the sliding window slides along the row of the fourth matrix and slides to the end of the row of the fourth matrix, the second line buffer needs to read in (s-1) × (W + p) + (W-1) data before the first calculating unit 310 can perform the next first operation. The embodiment of the present application sets t to (s-1) × (W + p) + (W-1), which means that the control unit 340 supplements (s-1) × (W + p) + (W-1) data required for the next second operation in the second line buffer only with (s-1) × (W + p) + (W-1) clock cycles, i.e., controls the second line buffer to read 1 new data element at each clock cycle from the n +1 clock cycle to the n + t clock cycle. In order to alleviate the Line Feeding bottleneck to the maximum extent in the (n + 1) th clock cycle to the (n + t) th clock cycle, the control unit 340 may control the first Line buffer 410 to sequentially read in (s-1) × (W + p) + (W-1) 0 elements, thereby maximizing the efficiency of the neural network calculation.

For convenience of understanding, the following K-th neural network layer and the K + 1-th neural network layer are convolution layers, and the first operation and the second operation are convolution operations as an example, and the flow control mechanism of the control unit 340 is described in detail. In the following embodiments, since the first calculation unit is for performing operations of the kth neural network layer and the first calculation unit is a crossbar array, the first calculation unit is hereinafter referred to as a kth crossbar array, and the second line buffer that provides the first matrix for the first calculation unit is hereinafter referred to as a kth line buffer; similarly, since the second calculation unit is used to perform the operation of the K +1 th neural network layer and the second calculation unit is a crossbar array, the second calculation unit is hereinafter referred to as a K +1 th crossbar array, and the first line buffer providing the second matrix for the second calculation unit is hereinafter referred to as a K +1 th line buffer.

Firstly, in order to ensure the dimension of a first matrix input by the Kth cross array and the dimension of a second matrix output by the Kth cross array to be consistent, 0 elements of the first p/2 rows of the first matrix are read for the Kth line buffer by controlling a Multiplexer (MUX) in the Kth line buffer, wherein each row comprises W + p 0 elements.

And step two, sequentially reading the data elements of h-1-p/2 rows of the first matrix, and respectively supplementing p/2 0 elements before reading the data elements of the head and the tail of each row of the first matrix.

And step three, continuing to read p/2 0 elements and the first w-1-p/2 data elements at the head of the h row of the first matrix, and controlling the Kth cross array to be in a sleep state all the time.

And step four, continuously reading subsequent data elements of the h row of the first matrix, wherein when one data element is read in, the K-th line buffer is full of the data elements required by one convolution operation, the K-th layer cross array performs one convolution operation, and the result is output to the K + 1-th line buffer.

Step five, when the Kth cross array calculates to the tail of a row of a fourth matrix (a matrix formed after the first matrix is supplemented with 0), preparing (s-1) × (W + p) + (W-1) data for the next convolution operation, wherein s is the step length. When preparing the downlink data, the kth crossbar array is controlled to be in a Sleep (Sleep) state all the time. At this time, the K +1 th Line buffer cannot read the intermediate data calculated by the K th cross array, and this phenomenon is referred to as a Line Feeding bottleneck in the embodiments of the present application.

In order to solve the Line Feeding problem, when data required by the first convolution operation after Line feed is prepared for the K-th cross array, the K + 1-th Line buffer can be controlled to read in 0 elements by using the blank clock periods, so that the problem of clock period waste caused by no effective data element flowing in from the upper-layer cross array is solved. The pipeline control mechanism improves the efficiency of neural network calculation under the condition of ensuring that the whole pipeline occupies the shortest clock period.

The K-th neural network layer and the K + 1-th neural network layer in the above are convolution-convolution connection structures. In addition to the convolution-convolution connection structure, another relatively common connection structure of the neural network layer is a convolution-pooling-convolution type connection structure. For the convolution-pooling-convolution type connection structure, the main principle of its pipeline control mechanism is similar to that of the convolution-convolution type connection structure described above. In particular, for the convolution-pooling connection portion in the convolution-pooling-convolution type connection structure, considering that the step size s of the pooling operation is generally greater than 1, for the calculation unit of the pooling layer (which may also be referred to as a pooling circuit, the implementation of which is not necessarily a cross array, e.g., for Max-posing, a NVM multi-way comparator may be used, for mean-posing, a cross array may be used), the calculation unit of the pooling layer may be switched to an Active (Active) state to perform one pooling operation each time s data flows into the Line buffer corresponding to the pooling layer, i.e., the calculation unit of the pooling layer may need to sleep for s-1 clock cycles each time it works, waiting for s data to flow from the convolution layer into the matrix after the convolution layer is complemented by 0, the Line Feeding problem of the convolution-convolution type connection structure above also exists, during the next several clock cycles, the convolutional layer will not output valid data to the line buffer corresponding to the pooling layer. Since the pooling operation usually does not need to complement 0, in the w-1 clock cycle, the corresponding line buffer of the pooling layer is in a state of no valid data read-in, and the computing unit of the pooling layer is in a sleep state. Meanwhile, when the convolutional layer is in the line feed process, the line buffer corresponding to the pooling layer needs to prepare s-1 line data elements before the calculation can be started, and the calculation unit of the pooling layer can be in a sleep state in (s-1) × W clock cycles. For the pooled-convolution connection part, the pipeline control mechanism of the control unit is basically consistent with that of the convolution-convolution type connection structure, and specific reference is made to the pipeline control mechanism of the convolution-convolution type connection structure, which is not described in detail here.

The embodiments of the present application will be described in more detail below with reference to specific examples. It should be noted that the examples of fig. 8 to 9 are only for assisting the skilled person in understanding the embodiments of the present application, and are not intended to limit the embodiments of the present application to the specific values or specific scenarios illustrated. It will be apparent to those skilled in the art that various equivalent modifications or variations are possible in light of the examples given in fig. 8-9, and such modifications or variations are intended to be included within the scope of the embodiments of the present application.

Taking the convolution-convolution type neural network connection structure as an example, assuming that the dimensionality of the output result of the cross array for convolution calculation of the K-1 th layer is 3x3, the convolution kernel size of each layer is 3x3, and the sliding step s is 1. In order to ensure that the size of the input matrix (or input characteristic diagram) of the cross array used for convolution calculation through the K-th layer is consistent with that of the output matrix (or output characteristic diagram), zero padding needs to be carried out on the periphery of the original output result characteristic diagram of the cross array of the K-1-th layer. As shown in fig. 8, the dimension of the input matrix to be convolved and calculated at the K-th layer is 3x3, and zero needs to be added at the periphery, so that the dimension of the input matrix is 5x5, where the input matrix may be the calculation output result of the cross array at the K-1 th layer, and may also be the original input data (such as original data of images, sounds, or texts).

Since the original size of the input matrix to be computed is 3x3 and the convolution kernel to be computed is 3x3, the original input matrix perimeter needs to be complemented by 1 zero to match the dimensionality of the input and output results. Then, the length N of the register in the line buffer proposed in the embodiment of the present application is (h-1) × (W + p) + W ═ 3-1) × (3+2) +3 ═ 13. If all intermediate result data needs to be stored using the conventional data storage approach, the memory size required for each neural network layer is 5 × 5 — 25. The working principle of the line buffer provided by the embodiment of the application is as follows: the data to be calculated will flow into the line buffers in sequence in a row read fashion, as shown in FIG. 8. In the next clock cycle, the old data in the line buffer is shifted back one bit in sequence by controlling the MUX in the line buffer to write the zero-padded values (2,3) in row 3 and column 4 of the input matrix. Meanwhile, the corresponding data elements to be subjected to convolution calculation in the line buffer (the data elements stored in the register in the dashed frame of fig. 8) are directly read into the subsequent cross array for calculation. The computation of data elements from reading into the line buffer to flowing into the subsequent crossbar array is very computationally inexpensive and can typically be done in one clock cycle.

Based on the line buffer operating principle described in fig. 8, the pipeline control mechanism proposed in the embodiment of the present application is described by taking a convolution-convolution type neural network structure as an example, assuming that the dimensions of the calculation results output from the crossbar array are all 3x3, and the convolution kernel size is 3x3, it is ensured that the dimensions of the calculation input and output of the convolution layer are consistent by means of zero padding to the periphery of the result output from the crossbar array. Referring to the convolution-convolution type neural network connection structure shown in fig. 9, the pipeline control method proposed in the present application is as follows:

step one, 5 clock cycles are utilized to control a multiplexer MUX to read zero padding data of a first row to a K +1 line buffer, and at the moment, a K-th layer cross array is in a sleep state.

And step two, continuously reading the first 1 st 0 element of the second row line by using 1 clock cycle, wherein the K-th layer cross array is in a sleep state.

And step three, continuously controlling the MUX to sequentially read the calculation output results (1,1), (1,2), (1,3) from the K-1 th layer of the cross array and the last 0 element of the second row (the total period of 4 clock cycles are spent), wherein the K-th layer of the cross array is in a sleep state. And in the 4 clock cycles beginning from the reading of the first effective calculation input result (1,1) from the K-1 layer cross array, the K +1 layer linear cascade register uses the 4 clock cycles to begin to read the first 4 zero padding of the first row for the K layer cross array calculation output matrix in advance, and at the moment, the K +1 layer cross array is in a sleep state.

And step four, continuing to read the first 1 element of the third row line for the K-th layer line buffer, and reading the first column raw calculation output result (2,1) of the third row from the K-1-th layer cross array (taking 2 clock cycles). During these two clock cycles, the K-th crossbar array is still in a sleep state. Meanwhile, in the 2 clock cycles, the line buffer of the (K + 1) th layer continues to read the 0 element at the end of the first row and the 0 element at the head of the third row, and the (K + 1) th layer cross array is in a sleep state.

Step five, in the next clock cycle, the line buffer of the K-th layer reads the original calculation output results (2,2) of the third row and the second column from the cross array of the K-1 th layer, as shown in fig. 8, at this time, the line buffer of the K-th layer stores data satisfying a convolution operation of the cross array of the K-th layer, because the cross array is analog calculation, in this clock cycle, the buffered data can sequentially flow into the cross array of the K-th layer for a multiplication and accumulation (convolution) calculation, and flow the calculation results (1,1) into the line buffer of the K +1 th layer, at this time, the cross array of the K-th layer is in an active state, and the cross array of the K +1 th layer is in a sleep state.

And step six, in the following two clock cycles, respectively reading the effective output (2,3) of the third row and the third column and 1 element at the tail of the third row from the K-1 layer cross array, performing convolution operation twice on the K layer cross array, and flowing the output results (1,2) and (1,3) into a K +1 layer line buffer, wherein in the two clock cycles, the K layer cross array is in an activated state to perform convolution operation, while the K +1 layer cross array is still in a sleep state, and waiting for the K +1 layer line buffer to buffer data required by the convolution operation once.

Step seven, in the next two clock cycles, the line buffer of the K layer needs to buffer the 0 element at the head of the fourth line firstly, and then receives the calculation output result (3,1) from the cross array of the K-1 layer. During the two clock cycles, the K-th layer cross array is in a sleep state, and the K-th layer line buffer is waited for buffering data required by one convolution operation. In the two clock cycles for preparing the K-th layer crossbar array for data required to perform one convolution operation, since the K-th layer crossbar array has no valid calculation output result flowing into the K + 1-th layer Line buffer, the K + 1-th layer Line buffer is in a Line Feeding state, and the K + 1-th layer crossbar array is in a sleep state. In order to solve the Line Feeding problem of the Line buffer of the K +1 th layer in the two clock cycles, zero padding operation is performed on the end of the second row and the head of the next row of the output matrix of the cross array of the K layer by controlling the MUX of the Line buffer of the K +1 th layer by using the two clock cycles, so that the problem that no effective data flows in the output matrix of the cross array of the K +1 th layer from the cross array of the K layer in the two clock cycles can be solved.

And step eight, in the next clock cycle, the K-th layer line buffer reads inflow data (3,2) from the K-1-th layer cross array, the K-th layer cross array is switched from a sleep state to an active state to perform sequential multiply-accumulate (convolution) calculation, the calculation result (2,1) flows into the K + 1-th layer line buffer, and at the moment, the K + 1-th layer cross array is in the sleep state.

Step nine, in the next clock cycle, the K-th layer line buffer receives output calculation results (3,3) from the K-1-th layer cross array, the K-th layer cross array is in an activated state to perform convolution calculation once, and the calculation output results (2,2) flow into the K + 1-th layer line buffer. At this time, the K +1 th layer cross array is switched from the sleep state to the active state, convolution calculation is performed once, and the calculation result flows into the next layer (K +2 th layer) line buffer in the same manner.

Method embodiments of the present application are described below, which correspond to apparatus embodiments, and therefore reference may be made to the foregoing apparatus embodiments for those parts not described in detail.

Fig. 10 is a schematic flow chart of a calculation method for neural network calculation according to an embodiment of the present application. The neural network comprises a K neural network layer and a K +1 neural network layer, the operation executed by the K neural network layer comprises a first operation, the operation executed by the K +1 neural network layer comprises a second operation, wherein K is a positive integer not less than 1, and the computing device applying the computing method comprises the following steps: the first calculation unit is used for executing the first operation for M times on the input first matrix to obtain a second matrix, and M is a positive integer not less than 1; a second calculation unit configured to perform the second operation on the input second matrix; the calculation method of fig. 10 includes:

1010. controlling the first computing unit to execute the ith first operation in the M first operations on the first matrix to obtain the ith data element of the second matrix, wherein i is more than or equal to 1 and less than or equal to M;

1020. storing the ith data element of the second matrix into a first storage unit;

1030. if the data element currently stored in the first storage unit can be used for executing the second operation once, controlling the second computing unit to execute the second operation once;

Optionally, in some embodiments, the computing device includes the first storage unit, the first storage unit includes a first line buffer, the first line buffer includes N registers, the N registers in the first line buffer are sequentially stored in a row-first or column-first manner into each element of a third matrix, the third matrix is a matrix obtained after the second matrix is complemented by performing the second operation on the second matrix, where N ═ 1 × (W + p) + W, h denotes a row number of a core corresponding to the second operation, W denotes a column number of the second matrix, and p denotes a row number or a column number of 0 elements required to be complemented by the second matrix to perform the second operation on the second matrix, where h, b, c, and d are equal to each other, W, p, W and N are all positive integers not less than 1.

Optionally, in some embodiments, the second computing unit is a crossbar array, and X target registers in the N registers are directly connected to X rows of the second computing unit, respectively, where the X target registers are from 1+ kx (W + p) th to W + kx (W + p) th registers in the N registers, where k is a positive integer taking a value from 0 to h-1, and X ═ hxw; step 1020 may include: storing an ith data element of the second matrix in the first line buffer; step 1030 may include: and if the data elements currently stored in the X target registers can be used for executing a second operation once, controlling the second computing unit to work, and executing the second operation once on the data elements stored in the X target registers.

Optionally, in some embodiments, step 1010 may include: in an nth clock cycle, controlling the first computing unit to execute the ith first operation on the first matrix to obtain an ith data element of the second matrix, where the ith data element of the second matrix is located in a last column of the second matrix, and an (i + 1) th data element of the second matrix is located at a starting position of a row next to a row where the ith data element is located, or the ith data element of the second matrix is located in a last row of the second matrix, and an (i + 1) th data element of the second matrix is located at a starting position of a column next to a column where the ith data element is located; the calculation method of fig. 10 may further include: controlling the first computing unit to execute the (i + 1) th first operation in the M first operations on the first matrix in the (n + t) th clock cycle, wherein t is a positive integer greater than 1; and controlling the first line buffer to store a 0 element in at least one clock cycle from the n +1 clock cycle to the n + t clock cycle.

Optionally, in some embodiments, the controlling the first line buffer to store 0 elements in at least one clock cycle between the (n + 1) th clock cycle and the (n + t) th clock cycle, where t ═ s-1) × (W + p) + (W-1), includes: and controlling the first line buffer to sequentially store (s-1) × (W + p) + (W-1) 0 elements from the n +1 clock cycle to the n + t clock cycle, wherein s represents a sliding step of a first operation.

Optionally, in some embodiments, the first computational unit is a crossbar array.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A computing device for neural network computing, wherein the neural network comprises a Kth neural network layer and a K +1 th neural network layer, wherein operations performed by the Kth neural network layer comprise first operations, and operations performed by the K +1 th neural network layer comprise second operations, wherein K is a positive integer not less than 1,

the computing device includes:

the first calculation unit is used for executing the first operation for M times on the input first matrix to obtain a second matrix, and M is a positive integer not less than 1;

a second calculation unit configured to perform the second operation on the input second matrix;

a control unit for:

controlling the first computing unit to execute the ith first operation in the M first operations on the first matrix to obtain the ith data element of the second matrix, wherein i is more than or equal to 1 and less than or equal to M;

storing the ith data element of the second matrix into a first storage unit;

before the K neural network layer calculation is completed, once the data elements currently stored by the first storage unit can be used for executing the second operation once, controlling the second calculation unit to execute the second operation once;

2. The computing device of claim 1, wherein the computing device includes the first storage unit, the first storage unit includes a first line buffer, the first line buffer includes N registers, the N registers in the first line buffer sequentially store each element of a third matrix in a row-first or column-first manner, the third matrix is a matrix obtained by complementing the second matrix for performing the second operation on the second matrix by 0, where N ═ h-1 x (W + p) + W, h denotes a number of rows of cores corresponding to the second operation, W denotes a number of columns of cores corresponding to the second operation, W denotes a number of columns of the second matrix, p denotes a number of rows or columns of 0 elements of the second matrix required for performing the second operation on the second matrix, wherein h, W, p, W and N are positive integers not less than 1.

3. The computing device of claim 2, wherein the second computing unit is a crossbar array, X target registers of the N registers are directly connected to X rows of the second computing unit, respectively, the X target registers being from 1+ kx (W + p) th to W + kx (W + p) th registers of the N registers, where k is a positive integer taking values from 0 to h-1, and X ═ hxw;

the control unit is specifically configured to:

storing an ith data element of the second matrix in the first line buffer;

and once the data elements currently stored in the X target registers can be used for executing a second operation once, controlling the second computing unit to work, and executing the second operation once on the data elements stored in the X target registers.

4. The computing device of claim 2, wherein the control unit is specifically to:

in an nth clock cycle, controlling the first computing unit to execute the ith first operation on the first matrix to obtain an ith data element of the second matrix, where the ith data element of the second matrix is located in a last column of the second matrix, and an (i + 1) th data element of the second matrix is located at a starting position of a row next to a row where the ith data element is located, or the ith data element of the second matrix is located in a last row of the second matrix, and an (i + 1) th data element of the second matrix is located at a starting position of a column next to a column where the ith data element is located;

the control unit is further configured to:

controlling the first computing unit to execute the (i + 1) th first operation in the M first operations on the first matrix in the (n + t) th clock cycle, wherein t is a positive integer greater than 1;

and controlling the first line buffer to store a 0 element in at least one clock cycle from the n +1 clock cycle to the n + t clock cycle.

5. The computing device of claim 4, wherein t ═ s-1 x (W + p) + (W-1),

the control unit is specifically configured to:

and controlling the first line buffer to sequentially store (s-1) × (W + p) + (W-1) 0 elements from the n +1 clock cycle to the n + t clock cycle, wherein s represents a sliding step of a first operation.

6. The computing device of any of claims 1-5, wherein the first computing unit is a crossbar array.

7. A computing method for neural network computation, wherein the neural network includes a K-th neural network layer and a K + 1-th neural network layer, and wherein operations performed by the K-th neural network layer include first operations and operations performed by the K + 1-th neural network layer include second operations, where K is a positive integer not less than 1, and a computing device to which the computing method is applied includes:

the calculation method comprises the following steps:

storing the ith data element of the second matrix into a first storage unit;

8. The computing method of claim 7, wherein the computing device includes the first storage unit, the first storage unit includes a first line buffer, the first line buffer includes N registers, the N registers in the first line buffer sequentially store each element of a third matrix in a row-first or column-first manner, the third matrix is a matrix obtained by complementing the second matrix for performing the second operation on the second matrix by 0, where N ═ h-1 x (W + p) + W, h denotes a number of rows of cores corresponding to the second operation, W denotes a number of columns of cores corresponding to the second operation, W denotes a number of columns of the second matrix, p denotes a number of rows or columns of 0 elements of the second matrix required for performing the second operation on the second matrix, wherein h, W, p, W and N are positive integers not less than 1.

9. The computing method of claim 8, wherein the second computing unit is a crossbar array, X target registers of the N registers are directly connected to X rows of the second computing unit, respectively, the X target registers are from 1+ kx (W + p) th to W + kx (W + p) th registers of the N registers, where k is a positive integer taking a value from 0 to h-1, and X ═ hxw;

the storing the ith data element of the second matrix into the first storage unit includes:

storing an ith data element of the second matrix in the first line buffer;

before the K neural network layer calculation is completed, once the data element currently stored by the first storage unit can be used for executing a second operation, controlling the second calculation unit to execute the second operation once, including:

10. The computing method of claim 8, wherein said controlling the first computing unit to perform an ith first operation of the M first operations on the first matrix comprises:

the calculation method further comprises:

11. The calculation method according to claim 10, wherein t ═ s-1 x (W + p) + (W-1),

said controlling said first line buffer to store a 0 element in at least one clock cycle between said n +1 clock cycle and said n + t clock cycle, comprising:

12. The computing method of any of claims 7-11, wherein the first computing unit is a crossbar array.