CN107578098B

CN107578098B - Neural network processor based on systolic array

Info

Publication number: CN107578098B
Application number: CN201710777741.4A
Authority: CN
Inventors: 韩银和; 许浩博; 王颖
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2017-09-01
Filing date: 2017-09-01
Publication date: 2020-10-30
Anticipated expiration: 2037-09-01
Also published as: CN107578098A

Abstract

The invention provides a neural network processor, which comprises a control unit, a calculation unit, a data storage unit and a weight storage unit, wherein the calculation unit acquires data and weights from the data storage unit and the weight storage unit respectively under the control of the control unit to carry out neural network related operation, the calculation unit comprises an array controller and a plurality of processing units connected in a pulsating array mode, the data and the weights are transmitted to the pulsating array formed by the processing units from different directions, and each processing unit simultaneously and parallelly processes the data passing through the processing unit. The neural network processor can achieve high processing speed; meanwhile, input data are reused for many times, so that higher operation throughput rate can be realized under the condition of consuming smaller memory access bandwidth.

Description

Neural network processor based on systolic array

Technical Field

The present invention relates to neural network technology, and more particularly, to neural network processor architectures.

Background

Deep learning has made a major breakthrough in recent years, and neural network models trained by deep learning algorithms have made remarkable achievements in the application fields of image recognition, voice processing, intelligent robots and the like. The deep neural network simulates the neural connection structure of the human brain by establishing a model, and describes the data characteristics by layering a plurality of transformation stages when processing signals such as images, sounds and texts. With the continuous improvement of the complexity of the neural network, the neural network technology has the problems of more occupied resources, low operation speed, high energy consumption and the like in the practical application process. The method of using hardware accelerators instead of traditional software computing becomes an effective way to improve the computational efficiency of neural networks, such as those implemented using general purpose graphics processors, special purpose processor chips, and field programmable logic arrays (FPGAs).

However, since the neural network processor belongs to a computation intensive processor and a memory access intensive processor, on one hand, the neural network model includes a large number of multiply-add operations and other nonlinear operations, and the neural network processor is required to keep high-load operation so as to guarantee the operation requirement of the neural network model; on the other hand, a large number of parameter iterations exist in the neural network operation process, and the computing unit needs to access a large number of memories, so that the bandwidth design requirement on the processor is greatly increased, and the memory access power consumption is increased.

Therefore, there is a need for an improvement to existing neural network processors to improve the computational efficiency of the neural network processors and reduce the hardware overhead.

Disclosure of Invention

It is therefore an object of the present invention to overcome the above-mentioned drawbacks of the prior art and to provide a systolic array-based neural network processor.

The purpose of the invention is realized by the following technical scheme:

according to an embodiment of the present invention, there is provided a neural network processor including a control unit, a calculation unit, a data storage unit, and a weight storage unit, the calculation unit acquiring data and weights from the data storage unit and the weight storage unit, respectively, under the control of the control unit, to perform neural network related operations;

the computing unit comprises an array controller and a plurality of processing units connected in a pulsating array mode, the array controller loads weights and data into the processing unit array from different directions, and each processing unit operates the received data and weights and transmits the data and weights to the next processing unit along different directions.

In the above technical solution, the processing unit array may be a one-dimensional systolic array or a two-dimensional systolic array.

In the above technical solution, the processing unit may include a data register, a weight register, a multiplier, and an accumulator;

wherein the weight register receives the weight of a processing unit in the column direction from the processing unit array, sends it to the multiplier and transfers it to the next processing unit in the direction;

the data register receives data from one processing unit in the row direction of the processing unit array, sends the data to the multiplier and transmits the data to the next processing unit in the row direction;

the multiplier multiplies the input data and the weight, the output of the multiplier is connected into the accumulator to be accumulated with the data in the accumulator or added with the partial sum input signal, and then the calculation result is used as partial sum output.

In the above technical solution, the array controller may load data from a row direction of the processing unit array, and load a weight from a column direction of the processing unit array.

In the above technical solution, the control unit may load the data sequence participating in the operation from the storage unit in the form of a row vector, and load the weight sequence corresponding to the data sequence in the form of a column vector.

In the above technical solution, the array controller may sequentially load the data sequence and the weight sequence into the corresponding rows and columns of the processing unit array according to the sequence from small row number to large column number, where the difference in time between adjacent rows and adjacent columns when entering the array is 1 clock cycle, and ensure that the corresponding weight and data to be calculated enter the processing unit array in the same clock cycle.

Compared with the prior art, the invention has the advantages that:

a pulsation array structure is adopted in a computing unit of the neural network processor, so that the operation efficiency of the neural network processor is improved, and the bandwidth requirement of processor design is relieved.

Drawings

Embodiments of the invention are further described below with reference to the accompanying drawings, in which:

FIG. 1 shows a general topology of a neural network;

FIG. 2 shows a schematic block diagram of a neural network convolution operation;

FIG. 3 shows a schematic block diagram of a neural network processor architecture, according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a computational unit of a neural network processor, according to one embodiment of the present invention;

FIG. 5 is a schematic diagram of a computing unit of a neural network processor, according to yet another embodiment of the present invention;

FIG. 6 shows a schematic diagram of a processing unit in a systolic array architecture according to one embodiment of the present invention;

FIG. 7 shows a schematic diagram of a computing process of a computing unit according to an embodiment of the invention;

FIG. 8 is a diagram illustrating a neural network processor according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail by embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The neural network is a mathematical model formed by modeling human brain structure and behavior activity, and is generally divided into an input layer, a hidden layer, an output layer and other structures, wherein each layer is composed of a plurality of neuron nodes, and the output values of the neuron nodes on the layer can be transmitted to the neuron nodes on the next layer as input and are connected layer by layer. The neural network has the bionic characteristic, and the process of multi-layer abstract iteration of the neural network has a similar information processing mode with the human brain and other perception organs.

Fig. 1 shows a common topological schematic of a neural network. The first layer input value of the neural network multilayer structure is an original image (the "original image" in the present invention refers to original data to be processed, not only an image obtained by taking a photograph in a narrow sense), and typically, for each layer of the neural network, a node value of the next layer can be obtained by calculating a neuron node value (also referred to as data herein) of the layer and a weight value corresponding thereto. For example, suppose

Several neuron nodes representing one layer in the neural network, which are connected with the node y of the next layer,

representing the weight of the corresponding connection, the value of y is defined as follows: y is x × w. Thus, there are a number of multiply-add-based convolution operations for each layer of the neural network. The convolution operation in a neural network is generally as shown in fig. 2: will be oneAnd scanning the characteristic graph by the two-dimensional weight convolution kernel with the size of K x K, solving the inner product of the weight and the corresponding characteristic element in the characteristic graph in the scanning process, and summing all the inner product values to obtain the characteristic element of the output layer. When each convolution layer has N characteristic layers, convolution kernels with the size of N KxK are convoluted with the characteristic patterns in the convolution layers, and N inner product values are summed to obtain an output layer characteristic element. With the increasing complexity of neural networks, such calculations will undoubtedly consume a large amount of resources. Thus, neural network computations are typically implemented using a dedicated neural network processor.

Common neural network processors are based on a memory-control-computation architecture. The storage structure is used for storing data participating in calculation, neural network weight, operation instructions of the processor and the like; the control structure is used for analyzing the operation instruction and generating a control signal to control the scheduling and storage of data in the processor and the calculation process of the neural network; the computation fabric is responsible for neural network computation operations. The storage unit can store data (for example, raw feature map data) transmitted from the outside of the neural network processor, trained neural network weights, processing results or intermediate results generated in the calculation process, instruction information involved in the calculation, and the like.

Fig. 3 shows a schematic structural diagram of a neural network processor 300 according to an embodiment of the present invention. As shown in fig. 3, the storage unit is further subdivided into an input data storage unit 311, a weight storage unit 312, an instruction storage unit 313 and an output data storage unit 314, wherein the input data storage unit 311 is used for storing data participating in calculation, for example, data including original feature map data and data participating in intermediate layer calculation; the weight storage unit 312 is used for storing the trained neural network weights; the instruction storage unit 313 is used for storing instruction information participating in calculation, and the instruction can be analyzed into a control flow by the control unit 320 to schedule calculation of the neural network; the output data storage unit 314 is used for storing the calculated neuron response values. By subdividing the storage units, data with substantially consistent data types can be centrally stored, so that a suitable storage medium can be selected, and operations such as data addressing can be simplified. It should be understood that the input data storage unit 311 and the output data storage unit 314 may also be the same storage unit.

The control unit 320 is responsible for instruction decoding, data scheduling, process control, and the like. For example, the instructions stored in the instruction storage unit are acquired and analyzed, and then the data are scheduled according to the control signals obtained by analysis and the calculation unit is controlled to perform the related operation of the neural network. In an embodiment of the present invention, the layer data participating in the neural network operation is divided into different regions, each region being used as a matrix, so that the operation between the data and the weights is divided into a plurality of matrix operations (for example, as shown in fig. 2). In this way, the control unit loads the weight sequence and the data sequence participating in the operation from the memory unit in the form of a row vector or a column vector suitable for the matrix operation.

One or more computing units (e.g.,

computing units

330, 331, etc.) may be included in the neural network processor, and each computing unit may perform a corresponding neural network computation according to a control signal from the control unit 320, acquire data from each storage unit, perform the computation, and write the computation result to the storage unit. The respective calculation units may have the same configuration or different configurations, and may perform the same calculation or different calculations. In one embodiment of the present invention, a computing unit is provided that includes an array controller and a plurality of processing units organized in a systolic array, each processing unit having the same internal structure. The array controller is responsible for loading data into the systolic array, each processing unit is responsible for data calculation, the weight is input from the top of the systolic array and is propagated from top to bottom, the data is input from the left side of the systolic array and is propagated from left to right, each processing unit calculates the received data and the weight, and the result is output from the right side of the systolic array. The systolic array may be a one-dimensional or two-dimensional structure. However, it should be understood that the neural network processor may also include other computing units, and the control unit may select different computing units to process data according to actual requirements.

Fig. 4 shows a schematic structural diagram of a computing unit in a neural network processor according to an embodiment of the present invention. As shown in fig. 4, the pulse array has a one-dimensional structure, and the processing units are connected in series. For the corresponding weight sequence and data sequence to be operated, the array controller loads each weight in the weight sequence to different processing units and keeps the weight until the last element of the corresponding data sequence completes the calculation with the corresponding weight, and then loads the next group of weights; while each data in the data sequence is loaded into the systolic array from the left side in turn, the processed data is transferred from the other side of the systolic array into the array controller. In such a computing unit configuration, the first data first enters the first processing unit, is processed and then passed to the next processing unit, while the second data enters the first processing unit. By analogy, when the first data arrives at the last processing element, it has been processed multiple times. Therefore, the ripple architecture actually reuses input data many times, so that higher operation throughput rate can be realized with smaller memory access bandwidth consumption.

Fig. 5 shows a schematic structural diagram of a computing unit in a neural network processor according to an embodiment of the present invention. In this embodiment, the computing units are organized in a two-dimensional array comprising a row array and a column array, and each processing unit is connected only to adjacent processing units, i.e. the processing units communicate only with adjacent processing units. The array controller is responsible for the scheduling of data, and can control the relevant data to be input into the processing unit from the upper side and the left side of the systolic array of the computing unit, and different data are input into the processing unit from different directions. For example, the array controller controls the weights to be input from above the processing unit array and to be propagated in the parallel direction from top to bottom; data is input from the left side of the processing unit array and propagates in the row direction from left to right. The present invention is not limited to the input direction and the ripple propagation direction of various computational elements, and the terms "left", "right", "up", "down", and the like, as referred to herein, refer only to the respective directions as illustrated in the figures, and should not be construed as limiting the physical implementation of the present invention.

As noted above, in embodiments of the present invention, the various processing units in the computing unit are homogeneous and perform the same operations. Fig. 6 shows a schematic structural diagram of a processing unit according to an embodiment of the present invention. As shown in fig. 6, the input signal of the processing unit includes data, weights and partial sums; the output signal includes a data output, a weight output, and a partial sum output. The processing unit mainly internally comprises a data register, a weight register, a multiplier and an accumulator. The weight input signal is connected to the weight register and the multiplier, the data input signal is connected to the data register and the multiplier, and the partial sum input signal is connected to the accumulator. The weight register can send data to the multiplier for processing, and can also directly transmit the data to a calculation unit below; the data register may also send the data to a multiplier for processing or directly to the next unit on the right. The input data and the weight are multiplied in the multiplier, the output of the multiplier is connected into the accumulator to be accumulated with the data in the accumulator or added with the partial sum input signal, and then the calculation result is used as partial sum output. The above-described operations and transfers can be flexibly set in response to control signals from the array controller. For example, each processing unit may perform the following operations:

1) receiving data of a node on a row and a column in a pulsating direction;

2) calculating the product of the two data and accumulating the product with the original registered result;

3) the accumulated values are saved, the input data received from the row is output to the next row node, and the input data received from the column is output to the next column node.

In addition, for the processing units organized in a one-dimensional array form, the weights do not need to be propagated downwards, so that after the array controller loads each element of the weight sequence to be processed into the weight register of each processing unit respectively, the weight register does not need to output, but remains for a period of time in the weight register, and after the array controller finishes the related calculation task of the weights, the weight register is emptied and the weights to be processed subsequently are continuously loaded.

The calculation process of a calculation unit using a two-dimensional array structure according to an embodiment of the present invention is described below with reference to fig. 7, by way of example, in which data is multiplied by two 3 x 3 matrices representing weights:

data matrix

Weight matrix

The array controller controls data and weights to be inputted into the processing units from above and to the left of the processing unit array, respectively. For example, the row vectors of the matrix a may generally enter the rows corresponding to the processing unit array in sequence from small row numbers to large row numbers, and the adjacent row vectors enter the processing unit array with a difference of 1 clock cycle in time, that is, the data in the ith row and k column of the matrix a and the data in the ith-1 row and k-1 column of the matrix a enter the processing unit array at the same time; the column vectors of the matrix B sequentially enter the corresponding columns of the processing unit array from small to large according to the sequence of column numbers, and the difference of time between the adjacent column vectors entering the processing unit array is 1 clock cycle, namely, the data of the j column of the k row of the matrix B and the data of the j-1 column of the k-1 row of the matrix B enter the processing unit array simultaneously. And the data matrix A enters the systolic array according to the rows and the weight matrix B enters the processing unit array according to the columns in parallel in time, namely corresponding elements A to be calculated in the matrix A and the matrix B_i,kAnd B_k,jIt enters the processing unit array at the same clock cycle until all elements of matrix a and matrix B have traversed the entire row and column of the processing unit array. The time alignment is satisfied by the input control of the array controller responsible for the arrival of each data at each on-cell. Thus, the array controller directs data and weights from different directions into the systolic array of processing units, with weights flowing from top to bottom and data flowing from left to right. In the process of data flow, all processing units process data flowing through it simultaneously in parallel, so that high processing speed can be achieved. At the same time, the data flow from the process unit array to the process unit array is performed by a predetermined data flow patternAll processing corresponding to the data is finished without inputting the data again, so that the memory access operation is reduced.

As shown in fig. 7, in the first cycle,

data

3 and 3 are simultaneously accessed into processing element PE11, and multiplication is performed in the processing element;

in the second cycle, data 3 flowing from the left side to the processing unit PE11 flows to the right, and data 4 is simultaneously coupled to the processing unit PE12, data 3 flowing from above to the processing unit PE12 flows downward, and data 2 is simultaneously coupled to the processing unit PE 21;

in the third cycle, data 3 flows into processing element PE11 from above PE11, data 2 flows into processing element P11 from the left, data 5 and data 2 flow into processing element PE21, data 4 and data 5 flow into processing element PE12, data 3 and data 2 flow into processing element PE13, data 2 and data 4 flow into compute element PE22, and data 3 flow into compute element PE 31.

In the fourth cycle, data 2 and data 2 are accessed to processing element PE12, data 4 and data 3 are accessed to processing element PE13, data 3 and data 3 are accessed to processing element PE21, data 5 and data 5 are accessed to processing element PE22, data 2 and data 2 are accessed to processing element PE23, data 2 and data 2 are accessed to processing element PE31, and data 3 and data 4 are accessed to processing element PE 32.

In the fifth cycle, data 2 and data 5 flow into the processing element PE13, data 3 and data 2 flow into the processing element PE22, data 5 and data 3 flow into the processing element PE23, data 5 and data 3 flow into the processing element PE31, data 5 and data 2 flow into the processing element PE32, and data 3 and data 2 flow into the processing element PE 33.

In the sixth cycle, data 3 and data 5 flow into the processing element PE23, data 5 and data 2 flow into the processing element PE32, data 2 and data 3 flow into the processing element PE33, and data 5 flow into the processing element PE 33.

In the seventh cycle, data 5 and data 5 flow into processing element PE 33.

The multiplication results are accumulated in the column direction, i.e., the multiplication results of PE11 are transmitted to PE21 for accumulation, and the accumulation calculation results are transmitted to PE31 for accumulation.

Fig. 8 is a schematic diagram showing an execution flow of a neural network processor using the above computing unit according to an example of the present invention. In step S1, the control unit addresses the storage unit, reads and parses the instruction to be executed next; step S2, acquiring input data from the storage unit according to the storage address obtained by the analysis instruction; step S3, loading data and weights from the input storage unit and the weight storage unit to the above-described calculation unit according to the embodiment of the present invention, respectively; step S4, the calculation unit executes an operation in a neural network operation; in step S5, the neural network calculation result is stored in the output storage unit.

Although the present invention has been described by way of preferred embodiments, the present invention is not limited to the embodiments described herein, and various changes and modifications may be made without departing from the scope of the present invention.

Claims

1. A neural network processor comprises a control unit, a calculation unit, a data storage unit and a weight storage unit, wherein the calculation unit respectively acquires data and weights from the data storage unit and the weight storage unit under the control of the control unit to perform neural network related operation;

the computing unit comprises an array controller and a plurality of processing units connected in a pulsating array mode, the array controller loads weights and data into the processing unit array from different directions, and each processing unit operates the received data and weights and transmits the data and weights to the next processing unit along different directions;

wherein the array of processing units is a two-dimensional systolic array;

wherein the processing unit comprises a data register, a weight register, a multiplier and an accumulator;

the multiplier multiplies the input data and the weight, the output of the multiplier is connected into the accumulator to be added with the partial sum input signal, and then the calculated result is used as the partial sum output.

2. The neural network processor of claim 1, wherein the array controller loads data from a row direction of the array of processing units and loads weights from a column direction of the array of processing units.

3. The neural network processor of claim 1, wherein the control unit loads the data sequence participating in the operation from the storage unit in a row vector form, and loads the weight sequence corresponding to the data sequence in a column vector form.

4. The neural network processor of claim 3, wherein the array controller loads the data sequence and the weight sequence into the corresponding rows and columns of the processing unit array in sequence from small to large row number and column number, respectively, adjacent rows and adjacent columns differ in time by 1 clock cycle when entering the array, and ensure that the corresponding weights and data to be calculated enter the processing unit array at the same clock cycle.