CN107578098A

CN107578098A - Neural network processor based on systolic arrays

Info

Publication number: CN107578098A
Application number: CN201710777741.4A
Authority: CN
Inventors: 韩银和; 许浩博; 王颖
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2017-09-01
Filing date: 2017-09-01
Publication date: 2018-01-12
Anticipated expiration: 2037-09-01
Also published as: CN107578098B

Abstract

The present invention provides a kind of neural network processor, including control unit, computing unit, data storage cell and weight memory cell, the computing unit obtains the data computing related to weight progress neutral net from data storage cell to weight memory cell respectively under the control of the control unit, wherein described computing unit includes multiple processing units that array control unit connects with a manner of systolic arrays, data and weight from different directions extremely should be by the systolic arrays that processing unit is formed, each processing unit be handled the data for flowing through it simultaneously in parallel.The neural network processor can reach very high processing speed；Input data has repeatedly been reused simultaneously, thus can realize higher computing throughput in the case where consuming less memory bandwidth.

Description

Neural network processor based on systolic arrays

Technical field

The present invention relates to nerual network technique, more particularly to neural network processor architecture.

Background technology

Deep learning achieves important breakthrough in the last few years, and the neural network model using deep learning Algorithm for Training is being schemed As the application fields such as identification, speech processes, intelligent robot achieve the achievement to attract people's attention.Deep neural network passes through foundation Model simulates the neural attachment structure of human brain, when handling the signals such as image, sound and text, passes through multiple conversion ranks Data characteristics is described for section layering.With the continuous improvement of neutral net complexity, nerual network technique is in practical application During exist and take that resource is more, the problems such as arithmetic speed is slow, energy expenditure is big.Traditional software meter is substituted using hardware accelerator The method of calculation turns into the effective mode for improving neural computing efficiency, such as utilizes graphics processing unit, special place Manage the neural network processor that device chip and FPGA (FPGA) are realized.

However, because neural network processor belongs to computation-intensive and memory access processor-intensive, on the one hand, nerve net Network model includes a large amount of multiplication add operations and other nonlinear operations, it is necessary to which neural network processor keeps high capacity fortune OK, to ensure the computing demand of neural network model；On the other hand, substantial amounts of parameter during neural network computing be present to change Generation, computing unit need largely to access memory, and the bandwidth Design demand to processor has been significantly greatly increased in this, increases simultaneously Memory access power consumption is added.

Therefore, it is necessary to be improved to existing neural network processor, to improve the computing of neural network processor effect Rate simultaneously reduces hardware spending.

The content of the invention

Therefore, a kind of it is an object of the invention to overcome above-mentioned prior art the defects of, there is provided god based on systolic arrays Through network processing unit.

The purpose of the present invention is achieved through the following technical solutions：

According to one embodiment of present invention, there is provided a kind of neural network processor, including control unit, calculating list Member, data storage cell and weight memory cell, the computing unit is under the control of the control unit respectively from data storage list Member obtains the data computing related to weight progress neutral net to weight memory cell,

Wherein described computing unit includes multiple processing units that array control unit connects with a manner of systolic arrays, array Weight and data are loaded onto in pe array by controller from different directions, and each processing unit is to the data and power that receive Computing is carried out again and data and weight are passed into next processing unit along different directions.

In the above-mentioned technical solutions, the pe array can be one-dimensional systolic arrays or two dimension systolic arrays.

In the above-mentioned technical solutions, the processing unit may include data register, weight register, multiplier and add up Device；

Wherein weight register receives the weight of a processing unit on the column direction from pe array, is sent out To multiplier and pass to next processing unit of the direction；

Data register receives the data of a processing unit on the line direction from pe array, is dealt into and multiplied Musical instruments used in a Buddhist or Taoist mass and the next processing unit for passing to the direction；

Multiplier carries out multiplying to the data and weight of input, its export access in accumulator with accumulator Data are carried out cumulative or carried out with part and input signal after add operation using result of calculation as partly and exporting.

In the above-mentioned technical solutions, the array control unit can load number from the line direction of the pe array According to from the column direction loading weight of the pe array.

In the above-mentioned technical solutions, described control unit can load the data for participating in computing from memory cell with row vector Sequence, weight sequence corresponding with the data sequence is loaded in the form of column vector.

In the above-mentioned technical solutions, the array control unit can press the order of line number and row number from small to large successively respectively Data sequence and weight sequence are loaded into row and column corresponding to pe array, adjacent lines and adjacent column are into array When differ 1 clock cycle in time, and ensure that the respective weights to be calculated and data are under the same clock cycle Into pe array.

Compared with prior art, the advantage of the invention is that：

The structure of systolic arrays is used in the computing unit of neural network processor, improves neural network processor Operation efficiency, alleviate the bandwidth demand of processor design.

Brief description of the drawings

Embodiments of the present invention is further illustrated referring to the drawings, wherein：

Fig. 1 shows the common topological schematic diagram of neutral net；

Fig. 2 shows neutral net convolution operation schematic block diagram；

Fig. 3 shows neural network processor structural schematic block diagram according to embodiments of the present invention；

Fig. 4 shows the structural representation of the computing unit of neural network processor according to an embodiment of the invention；

Fig. 5 shows the structural representation of the computing unit of the neural network processor according to another embodiment of the invention Figure；

Fig. 6 shows the structural representation of processing unit in systolic array architecture according to an embodiment of the invention；

Fig. 7 shows the calculating process schematic diagram of computing unit according to an embodiment of the invention

Fig. 8 shows that neural network processor according to an embodiment of the invention performs schematic flow sheet.

Embodiment

In order that the purpose of the present invention, technical scheme and advantage are more clearly understood, pass through below in conjunction with accompanying drawing specific real Applying example, the present invention is described in more detail.It should be appreciated that specific embodiment described herein is only to explain the present invention, and It is not used in the restriction present invention.

Neutral net is to be modeled the mathematical modeling to be formed to human brain structure and behavior activity, be generally divided into input layer, The structure such as hidden layer and output layer, each layer are formed by multiple neuron nodes, the output valve of the neuron node of this layer, meeting Next layer of neuron node is passed to as input, is successively connected.Neural network has bionics characteristics in itself, and its multilayer is taken out As the process of iteration has similar information processing manner with human brain and other perceptual organs.

Fig. 1 shows the common topological schematic diagram of neutral net.The first layer input value of neutral net sandwich construction is original (" original image " in the present invention refers to pending initial data to beginning image, and not exclusively narrow sense passes through to shoot and shone The image that piece obtains), typically, can be by the neuron node value to this layer (herein for each layer of neutral net Also referred to as data) and its corresponding weighted value calculated to obtain next layer of nodal value.For example, it is assumed thatSeveral neuron nodes of a certain layer in neutral net are represented, they are connected with next layer of node y,The weight of corresponding connection is represented, then y value definition：Y=x × w.Therefore, for neutral net The largely convolution operation based on multiply-add operation all be present in each layer.Convolution operation process in neutral net is generally such as Fig. 2 institutes Show：The two dimension modulus convolution kernel of one K*K size is scanned to characteristic pattern, in scanning process weight with it is right in characteristic pattern The characteristic element answered seeks inner product, and all inner product values are summed, and obtains an output layer characteristic element.When each convolutional layer has During N number of feature figure layer, the convolution kernel and characteristic pattern in the convolutional layer that have N number of K*K sizes carry out convolution operation, N number of inner product value Summation obtains an output layer characteristic element.With the continuous improvement of neutral net complexity, such calculating can undoubtedly consume Substantial amounts of resource.Thus, generally use special neural network processor realizes neural computing.

Common neural network processor is all based on the structure of storage-control-calculating.Storage organization is based on storing and participating in Operational order of the data of calculation, neutral net weight and processor etc.；Control structure is used to parse operational order, generation control letter Number with the scheduling of data in control processor and storage and the calculating process of neutral net；Calculate structure responsible nerve network meter Calculate operation.Wherein memory cell can store transmitted outside neural network processor data (for example, primitive character diagram data), Caused result or intermediate result in the neutral net weight that has trained, calculating process, participate in the instruction letter that calculates Breath etc..

Fig. 3 shows the structural representation of neural network processor according to embodiments of the present invention.As shown in figure 3, storage Unit is further subdivided into input data memory cell 311, weight memory cell 312, the location of instruction 313 and output data Memory cell 314, wherein, input data memory cell 311 is used to store the data for participating in calculating, such as including primitive character figure Data and the data for participating in intermediate layer calculating；Weight memory cell 312 is used to store the neutral net weight trained；Refer to Make memory cell 313 be used for store participate in calculating command information, instruction can controlled unit 320 resolve to controlling stream to dispatch The calculating of neutral net；Output data memory cell 314 is used to store the neuron response being calculated.It is single by that will store Member is finely divided, can be centrally stored by the basically identical data of data type, in order to select suitable storage medium and can letter Change the operation such as addressing data.It should be understood that input data memory cell 311 and output data memory cell 314 can also be same Memory cell.

Control unit 320 is responsible for the work such as Instruction decoding, data dispatch, process control.Such as acquisition is stored in instruction and deposited The instruction of storage unit is simultaneously parsed, and then is dispatched data according to the obtained control signal of parsing and controlled computing unit to carry out The related operation of neutral net.In an embodiment of the present invention, the figure layer data for participating in neural network computing are divided into difference Region, each region is as a matrix, so as to which the computing between data and weight to be divided into the shape of multiple matrix operations Formula (such as shown in Fig. 2).So, control unit is suitable for the row vector of matrix operation or the form of column vector from memory cell To load the weight sequence and data sequence that participate in computing.

One or more computing units (such as computing unit 330,331 etc.) can be included in neural network processor, Each computing unit can perform corresponding neural computing according to the control signal from control unit 320, single from each storage Member obtains data and is calculated and result of calculation is written into memory cell.Each computing unit can use identical structure or Different structures, identical calculating can be performed, different calculating can also be carried out.There is provided in one embodiment of the invention Computing unit include array control unit and in the form of systolic arrays tissue multiple processing units, each processing unit has phase Same internal structure.Array control unit is responsible for data being loaded onto in systolic arrays, and each processing unit is responsible for data calculating, weight Input from the top of systolic arrays, propagate from top to bottom, data input on the left of systolic arrays, and propagate from left to right, everywhere Reason unit carries out computing to the data and weight that receive, is as a result exported from the right side of systolic arrays.Systolic arrays can be one-dimensional Or two-dimensional structure.It should be understood that the computing unit otherwise calculated can also be included in neural network processor, can To select different computing units according to the actual requirements by control unit come processing data.

Fig. 4 shows the structural representation of computing unit in neural network processor according to an embodiment of the invention. As shown in figure 4, systolic arrays ties up one-dimentional structure, each processing unit serial connection.For the respective weights sequence of pending computing And each weight in weight sequence is loaded into different processing units and remains to corresponding data sequence by data sequence, array control unit Arrange after last element completes the calculating with respective weights and reload next group of weight；Simultaneously successively will be each in data sequence Data are loaded onto in systolic arrays from left side, and processed data is from the opposite side transmission meeting array control unit of systolic arrays. In such computing unit structure, first data initially enters first processing unit, is passed to after processing Next processing unit, while second data enters first processing unit.By that analogy, when first data reaches finally One processing unit, it has been processed repeatedly.So this pulsation framework has actually repeatedly reused input data, It is possible thereby to realize higher computing throughput in the case where consuming less memory bandwidth.

Fig. 5 shows the structural representation of computing unit in neural network processor according to an embodiment of the invention. In this embodiment, in computing unit using two-dimensional array mode come the multiple computing units of tissue, including row array and column array, And each processing unit is only connected with adjacent processing unit, i.e., processing unit is only communicated with adjacent processing unit.Battle array Row controller is responsible for the scheduling of data, can control related data from the top of the systolic arrays of computing unit and left input to In processing unit, different data are inputted into processing unit from different directions.For example, array control unit control weight is from The top input of cell array is managed, is propagated from top to bottom on edge and column direction；Data are defeated from the left side of pe array Enter, and propagated from left to right along line direction.The not input direction to various calculating elements of the invention and pulsation propagation side To being limited, " left side " referred to herein, " right side ", the term such as " on ", " under " only refers to the respective direction of example in figure, should not solve It is interpreted as the limitation of the physics realization to the present invention.

As noted above, in an embodiment of the present invention, each processing unit is isomorphism and performs phase in computing unit Same operation.Fig. 6 gives the structural representation of processing unit according to an embodiment of the invention.As shown in fig. 6, processing The input signal of unit include data, weight and part and；Output signal includes data output, weight exports and partly and defeated Go out.Processing unit main inside includes data register, weight register, multiplier and accumulator.Weight input signal connects To weight register and multiplier, data input signal accesses to data register and multiplier, is partly accessed with input signal To accumulator.Data can be dealt into multiplier and be handled by weight register, can also be directly passed to the calculating list of lower section Member；Data can also be dealt into multiplier and be handled by same data register, or be directly passed to the next unit on right side. The data and weight of input carry out multiplying in multiplier, the output of multiplier access in accumulator with accumulator Data are carried out cumulative or carried out with part and input signal after add operation using result of calculation as partly and exporting.Above-mentioned computing It may be in response to flexibly be set from the control signal of array control unit with transmitting.For example, each processing unit can be held The following operation of row：

1) data of a upper node for the row and column in pulsation direction are received；

2) product of two data is calculated, and the result with depositing originally is added up；

3) value after adding up is preserved, the input data received voluntarily is output to next row node, by received from row Input data is output to next row node.

In addition, for the processing unit of one-dimensional array form tissue, weight need not be propagated downwards, therefore work as array control unit After pending weight sequence each element is respectively loaded in the weight register of each processing unit, weight register need not be carried out Output, but in weight register retain a period of time, array control unit treat wherein weight complete its related computing tasks it Afterwards, empty weight register and continue the follow-up pending weight of loading.

With reference to Fig. 7, illustrated with following with representing the example of the two of data and weight 3*3 matrix multiples according to this hair The calculating process of the computing unit using two-dimensional array structure of bright embodiment：

Data matrixWeight matrix

Array control unit control data and weight are inputted to processing unit from the top of pe array and left respectively In.Corresponded to for example, generally the row vector of matrix A can be sequentially entered into pe array by the order of line number from small to large Row, and adjacent row vector differs 1 clock cycle in time into pe array, i.e. the i-th row k row of matrix A Data and matrix A the i-th -1 row k-1 row data simultaneously enter pe array；The column vector of matrix B is by row number from small Row corresponding to pe array are sequentially entered to big order, and adjacent column vector enters pe array in the time The data of the row k j row of 1 clock cycle of upper difference, i.e. matrix B and the data of the row of kth -1 j-1 row of matrix B enter simultaneously Pe array.Also, data matrix A enters processing unit battle array with weight matrix B by systolic arrays is advanced into by row Arrange parallel in time, i.e., the corresponding element A that calculated in matrix A and matrix B_i,kAnd B_k,jIt is under the same clock cycle Into pe array, until the full line and permutation of all elements whole penetration management cell array of matrix A and matrix B. It is responsible for each input control in unit of each data arrival is met time alignment by array control unit.So, antenna array control Device from different directions extremely should be by the systolic arrays that processing unit is formed, weight flows from top to bottom, data by data and weight Flow from left to right.During data flow, all processing units are simultaneously in parallel at the data to flowing through it Reason, thus very high processing speed can be reached.Meanwhile by predetermined data flow pattern make data from flow into processing Cell array to outflow pe array during complete all processing that should be done to it, without re-entering these numbers again According to thus also reducing accessing operation.

As shown in fig. 7, in a cycle, data 3 and 3 are accessed in processing unit PE11 simultaneously, and reason is single in this place Multiplying is carried out in member；

In second period, the data 3 that processing unit PE11 is flowed to from left side flow to the right, and data 4 access simultaneously To processing unit PE12, the data 3 that processing unit PE12 is flowed to from top flow downward, and data 2 access to processing simultaneously In unit PE21；

The 3rd cycle, data 3 flow into processing unit PE11 above PE11, from data 2 flow into from left side Unit P11 is managed, data 5 and data 2 flow into processing unit PE21, and data 4 and data 5 flow into processing unit PE12, data 3 Processing unit PE13 is flowed into data 2, data 2 and data 4 flow into calculation units PE 22, and data 3 and data 3 flow into meter Unit PE31 is calculated,

The 4th cycle, data 2 and data 2 access to processing unit PE12, and it is single that data 4 and data 3 access to processing First PE13, data 3 and data 3 access to processing unit PE21, and data 5 and data 5 access to processing unit PE22, the He of data 2 Data 2 access to processing unit PE23, and data 2 and data 2 access to processing unit PE31, and data 3 and data 4 access to processing Unit PE32

The 5th cycle, data 2 and data 5 are flowed into processing unit PE13, and data 3 and data 2 flow into processing In unit PE22, data 5 and data 3 are flowed into processing unit 23, and data 5 and data 3 are flowed into processing unit PE31, number Flowed into according to 5 and data 2 in processing unit PE32, data 3 and data 2 are flowed into processing unit PE33.

The 6th cycle, data 3 and data 5 are flowed into processing unit PE23, and data 5 and data 2 flow into processing In unit PE32, data 2 and data 3 are flowed into processing unit PE33, and data 5 and data 5 are flowed into processing unit PE33.

The 7th cycle, data 5 and data 5 are flowed into processing unit PE33.

Wherein, result of product is added up in column direction, i.e., PE11 result of product, which is transferred in PE21, is added up, then Accumulation calculating result is transferred in PE31 and added up.

Fig. 8 is shown performs flow according to the neural network processor using above-mentioned computing unit of an example of the present invention Schematic diagram.In step S1, control unit addresses to memory cell, reads and parse the instruction for needing to perform in next step；Step S2, The storage address obtained according to analysis instruction obtains input data from memory cell；Step S3, by data and weight respectively from Input memory cell and weight memory cell are loaded into computing unit according to embodiments of the present invention described above；Step S4, The computing unit performs the arithmetic operation in neural network computing；Step S5, output will be stored in neural computing result In memory cell.

Although the present invention be described by means of preferred embodiments, but the present invention be not limited to it is described here Embodiment, also include made various changes and change without departing from the present invention.

Claims

1. a kind of neural network processor, including control unit, computing unit, data storage cell and weight memory cell, meter Calculate unit and obtain data and weight progress god from data storage cell and weight memory cell respectively under the control of the control unit Computing through network correlation,

Wherein described computing unit includes multiple processing units that array control unit connects with a manner of systolic arrays, antenna array control Weight and data are loaded onto in pe array by device from different directions, and each processing unit enters to the data received with weight Data and weight are simultaneously passed to next processing unit by row computing along different directions.

2. neural network processor according to claim 1, wherein the pe array is a dimension systolic array.

3. neural network processor according to claim 1, wherein the pe array is two dimension systolic arrays.

4. neural network processor according to claim 3, wherein the processing unit includes data register, weight is posted Storage, multiplier and accumulator；

Wherein weight register receives the weight of a processing unit on the column direction from pe array, is dealt into and multiplied Musical instruments used in a Buddhist or Taoist mass and the next processing unit for passing to the direction；

Data register receives the data of a processing unit on the line direction from pe array, is dealt into multiplier And pass to next processing unit of the direction；

Multiplier carries out multiplying to the data and weight of input, its export access in accumulator with the data in accumulator Carry out cumulative or carried out with part and input signal after add operation using result of calculation as partly and exporting.

5. the neural network processor according to claim 3 or 4, wherein the array control unit is from the processing unit battle array The line direction loading data of row, weight is loaded from the column direction of the pe array.

6. the neural network processor according to claim 3 or 4, wherein described control unit from memory cell with row to Amount loading participates in the data sequence of computing, and weight sequence corresponding with the data sequence is loaded in the form of column vector.

7. neural network processor according to claim 6, wherein the array control unit respectively by line number and row number from It is small that data sequence and weight sequence are loaded into row and column corresponding to pe array successively to big order, adjacent lines and Adjacent column differs 1 clock cycle in time when entering array, and ensures that the respective weights to be calculated and data are Enter pe array under the same clock cycle.