CN109978156A

CN109978156A - Integrated circuit chip device and Related product

Info

Publication number: CN109978156A
Application number: CN201711469408.3A
Authority: CN
Inventors: 不公告发明人
Original assignee: Beijing Zhongke Cambrian Technology Co Ltd
Current assignee: Cambricon Technologies Corp Ltd; Beijing Zhongke Cambrian Technology Co Ltd
Priority date: 2017-12-28
Filing date: 2017-12-28
Publication date: 2019-07-05
Anticipated expiration: 2037-12-28
Also published as: CN109978156B

Abstract

Present disclosure provides a kind of integrated circuit chip device and Related product, training of the described device for the neural network of execution, the neural network includes n-layer, and the n value range is the integer more than or equal to 2, and the integrated circuit chip device includes: main process task circuit and multiple based process circuits；The main process task circuit includes: data type computing circuit；The data type computing circuit, for executing the conversion between floating point type data and fixed point type data；The multiple based process circuit is in array distribution；Each based process circuit and other adjacent based process circuit connections, m based process circuit of n based process circuit of the 1st row of main process task circuit connection, n based process circuit of m row and the 1st column.The advantage that the technical solution that present disclosure provides has calculation amount small, low in energy consumption.

Description

Integrated circuit chip device and Related product

Technical field

Present disclosure is related to field of neural networks more particularly to a kind of integrated circuit chip device and Related product.

Background technique

Artificial neural network (Artificial Neural Network, i.e. ANN), it is artificial since being the 1980s The research hotspot that smart field rises.It is abstracted human brain neuroid from information processing angle, and it is simple to establish certain Model is formed different networks by different connection types.Neural network or class are also often directly referred to as in engineering and academia Neural network.Neural network is a kind of operational model, is constituted by being coupled to each other between a large amount of node (or neuron).It is existing The operation of some neural networks be based on CPU (Central Processing Unit, central processing unit) or GPU (English: Graphics Processing Unit, graphics processor) Lai Shixian neural network forward operation, such forward operation Computationally intensive, power consumption is high.

Summary of the invention

Present disclosure embodiment provides a kind of integrated circuit chip device and Related product, can promote the processing of computing device Speed improves efficiency.

In a first aspect, providing a kind of training integrated circuit chip device of the neural network of execution, described device is for holding The training of capable neural network, the neural network include n-layer, and the n value range is the integer more than or equal to 2, described integrated Circuit chip device includes: main process task circuit and multiple based process circuits；The main process task circuit includes: data type Computing circuit；The data type computing circuit, for executing the conversion between floating point type data and fixed point type data；

The multiple based process circuit is in array distribution；Each based process circuit and other adjacent based process electricity Road connection, n based process circuit of the 1st row of main process task circuit connection, n based process circuit of m row and the M based process circuit of 1 column；

The integrated circuit chip device determines that first layer inputs number according to the training instruction for receiving training instruction According to first layer weight group data, the n-layer for executing neural network to first layer input data and first layer weight group data is positive Operation obtains the n-th output result of forward operation；

The main process task circuit is also used to obtain the n-th output result gradient according to the n-th output result, according to described in Training instruction obtains the n-th reversed operation of the reversed operation of n-th layer, exports result gradient, n-th layer input data, n-th according to n-th Layer weight group data and the n-th reversed operation obtain the n-th reversed computational complexity, true according to the described n-th reversed computational complexity Fixed n-th output result gradient, n-th layer input data, the corresponding n-th reverse data type of n-th layer weight group data；

The main process task circuit exports result gradient for n-th for the type according to the n-th reversed operation, n-th layer inputs number It is divided into broadcast data block and distribution data block according to, n-th layer of weight group data, the distribution to the n-th reverse data type Data block carries out deconsolidation process and obtains multiple basic data blocks, and the multiple basic data block is distributed to and the main process task At least one branch process circuit in the based process circuit of circuit connection, by the broadcast number of the n-th reverse data type It broadcasts according to block to the based process circuit with the main process task circuit connection；

The based process circuit, for the broadcast data block and the n-th reverse data type according to the n-th reverse data type Basic data block execute the operation in neural network in a parallel fashion and obtain operation result, and by the operation result by with The based process circuit transmission of the main process task circuit connection gives the main process task circuit；

The main process task circuit obtains n-th layer weight group gradient and n-th layer input for being handled the operation result Data gradient is updated n-th layer weight group data using the n-th layer weight group gradient；The n-th reverse data class Type includes: fixed point type or floating point type；

The integrated circuit chip device, be also used to using n-th layer input data gradient as (n-1)th layer (n-1)th output As a result gradient executes n-1 layers of direction operation and obtains n-1 layers of weight group gradient, using n-1 layers of weight group gradient updating respective layer Weight group data, the weight group data include；At least two weights.

Second aspect, provides a kind of neural network computing device, and the neural network computing device includes one or more The integrated circuit chip device that first aspect provides.

The third aspect, provides a kind of combined treatment device, and the combined treatment device includes: the nerve that second aspect provides Network operations device, general interconnecting interface and general processing unit；

The neural network computing device is connect by the general interconnecting interface with the general processing unit.

Fourth aspect, provides a kind of chip, the device or third of the device of the integrated chip first aspect, second aspect The device of aspect.

5th aspect, provides a kind of electronic equipment, the electronic equipment includes the chip of fourth aspect.

As can be seen that providing data conversion computing circuit by present disclosure embodiment and converting the type of data block Operation afterwards saves transfer resource and computing resource, so it is with low in energy consumption, the small advantage of calculation amount.

Detailed description of the invention

Fig. 1 is a kind of training method schematic diagram of neural network.

Fig. 1 a is a kind of forward operation schematic diagram of neural network.

Fig. 1 b is a kind of schematic configuration diagram of fixed-point data type.

Fig. 2 a is convolution input data schematic diagram.

Fig. 2 b is convolution kernel schematic diagram.

Fig. 2 c is the operation window schematic diagram of a three-dimensional data block of input data.

Fig. 2 d is another operation window schematic diagram of a three-dimensional data block of input data.

Fig. 2 e is the another operation window schematic diagram of a three-dimensional data block of input data

Fig. 3 is a kind of structural schematic diagram of neural network chip.

Fig. 4 a is Matrix Multiplication with matrix schematic diagram.

Fig. 4 b is Matrix Multiplication with the method flow diagram of matrix.

Fig. 4 c is Matrix Multiplication with vector schematic diagram.

Fig. 4 d is Matrix Multiplication with the method flow diagram of vector.

Fig. 4 e is a kind of neural metwork training schematic diagram.

Fig. 4 f is another neural metwork training schematic diagram.

Fig. 4 g is neural network forward direction and reversed operation schematic diagram.

Fig. 4 h is neural metwork training multilayered structure schematic diagram.

Fig. 5 a is that present disclosure is also disclosed that a combined treatment device structural schematic diagram.

Fig. 5 b is that present disclosure is also disclosed that a combined treatment device another kind structural schematic diagram.

Fig. 5 c is a kind of structural schematic diagram for neural network processor board that present disclosure embodiment provides；

Fig. 5 d is a kind of structural schematic diagram for neural network chip encapsulating structure that present disclosure embodiment stream provides；

Fig. 5 e is a kind of structural schematic diagram for neural network chip that present disclosure embodiment stream provides；

Fig. 6 is a kind of schematic diagram for neural network chip encapsulating structure that present disclosure embodiment stream provides；

Fig. 6 a is the schematic diagram for another neural network chip encapsulating structure that present disclosure embodiment stream provides.

Specific embodiment

In order to make those skilled in the art more fully understand present disclosure scheme, below in conjunction in present disclosure embodiment The technical solution in present disclosure embodiment is clearly and completely described in attached drawing, it is clear that described embodiment is only Present disclosure a part of the embodiment, instead of all the embodiments.Based on the embodiment in present disclosure, ordinary skill people Member's every other embodiment obtained without creative efforts, belongs to the range of present disclosure protection.

In the device that first aspect provides, the Main Processor Unit, specifically for by the n-th reversed computational complexity and in advance If threshold value comparison, such as the described n-th reversed computational complexity is higher than the preset threshold, determines that the n-th reverse data type is Fixed point type, such as the described n-th reversed computational complexity are less than or equal to the preset threshold, and computing device determines that described n-th is anti- It is floating point type to data type.

In the device that first aspect provides, the Main Processor Unit is specifically used for determining the n-th output result ladder The (n+1)th reverse data type that degree, n-th layer input data, n-th layer weight group data belong to, such as (n+1)th reverse data Type is different from the n-th reverse data type, will belong to the (n+1)th reverse data class by the data type computing circuit The n-th output result gradient of type, n-th layer input data, n-th layer weight group data conversion are at belonging to the n-th reverse data class The n-th output the result gradient, n-th layer input data, n-th layer weight group data of type.

In the device that first aspect provides, the Main Processor Unit is convolution fortune for such as the reversed operation of the n-layer It calculates, convolution input data is the n-th layer input data, and convolution kernel is the n-th output result gradient,

N-th reversed computational complexity=α * C*kW*kW*M*N*W*C*H；

Wherein, α is convolution coefficient, and value range is greater than 1；C, kW, kW, M be convolution kernel four dimensions value, N, W, C, H is the value of convolution input data four dimensions；

If the complexity is greater than given threshold, determines that the n-th reverse data type is floating type, determine the volume Whether product input data and convolution kernel are floating data, will if the convolution input data and convolution kernel are not floating data The convolution input data is converted into floating data, and convolution kernel is converted into floating data, then by convolution input data, convolution Core executes convolution algorithm with floating type.

In the device that first aspect provides, the Main Processor Unit is also used to such as the described n-th reversed operation are as follows: matrix Multiply matrix operation, the input data is n-th layer input data, and the weight is the n-th output result gradient；

Complexity=β * F*G*E*F；Wherein, β is matrix coefficient, and value range is more than or equal to 1, and F, G are n-th layer input The row, column value of data, E, F are the row, column value of weight；

If the complexity is greater than given threshold, determine that the n-th reverse data type is floating type, determine this Whether n-layer input data and weight are floating data, if the n-th layer input data and weight are not floating data, will be weighed Value is converted into floating data, and n-th layer input data, weight are then executed Matrix Multiplication matrix operation with floating type.

In the device that first aspect provides, integrated circuit chip device is also used to such as the described n-th reversed operation are as follows: square Battle array multiplies vector operation, and the input data is n-th layer input data, and the weight is the n-th output result gradient；

Complexity=β * F*G*F；Wherein, β is matrix coefficient, and value range is more than or equal to 1, and F, G are that n-th layer inputs number According to row, column value, F be n-th output result gradient train value；

If the complexity is greater than given threshold, determine that the n-th reverse data type is floating type, determine this Whether n-layer input data and weight are floating data, if the n-th layer input data and weight are not floating data, lead to Know that the n-th layer input data is converted into floating data by the k Branch Processing Unit, weight is converted into floating data, so N-th layer input data, weight are executed into Matrix Multiplication vector operation with floating type afterwards.

In the device that first aspect provides, the main process task circuit, specifically for the type of such as described n-th reversed operation For multiplying, determine that the n-th layer input data and the n-th layer weight group data are distribution data block, described It is broadcast data block that n, which exports result gradient,；If the type of the n-th reversed operation is convolution algorithm, the n-th layer input number is determined Accordingly and the n-th layer weight group data are broadcast data block, and the n-th output result gradient is distribution data block.

In the device that first aspect provides, the n-layer inverse operation further include: bigoted operation connects operation, GEMM fortune entirely One of calculation, GEMV operation, activation operation or any combination.

In the device that first aspect provides, the main process task circuit includes: buffer circuit on master register or main leaf；

The based process circuit includes: base register or basic on piece buffer circuit.

In the device that first aspect provides, the main process task circuit includes: vector operation device circuit, arithmetic logic unit One of circuit, accumulator circuit, matrix transposition circuit, direct memory access circuit or data rearrangement circuit or any group It closes.

In the device that first aspect provides, the n-th output result gradient are as follows: vector, matrix, three-dimensional data block, four A kind of or any combination in dimensional data block and n dimensional data block；

The n-th layer input data are as follows: one in vector, matrix, three-dimensional data block, 4 D data block and n dimensional data block Kind or any combination；

The n-layer weight group data are as follows: one in vector, matrix, three-dimensional data block, 4 D data block and n dimensional data block Kind or any combination.

As shown in Figure 1, the step of neural metwork training, includes:

Each layer in one (multilayer) neural network successively executes forward operation；

Reversed operation, which is successively executed, according to the sequence of opposite layer obtains weight gradient；

The weight of update forward operation is removed with the gradient for the weight being calculated；

Here it is the successively iteration of the training of neural network, entire training process needs repeat (i.e. successive ignition meter Calculate) this process is multiple.

Refering to Fig. 3, Fig. 3 is a kind of integrated circuit chip device, and training of the described device for the neural network of execution should Neural network includes n-layer, and the n value range is the integer more than or equal to 2, and the integrated circuit chip device includes: main place Manage circuit and multiple based process circuits；The main process task circuit includes: data type computing circuit；The data type Computing circuit, for executing the conversion between floating point type data and fixed point type data；

As shown in Figure 1a, a kind of forward operation of the neural network provided for present disclosure embodiment, each layer uses oneself Type according to layer of input data and weight specified by operation rule corresponding output data is calculated；

The forward operation process (being also reasoning, inference) of neural network is the input data for successively handling each layer, warp Certain calculating is crossed, the process of output data is obtained, has the feature that

The input of a certain layer:

The input of a certain layer can be the input data of neural network；

The input of a certain layer can be the output of other layers；

The input of a certain layer can be the output (the case where corresponding to Recognition with Recurrent Neural Network) of this layer of last moment；

A certain layer can obtain input from multiple above-mentioned input sources simultaneously；

The output of a certain layer:

The output of a certain layer can be used as the output result of neural network；

The output of a certain layer can be other layers of input；

The output of a certain layer can be the input (the case where Recognition with Recurrent Neural Network) of this layer of subsequent time；

The output of a certain layer can export result to above-mentioned multiple outbound courses；

Specifically, the type of the operation of the layer in the neural network includes but is not limited to following several:

Convolutional layer (i.e. execution convolution algorithm)；

Full articulamentum (executing full connection operation)；

Normalize (regularization) layer: including LRN (Local Response Normalization) layer, BN (Batch Normalization) the types such as layer；

Pond layer；

Active coating: including but is not limited to the Tanh with Sigmoid layers of Types Below, ReLU layers, PReLu layers, LeakyReLu layers Layer；

The reversed operation of layer, each layer of reversed operation need to be implemented two parts operation: a part is using may be dilute It dredges the output data gradient indicated and may be that the input data of rarefaction representation calculates the gradient of weight (for " weight is more Newly " step updates the weight of this layer), another part is using the output data gradient that may be rarefaction representation and may be dilute The weight indicated is dredged, calculates input data gradient (for the output data gradient as next layer in reversed operation for it Carry out reversed operation)；

Reversed operation is according to the sequence opposite with forward operation, the back transfer gradient since the last layer.

In a kind of optinal plan, the output data gradient that a certain layer retrospectively calculate obtains be can come from:

The gradient of the last loss function of neural network (lost function or cost function) passback；

Other layers of input data gradient；

The input data gradient (the case where corresponding to Recognition with Recurrent Neural Network) of this layer of last moment；

A certain layer can obtain output data gradient from multiple above-mentioned sources simultaneously；

After having executed the reversed operation of neural network, the gradient of the weight of each layer is just calculated, in the step In, the first input-buffer and the second input-buffer of described device are respectively used to store the gradient of the weight of this layer and weight, so Using weights gradient is updated weight in arithmetic element afterwards；

The operation being mentioned above all is that multilayer neural network was realized in one layer of operation in neural network Cheng Shi, in forward operation, after upper one layer of artificial neural network, which executes, to be completed, next layer of operational order can be by operation Calculated output data carries out operation as next layer of input data and (or carries out certain to the output data in unit A little operations are re-used as next layer of input data), meanwhile, weight is also replaced with to next layer of weight；In reversed operation, After the completion of the reversed operation of upper one layer of artificial neural network executes, next layer of operational order can will calculate in arithmetic element Input data gradient carry out operation as next layer of output data gradient and (or certain carried out to the input data gradient A little operations are re-used as next layer of output data gradient), while weight being replaced with to next layer of weight；(with chart below Show, reversed operation is indicated with the arrow of dotted line in the following figure, the arrow of solid line indicates forward operation, and respectively scheming following mark indicates The meaning of figure)

The representation method of fixed point data

The method of fixed point refers to that the expression of the data of some data block in network is converted into certain specific fixation is small The data coding method (the 0/1 bit disposing way for being mapped to data on circuit device) of several positions；

In a kind of optinal plan, multiple data composition number is used into same fixed-point representation according to block as a whole Method carries out fixed point expression；

Fig. 1 b shows the specific table of short digit fixed-point data structure for storing data according to an embodiment of the present invention Show method.Wherein, 1Bit are used to indicate symbol, and M are used to indicate integer part, and N for indicating fractional part；It compares In 32 floating data representations, the short position fixed-point data representation that the present invention uses is less in addition to occupying number of bits Outside, it for same layer, same type of data in neural network, such as all weight datas of first convolutional layer, also in addition sets The position for having set a flag bit Point location record decimal point, can adjust in this way according to the distribution of real data The precision and can indicate data area that data indicate.

Expression, that is, 32bit of floating number is indicated, but for this technical solution, uses fixed-point number that can reduce The digit of the bit of one numerical value, to reduce the data volume of transmission and the data volume of operation.

Input data indicated with Fig. 2 a (N number of sample, each sample have C channel, a height of H of the characteristic pattern in each channel, Width is W), weight namely convolution kernel indicate (there is M convolution kernel, each convolution kernel has C channel, and height and width are respectively with Fig. 2 b KH and KW).For N number of sample of input data, the rule of convolution algorithm is the same, and explained later is on a sample The process of convolution algorithm is carried out, on a sample, each of M convolution kernel will carry out same operation, Mei Gejuan Product kernel operation obtains a sheet of planar characteristic pattern, and M plane characteristic figure is finally calculated in M convolution kernel, (to a sample, volume Long-pending output is M characteristic pattern), for a convolution kernel, inner product fortune is carried out in each plan-position of a sample It calculates, is slided then along the direction H and W, for example, Fig. 2 c indicates that a convolution kernel is right in a sample of input data The position of inferior horn carries out the corresponding diagram of inner product operation；Fig. 2 d indicates that the position of convolution slides a lattice and Fig. 2 e to the left and indicates convolution One lattice of position upward sliding.

When the first operation is convolution algorithm, the input data is convolution input data, and the weight data is convolution kernel,

First complexity=α * C*kW*kW*M*N*W*C*H；

If first complexity is greater than given threshold, determine whether the convolution input data and convolution kernel are floating number According to, if the convolution input data and convolution kernel are not floating data, which is converted into floating data, it will Convolution kernel is converted into floating data, and convolution input data, convolution kernel are then executed convolution algorithm with floating type.

Specifically, the mode of the process of convolution can be handled using chip structure as shown in Figure 3, main process task circuit ( Be properly termed as master unit) data conversion computing circuit can the first complexity be greater than given threshold when, by the part of weight Or the data conversion in whole convolution kernels is at the data of fixed point type, the control circuit of main process task circuit by the part of weight or Data in whole convolution kernels are sent to those of to be directly connected with main process task circuit at basis by lateral Data Input Interface It manages circuit (being referred to as base unit)；

In a kind of optinal plan, the control circuit of main process task circuit sends the data of some convolution kernel in weight every time One number or a part of number give some based process circuit；(for example, for some based process circuit, the 1st transmission 3rd number of row the 1st, the 2nd the 2nd number sent in the 3rd row data, the 3rd number ... of the 3rd the 3rd row of transmission, or 1st the 3rd row the first two number of transmission, second of the 3rd row the 3rd of transmission and the 4th number, third time send the 3rd row the 5th and the 6th Number ...；)

Another situation is that, the control circuit of main process task circuit is by the several convolution kernels of certain in weight in a kind of optinal plan Data every time respectively send an a part of number of number person give some based process circuit；(for example, for some based process Circuit, the 1st number of the 1st the 3rd, 4, the 5 every row of row of transmission, the 2nd number of the 2nd the 3rd, 4, the 5 every row of row of transmission, the 3rd hair The 3rd number ... or the 1st transmission every row the first two number of the 3rd, 4,5 row of the 3rd, 4, the 5 every row of row are sent, second sends the The every row the 3rd of 3,4,5 rows and the 4th number, third time send the every row the 5th of the 3rd, 4,5 row and the 6th number ...；)

The control circuit of main process task circuit divides input data according to the position of convolution, the control of main process task circuit Circuit by the data some or all of in input data in convolution position be sent to by vertical Data Input Interface directly with Main process task circuit be connected those of based process circuit；

In a kind of optinal plan, the control circuit of main process task circuit is every by the data of some convolution position in input data One number of secondary transmission or a part of number give some based process circuit；(for example, for some based process circuit, the 1st Secondary to send the 3rd and arrange the 1st number, the 2nd the 2nd number sent in the 3rd column data sends the 3rd of the 3rd column for the 3rd time Number ... or the 1st the 3rd column the first two number of transmission, second of transmission the 3rd arrange the 3rd and the 4th number, and third time sends the 3rd Arrange the 5th and the 6th number ...；)

Another situation is that, the control circuit of main process task circuit is by the several volumes of certain in input data in a kind of optinal plan The data of product position respectively send a number every time or a part of number gives some based process circuit；(for example, for some Based process circuit, the 1st number of the 1st the 3rd, 4,5 column each column of transmission, the 2nd of the 2nd the 3rd, 4,5 column each column of transmission Number, the 3rd number ... or the 1st the 3rd, 4,5 column each column the first two number of transmission of the 3rd the 3rd, 4,5 column each column of transmission, the The 3rd, 4,5 column each column the 3rd of secondary transmission and the 4th number, third time send the 3rd, 4,5 column each column the 5th and the 6th number ...；)

After based process circuit receives the data of weight, which is transmitted by its lateral data output interface Be connected next based process circuit to it；After based process circuit receives the data of input data, which is passed through Its vertical data output interface is transferred to coupled next based process circuit；

Each based process circuit carries out operation to the data received；

In a kind of optinal plan, based process circuit calculates the multiplication of one or more groups of two data every time, then will As a result it is added on register and/or on piece caching；

In a kind of optinal plan, based process circuit calculates the inner product of one or more groups of two vectors every time, then will As a result it is added on register and/or on piece caching；

After based process circuit counting goes out result, result can be transferred out from data output interface；

In a kind of optinal plan, which can be the final result or intermediate result of inner product operation；

Specifically, from the interface if the based process circuit has the output interface being directly connected with main process task circuit Transmission towards can directly export to the direction of the based process circuit of main process task circuit output as a result, if it is not, tie Fruit.

After based process circuit receives the calculated result from other based process circuits, transmit the data to Its other based process circuit or main process task circuit for being connected；

Towards can be directly to the direction of main process task circuit output output result (for example, bottom line based process electricity Road outputs it result and is directly output to main process task circuit, other based process circuits transmit downwards from vertical output interface Operation result)；

Main process task circuit receive each based process circuit inner product operation as a result, output result can be obtained.

Refering to Fig. 4 a, Fig. 4 a is a kind of Matrix Multiplication with the operation of matrix, such as first operation are as follows: Matrix Multiplication matrix fortune It calculates, the input data is the first matrix of the Matrix Multiplication matrix operation, and the weight is the Matrix Multiplication matrix operation Second matrix；

First complexity=β * F*G*E*F；Wherein, β is matrix coefficient, and value range is F, G first more than or equal to 1 The row, column value of matrix, E, F are the row, column value of the second matrix；

If first complexity is greater than given threshold, determine whether first matrix and the second matrix are floating number According to if first matrix and the second matrix are not floating data, by first matrix conversion at floating data, by the second square Battle array is converted into floating data, and the first matrix, the second matrix are then executed Matrix Multiplication matrix operation with floating type.

Refering to Fig. 4 b, the operation of Matrix Multiplication matrix is completed using device as shown in Figure 3；

Be described below calculate size be M row L column matrix S and size be L row N column matrix P multiplication operation, (square Every a line in battle array S is identical with each column length of matrix P, as shown in Figure 2 d) to possess K a for the neural computing device Based process circuit:

Step S401b, matrix S and matrix P are converted by main process task circuit when such as the first complexity is greater than given threshold Every data line in matrix S is distributed to K based process circuit by the control circuit of fixed point type data, main process task circuit In some on, based process circuit by the data received be stored on piece caching and/or register in；Specifically, can With the based process circuit being sent in K based process circuit with main process task circuit connection.

In a kind of optinal plan, if line number M≤K of S, the control circuit of main process task circuit is to M based process Circuit distributes a line of s-matrix respectively；

In a kind of optinal plan, if line number M > K of S, the control circuit of main process task circuit is to each based process electricity Distribute a line or the data of multirow in s-matrix respectively in road.

There is Mi row to be distributed to i-th of based process circuit in S, the collection of this Mi row is collectively referred to as Ai, as Fig. 2 e indicates i-th of base Calculating to be executed on plinth processing circuit.

In a kind of optinal plan, in each based process circuit, such as in i-th of based process circuit:

Matrix A i is stored in i-th of based process circuit register by the received matrix A i distributed by main process task circuit And/or on piece caching；Advantage be the reduction of after volume of transmitted data, improve computational efficiency, reduce power consumption.

Step S402b, each section in matrix P is transferred to each base by the control circuit of main process task circuit in a broadcast manner Plinth processing circuit；

In a kind of optinal plan, each section in matrix P can only be broadcasted and once arrive posting for each based process circuit In storage or on piece caching, i-th of based process circuit is fully multiplexed the data of the matrix P this time obtained, Complete the corresponding inner product operation with every a line in matrix A i；Multiplexing in the present embodiment is specifically as follows based process circuit and exists Reused in calculating, for example, matrix P data multiplexing, can be and the data of matrix P are being used for multiple times.

In a kind of optinal plan, each section in matrix P can be repeatedly broadcast to respectively by the control circuit of main process task circuit In register or the on piece caching of a based process circuit, number of i-th of based process circuit to the matrix P obtained every time According to without multiplexing, the inner product operation of the every a line corresponded in matrix A i is completed by several times；

In a kind of optinal plan, each section in matrix P can be repeatedly broadcast to respectively by the control circuit of main process task circuit In register or the on piece caching of a based process circuit, number of i-th of based process circuit to the matrix P obtained every time According to fractional reuse is carried out, the inner product operation of the every a line corresponded in matrix A i is completed；

In a kind of optinal plan, each based process circuit, such as i-th of based process circuit, calculating matrix Ai's The inner product of data and the data of matrix P；

Step S403b, the result of inner product operation is added up and is transmitted by the accumulator circuit of each based process circuit Return main process task circuit.

In a kind of optinal plan, based process circuit can execute the part and be transmitted back to that inner product operation obtains for each Main process task circuit adds up；

The part that can also be obtained the inner product operation that each based process circuit executes in a kind of optinal plan and guarantor It is cumulative to terminate to be transmitted back to main process task circuit later in register and/or the on piece caching of existence foundation processing circuit；

In a kind of optinal plan, can also by the obtained part of inner product operation that each based process circuit executes and It is stored in the register and/or on piece caching of based process circuit and adds up under partial picture, be transferred under partial picture Main process task circuit adds up, cumulative to terminate to be transmitted back to main process task circuit later.

It is a kind of Matrix Multiplication with the operation schematic diagram of vector refering to Fig. 4 c.Such as first operation are as follows: Matrix Multiplication vector fortune It calculates, the input data is the first matrix of the Matrix Multiplication vector operation, and the weight is the Matrix Multiplication vector operation Vector；

First complexity=β * F*G*F；Wherein, β is matrix coefficient, and value range is more than or equal to 1, and F, G are the first square The row, column value of battle array, F are the train value of vector；

If first complexity is greater than given threshold, determine whether first matrix and vector are floating data, such as First matrix and vector are not floating data, and by first matrix conversion at floating data, vector is converted into floating number According to then by the first matrix, vector with floating type execution Matrix Multiplication vector operation.

Refering to Fig. 4 d, Fig. 4 d has provided a kind of implementation method of Matrix Multiplication vector, can specifically include:

Step S401, every data line in matrix S is converted into pinpointing by the data conversion computing circuit of main process task circuit The data of type, the control circuit of main process task circuit are distributed in some in K based process circuit, based process electricity The distribution data received are stored in the on piece caching and/or register of based process circuit by road；

In a kind of optinal plan, if line number M≤K of matrix S, the control circuit of main process task circuit is to K basis Processing circuit distributes a line of s-matrix respectively；

In a kind of optinal plan, if line number M > K of matrix S, the control circuit of main process task circuit gives each basis Processing circuit distributes a line or the data of multirow in s-matrix respectively.

The collection for the row being distributed in the S of i-th of based process circuit is combined into Ai, shares Mi row, as Fig. 2 c is indicated i-th Calculating to be executed on based process circuit.

In a kind of optinal plan, in each based process circuit, such as in i-th of based process circuit, it can incite somebody to action The distribution data received such as matrix A i is stored in the register and/or on piece caching of i-th of based process circuit；It is excellent The volume of transmitted data for putting the distribution data after being the reduction of, improves computational efficiency, reduces power consumption.

Step S402, vector P is converted into the data of fixed point type, main place by the data type computing circuit of main process task circuit Each section in the vector P of fixed point type is transferred to K based process circuit by the control circuit of reason circuit in a broadcast manner；

In a kind of optinal plan, the control circuit of main process task circuit, which can only broadcast each section in vector P, once to be arrived In register or the on piece caching of each based process circuit, i-th of based process circuit is to the vector P this time obtained Data be fully multiplexed, complete the corresponding inner product operation with every a line in matrix A i.Advantage is reduced from main process task Circuit to based process circuit vector P repetition transmission volume of transmitted data, improve execution efficiency, reduce transmission power consumption.

In a kind of optinal plan, each section in vector P can be repeatedly broadcast to respectively by the control circuit of main process task circuit In register or the on piece caching of a based process circuit, number of i-th of based process circuit to the vector P obtained every time According to without multiplexing, the inner product operation of the every a line corresponded in matrix A i is completed by several times；Advantage is to reduce based process electricity The volume of transmitted data of the vector P of single transmission inside road, and based process circuit caching and/or register can be reduced Capacity improves execution efficiency, reduces transmission power consumption, reduces cost.

In a kind of optinal plan, each section in vector P can be repeatedly broadcast to respectively by the control circuit of main process task circuit In register or the on piece caching of a based process circuit, number of i-th of based process circuit to the vector P obtained every time According to fractional reuse is carried out, the inner product operation of the every a line corresponded in matrix A i is completed；Advantage is reduced from main process task circuit To the volume of transmitted data of based process circuit, the volume of transmitted data inside based process circuit is also reduced, execution efficiency is improved, Reduce transmission power consumption.

Step S403, the inner product of the data of inner product operation device the circuit counting matrix S and vector P of K based process circuit, Such as i-th of based process circuit, the inner product of the data of the data and vector P of calculating matrix Ai；

Step S404, the accumulator circuit of K based process circuit is added up the result of inner product operation As a result, accumulation result to be transmitted back to main process task circuit in the form of fixed point type.

In a kind of optinal plan, each based process circuit can be executed to the part and (part that inner product operation obtains That is a part of accumulation result, such as accumulation result are as follows: F1*G1+F2*G2+F3*G3+F4*G4+F5*G5, then part and Can be with are as follows: the value of F1*G1+ F2*G2+F3*G3) it is transmitted back to main process task circuit and adds up；Advantage is to reduce based process Operand inside circuit improves the operation efficiency of based process circuit.

The part that can also be obtained the inner product operation that each based process circuit executes in a kind of optinal plan and guarantor It is cumulative to terminate to be transmitted back to main process task circuit later in register and/or the on piece caching of existence foundation processing circuit；Advantage is, Reduce the volume of transmitted data between based process circuit and main process task circuit, improve operation efficiency, reduces data transmission Power consumption.

In a kind of optinal plan, can also by the obtained part of inner product operation that each based process circuit executes and It is stored in the register and/or on piece caching of based process circuit and adds up under partial picture, be transferred under partial picture Main process task circuit adds up, cumulative to terminate to be transmitted back to main process task circuit later；Advantage be reduce based process circuit and Volume of transmitted data between main process task circuit, improves operation efficiency, reduces data transmission power consumption, reduces based process Operand inside circuit improves the operation efficiency of based process circuit.

Neural network training method

Involved all data can use different data presentation techniques in neural network training process；

Specifically, the data presentation technique includes but is not limited to following situations:

The floating number of different bit wides；

The fixed-point number of different bit wides, the fixed-point number of different fixed positions；

The different moments (at the time of being specifically just different the number of iterations or initialization) of training process trained Different data block (i.e. multiple inputs in different phase (i.e. positive or reversed operation), different layers, same layer in journey Data block, output block) or the same data block in the sub-block of different piece that divides, be ok:

It can be respectively using fixed point or floating-point；

For fixed point:

Use different fixed point bit wides；

Use different fixed point deviants (namely fixed position)；

The concrete methods of realizing for illustrating neural metwork training with an actual example below, is as shown in Figure 1a single layer The specific calculating schematic diagram of the neural metwork training of operation, as shown in Figure 1a, input data and weight or parameter execute this layer Operation, technical solution provided by the embodiments of the present application determine whether according to the forward operation amount of input data, weight and this layer The type of the input data and weight is converted, specific mode can be with are as follows: such as the input data and weight storage institute The register or storage space of occupancy are greater than given threshold and the forward operation amount of this layer is greater than setting operand, and determining should When input data and weight data are floating data, the input data and weight data are converted into fixed-point data.Such as input Data and the occupied register of weight storage or storage space are less than given threshold, such as the input data and weight number According to for fixed-point data, after input data and weight data are converted into floating data, this layer of operation is executed.

Principle the application of above-mentioned data type conversion is elaborated, is a kind of fixed point class as shown in Figure 1 b The expression of type data, for computing system, the storage bit number of 1 floating data is 32bit, and for fixed-point data, especially It is the expression that data are carried out using the data of floating point type as shown in Figure 1 b, and the storage bit number of 1 fixed-point data can be with Accomplish 16Bit hereinafter, so for this conversion for, the transport overhead that can be significantly reduced between calculator, in addition, right For calculator, the space of the data storage of less bit is also smaller, i.e., storage overhead can be smaller, and calculation amount can also subtract Few, i.e., computing cost can be reduced, so the expense of computing cost and storage can be reduced, but data type be turned The expense for being also the need for part, hereinafter referred to as transition overhead are changed, for computationally intensive, the big data of data storage capacity turn Change expense almost can be ignored for subsequent computing cost, storage overhead and transport overhead, so for Computationally intensive, the big data of data storage capacity, the application is used data type conversion into the technology of the data of fixed point type Scheme, data storage capacity small data small conversely, for calculation amount, at this time due to computing cost itself, storage overhead and Transport overhead is with regard to smaller, at this time if using fixed-point data, since the precision of fixed-point data can be slightly below floating data, Under the premise of calculation amount is lesser, need to guarantee the precision calculated, so here by the data conversion of fixed point type at floating number According to improving the purpose of the precision of calculating by increasing lesser expense.

Illustrated below with actual example, as shown in fig 4e, this layer of operation mode is matrix multiplication, input data And weight is matrix, input data here by taking matrix I as an example, such as scheme by taking matrix W as an example by weight for convenience of explanation Shown in 4e, output data=matrix I* matrix W；Here if the sum of number of columns and line number amount of matrix I and matrix W compared with Greatly, it can think above-mentioned matrix I and matrix W memory and/or register take up too much space and calculation amount also compared with Greatly, matrix I and matrix W are converted into fixed-point data, then existed if matrix I and matrix W are floating data in this way Execute the operation of matrix multiplication.

For example, matrix I be 1000*1000 matrix, matrix W is also the matrix of 1000*1000, then for number of columns with And the sum of line number amount is 2000, quantity is very big, and corresponding calculation amount is just bigger, and Matrix Multiplication is multiplied with the inner product operation of matrix Method operation i.e. 109 time, for this technical solution, since the quantity of matrix I and matrix W are very big, it is impossible to once will be all Data are all transmitted, and data same in this way may be transmitted several times, it is assumed that are transmitted for fixed-point data, so that it may be significantly reduced The data volume of transmission, and then transport overhead is reduced, relative to the calculating and storage of less bit can also reduce calculating Expense and storage overhead.

It is for the technical solution that fixed-point data is converted into floating data, by taking reversed operation as an example, as shown in figure 4g Calculate structure on to arrow direction be a kind of reversed operation.By taking reversed operation as an example, for direction operation, direction operation It is output data gradient, which is specifically as follows, if the output data gradient is that current iteration calculates most Later layer, the output data for the last layer which calculates is by default operation (default operation The concrete operation step of the default operation can not limited by producer's sets itself according to their needs here) it is exported Data gradient, as the output data gradient be non-current iteration calculate the last layer, such as the output data gradient be this The n-th layer of iterative calculation, then the output data gradient is the input data gradient that (n+1)th layer of reversed operation is calculated.

Illustrated below with actual example, as shown in figure 4g, this layer of operation mode is matrix multiplication, input data For matrix, weight is scalar, and input data here by taking matrix I as an example, such as scheme by taking scalar C as an example by weight for convenience of explanation Shown in 4g, output data=matrix I*C；At this time due to the data that weight is scalar, data calculation amount is smaller, in this way if square Battle array I is fixed-point data, then matrix I is converted into floating data, then in the operation for executing Matrix Multiplication scalar.

For example, matrix I is the matrix of 10*10, scalar C is counted then being 20 for the sum of number of columns and line number amount Amount is smaller, (assuming that being greater than 100 here is considered larger, is considered smaller less than 100, for the 100 digital capacity domain skill Art personnel can arbitrarily set.) corresponding calculation amount is with regard to very little, Matrix Multiplication is with the multiplying of the inner product operation of matrix i.e. 102 It is secondary, since calculation amount is small, if still calculated using fixed-point data, its precision can be had an impact, in order to enable calculating Precision is higher, under the premise of smaller calculation amount, can be improved computational accuracy by floating data calculating.

In a kind of optinal plan, fixed fixed point bit wide can be respectively adopted in each data block of each layer in network, but It is its fixed position with training iteration cycle variation；

Specifically, in the training process, the data presentation technique of some data block can be set as follows；

It specifically, can be to some data block selection arbitrary data representation method when starting to train；

In a kind of optinal plan, the floating point representation method of specific bit wide can choose；

In a kind of optinal plan, the fixed-point representation method of particular form can choose；

It can choose specific fixed point bit wide；

It can choose specific fixed position；

In a kind of optional scheme, it is fixed to be arranged according to the maximum value of the absolute value of data all in the data block Point position；

In a kind of optinal plan, fixed point can be set according to the minimum value of the absolute value of data all in the data block Position；

It, can be according to the fixed position of other data blocks come notebook data block when determining initialization in a kind of optinal plan Fixed position；

In a kind of optinal plan, the fixed position of notebook data block can be set based on experience value；

Specifically, in the training process, the data that can change some data block in any iteration cycle number indicate Method；

It, can be without adjustment for some data block in a kind of optinal plan；

In a kind of optinal plan, it can be adjusted every certain the number of iterations；

In a kind of optinal plan, it can be adjusted every certain training epoch number；

In a kind of optinal plan, it can be adjusted according to unfixed the number of iterations interval；

In a kind of optinal plan, unfixed trained epoch number can be spaced and be adjusted；

Specifically, in the training process, it adjusts adjustable for arbitrary data when the representation method of some data block Representation method；

In a kind of optinal plan, if a data block is indicated using fixed fixed point bit wide fixed-point number, The adjustment mode for the fixed position that data indicate may is that

In a kind of optinal plan, fixed position is set according to the setting method of initialization fixed position every time；

In a kind of optinal plan, if what some data block was calculated according to the initial setting method of fixed position Fixed position increased in some iteration cycle than last iteration cycle, that is just by the fixed position in this period towards increase Method changes；Conversely, then changing towards reduced direction.

Present disclosure also provides a kind of integrated circuit chip device, and the integrated circuit chip device is for executing neural network Training, the neural network include multilayer, the integrated circuit chip device includes: processing circuit and external interface；

The external interface, for receiving training instruction；

The processing circuit leads to for determining first layer input data and first layer weight data according to the training instruction The n-layer forward operation for crossing first layer input data and first layer weight data execution neural network obtains the n-th output result；

The processing circuit is also used to obtain the n-th output result gradient according to the n-th output result, refer to according to the training The the n-th reversed operation for obtaining n-th layer of reversed operation is enabled, exports result gradient, n-th layer input data, n-th layer weight according to n-th Group data and the n-th reversed operation obtain the n-th reversed computational complexity, determine n-th according to the described n-th reversed computational complexity Result gradient, n-th layer input data, the corresponding n-th reverse data type of n-th layer weight group data are exported, the n-th output is tied Fruit gradient, n-th layer input data, n-th layer weight group data are reversely transported with the n-layer that the n-th reverse data type executes neural network Calculation obtains n weight gradient of n-layer operation；The n-th reverse data type includes: fixed point type or floating point type；

The processing circuit is also used to be updated n weight of n-layer operation using the n weight gradient.

Present disclosure is also disclosed that a neural network computing device comprising it is one or more in chip as shown in Figure 3, For being obtained from other processing units to operational data and control information, specified neural network computing, implementing result are executed Peripheral equipment is passed to by I/O interface.Peripheral equipment for example camera, display, mouse, keyboard, network interface card, wifi interface, Server.When comprising more than one chip, chip chamber as shown in Figure 3 can be linked simultaneously by specific structure Transmission data are for example interconnected by PCIE bus and are transmitted data, to support the operation of more massive neural network. At this point it is possible to share same control system, there can also be control system independent；Can be with shared drive, it can also be each Accelerator has respective memory.In addition, its mutual contact mode can be any interconnection topology.

The neural network computing device compatibility with higher can pass through PCIE interface and various types of server phases Connection.

Present disclosure is also disclosed that a combined treatment device comprising above-mentioned neural network computing device, general interconnection Interface and other processing units (i.e. general processing unit).Neural network computing device is interacted with other processing units, The common operation completing user and specifying.Such as the schematic diagram that 5a is combined treatment device.

Other processing units, including central processor CPU, graphics processor GPU, neural network processor etc. are general/special With one of processor or above processor type.Processor quantity included by other processing units is with no restrictions.Its His interface of the processing unit as neural network computing device and external data and control, including data are carried, and are completed to Benshen Unlatching, stopping through network operations device etc. control substantially；Other processing units can also cooperate with neural network computing device It is common to complete processor active task.

General interconnecting interface, for transmitting data and control between the neural network computing device and other processing units Instruction.The neural network computing device obtains required input data, write-in neural network computing dress from other processing units Set the storage device of on piece；Control instruction can be obtained from other processing units, write-in neural network computing device on piece Control caching；The data in the memory module of neural network computing device can also be read and be transferred to other processing units.

As shown in Figure 5 b, optionally, which further includes storage device, for being stored in this arithmetic element/arithmetic unit Or data required for other arithmetic elements, the data of operation required for being particularly suitable in this neural network computing device or The data that can not be all saved in the storage inside of other processing units.

The combined treatment device can be used as the SOC on piece of the equipment such as mobile phone, robot, unmanned plane, video monitoring equipment The die area of control section is effectively reduced in system, improves processing speed, reduces overall power.When this situation, at the combination The general interconnecting interface of reason device is connected with certain components of equipment.Certain components for example camera, display, mouse, key Disk, network interface card, wifi interface.

C referring to figure 5., Fig. 5 c are a kind of structural representation for neural network processor board that present disclosure embodiment provides Figure.As shown in Figure 5 c, above-mentioned neural network processor board 10 includes that neural network chip encapsulating structure 11, first is electrical and non- Electrical connection arrangement 12 and first substrate (substrate) 13.

Present disclosure is not construed as limiting the specific structure of neural network chip encapsulating structure 11, optionally, as fig 5d, Above-mentioned neural network chip encapsulating structure 11 include: neural network chip 111, second electrical and non-electrical attachment device 112, The second substrate 113.

The concrete form of neural network chip 111 involved in present disclosure is not construed as limiting, above-mentioned neural network chip 111 Including but not limited to by neural network processor integrate neural network chip, above-mentioned chip can by silicon materials, germanium material, Quantum material or molecular material etc. are made.(such as: more harsh environment) and different application demands can will according to the actual situation Above-mentioned neural network chip is packaged, so that the major part of neural network chip is wrapped, and will be on neural network chip Pin the outside of encapsulating structure is connected to by conductors such as gold threads, for and more outer layer carry out circuit connection.

Present disclosure is not construed as limiting the specific structure of neural network chip 111, optionally, please refers to Fig. 1 a or Fig. 1 b institute The device shown.

Present disclosure for first substrate 13 and the second substrate 113 type without limitation, can be printed circuit board (printed circuit board, PCB) or (printed wiring board, PWB), it is also possible to be other circuit boards.It is right The making material of PCB is also without limitation.

The second substrate 113 involved in present disclosure is electrical and non-by second for carrying above-mentioned neural network chip 111 The neural network core that above-mentioned neural network chip 111 and the second substrate 113 are attached by electrical connection arrangement 112 Chip package 11 is convenient for for protecting neural network chip 111 by neural network chip encapsulating structure 11 and first substrate 13 are further encapsulated.

Electrical for above-mentioned specific second and non-electrical attachment device 112 the corresponding knot of packaged type and packaged type Structure is not construed as limiting, and can be selected suitable packaged type with different application demands according to the actual situation and simply be improved, Such as: flip chip ball grid array encapsulates (Flip Chip Ball Grid Array Package, FCBGAP), slim four directions Flat type packaged (Low-profile Quad Flat Package, LQFP), the quad flat package (Quad with radiator Flat Package with Heat sink, HQFP), without pin quad flat package (Quad Flat Non-lead Package, QFN) or the encapsulation side small spacing quad flat formula encapsulation (Fine-pitch Ball Grid Package, FBGA) etc. Formula.

Flip-chip (Flip Chip), suitable for the area requirements after encapsulation are high or biography to the inductance of conducting wire, signal In the case where defeated time-sensitive.In addition to this packaged type that wire bonding (Wire Bonding) can be used, reduces cost, Improve the flexibility of encapsulating structure.

Ball grid array (Ball Grid Array), is capable of providing more pins, and the average conductor length of pin is short, tool The effect of standby high-speed transfer signal, wherein encapsulation can encapsulate (Pin Grid Array, PGA), zero with Pin-Grid Array Contact engaging and separating force (Zero Insertion Force, ZIF), single edge contact connection (Single Edge Contact Connection, SECC), contact array (Land Grid Array, LGA) etc. replaces.

Optionally, using the packaged type of flip chip ball grid array (Flip Chip Ball Grid Array) to mind It is packaged through network chip 111 and the second substrate 113, the schematic diagram of specific neural network chip encapsulating structure can refer to Fig. 6.As shown in fig. 6, above-mentioned neural network chip encapsulating structure includes: neural network chip 21, pad 22, soldered ball 23, second Tie point 25, pin 26 on substrate 24, the second substrate 24.

Wherein, pad 22 is connected with neural network chip 21, passes through the tie point 25 on pad 22 and the second substrate 24 Between welding form soldered ball 23, neural network chip 21 and the second substrate 24 are connected, that is, realize neural network chip 21 Encapsulation.

Pin 26 is used for the external circuit with encapsulating structure (for example, the first substrate on neural network processor board 10 13) be connected, it can be achieved that external data and internal data transmission, convenient for neural network chip 21 or neural network chip 21 it is right The neural network processor answered handles data.Type and quantity present disclosure for pin are also not construed as limiting, according to not Different pin forms can be selected in same encapsulation technology, and defers to certain rule and arranged.

Optionally, above-mentioned neural network chip encapsulating structure further includes insulation filler, is placed in pad 22, soldered ball 23 and connects In gap between contact 25, interference is generated between soldered ball and soldered ball for preventing.

Wherein, the material of insulation filler can be silicon nitride, silica or silicon oxynitride；Interference comprising electromagnetic interference, Inductive interferences etc..

Optionally, above-mentioned neural network chip encapsulating structure further includes radiator, for distributing neural network chip 21 Heat when operation.Wherein, radiator can be the good sheet metal of one piece of thermal conductivity, cooling fin or radiator, for example, Fan.

For example, as shown in Figure 6 a, neural network chip encapsulating structure 11 include: neural network chip 21, pad 22, Soldered ball 23, the second substrate 24, the tie point 25 in the second substrate 24, pin 26, insulation filler 27, thermal grease 28 and metal Shell cooling fin 29.Wherein, thermal grease 28 and metal shell cooling fin 29 are used to distribute heat when neural network chip 21 is run Amount.

Optionally, above-mentioned neural network chip encapsulating structure 11 further includes reinforced structure, is connect with pad 22, and interior is embedded in In soldered ball 23, to enhance the bonding strength between soldered ball 23 and pad 22.

Wherein, reinforced structure can be metal wire structure or column structure, it is not limited here.

Present disclosure is electrical for first and the concrete form of non-electrical device of air 12 is also not construed as limiting, can refer to second it is electrical and Neural network chip encapsulating structure 11 is packaged by the description of non-electrical device of air 112 by welding, can also be with By the way of connecting line connection or pluggable mode connection the second substrate 113 and first substrate 13, it is convenient for the first base of subsequent replacement Plate 13 or neural network chip encapsulating structure 11.

Optionally, first substrate 13 includes the interface etc. for the internal storage location of extension storage capacity, such as: synchronous dynamic Random access memory (Synchronous Dynamic Random Access Memory, SDRAM), Double Data Rate synchronous dynamic Random access memory (Double Date Rate SDRAM, DDR) etc., improves neural network processor by exented memory Processing capacity.

It may also include quick external equipment interconnection bus (Peripheral Component on first substrate 13 Interconnect-Express, PCI-E or PCIe) interface, hot-swappable (the Small Form-factor of small package Pluggable, SFP) interface, Ethernet interface, Controller Area Network BUS (Controller Area Network, CAN) The convenience of arithmetic speed and operation can be improved for the data transmission between encapsulating structure and external circuit in interface etc..

Neural network processor is encapsulated as neural network chip 111, neural network chip 111 is encapsulated as neural network Neural network chip encapsulating structure 11 is encapsulated as neural network processor board 10, by board by chip-packaging structure 11 Interface (slot or lock pin) and external circuit (such as: computer motherboard) carry out data interaction, i.e., directly by using nerve Network processing unit board 10 realizes the function of neural network processor, and protects neural network chip 111.And at neural network Other modules can be also added on reason device board 10, improve the application range and operation efficiency of neural network processor.

In one embodiment, the present disclosure discloses an electronic devices comprising above-mentioned neural network processor plate Card 10 or neural network chip encapsulating structure 11.

Electronic device include data processing equipment, robot, computer, printer, scanner, tablet computer, intelligent terminal, Mobile phone, automobile data recorder, navigator, sensor, camera, server, camera, video camera, projector, wrist-watch, earphone, shifting Dynamic storage, wearable device, the vehicles, household electrical appliance, and/or Medical Devices.

The vehicles include aircraft, steamer and/or vehicle；The household electrical appliance include TV, air-conditioning, micro-wave oven, Refrigerator, electric cooker, humidifier, washing machine, electric light, gas-cooker, kitchen ventilator；The Medical Devices include Nuclear Magnetic Resonance, B ultrasound Instrument and/or electrocardiograph.

Particular embodiments described above has carried out further in detail the purpose of present disclosure, technical scheme and beneficial effects Describe in detail it is bright, it is all it should be understood that be not limited to present disclosure the foregoing is merely the specific embodiment of present disclosure Within the spirit and principle of present disclosure, any modification, equivalent substitution, improvement and etc. done should be included in the guarantor of present disclosure Within the scope of shield.

Claims

1. a kind of integrated circuit chip device, training of the described device for the neural network of execution, the neural network include n Layer, the n value range are the integer more than or equal to 2, which is characterized in that the integrated circuit chip device includes: main process task Circuit and multiple based process circuits；The main process task circuit includes: data type computing circuit；The data type operation Circuit, for executing the conversion between floating point type data and fixed point type data；

The multiple based process circuit is in array distribution；Each based process circuit and other adjacent based process circuits connect It connects, what n based process circuit of the 1st row of main process task circuit connection, n based process circuit of m row and the 1st arranged M based process circuit；

The integrated circuit chip device, for receiving training instruction, according to the training instruction determine first layer input data and First layer weight group data execute the n-layer forward operation of neural network to first layer input data and first layer weight group data Obtain the n-th output result of forward operation；

The main process task circuit is also used to obtain the n-th output result gradient according to the n-th output result, according to the training Instruction obtains the n-th reversed operation of the reversed operation of n-th layer, exports result gradient, n-th layer input data, n-th layer power according to n-th Value group data and the n-th reversed operation obtain the n-th reversed computational complexity, determine n-th according to the described n-th reversed computational complexity Export result gradient, n-th layer input data, the corresponding n-th reverse data type of n-th layer weight group data；

The main process task circuit, for according to the n-th reversed operation type by n-th export result gradient, n-th layer input data, N-th layer weight group data are divided into broadcast data block and distribution data block, to the distribution data of the n-th reverse data type Block carries out deconsolidation process and obtains multiple basic data blocks, and the multiple basic data block is distributed to and is connected with the main process task circuit At least one branch process circuit in the based process circuit connect, the broadcast data block of the n-th reverse data type is wide Cast to the based process circuit with the main process task circuit connection；

The based process circuit, for according to the broadcast data block of the n-th reverse data type and the base of the n-th reverse data type The operation that notebook data block executes in neural network in a parallel fashion obtains operation result, and by the operation result by with the master The based process circuit transmission of processing circuit connection gives the main process task circuit；

The main process task circuit obtains n-th layer weight group gradient and n-th layer input data for being handled the operation result Gradient is updated n-th layer weight group data using the n-th layer weight group gradient；The n-th reverse data type packet It includes: fixed point type or floating point type；

The integrated circuit chip device, be also used to using n-th layer input data gradient as (n-1)th layer (n-1)th export result Gradient executes n-1 layers of direction operation and obtains n-1 layers of weight group gradient, using the weight of n-1 layers of weight group gradient updating respective layer Group data, the weight group data include；At least two weights.

2. integrated circuit chip device according to claim 1, which is characterized in that

The Main Processor Unit, specifically for compared with preset threshold, such as described n-th is reversely transported by the n-th reversed computational complexity It calculates complexity and is higher than the preset threshold, determine that the n-th reverse data type is fixed point type, such as the described n-th reversed operation Complexity is less than or equal to the preset threshold, and computing device determines that the n-th reverse data type is floating point type.

3. integrated circuit chip device according to claim 2, which is characterized in that

The Main Processor Unit is specifically used for determining the n-th output the result gradient, n-th layer input data, n-th layer weight group The (n+1)th reverse data type that data belong to, such as the (n+1)th reverse data type is different from the n-th reverse data type, It is by the data type computing circuit that the n-th output result gradient, the n-th layer for belonging to the (n+1)th reverse data type is defeated It is defeated at the n-th output result gradient, the n-th layer for belonging to the n-th reverse data type to enter data, n-th layer weight group data conversion Enter data, n-th layer weight group data.

4. integrated circuit chip device according to claim 1, which is characterized in that

The Main Processor Unit is convolution algorithm for such as the reversed operation of the n-layer, and convolution input data is that the n-th layer is defeated Entering data, convolution kernel is the n-th output result gradient,

N-th reversed computational complexity=α * C*kW*kW*M*N*W*C*H；

Wherein, α is convolution coefficient, and value range is greater than 1；C, kW, kW, M are the value of convolution kernel four dimensions, and N, W, C, H are The value of convolution input data four dimensions；

If the complexity is greater than given threshold, determines that the n-th reverse data type is floating type, determine that the convolution is defeated Enter whether data and convolution kernel are floating data, if the convolution input data and convolution kernel are not floating data, by the volume Product input data is converted into floating data, and convolution kernel is converted into floating data, then by convolution input data, convolution kernel with floating Point data type executes convolution algorithm.

5. integrated circuit chip device according to claim 1, which is characterized in that

The Main Processor Unit is also used to such as the described n-th reversed operation are as follows: Matrix Multiplication matrix operation, the input data are n-th Layer input data, the weight are the n-th output result gradient；

Complexity=β * F*G*E*F；Wherein, β is matrix coefficient, and value range is more than or equal to 1, and F, G are n-th layer input data Row, column value, E, F be weight row, column value；

If the complexity is greater than given threshold, determines that the n-th reverse data type is floating type, determine the n-th layer Whether input data and weight are floating data, if the n-th layer input data and weight are not floating data, by this n-th Layer input data is converted into floating data, weight is converted into floating data, then by n-th layer input data, weight with floating-point Data type executes Matrix Multiplication matrix operation.

6. integrated circuit chip device according to claim 1, which is characterized in that

Integrated circuit chip device is also used to such as the described n-th reversed operation are as follows: Matrix Multiplication vector operation, the input data are N-th layer input data, the weight are the n-th output result gradient；

Complexity=β * F*G*F；Wherein, β is matrix coefficient, and value range is more than or equal to 1, and F, G are n-th layer input data Row, column value, F are the train value of the n-th output result gradient；

If the complexity is greater than given threshold, determines that the n-th reverse data type is floating type, determine the n-th layer Whether input data and weight are floating data, if the n-th layer input data and weight are not floating data, notify institute It states k Branch Processing Unit and the n-th layer input data is converted into floating data, weight is converted into floating data, then will N-th layer input data, weight execute Matrix Multiplication vector operation with floating type.

7. integrated circuit chip device according to claim 1, which is characterized in that

The main process task circuit, the type specifically for such as described n-th reversed operation is multiplying, determines that the n-th layer is defeated Entering data and the n-th layer weight group data is distribution data block, and the n-th output result gradient is broadcast data block； If the type of the n-th reversed operation is convolution algorithm, determine that the n-th layer input data and the n-th layer weight group data are equal For broadcast data block, the n-th output result gradient is distribution data block.

8. integrated circuit chip device described in -7 any one according to claim 1, which is characterized in that

The n-layer inverse operation further include: bigoted operation, entirely connect operation, GEMM operation, GEMV operation, activation operation in one Kind or any combination.

9. integrated circuit chip device according to claim 1, which is characterized in that

The main process task circuit includes: buffer circuit on master register or main leaf；

10. integrated circuit chip device according to claim 9, which is characterized in that

The main process task circuit includes: vector operation device circuit, arithmetic logic unit circuit, accumulator circuit, matrix transposition electricity One of road, direct memory access circuit or data rearrangement circuit or any combination.

11. integrated circuit chip device according to claim 9, which is characterized in that

The n-th output result gradient are as follows: a kind of in vector, matrix, three-dimensional data block, 4 D data block and n dimensional data block Or any combination；

The n-th layer input data are as follows: it is a kind of in vector, matrix, three-dimensional data block, 4 D data block and n dimensional data block or Any combination；

The n-layer weight group data are as follows: it is a kind of in vector, matrix, three-dimensional data block, 4 D data block and n dimensional data block or Any combination.

12. a kind of neural network computing device, which is characterized in that the neural network computing device includes one or more as weighed Benefit requires integrated circuit chip device described in 1-11 any one.

13. a kind of combined treatment device, which is characterized in that the combined treatment device includes: mind as claimed in claim 12 Through network operations device, general interconnecting interface and general processing unit；

14. a kind of chip, which is characterized in that the integrated chip such as claim 1-13 any one described device.

15. a kind of smart machine, which is characterized in that the smart machine includes chip as claimed in claim 14.

16. a kind of operation method of neural network, which is characterized in that the method is applied in integrated circuit chip device, institute Stating integrated circuit chip device includes: the integrated circuit chip device as described in claim 1-11 any one, described integrated Circuit chip device is used to execute the training operation of neural network.