CN109978156A - Integrated circuit chip device and Related product - Google Patents
Integrated circuit chip device and Related product Download PDFInfo
- Publication number
- CN109978156A CN109978156A CN201711469408.3A CN201711469408A CN109978156A CN 109978156 A CN109978156 A CN 109978156A CN 201711469408 A CN201711469408 A CN 201711469408A CN 109978156 A CN109978156 A CN 109978156A
- Authority
- CN
- China
- Prior art keywords
- data
- layer
- circuit
- type
- input data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Neurology (AREA)
- Advance Control (AREA)
- Logic Circuits (AREA)
Abstract
Present disclosure provides a kind of integrated circuit chip device and Related product, training of the described device for the neural network of execution, the neural network includes n-layer, and the n value range is the integer more than or equal to 2, and the integrated circuit chip device includes: main process task circuit and multiple based process circuits;The main process task circuit includes: data type computing circuit;The data type computing circuit, for executing the conversion between floating point type data and fixed point type data;The multiple based process circuit is in array distribution;Each based process circuit and other adjacent based process circuit connections, m based process circuit of n based process circuit of the 1st row of main process task circuit connection, n based process circuit of m row and the 1st column.The advantage that the technical solution that present disclosure provides has calculation amount small, low in energy consumption.
Description
Technical field
Present disclosure is related to field of neural networks more particularly to a kind of integrated circuit chip device and Related product.
Background technique
Artificial neural network (Artificial Neural Network, i.e. ANN), it is artificial since being the 1980s
The research hotspot that smart field rises.It is abstracted human brain neuroid from information processing angle, and it is simple to establish certain
Model is formed different networks by different connection types.Neural network or class are also often directly referred to as in engineering and academia
Neural network.Neural network is a kind of operational model, is constituted by being coupled to each other between a large amount of node (or neuron).It is existing
The operation of some neural networks be based on CPU (Central Processing Unit, central processing unit) or GPU (English:
Graphics Processing Unit, graphics processor) Lai Shixian neural network forward operation, such forward operation
Computationally intensive, power consumption is high.
Summary of the invention
Present disclosure embodiment provides a kind of integrated circuit chip device and Related product, can promote the processing of computing device
Speed improves efficiency.
In a first aspect, providing a kind of training integrated circuit chip device of the neural network of execution, described device is for holding
The training of capable neural network, the neural network include n-layer, and the n value range is the integer more than or equal to 2, described integrated
Circuit chip device includes: main process task circuit and multiple based process circuits;The main process task circuit includes: data type
Computing circuit;The data type computing circuit, for executing the conversion between floating point type data and fixed point type data;
The multiple based process circuit is in array distribution;Each based process circuit and other adjacent based process electricity
Road connection, n based process circuit of the 1st row of main process task circuit connection, n based process circuit of m row and the
M based process circuit of 1 column;
The integrated circuit chip device determines that first layer inputs number according to the training instruction for receiving training instruction
According to first layer weight group data, the n-layer for executing neural network to first layer input data and first layer weight group data is positive
Operation obtains the n-th output result of forward operation;
The main process task circuit is also used to obtain the n-th output result gradient according to the n-th output result, according to described in
Training instruction obtains the n-th reversed operation of the reversed operation of n-th layer, exports result gradient, n-th layer input data, n-th according to n-th
Layer weight group data and the n-th reversed operation obtain the n-th reversed computational complexity, true according to the described n-th reversed computational complexity
Fixed n-th output result gradient, n-th layer input data, the corresponding n-th reverse data type of n-th layer weight group data;
The main process task circuit exports result gradient for n-th for the type according to the n-th reversed operation, n-th layer inputs number
It is divided into broadcast data block and distribution data block according to, n-th layer of weight group data, the distribution to the n-th reverse data type
Data block carries out deconsolidation process and obtains multiple basic data blocks, and the multiple basic data block is distributed to and the main process task
At least one branch process circuit in the based process circuit of circuit connection, by the broadcast number of the n-th reverse data type
It broadcasts according to block to the based process circuit with the main process task circuit connection;
The based process circuit, for the broadcast data block and the n-th reverse data type according to the n-th reverse data type
Basic data block execute the operation in neural network in a parallel fashion and obtain operation result, and by the operation result by with
The based process circuit transmission of the main process task circuit connection gives the main process task circuit;
The main process task circuit obtains n-th layer weight group gradient and n-th layer input for being handled the operation result
Data gradient is updated n-th layer weight group data using the n-th layer weight group gradient;The n-th reverse data class
Type includes: fixed point type or floating point type;
The integrated circuit chip device, be also used to using n-th layer input data gradient as (n-1)th layer (n-1)th output
As a result gradient executes n-1 layers of direction operation and obtains n-1 layers of weight group gradient, using n-1 layers of weight group gradient updating respective layer
Weight group data, the weight group data include;At least two weights.
Second aspect, provides a kind of neural network computing device, and the neural network computing device includes one or more
The integrated circuit chip device that first aspect provides.
The third aspect, provides a kind of combined treatment device, and the combined treatment device includes: the nerve that second aspect provides
Network operations device, general interconnecting interface and general processing unit;
The neural network computing device is connect by the general interconnecting interface with the general processing unit.
Fourth aspect, provides a kind of chip, the device or third of the device of the integrated chip first aspect, second aspect
The device of aspect.
5th aspect, provides a kind of electronic equipment, the electronic equipment includes the chip of fourth aspect.
As can be seen that providing data conversion computing circuit by present disclosure embodiment and converting the type of data block
Operation afterwards saves transfer resource and computing resource, so it is with low in energy consumption, the small advantage of calculation amount.
Detailed description of the invention
Fig. 1 is a kind of training method schematic diagram of neural network.
Fig. 1 a is a kind of forward operation schematic diagram of neural network.
Fig. 1 b is a kind of schematic configuration diagram of fixed-point data type.
Fig. 2 a is convolution input data schematic diagram.
Fig. 2 b is convolution kernel schematic diagram.
Fig. 2 c is the operation window schematic diagram of a three-dimensional data block of input data.
Fig. 2 d is another operation window schematic diagram of a three-dimensional data block of input data.
Fig. 2 e is the another operation window schematic diagram of a three-dimensional data block of input data
Fig. 3 is a kind of structural schematic diagram of neural network chip.
Fig. 4 a is Matrix Multiplication with matrix schematic diagram.
Fig. 4 b is Matrix Multiplication with the method flow diagram of matrix.
Fig. 4 c is Matrix Multiplication with vector schematic diagram.
Fig. 4 d is Matrix Multiplication with the method flow diagram of vector.
Fig. 4 e is a kind of neural metwork training schematic diagram.
Fig. 4 f is another neural metwork training schematic diagram.
Fig. 4 g is neural network forward direction and reversed operation schematic diagram.
Fig. 4 h is neural metwork training multilayered structure schematic diagram.
Fig. 5 a is that present disclosure is also disclosed that a combined treatment device structural schematic diagram.
Fig. 5 b is that present disclosure is also disclosed that a combined treatment device another kind structural schematic diagram.
Fig. 5 c is a kind of structural schematic diagram for neural network processor board that present disclosure embodiment provides;
Fig. 5 d is a kind of structural schematic diagram for neural network chip encapsulating structure that present disclosure embodiment stream provides;
Fig. 5 e is a kind of structural schematic diagram for neural network chip that present disclosure embodiment stream provides;
Fig. 6 is a kind of schematic diagram for neural network chip encapsulating structure that present disclosure embodiment stream provides;
Fig. 6 a is the schematic diagram for another neural network chip encapsulating structure that present disclosure embodiment stream provides.
Specific embodiment
In order to make those skilled in the art more fully understand present disclosure scheme, below in conjunction in present disclosure embodiment
The technical solution in present disclosure embodiment is clearly and completely described in attached drawing, it is clear that described embodiment is only
Present disclosure a part of the embodiment, instead of all the embodiments.Based on the embodiment in present disclosure, ordinary skill people
Member's every other embodiment obtained without creative efforts, belongs to the range of present disclosure protection.
In the device that first aspect provides, the Main Processor Unit, specifically for by the n-th reversed computational complexity and in advance
If threshold value comparison, such as the described n-th reversed computational complexity is higher than the preset threshold, determines that the n-th reverse data type is
Fixed point type, such as the described n-th reversed computational complexity are less than or equal to the preset threshold, and computing device determines that described n-th is anti-
It is floating point type to data type.
In the device that first aspect provides, the Main Processor Unit is specifically used for determining the n-th output result ladder
The (n+1)th reverse data type that degree, n-th layer input data, n-th layer weight group data belong to, such as (n+1)th reverse data
Type is different from the n-th reverse data type, will belong to the (n+1)th reverse data class by the data type computing circuit
The n-th output result gradient of type, n-th layer input data, n-th layer weight group data conversion are at belonging to the n-th reverse data class
The n-th output the result gradient, n-th layer input data, n-th layer weight group data of type.
In the device that first aspect provides, the Main Processor Unit is convolution fortune for such as the reversed operation of the n-layer
It calculates, convolution input data is the n-th layer input data, and convolution kernel is the n-th output result gradient,
N-th reversed computational complexity=α * C*kW*kW*M*N*W*C*H;
Wherein, α is convolution coefficient, and value range is greater than 1;C, kW, kW, M be convolution kernel four dimensions value, N, W,
C, H is the value of convolution input data four dimensions;
If the complexity is greater than given threshold, determines that the n-th reverse data type is floating type, determine the volume
Whether product input data and convolution kernel are floating data, will if the convolution input data and convolution kernel are not floating data
The convolution input data is converted into floating data, and convolution kernel is converted into floating data, then by convolution input data, convolution
Core executes convolution algorithm with floating type.
In the device that first aspect provides, the Main Processor Unit is also used to such as the described n-th reversed operation are as follows: matrix
Multiply matrix operation, the input data is n-th layer input data, and the weight is the n-th output result gradient;
Complexity=β * F*G*E*F;Wherein, β is matrix coefficient, and value range is more than or equal to 1, and F, G are n-th layer input
The row, column value of data, E, F are the row, column value of weight;
If the complexity is greater than given threshold, determine that the n-th reverse data type is floating type, determine this
Whether n-layer input data and weight are floating data, if the n-th layer input data and weight are not floating data, will be weighed
Value is converted into floating data, and n-th layer input data, weight are then executed Matrix Multiplication matrix operation with floating type.
In the device that first aspect provides, integrated circuit chip device is also used to such as the described n-th reversed operation are as follows: square
Battle array multiplies vector operation, and the input data is n-th layer input data, and the weight is the n-th output result gradient;
Complexity=β * F*G*F;Wherein, β is matrix coefficient, and value range is more than or equal to 1, and F, G are that n-th layer inputs number
According to row, column value, F be n-th output result gradient train value;
If the complexity is greater than given threshold, determine that the n-th reverse data type is floating type, determine this
Whether n-layer input data and weight are floating data, if the n-th layer input data and weight are not floating data, lead to
Know that the n-th layer input data is converted into floating data by the k Branch Processing Unit, weight is converted into floating data, so
N-th layer input data, weight are executed into Matrix Multiplication vector operation with floating type afterwards.
In the device that first aspect provides, the main process task circuit, specifically for the type of such as described n-th reversed operation
For multiplying, determine that the n-th layer input data and the n-th layer weight group data are distribution data block, described
It is broadcast data block that n, which exports result gradient,;If the type of the n-th reversed operation is convolution algorithm, the n-th layer input number is determined
Accordingly and the n-th layer weight group data are broadcast data block, and the n-th output result gradient is distribution data block.
In the device that first aspect provides, the n-layer inverse operation further include: bigoted operation connects operation, GEMM fortune entirely
One of calculation, GEMV operation, activation operation or any combination.
In the device that first aspect provides, the main process task circuit includes: buffer circuit on master register or main leaf;
The based process circuit includes: base register or basic on piece buffer circuit.
In the device that first aspect provides, the main process task circuit includes: vector operation device circuit, arithmetic logic unit
One of circuit, accumulator circuit, matrix transposition circuit, direct memory access circuit or data rearrangement circuit or any group
It closes.
In the device that first aspect provides, the n-th output result gradient are as follows: vector, matrix, three-dimensional data block, four
A kind of or any combination in dimensional data block and n dimensional data block;
The n-th layer input data are as follows: one in vector, matrix, three-dimensional data block, 4 D data block and n dimensional data block
Kind or any combination;
The n-layer weight group data are as follows: one in vector, matrix, three-dimensional data block, 4 D data block and n dimensional data block
Kind or any combination.
As shown in Figure 1, the step of neural metwork training, includes:
Each layer in one (multilayer) neural network successively executes forward operation;
Reversed operation, which is successively executed, according to the sequence of opposite layer obtains weight gradient;
The weight of update forward operation is removed with the gradient for the weight being calculated;
Here it is the successively iteration of the training of neural network, entire training process needs repeat (i.e. successive ignition meter
Calculate) this process is multiple.
Refering to Fig. 3, Fig. 3 is a kind of integrated circuit chip device, and training of the described device for the neural network of execution should
Neural network includes n-layer, and the n value range is the integer more than or equal to 2, and the integrated circuit chip device includes: main place
Manage circuit and multiple based process circuits;The main process task circuit includes: data type computing circuit;The data type
Computing circuit, for executing the conversion between floating point type data and fixed point type data;
The multiple based process circuit is in array distribution;Each based process circuit and other adjacent based process electricity
Road connection, n based process circuit of the 1st row of main process task circuit connection, n based process circuit of m row and the
M based process circuit of 1 column;
The integrated circuit chip device determines that first layer inputs number according to the training instruction for receiving training instruction
According to first layer weight group data, the n-layer for executing neural network to first layer input data and first layer weight group data is positive
Operation obtains the n-th output result of forward operation;
The main process task circuit is also used to obtain the n-th output result gradient according to the n-th output result, according to described in
Training instruction obtains the n-th reversed operation of the reversed operation of n-th layer, exports result gradient, n-th layer input data, n-th according to n-th
Layer weight group data and the n-th reversed operation obtain the n-th reversed computational complexity, true according to the described n-th reversed computational complexity
Fixed n-th output result gradient, n-th layer input data, the corresponding n-th reverse data type of n-th layer weight group data;
The main process task circuit exports result gradient for n-th for the type according to the n-th reversed operation, n-th layer inputs number
It is divided into broadcast data block and distribution data block according to, n-th layer of weight group data, the distribution to the n-th reverse data type
Data block carries out deconsolidation process and obtains multiple basic data blocks, and the multiple basic data block is distributed to and the main process task
At least one branch process circuit in the based process circuit of circuit connection, by the broadcast number of the n-th reverse data type
It broadcasts according to block to the based process circuit with the main process task circuit connection;
The based process circuit, for the broadcast data block and the n-th reverse data type according to the n-th reverse data type
Basic data block execute the operation in neural network in a parallel fashion and obtain operation result, and by the operation result by with
The based process circuit transmission of the main process task circuit connection gives the main process task circuit;
The main process task circuit obtains n-th layer weight group gradient and n-th layer input for being handled the operation result
Data gradient is updated n-th layer weight group data using the n-th layer weight group gradient;The n-th reverse data class
Type includes: fixed point type or floating point type;
The integrated circuit chip device, be also used to using n-th layer input data gradient as (n-1)th layer (n-1)th output
As a result gradient executes n-1 layers of direction operation and obtains n-1 layers of weight group gradient, using n-1 layers of weight group gradient updating respective layer
Weight group data, the weight group data include;At least two weights.
As shown in Figure 1a, a kind of forward operation of the neural network provided for present disclosure embodiment, each layer uses oneself
Type according to layer of input data and weight specified by operation rule corresponding output data is calculated;
The forward operation process (being also reasoning, inference) of neural network is the input data for successively handling each layer, warp
Certain calculating is crossed, the process of output data is obtained, has the feature that
The input of a certain layer:
The input of a certain layer can be the input data of neural network;
The input of a certain layer can be the output of other layers;
The input of a certain layer can be the output (the case where corresponding to Recognition with Recurrent Neural Network) of this layer of last moment;
A certain layer can obtain input from multiple above-mentioned input sources simultaneously;
The output of a certain layer:
The output of a certain layer can be used as the output result of neural network;
The output of a certain layer can be other layers of input;
The output of a certain layer can be the input (the case where Recognition with Recurrent Neural Network) of this layer of subsequent time;
The output of a certain layer can export result to above-mentioned multiple outbound courses;
Specifically, the type of the operation of the layer in the neural network includes but is not limited to following several:
Convolutional layer (i.e. execution convolution algorithm);
Full articulamentum (executing full connection operation);
Normalize (regularization) layer: including LRN (Local Response Normalization) layer, BN (Batch
Normalization) the types such as layer;
Pond layer;
Active coating: including but is not limited to the Tanh with Sigmoid layers of Types Below, ReLU layers, PReLu layers, LeakyReLu layers
Layer;
The reversed operation of layer, each layer of reversed operation need to be implemented two parts operation: a part is using may be dilute
It dredges the output data gradient indicated and may be that the input data of rarefaction representation calculates the gradient of weight (for " weight is more
Newly " step updates the weight of this layer), another part is using the output data gradient that may be rarefaction representation and may be dilute
The weight indicated is dredged, calculates input data gradient (for the output data gradient as next layer in reversed operation for it
Carry out reversed operation);
Reversed operation is according to the sequence opposite with forward operation, the back transfer gradient since the last layer.
In a kind of optinal plan, the output data gradient that a certain layer retrospectively calculate obtains be can come from:
The gradient of the last loss function of neural network (lost function or cost function) passback;
Other layers of input data gradient;
The input data gradient (the case where corresponding to Recognition with Recurrent Neural Network) of this layer of last moment;
A certain layer can obtain output data gradient from multiple above-mentioned sources simultaneously;
After having executed the reversed operation of neural network, the gradient of the weight of each layer is just calculated, in the step
In, the first input-buffer and the second input-buffer of described device are respectively used to store the gradient of the weight of this layer and weight, so
Using weights gradient is updated weight in arithmetic element afterwards;
The operation being mentioned above all is that multilayer neural network was realized in one layer of operation in neural network
Cheng Shi, in forward operation, after upper one layer of artificial neural network, which executes, to be completed, next layer of operational order can be by operation
Calculated output data carries out operation as next layer of input data and (or carries out certain to the output data in unit
A little operations are re-used as next layer of input data), meanwhile, weight is also replaced with to next layer of weight;In reversed operation,
After the completion of the reversed operation of upper one layer of artificial neural network executes, next layer of operational order can will calculate in arithmetic element
Input data gradient carry out operation as next layer of output data gradient and (or certain carried out to the input data gradient
A little operations are re-used as next layer of output data gradient), while weight being replaced with to next layer of weight;(with chart below
Show, reversed operation is indicated with the arrow of dotted line in the following figure, the arrow of solid line indicates forward operation, and respectively scheming following mark indicates
The meaning of figure)
The representation method of fixed point data
The method of fixed point refers to that the expression of the data of some data block in network is converted into certain specific fixation is small
The data coding method (the 0/1 bit disposing way for being mapped to data on circuit device) of several positions;
In a kind of optinal plan, multiple data composition number is used into same fixed-point representation according to block as a whole
Method carries out fixed point expression;
Fig. 1 b shows the specific table of short digit fixed-point data structure for storing data according to an embodiment of the present invention
Show method.Wherein, 1Bit are used to indicate symbol, and M are used to indicate integer part, and N for indicating fractional part;It compares
In 32 floating data representations, the short position fixed-point data representation that the present invention uses is less in addition to occupying number of bits
Outside, it for same layer, same type of data in neural network, such as all weight datas of first convolutional layer, also in addition sets
The position for having set a flag bit Point location record decimal point, can adjust in this way according to the distribution of real data
The precision and can indicate data area that data indicate.
Expression, that is, 32bit of floating number is indicated, but for this technical solution, uses fixed-point number that can reduce
The digit of the bit of one numerical value, to reduce the data volume of transmission and the data volume of operation.
Input data indicated with Fig. 2 a (N number of sample, each sample have C channel, a height of H of the characteristic pattern in each channel,
Width is W), weight namely convolution kernel indicate (there is M convolution kernel, each convolution kernel has C channel, and height and width are respectively with Fig. 2 b
KH and KW).For N number of sample of input data, the rule of convolution algorithm is the same, and explained later is on a sample
The process of convolution algorithm is carried out, on a sample, each of M convolution kernel will carry out same operation, Mei Gejuan
Product kernel operation obtains a sheet of planar characteristic pattern, and M plane characteristic figure is finally calculated in M convolution kernel, (to a sample, volume
Long-pending output is M characteristic pattern), for a convolution kernel, inner product fortune is carried out in each plan-position of a sample
It calculates, is slided then along the direction H and W, for example, Fig. 2 c indicates that a convolution kernel is right in a sample of input data
The position of inferior horn carries out the corresponding diagram of inner product operation;Fig. 2 d indicates that the position of convolution slides a lattice and Fig. 2 e to the left and indicates convolution
One lattice of position upward sliding.
When the first operation is convolution algorithm, the input data is convolution input data, and the weight data is convolution kernel,
First complexity=α * C*kW*kW*M*N*W*C*H;
Wherein, α is convolution coefficient, and value range is greater than 1;C, kW, kW, M be convolution kernel four dimensions value, N, W,
C, H is the value of convolution input data four dimensions;
If first complexity is greater than given threshold, determine whether the convolution input data and convolution kernel are floating number
According to, if the convolution input data and convolution kernel are not floating data, which is converted into floating data, it will
Convolution kernel is converted into floating data, and convolution input data, convolution kernel are then executed convolution algorithm with floating type.
Specifically, the mode of the process of convolution can be handled using chip structure as shown in Figure 3, main process task circuit (
Be properly termed as master unit) data conversion computing circuit can the first complexity be greater than given threshold when, by the part of weight
Or the data conversion in whole convolution kernels is at the data of fixed point type, the control circuit of main process task circuit by the part of weight or
Data in whole convolution kernels are sent to those of to be directly connected with main process task circuit at basis by lateral Data Input Interface
It manages circuit (being referred to as base unit);
In a kind of optinal plan, the control circuit of main process task circuit sends the data of some convolution kernel in weight every time
One number or a part of number give some based process circuit;(for example, for some based process circuit, the 1st transmission
3rd number of row the 1st, the 2nd the 2nd number sent in the 3rd row data, the 3rd number ... of the 3rd the 3rd row of transmission, or
1st the 3rd row the first two number of transmission, second of the 3rd row the 3rd of transmission and the 4th number, third time send the 3rd row the 5th and the 6th
Number ...;)
Another situation is that, the control circuit of main process task circuit is by the several convolution kernels of certain in weight in a kind of optinal plan
Data every time respectively send an a part of number of number person give some based process circuit;(for example, for some based process
Circuit, the 1st number of the 1st the 3rd, 4, the 5 every row of row of transmission, the 2nd number of the 2nd the 3rd, 4, the 5 every row of row of transmission, the 3rd hair
The 3rd number ... or the 1st transmission every row the first two number of the 3rd, 4,5 row of the 3rd, 4, the 5 every row of row are sent, second sends the
The every row the 3rd of 3,4,5 rows and the 4th number, third time send the every row the 5th of the 3rd, 4,5 row and the 6th number ...;)
The control circuit of main process task circuit divides input data according to the position of convolution, the control of main process task circuit
Circuit by the data some or all of in input data in convolution position be sent to by vertical Data Input Interface directly with
Main process task circuit be connected those of based process circuit;
In a kind of optinal plan, the control circuit of main process task circuit is every by the data of some convolution position in input data
One number of secondary transmission or a part of number give some based process circuit;(for example, for some based process circuit, the 1st
Secondary to send the 3rd and arrange the 1st number, the 2nd the 2nd number sent in the 3rd column data sends the 3rd of the 3rd column for the 3rd time
Number ... or the 1st the 3rd column the first two number of transmission, second of transmission the 3rd arrange the 3rd and the 4th number, and third time sends the 3rd
Arrange the 5th and the 6th number ...;)
Another situation is that, the control circuit of main process task circuit is by the several volumes of certain in input data in a kind of optinal plan
The data of product position respectively send a number every time or a part of number gives some based process circuit;(for example, for some
Based process circuit, the 1st number of the 1st the 3rd, 4,5 column each column of transmission, the 2nd of the 2nd the 3rd, 4,5 column each column of transmission
Number, the 3rd number ... or the 1st the 3rd, 4,5 column each column the first two number of transmission of the 3rd the 3rd, 4,5 column each column of transmission, the
The 3rd, 4,5 column each column the 3rd of secondary transmission and the 4th number, third time send the 3rd, 4,5 column each column the 5th and the 6th number ...;)
After based process circuit receives the data of weight, which is transmitted by its lateral data output interface
Be connected next based process circuit to it;After based process circuit receives the data of input data, which is passed through
Its vertical data output interface is transferred to coupled next based process circuit;
Each based process circuit carries out operation to the data received;
In a kind of optinal plan, based process circuit calculates the multiplication of one or more groups of two data every time, then will
As a result it is added on register and/or on piece caching;
In a kind of optinal plan, based process circuit calculates the inner product of one or more groups of two vectors every time, then will
As a result it is added on register and/or on piece caching;
After based process circuit counting goes out result, result can be transferred out from data output interface;
In a kind of optinal plan, which can be the final result or intermediate result of inner product operation;
Specifically, from the interface if the based process circuit has the output interface being directly connected with main process task circuit
Transmission towards can directly export to the direction of the based process circuit of main process task circuit output as a result, if it is not, tie
Fruit.
After based process circuit receives the calculated result from other based process circuits, transmit the data to
Its other based process circuit or main process task circuit for being connected;
Towards can be directly to the direction of main process task circuit output output result (for example, bottom line based process electricity
Road outputs it result and is directly output to main process task circuit, other based process circuits transmit downwards from vertical output interface
Operation result);
Main process task circuit receive each based process circuit inner product operation as a result, output result can be obtained.
Refering to Fig. 4 a, Fig. 4 a is a kind of Matrix Multiplication with the operation of matrix, such as first operation are as follows: Matrix Multiplication matrix fortune
It calculates, the input data is the first matrix of the Matrix Multiplication matrix operation, and the weight is the Matrix Multiplication matrix operation
Second matrix;
First complexity=β * F*G*E*F;Wherein, β is matrix coefficient, and value range is F, G first more than or equal to 1
The row, column value of matrix, E, F are the row, column value of the second matrix;
If first complexity is greater than given threshold, determine whether first matrix and the second matrix are floating number
According to if first matrix and the second matrix are not floating data, by first matrix conversion at floating data, by the second square
Battle array is converted into floating data, and the first matrix, the second matrix are then executed Matrix Multiplication matrix operation with floating type.
Refering to Fig. 4 b, the operation of Matrix Multiplication matrix is completed using device as shown in Figure 3;
Be described below calculate size be M row L column matrix S and size be L row N column matrix P multiplication operation, (square
Every a line in battle array S is identical with each column length of matrix P, as shown in Figure 2 d) to possess K a for the neural computing device
Based process circuit:
Step S401b, matrix S and matrix P are converted by main process task circuit when such as the first complexity is greater than given threshold
Every data line in matrix S is distributed to K based process circuit by the control circuit of fixed point type data, main process task circuit
In some on, based process circuit by the data received be stored on piece caching and/or register in;Specifically, can
With the based process circuit being sent in K based process circuit with main process task circuit connection.
In a kind of optinal plan, if line number M≤K of S, the control circuit of main process task circuit is to M based process
Circuit distributes a line of s-matrix respectively;
In a kind of optinal plan, if line number M > K of S, the control circuit of main process task circuit is to each based process electricity
Distribute a line or the data of multirow in s-matrix respectively in road.
There is Mi row to be distributed to i-th of based process circuit in S, the collection of this Mi row is collectively referred to as Ai, as Fig. 2 e indicates i-th of base
Calculating to be executed on plinth processing circuit.
In a kind of optinal plan, in each based process circuit, such as in i-th of based process circuit:
Matrix A i is stored in i-th of based process circuit register by the received matrix A i distributed by main process task circuit
And/or on piece caching;Advantage be the reduction of after volume of transmitted data, improve computational efficiency, reduce power consumption.
Step S402b, each section in matrix P is transferred to each base by the control circuit of main process task circuit in a broadcast manner
Plinth processing circuit;
In a kind of optinal plan, each section in matrix P can only be broadcasted and once arrive posting for each based process circuit
In storage or on piece caching, i-th of based process circuit is fully multiplexed the data of the matrix P this time obtained,
Complete the corresponding inner product operation with every a line in matrix A i;Multiplexing in the present embodiment is specifically as follows based process circuit and exists
Reused in calculating, for example, matrix P data multiplexing, can be and the data of matrix P are being used for multiple times.
In a kind of optinal plan, each section in matrix P can be repeatedly broadcast to respectively by the control circuit of main process task circuit
In register or the on piece caching of a based process circuit, number of i-th of based process circuit to the matrix P obtained every time
According to without multiplexing, the inner product operation of the every a line corresponded in matrix A i is completed by several times;
In a kind of optinal plan, each section in matrix P can be repeatedly broadcast to respectively by the control circuit of main process task circuit
In register or the on piece caching of a based process circuit, number of i-th of based process circuit to the matrix P obtained every time
According to fractional reuse is carried out, the inner product operation of the every a line corresponded in matrix A i is completed;
In a kind of optinal plan, each based process circuit, such as i-th of based process circuit, calculating matrix Ai's
The inner product of data and the data of matrix P;
Step S403b, the result of inner product operation is added up and is transmitted by the accumulator circuit of each based process circuit
Return main process task circuit.
In a kind of optinal plan, based process circuit can execute the part and be transmitted back to that inner product operation obtains for each
Main process task circuit adds up;
The part that can also be obtained the inner product operation that each based process circuit executes in a kind of optinal plan and guarantor
It is cumulative to terminate to be transmitted back to main process task circuit later in register and/or the on piece caching of existence foundation processing circuit;
In a kind of optinal plan, can also by the obtained part of inner product operation that each based process circuit executes and
It is stored in the register and/or on piece caching of based process circuit and adds up under partial picture, be transferred under partial picture
Main process task circuit adds up, cumulative to terminate to be transmitted back to main process task circuit later.
It is a kind of Matrix Multiplication with the operation schematic diagram of vector refering to Fig. 4 c.Such as first operation are as follows: Matrix Multiplication vector fortune
It calculates, the input data is the first matrix of the Matrix Multiplication vector operation, and the weight is the Matrix Multiplication vector operation
Vector;
First complexity=β * F*G*F;Wherein, β is matrix coefficient, and value range is more than or equal to 1, and F, G are the first square
The row, column value of battle array, F are the train value of vector;
If first complexity is greater than given threshold, determine whether first matrix and vector are floating data, such as
First matrix and vector are not floating data, and by first matrix conversion at floating data, vector is converted into floating number
According to then by the first matrix, vector with floating type execution Matrix Multiplication vector operation.
Refering to Fig. 4 d, Fig. 4 d has provided a kind of implementation method of Matrix Multiplication vector, can specifically include:
Step S401, every data line in matrix S is converted into pinpointing by the data conversion computing circuit of main process task circuit
The data of type, the control circuit of main process task circuit are distributed in some in K based process circuit, based process electricity
The distribution data received are stored in the on piece caching and/or register of based process circuit by road;
In a kind of optinal plan, if line number M≤K of matrix S, the control circuit of main process task circuit is to K basis
Processing circuit distributes a line of s-matrix respectively;
In a kind of optinal plan, if line number M > K of matrix S, the control circuit of main process task circuit gives each basis
Processing circuit distributes a line or the data of multirow in s-matrix respectively.
The collection for the row being distributed in the S of i-th of based process circuit is combined into Ai, shares Mi row, as Fig. 2 c is indicated i-th
Calculating to be executed on based process circuit.
In a kind of optinal plan, in each based process circuit, such as in i-th of based process circuit, it can incite somebody to action
The distribution data received such as matrix A i is stored in the register and/or on piece caching of i-th of based process circuit;It is excellent
The volume of transmitted data for putting the distribution data after being the reduction of, improves computational efficiency, reduces power consumption.
Step S402, vector P is converted into the data of fixed point type, main place by the data type computing circuit of main process task circuit
Each section in the vector P of fixed point type is transferred to K based process circuit by the control circuit of reason circuit in a broadcast manner;
In a kind of optinal plan, the control circuit of main process task circuit, which can only broadcast each section in vector P, once to be arrived
In register or the on piece caching of each based process circuit, i-th of based process circuit is to the vector P this time obtained
Data be fully multiplexed, complete the corresponding inner product operation with every a line in matrix A i.Advantage is reduced from main process task
Circuit to based process circuit vector P repetition transmission volume of transmitted data, improve execution efficiency, reduce transmission power consumption.
In a kind of optinal plan, each section in vector P can be repeatedly broadcast to respectively by the control circuit of main process task circuit
In register or the on piece caching of a based process circuit, number of i-th of based process circuit to the vector P obtained every time
According to without multiplexing, the inner product operation of the every a line corresponded in matrix A i is completed by several times;Advantage is to reduce based process electricity
The volume of transmitted data of the vector P of single transmission inside road, and based process circuit caching and/or register can be reduced
Capacity improves execution efficiency, reduces transmission power consumption, reduces cost.
In a kind of optinal plan, each section in vector P can be repeatedly broadcast to respectively by the control circuit of main process task circuit
In register or the on piece caching of a based process circuit, number of i-th of based process circuit to the vector P obtained every time
According to fractional reuse is carried out, the inner product operation of the every a line corresponded in matrix A i is completed;Advantage is reduced from main process task circuit
To the volume of transmitted data of based process circuit, the volume of transmitted data inside based process circuit is also reduced, execution efficiency is improved,
Reduce transmission power consumption.
Step S403, the inner product of the data of inner product operation device the circuit counting matrix S and vector P of K based process circuit,
Such as i-th of based process circuit, the inner product of the data of the data and vector P of calculating matrix Ai;
Step S404, the accumulator circuit of K based process circuit is added up the result of inner product operation
As a result, accumulation result to be transmitted back to main process task circuit in the form of fixed point type.
In a kind of optinal plan, each based process circuit can be executed to the part and (part that inner product operation obtains
That is a part of accumulation result, such as accumulation result are as follows: F1*G1+F2*G2+F3*G3+F4*G4+F5*G5, then part and
Can be with are as follows: the value of F1*G1+ F2*G2+F3*G3) it is transmitted back to main process task circuit and adds up;Advantage is to reduce based process
Operand inside circuit improves the operation efficiency of based process circuit.
The part that can also be obtained the inner product operation that each based process circuit executes in a kind of optinal plan and guarantor
It is cumulative to terminate to be transmitted back to main process task circuit later in register and/or the on piece caching of existence foundation processing circuit;Advantage is,
Reduce the volume of transmitted data between based process circuit and main process task circuit, improve operation efficiency, reduces data transmission
Power consumption.
In a kind of optinal plan, can also by the obtained part of inner product operation that each based process circuit executes and
It is stored in the register and/or on piece caching of based process circuit and adds up under partial picture, be transferred under partial picture
Main process task circuit adds up, cumulative to terminate to be transmitted back to main process task circuit later;Advantage be reduce based process circuit and
Volume of transmitted data between main process task circuit, improves operation efficiency, reduces data transmission power consumption, reduces based process
Operand inside circuit improves the operation efficiency of based process circuit.
Neural network training method
Involved all data can use different data presentation techniques in neural network training process;
Specifically, the data presentation technique includes but is not limited to following situations:
The floating number of different bit wides;
The fixed-point number of different bit wides, the fixed-point number of different fixed positions;
The different moments (at the time of being specifically just different the number of iterations or initialization) of training process trained
Different data block (i.e. multiple inputs in different phase (i.e. positive or reversed operation), different layers, same layer in journey
Data block, output block) or the same data block in the sub-block of different piece that divides, be ok:
It can be respectively using fixed point or floating-point;
For fixed point:
Use different fixed point bit wides;
Use different fixed point deviants (namely fixed position);
The concrete methods of realizing for illustrating neural metwork training with an actual example below, is as shown in Figure 1a single layer
The specific calculating schematic diagram of the neural metwork training of operation, as shown in Figure 1a, input data and weight or parameter execute this layer
Operation, technical solution provided by the embodiments of the present application determine whether according to the forward operation amount of input data, weight and this layer
The type of the input data and weight is converted, specific mode can be with are as follows: such as the input data and weight storage institute
The register or storage space of occupancy are greater than given threshold and the forward operation amount of this layer is greater than setting operand, and determining should
When input data and weight data are floating data, the input data and weight data are converted into fixed-point data.Such as input
Data and the occupied register of weight storage or storage space are less than given threshold, such as the input data and weight number
According to for fixed-point data, after input data and weight data are converted into floating data, this layer of operation is executed.
Principle the application of above-mentioned data type conversion is elaborated, is a kind of fixed point class as shown in Figure 1 b
The expression of type data, for computing system, the storage bit number of 1 floating data is 32bit, and for fixed-point data, especially
It is the expression that data are carried out using the data of floating point type as shown in Figure 1 b, and the storage bit number of 1 fixed-point data can be with
Accomplish 16Bit hereinafter, so for this conversion for, the transport overhead that can be significantly reduced between calculator, in addition, right
For calculator, the space of the data storage of less bit is also smaller, i.e., storage overhead can be smaller, and calculation amount can also subtract
Few, i.e., computing cost can be reduced, so the expense of computing cost and storage can be reduced, but data type be turned
The expense for being also the need for part, hereinafter referred to as transition overhead are changed, for computationally intensive, the big data of data storage capacity turn
Change expense almost can be ignored for subsequent computing cost, storage overhead and transport overhead, so for
Computationally intensive, the big data of data storage capacity, the application is used data type conversion into the technology of the data of fixed point type
Scheme, data storage capacity small data small conversely, for calculation amount, at this time due to computing cost itself, storage overhead and
Transport overhead is with regard to smaller, at this time if using fixed-point data, since the precision of fixed-point data can be slightly below floating data,
Under the premise of calculation amount is lesser, need to guarantee the precision calculated, so here by the data conversion of fixed point type at floating number
According to improving the purpose of the precision of calculating by increasing lesser expense.
Illustrated below with actual example, as shown in fig 4e, this layer of operation mode is matrix multiplication, input data
And weight is matrix, input data here by taking matrix I as an example, such as scheme by taking matrix W as an example by weight for convenience of explanation
Shown in 4e, output data=matrix I* matrix W;Here if the sum of number of columns and line number amount of matrix I and matrix W compared with
Greatly, it can think above-mentioned matrix I and matrix W memory and/or register take up too much space and calculation amount also compared with
Greatly, matrix I and matrix W are converted into fixed-point data, then existed if matrix I and matrix W are floating data in this way
Execute the operation of matrix multiplication.
For example, matrix I be 1000*1000 matrix, matrix W is also the matrix of 1000*1000, then for number of columns with
And the sum of line number amount is 2000, quantity is very big, and corresponding calculation amount is just bigger, and Matrix Multiplication is multiplied with the inner product operation of matrix
Method operation i.e. 109 time, for this technical solution, since the quantity of matrix I and matrix W are very big, it is impossible to once will be all
Data are all transmitted, and data same in this way may be transmitted several times, it is assumed that are transmitted for fixed-point data, so that it may be significantly reduced
The data volume of transmission, and then transport overhead is reduced, relative to the calculating and storage of less bit can also reduce calculating
Expense and storage overhead.
It is for the technical solution that fixed-point data is converted into floating data, by taking reversed operation as an example, as shown in figure 4g
Calculate structure on to arrow direction be a kind of reversed operation.By taking reversed operation as an example, for direction operation, direction operation
It is output data gradient, which is specifically as follows, if the output data gradient is that current iteration calculates most
Later layer, the output data for the last layer which calculates is by default operation (default operation
The concrete operation step of the default operation can not limited by producer's sets itself according to their needs here) it is exported
Data gradient, as the output data gradient be non-current iteration calculate the last layer, such as the output data gradient be this
The n-th layer of iterative calculation, then the output data gradient is the input data gradient that (n+1)th layer of reversed operation is calculated.
Illustrated below with actual example, as shown in figure 4g, this layer of operation mode is matrix multiplication, input data
For matrix, weight is scalar, and input data here by taking matrix I as an example, such as scheme by taking scalar C as an example by weight for convenience of explanation
Shown in 4g, output data=matrix I*C;At this time due to the data that weight is scalar, data calculation amount is smaller, in this way if square
Battle array I is fixed-point data, then matrix I is converted into floating data, then in the operation for executing Matrix Multiplication scalar.
For example, matrix I is the matrix of 10*10, scalar C is counted then being 20 for the sum of number of columns and line number amount
Amount is smaller, (assuming that being greater than 100 here is considered larger, is considered smaller less than 100, for the 100 digital capacity domain skill
Art personnel can arbitrarily set.) corresponding calculation amount is with regard to very little, Matrix Multiplication is with the multiplying of the inner product operation of matrix i.e. 102
It is secondary, since calculation amount is small, if still calculated using fixed-point data, its precision can be had an impact, in order to enable calculating
Precision is higher, under the premise of smaller calculation amount, can be improved computational accuracy by floating data calculating.
In a kind of optinal plan, fixed fixed point bit wide can be respectively adopted in each data block of each layer in network, but
It is its fixed position with training iteration cycle variation;
Specifically, in the training process, the data presentation technique of some data block can be set as follows;
It specifically, can be to some data block selection arbitrary data representation method when starting to train;
In a kind of optinal plan, the floating point representation method of specific bit wide can choose;
In a kind of optinal plan, the fixed-point representation method of particular form can choose;
It can choose specific fixed point bit wide;
It can choose specific fixed position;
In a kind of optional scheme, it is fixed to be arranged according to the maximum value of the absolute value of data all in the data block
Point position;
In a kind of optinal plan, fixed point can be set according to the minimum value of the absolute value of data all in the data block
Position;
It, can be according to the fixed position of other data blocks come notebook data block when determining initialization in a kind of optinal plan
Fixed position;
In a kind of optinal plan, the fixed position of notebook data block can be set based on experience value;
Specifically, in the training process, the data that can change some data block in any iteration cycle number indicate
Method;
It, can be without adjustment for some data block in a kind of optinal plan;
In a kind of optinal plan, it can be adjusted every certain the number of iterations;
In a kind of optinal plan, it can be adjusted every certain training epoch number;
In a kind of optinal plan, it can be adjusted according to unfixed the number of iterations interval;
In a kind of optinal plan, unfixed trained epoch number can be spaced and be adjusted;
Specifically, in the training process, it adjusts adjustable for arbitrary data when the representation method of some data block
Representation method;
In a kind of optinal plan, if a data block is indicated using fixed fixed point bit wide fixed-point number,
The adjustment mode for the fixed position that data indicate may is that
In a kind of optinal plan, fixed position is set according to the setting method of initialization fixed position every time;
In a kind of optinal plan, if what some data block was calculated according to the initial setting method of fixed position
Fixed position increased in some iteration cycle than last iteration cycle, that is just by the fixed position in this period towards increase
Method changes;Conversely, then changing towards reduced direction.
Present disclosure also provides a kind of integrated circuit chip device, and the integrated circuit chip device is for executing neural network
Training, the neural network include multilayer, the integrated circuit chip device includes: processing circuit and external interface;
The external interface, for receiving training instruction;
The processing circuit leads to for determining first layer input data and first layer weight data according to the training instruction
The n-layer forward operation for crossing first layer input data and first layer weight data execution neural network obtains the n-th output result;
The processing circuit is also used to obtain the n-th output result gradient according to the n-th output result, refer to according to the training
The the n-th reversed operation for obtaining n-th layer of reversed operation is enabled, exports result gradient, n-th layer input data, n-th layer weight according to n-th
Group data and the n-th reversed operation obtain the n-th reversed computational complexity, determine n-th according to the described n-th reversed computational complexity
Result gradient, n-th layer input data, the corresponding n-th reverse data type of n-th layer weight group data are exported, the n-th output is tied
Fruit gradient, n-th layer input data, n-th layer weight group data are reversely transported with the n-layer that the n-th reverse data type executes neural network
Calculation obtains n weight gradient of n-layer operation;The n-th reverse data type includes: fixed point type or floating point type;
The processing circuit is also used to be updated n weight of n-layer operation using the n weight gradient.
Present disclosure is also disclosed that a neural network computing device comprising it is one or more in chip as shown in Figure 3,
For being obtained from other processing units to operational data and control information, specified neural network computing, implementing result are executed
Peripheral equipment is passed to by I/O interface.Peripheral equipment for example camera, display, mouse, keyboard, network interface card, wifi interface,
Server.When comprising more than one chip, chip chamber as shown in Figure 3 can be linked simultaneously by specific structure
Transmission data are for example interconnected by PCIE bus and are transmitted data, to support the operation of more massive neural network.
At this point it is possible to share same control system, there can also be control system independent;Can be with shared drive, it can also be each
Accelerator has respective memory.In addition, its mutual contact mode can be any interconnection topology.
The neural network computing device compatibility with higher can pass through PCIE interface and various types of server phases
Connection.
Present disclosure is also disclosed that a combined treatment device comprising above-mentioned neural network computing device, general interconnection
Interface and other processing units (i.e. general processing unit).Neural network computing device is interacted with other processing units,
The common operation completing user and specifying.Such as the schematic diagram that 5a is combined treatment device.
Other processing units, including central processor CPU, graphics processor GPU, neural network processor etc. are general/special
With one of processor or above processor type.Processor quantity included by other processing units is with no restrictions.Its
His interface of the processing unit as neural network computing device and external data and control, including data are carried, and are completed to Benshen
Unlatching, stopping through network operations device etc. control substantially;Other processing units can also cooperate with neural network computing device
It is common to complete processor active task.
General interconnecting interface, for transmitting data and control between the neural network computing device and other processing units
Instruction.The neural network computing device obtains required input data, write-in neural network computing dress from other processing units
Set the storage device of on piece;Control instruction can be obtained from other processing units, write-in neural network computing device on piece
Control caching;The data in the memory module of neural network computing device can also be read and be transferred to other processing units.
As shown in Figure 5 b, optionally, which further includes storage device, for being stored in this arithmetic element/arithmetic unit
Or data required for other arithmetic elements, the data of operation required for being particularly suitable in this neural network computing device or
The data that can not be all saved in the storage inside of other processing units.
The combined treatment device can be used as the SOC on piece of the equipment such as mobile phone, robot, unmanned plane, video monitoring equipment
The die area of control section is effectively reduced in system, improves processing speed, reduces overall power.When this situation, at the combination
The general interconnecting interface of reason device is connected with certain components of equipment.Certain components for example camera, display, mouse, key
Disk, network interface card, wifi interface.
C referring to figure 5., Fig. 5 c are a kind of structural representation for neural network processor board that present disclosure embodiment provides
Figure.As shown in Figure 5 c, above-mentioned neural network processor board 10 includes that neural network chip encapsulating structure 11, first is electrical and non-
Electrical connection arrangement 12 and first substrate (substrate) 13.
Present disclosure is not construed as limiting the specific structure of neural network chip encapsulating structure 11, optionally, as fig 5d,
Above-mentioned neural network chip encapsulating structure 11 include: neural network chip 111, second electrical and non-electrical attachment device 112,
The second substrate 113.
The concrete form of neural network chip 111 involved in present disclosure is not construed as limiting, above-mentioned neural network chip 111
Including but not limited to by neural network processor integrate neural network chip, above-mentioned chip can by silicon materials, germanium material,
Quantum material or molecular material etc. are made.(such as: more harsh environment) and different application demands can will according to the actual situation
Above-mentioned neural network chip is packaged, so that the major part of neural network chip is wrapped, and will be on neural network chip
Pin the outside of encapsulating structure is connected to by conductors such as gold threads, for and more outer layer carry out circuit connection.
Present disclosure is not construed as limiting the specific structure of neural network chip 111, optionally, please refers to Fig. 1 a or Fig. 1 b institute
The device shown.
Present disclosure for first substrate 13 and the second substrate 113 type without limitation, can be printed circuit board
(printed circuit board, PCB) or (printed wiring board, PWB), it is also possible to be other circuit boards.It is right
The making material of PCB is also without limitation.
The second substrate 113 involved in present disclosure is electrical and non-by second for carrying above-mentioned neural network chip 111
The neural network core that above-mentioned neural network chip 111 and the second substrate 113 are attached by electrical connection arrangement 112
Chip package 11 is convenient for for protecting neural network chip 111 by neural network chip encapsulating structure 11 and first substrate
13 are further encapsulated.
Electrical for above-mentioned specific second and non-electrical attachment device 112 the corresponding knot of packaged type and packaged type
Structure is not construed as limiting, and can be selected suitable packaged type with different application demands according to the actual situation and simply be improved,
Such as: flip chip ball grid array encapsulates (Flip Chip Ball Grid Array Package, FCBGAP), slim four directions
Flat type packaged (Low-profile Quad Flat Package, LQFP), the quad flat package (Quad with radiator
Flat Package with Heat sink, HQFP), without pin quad flat package (Quad Flat Non-lead
Package, QFN) or the encapsulation side small spacing quad flat formula encapsulation (Fine-pitch Ball Grid Package, FBGA) etc.
Formula.
Flip-chip (Flip Chip), suitable for the area requirements after encapsulation are high or biography to the inductance of conducting wire, signal
In the case where defeated time-sensitive.In addition to this packaged type that wire bonding (Wire Bonding) can be used, reduces cost,
Improve the flexibility of encapsulating structure.
Ball grid array (Ball Grid Array), is capable of providing more pins, and the average conductor length of pin is short, tool
The effect of standby high-speed transfer signal, wherein encapsulation can encapsulate (Pin Grid Array, PGA), zero with Pin-Grid Array
Contact engaging and separating force (Zero Insertion Force, ZIF), single edge contact connection (Single Edge Contact Connection,
SECC), contact array (Land Grid Array, LGA) etc. replaces.
Optionally, using the packaged type of flip chip ball grid array (Flip Chip Ball Grid Array) to mind
It is packaged through network chip 111 and the second substrate 113, the schematic diagram of specific neural network chip encapsulating structure can refer to
Fig. 6.As shown in fig. 6, above-mentioned neural network chip encapsulating structure includes: neural network chip 21, pad 22, soldered ball 23, second
Tie point 25, pin 26 on substrate 24, the second substrate 24.
Wherein, pad 22 is connected with neural network chip 21, passes through the tie point 25 on pad 22 and the second substrate 24
Between welding form soldered ball 23, neural network chip 21 and the second substrate 24 are connected, that is, realize neural network chip 21
Encapsulation.
Pin 26 is used for the external circuit with encapsulating structure (for example, the first substrate on neural network processor board 10
13) be connected, it can be achieved that external data and internal data transmission, convenient for neural network chip 21 or neural network chip 21 it is right
The neural network processor answered handles data.Type and quantity present disclosure for pin are also not construed as limiting, according to not
Different pin forms can be selected in same encapsulation technology, and defers to certain rule and arranged.
Optionally, above-mentioned neural network chip encapsulating structure further includes insulation filler, is placed in pad 22, soldered ball 23 and connects
In gap between contact 25, interference is generated between soldered ball and soldered ball for preventing.
Wherein, the material of insulation filler can be silicon nitride, silica or silicon oxynitride;Interference comprising electromagnetic interference,
Inductive interferences etc..
Optionally, above-mentioned neural network chip encapsulating structure further includes radiator, for distributing neural network chip 21
Heat when operation.Wherein, radiator can be the good sheet metal of one piece of thermal conductivity, cooling fin or radiator, for example,
Fan.
For example, as shown in Figure 6 a, neural network chip encapsulating structure 11 include: neural network chip 21, pad 22,
Soldered ball 23, the second substrate 24, the tie point 25 in the second substrate 24, pin 26, insulation filler 27, thermal grease 28 and metal
Shell cooling fin 29.Wherein, thermal grease 28 and metal shell cooling fin 29 are used to distribute heat when neural network chip 21 is run
Amount.
Optionally, above-mentioned neural network chip encapsulating structure 11 further includes reinforced structure, is connect with pad 22, and interior is embedded in
In soldered ball 23, to enhance the bonding strength between soldered ball 23 and pad 22.
Wherein, reinforced structure can be metal wire structure or column structure, it is not limited here.
Present disclosure is electrical for first and the concrete form of non-electrical device of air 12 is also not construed as limiting, can refer to second it is electrical and
Neural network chip encapsulating structure 11 is packaged by the description of non-electrical device of air 112 by welding, can also be with
By the way of connecting line connection or pluggable mode connection the second substrate 113 and first substrate 13, it is convenient for the first base of subsequent replacement
Plate 13 or neural network chip encapsulating structure 11.
Optionally, first substrate 13 includes the interface etc. for the internal storage location of extension storage capacity, such as: synchronous dynamic
Random access memory (Synchronous Dynamic Random Access Memory, SDRAM), Double Data Rate synchronous dynamic
Random access memory (Double Date Rate SDRAM, DDR) etc., improves neural network processor by exented memory
Processing capacity.
It may also include quick external equipment interconnection bus (Peripheral Component on first substrate 13
Interconnect-Express, PCI-E or PCIe) interface, hot-swappable (the Small Form-factor of small package
Pluggable, SFP) interface, Ethernet interface, Controller Area Network BUS (Controller Area Network, CAN)
The convenience of arithmetic speed and operation can be improved for the data transmission between encapsulating structure and external circuit in interface etc..
Neural network processor is encapsulated as neural network chip 111, neural network chip 111 is encapsulated as neural network
Neural network chip encapsulating structure 11 is encapsulated as neural network processor board 10, by board by chip-packaging structure 11
Interface (slot or lock pin) and external circuit (such as: computer motherboard) carry out data interaction, i.e., directly by using nerve
Network processing unit board 10 realizes the function of neural network processor, and protects neural network chip 111.And at neural network
Other modules can be also added on reason device board 10, improve the application range and operation efficiency of neural network processor.
In one embodiment, the present disclosure discloses an electronic devices comprising above-mentioned neural network processor plate
Card 10 or neural network chip encapsulating structure 11.
Electronic device include data processing equipment, robot, computer, printer, scanner, tablet computer, intelligent terminal,
Mobile phone, automobile data recorder, navigator, sensor, camera, server, camera, video camera, projector, wrist-watch, earphone, shifting
Dynamic storage, wearable device, the vehicles, household electrical appliance, and/or Medical Devices.
The vehicles include aircraft, steamer and/or vehicle;The household electrical appliance include TV, air-conditioning, micro-wave oven,
Refrigerator, electric cooker, humidifier, washing machine, electric light, gas-cooker, kitchen ventilator;The Medical Devices include Nuclear Magnetic Resonance, B ultrasound
Instrument and/or electrocardiograph.
Particular embodiments described above has carried out further in detail the purpose of present disclosure, technical scheme and beneficial effects
Describe in detail it is bright, it is all it should be understood that be not limited to present disclosure the foregoing is merely the specific embodiment of present disclosure
Within the spirit and principle of present disclosure, any modification, equivalent substitution, improvement and etc. done should be included in the guarantor of present disclosure
Within the scope of shield.
Claims (16)
1. a kind of integrated circuit chip device, training of the described device for the neural network of execution, the neural network include n
Layer, the n value range are the integer more than or equal to 2, which is characterized in that the integrated circuit chip device includes: main process task
Circuit and multiple based process circuits;The main process task circuit includes: data type computing circuit;The data type operation
Circuit, for executing the conversion between floating point type data and fixed point type data;
The multiple based process circuit is in array distribution;Each based process circuit and other adjacent based process circuits connect
It connects, what n based process circuit of the 1st row of main process task circuit connection, n based process circuit of m row and the 1st arranged
M based process circuit;
The integrated circuit chip device, for receiving training instruction, according to the training instruction determine first layer input data and
First layer weight group data execute the n-layer forward operation of neural network to first layer input data and first layer weight group data
Obtain the n-th output result of forward operation;
The main process task circuit is also used to obtain the n-th output result gradient according to the n-th output result, according to the training
Instruction obtains the n-th reversed operation of the reversed operation of n-th layer, exports result gradient, n-th layer input data, n-th layer power according to n-th
Value group data and the n-th reversed operation obtain the n-th reversed computational complexity, determine n-th according to the described n-th reversed computational complexity
Export result gradient, n-th layer input data, the corresponding n-th reverse data type of n-th layer weight group data;
The main process task circuit, for according to the n-th reversed operation type by n-th export result gradient, n-th layer input data,
N-th layer weight group data are divided into broadcast data block and distribution data block, to the distribution data of the n-th reverse data type
Block carries out deconsolidation process and obtains multiple basic data blocks, and the multiple basic data block is distributed to and is connected with the main process task circuit
At least one branch process circuit in the based process circuit connect, the broadcast data block of the n-th reverse data type is wide
Cast to the based process circuit with the main process task circuit connection;
The based process circuit, for according to the broadcast data block of the n-th reverse data type and the base of the n-th reverse data type
The operation that notebook data block executes in neural network in a parallel fashion obtains operation result, and by the operation result by with the master
The based process circuit transmission of processing circuit connection gives the main process task circuit;
The main process task circuit obtains n-th layer weight group gradient and n-th layer input data for being handled the operation result
Gradient is updated n-th layer weight group data using the n-th layer weight group gradient;The n-th reverse data type packet
It includes: fixed point type or floating point type;
The integrated circuit chip device, be also used to using n-th layer input data gradient as (n-1)th layer (n-1)th export result
Gradient executes n-1 layers of direction operation and obtains n-1 layers of weight group gradient, using the weight of n-1 layers of weight group gradient updating respective layer
Group data, the weight group data include;At least two weights.
2. integrated circuit chip device according to claim 1, which is characterized in that
The Main Processor Unit, specifically for compared with preset threshold, such as described n-th is reversely transported by the n-th reversed computational complexity
It calculates complexity and is higher than the preset threshold, determine that the n-th reverse data type is fixed point type, such as the described n-th reversed operation
Complexity is less than or equal to the preset threshold, and computing device determines that the n-th reverse data type is floating point type.
3. integrated circuit chip device according to claim 2, which is characterized in that
The Main Processor Unit is specifically used for determining the n-th output the result gradient, n-th layer input data, n-th layer weight group
The (n+1)th reverse data type that data belong to, such as the (n+1)th reverse data type is different from the n-th reverse data type,
It is by the data type computing circuit that the n-th output result gradient, the n-th layer for belonging to the (n+1)th reverse data type is defeated
It is defeated at the n-th output result gradient, the n-th layer for belonging to the n-th reverse data type to enter data, n-th layer weight group data conversion
Enter data, n-th layer weight group data.
4. integrated circuit chip device according to claim 1, which is characterized in that
The Main Processor Unit is convolution algorithm for such as the reversed operation of the n-layer, and convolution input data is that the n-th layer is defeated
Entering data, convolution kernel is the n-th output result gradient,
N-th reversed computational complexity=α * C*kW*kW*M*N*W*C*H;
Wherein, α is convolution coefficient, and value range is greater than 1;C, kW, kW, M are the value of convolution kernel four dimensions, and N, W, C, H are
The value of convolution input data four dimensions;
If the complexity is greater than given threshold, determines that the n-th reverse data type is floating type, determine that the convolution is defeated
Enter whether data and convolution kernel are floating data, if the convolution input data and convolution kernel are not floating data, by the volume
Product input data is converted into floating data, and convolution kernel is converted into floating data, then by convolution input data, convolution kernel with floating
Point data type executes convolution algorithm.
5. integrated circuit chip device according to claim 1, which is characterized in that
The Main Processor Unit is also used to such as the described n-th reversed operation are as follows: Matrix Multiplication matrix operation, the input data are n-th
Layer input data, the weight are the n-th output result gradient;
Complexity=β * F*G*E*F;Wherein, β is matrix coefficient, and value range is more than or equal to 1, and F, G are n-th layer input data
Row, column value, E, F be weight row, column value;
If the complexity is greater than given threshold, determines that the n-th reverse data type is floating type, determine the n-th layer
Whether input data and weight are floating data, if the n-th layer input data and weight are not floating data, by this n-th
Layer input data is converted into floating data, weight is converted into floating data, then by n-th layer input data, weight with floating-point
Data type executes Matrix Multiplication matrix operation.
6. integrated circuit chip device according to claim 1, which is characterized in that
Integrated circuit chip device is also used to such as the described n-th reversed operation are as follows: Matrix Multiplication vector operation, the input data are
N-th layer input data, the weight are the n-th output result gradient;
Complexity=β * F*G*F;Wherein, β is matrix coefficient, and value range is more than or equal to 1, and F, G are n-th layer input data
Row, column value, F are the train value of the n-th output result gradient;
If the complexity is greater than given threshold, determines that the n-th reverse data type is floating type, determine the n-th layer
Whether input data and weight are floating data, if the n-th layer input data and weight are not floating data, notify institute
It states k Branch Processing Unit and the n-th layer input data is converted into floating data, weight is converted into floating data, then will
N-th layer input data, weight execute Matrix Multiplication vector operation with floating type.
7. integrated circuit chip device according to claim 1, which is characterized in that
The main process task circuit, the type specifically for such as described n-th reversed operation is multiplying, determines that the n-th layer is defeated
Entering data and the n-th layer weight group data is distribution data block, and the n-th output result gradient is broadcast data block;
If the type of the n-th reversed operation is convolution algorithm, determine that the n-th layer input data and the n-th layer weight group data are equal
For broadcast data block, the n-th output result gradient is distribution data block.
8. integrated circuit chip device described in -7 any one according to claim 1, which is characterized in that
The n-layer inverse operation further include: bigoted operation, entirely connect operation, GEMM operation, GEMV operation, activation operation in one
Kind or any combination.
9. integrated circuit chip device according to claim 1, which is characterized in that
The main process task circuit includes: buffer circuit on master register or main leaf;
The based process circuit includes: base register or basic on piece buffer circuit.
10. integrated circuit chip device according to claim 9, which is characterized in that
The main process task circuit includes: vector operation device circuit, arithmetic logic unit circuit, accumulator circuit, matrix transposition electricity
One of road, direct memory access circuit or data rearrangement circuit or any combination.
11. integrated circuit chip device according to claim 9, which is characterized in that
The n-th output result gradient are as follows: a kind of in vector, matrix, three-dimensional data block, 4 D data block and n dimensional data block
Or any combination;
The n-th layer input data are as follows: it is a kind of in vector, matrix, three-dimensional data block, 4 D data block and n dimensional data block or
Any combination;
The n-layer weight group data are as follows: it is a kind of in vector, matrix, three-dimensional data block, 4 D data block and n dimensional data block or
Any combination.
12. a kind of neural network computing device, which is characterized in that the neural network computing device includes one or more as weighed
Benefit requires integrated circuit chip device described in 1-11 any one.
13. a kind of combined treatment device, which is characterized in that the combined treatment device includes: mind as claimed in claim 12
Through network operations device, general interconnecting interface and general processing unit;
The neural network computing device is connect by the general interconnecting interface with the general processing unit.
14. a kind of chip, which is characterized in that the integrated chip such as claim 1-13 any one described device.
15. a kind of smart machine, which is characterized in that the smart machine includes chip as claimed in claim 14.
16. a kind of operation method of neural network, which is characterized in that the method is applied in integrated circuit chip device, institute
Stating integrated circuit chip device includes: the integrated circuit chip device as described in claim 1-11 any one, described integrated
Circuit chip device is used to execute the training operation of neural network.
Priority Applications (13)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711469408.3A CN109978156B (en) | 2017-12-28 | 2017-12-28 | Integrated circuit chip device and related product |
PCT/CN2018/123929 WO2019129070A1 (en) | 2017-12-27 | 2018-12-26 | Integrated circuit chip device |
EP18896519.8A EP3719712B1 (en) | 2017-12-27 | 2018-12-26 | Integrated circuit chip device |
EP20201907.1A EP3783477B1 (en) | 2017-12-27 | 2018-12-26 | Integrated circuit chip device |
EP20203232.2A EP3789871B1 (en) | 2017-12-27 | 2018-12-26 | Integrated circuit chip device |
US16/903,304 US11544546B2 (en) | 2017-12-27 | 2020-06-16 | Integrated circuit chip device |
US17/134,435 US11741351B2 (en) | 2017-12-27 | 2020-12-27 | Integrated circuit chip device |
US17/134,486 US11748604B2 (en) | 2017-12-27 | 2020-12-27 | Integrated circuit chip device |
US17/134,446 US11748603B2 (en) | 2017-12-27 | 2020-12-27 | Integrated circuit chip device |
US17/134,444 US11748601B2 (en) | 2017-12-27 | 2020-12-27 | Integrated circuit chip device |
US17/134,445 US11748602B2 (en) | 2017-12-27 | 2020-12-27 | Integrated circuit chip device |
US17/134,487 US11748605B2 (en) | 2017-12-27 | 2020-12-27 | Integrated circuit chip device |
US18/073,924 US11983621B2 (en) | 2017-12-27 | 2022-12-02 | Integrated circuit chip device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711469408.3A CN109978156B (en) | 2017-12-28 | 2017-12-28 | Integrated circuit chip device and related product |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109978156A true CN109978156A (en) | 2019-07-05 |
CN109978156B CN109978156B (en) | 2020-06-12 |
Family
ID=67075532
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711469408.3A Active CN109978156B (en) | 2017-12-27 | 2017-12-28 | Integrated circuit chip device and related product |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109978156B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115221102A (en) * | 2021-04-16 | 2022-10-21 | 中科寒武纪科技股份有限公司 | Method for optimizing convolution operation of system on chip and related product |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170063351A1 (en) * | 2015-08-31 | 2017-03-02 | Semiconductor Energy Laboratory Co., Ltd. | Semiconductor device or electronic device including the semiconductor device |
CN107066239A (en) * | 2017-03-01 | 2017-08-18 | 智擎信息系统(上海)有限公司 | A kind of hardware configuration for realizing convolutional neural networks forward calculation |
CN107169563A (en) * | 2017-05-08 | 2017-09-15 | 中国科学院计算技术研究所 | Processing system and method applied to two-value weight convolutional network |
US9779355B1 (en) * | 2016-09-15 | 2017-10-03 | International Business Machines Corporation | Back propagation gates and storage capacitor for neural networks |
CN109961138A (en) * | 2017-12-14 | 2019-07-02 | 北京中科寒武纪科技有限公司 | Neural network training method and Related product |
-
2017
- 2017-12-28 CN CN201711469408.3A patent/CN109978156B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170063351A1 (en) * | 2015-08-31 | 2017-03-02 | Semiconductor Energy Laboratory Co., Ltd. | Semiconductor device or electronic device including the semiconductor device |
US9779355B1 (en) * | 2016-09-15 | 2017-10-03 | International Business Machines Corporation | Back propagation gates and storage capacitor for neural networks |
CN107066239A (en) * | 2017-03-01 | 2017-08-18 | 智擎信息系统(上海)有限公司 | A kind of hardware configuration for realizing convolutional neural networks forward calculation |
CN107169563A (en) * | 2017-05-08 | 2017-09-15 | 中国科学院计算技术研究所 | Processing system and method applied to two-value weight convolutional network |
CN109961138A (en) * | 2017-12-14 | 2019-07-02 | 北京中科寒武纪科技有限公司 | Neural network training method and Related product |
Non-Patent Citations (4)
Title |
---|
SHAOLI LIU等: "Cambricon: An Instruction Set Architecture for Neural Networks", 《2016 ACM/IEEE 43RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA)》 * |
YUNJI CHEN等: "DaDianNao: A Neural Network Supercomputer", 《2014 47TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE》 * |
毛健: "基于BP网络的神经元芯片的关键部件设计", 《万方数据知识服务平台》 * |
陆志坚: "基于FPGA的卷积神经网络并行结构研究", 《中国博士学位论文全文数据库 信息科技辑》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115221102A (en) * | 2021-04-16 | 2022-10-21 | 中科寒武纪科技股份有限公司 | Method for optimizing convolution operation of system on chip and related product |
CN115221102B (en) * | 2021-04-16 | 2024-01-19 | 中科寒武纪科技股份有限公司 | Method for optimizing convolution operation of system-on-chip and related product |
Also Published As
Publication number | Publication date |
---|---|
CN109978156B (en) | 2020-06-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109961138A (en) | Neural network training method and Related product | |
WO2019129070A1 (en) | Integrated circuit chip device | |
CN109978131A (en) | Integrated circuit chip device and Related product | |
CN111242294B (en) | Integrated circuit chip device and related products | |
CN109977446A (en) | Integrated circuit chip device and Related product | |
CN109961134A (en) | Integrated circuit chip device and Related product | |
CN109961131A (en) | Neural network forward operation method and Related product | |
CN109978156A (en) | Integrated circuit chip device and Related product | |
CN109961135A (en) | Integrated circuit chip device and Related product | |
CN109978148A (en) | Integrated circuit chip device and Related product | |
CN109978157A (en) | Integrated circuit chip device and Related product | |
CN109978147A (en) | Integrated circuit chip device and Related product | |
CN110175673A (en) | Processing method and accelerator | |
CN109960673A (en) | Integrated circuit chip device and Related product | |
CN109978151A (en) | Neural network processor board and Related product | |
CN109977071A (en) | Neural network processor board and Related product | |
CN109978150A (en) | Neural network processor board and Related product | |
CN109978158A (en) | Integrated circuit chip device and Related product | |
CN110197264A (en) | Neural network processor board and Related product | |
CN110197267A (en) | Neural network processor board and Related product | |
CN110490315A (en) | The reversed operation Sparse methods and Related product of neural network | |
CN109961133A (en) | Integrated circuit chip device and Related product | |
CN109978154A (en) | Integrated circuit chip device and Related product | |
CN109978130A (en) | Integrated circuit chip device and Related product | |
CN109978155A (en) | Integrated circuit chip device and Related product |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 100000 room 644, No. 6, No. 6, South Road, Beijing Academy of Sciences Applicant after: Zhongke Cambrian Technology Co., Ltd Address before: 100000 room 644, No. 6, No. 6, South Road, Beijing Academy of Sciences Applicant before: Beijing Zhongke Cambrian Technology Co., Ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |