CN109978148A - Integrated circuit chip device and Related product - Google Patents

Integrated circuit chip device and Related product Download PDF

Info

Publication number
CN109978148A
CN109978148A CN201711467271.8A CN201711467271A CN109978148A CN 109978148 A CN109978148 A CN 109978148A CN 201711467271 A CN201711467271 A CN 201711467271A CN 109978148 A CN109978148 A CN 109978148A
Authority
CN
China
Prior art keywords
data
layer
circuit
type
input data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711467271.8A
Other languages
Chinese (zh)
Other versions
CN109978148B (en
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cambricon Technologies Corp Ltd
Beijing Zhongke Cambrian Technology Co Ltd
Original Assignee
Beijing Zhongke Cambrian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongke Cambrian Technology Co Ltd filed Critical Beijing Zhongke Cambrian Technology Co Ltd
Priority to CN201711467271.8A priority Critical patent/CN109978148B/en
Publication of CN109978148A publication Critical patent/CN109978148A/en
Application granted granted Critical
Publication of CN109978148B publication Critical patent/CN109978148B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Design And Manufacture Of Integrated Circuits (AREA)
  • Logic Circuits (AREA)

Abstract

Present disclosure provides a kind of integrated circuit chip device and Related product, training of the described device for the neural network of execution, the neural network includes n-layer, the n value range is the integer more than or equal to 2, the integrated circuit chip device includes: main process task circuit, k branch process circuit and k group based process circuit, the main process task circuit is separately connected with the k branch process circuit, each branch process circuit corresponds to one group of based process circuit in k group based process circuit in k branch process circuit, one group of based process circuit includes at least one based process circuit;The branch process circuit includes: data type computing circuit, for executing the conversion between floating point type data and fixed point type data.The advantage that the technical solution that present disclosure provides has calculation amount small, low in energy consumption.

Description

Integrated circuit chip device and Related product
Technical field
Present disclosure is related to field of neural networks more particularly to a kind of integrated circuit chip device and Related product.
Background technique
Artificial neural network (Artificial Neural Network, i.e. ANN), it is artificial since being the 1980s The research hotspot that smart field rises.It is abstracted human brain neuroid from information processing angle, and it is simple to establish certain Model is formed different networks by different connection types.Neural network or class are also often directly referred to as in engineering and academia Neural network.Neural network is a kind of operational model, is constituted by being coupled to each other between a large amount of node (or neuron).It is existing Neural network operation be based on CPU (Central Processing Unit, central processing unit) or GPU (English: Graphics Processing Unit, graphics processor) Lai Shixian neural network forward operation, such forward operation Computationally intensive, power consumption is high.
Summary of the invention
Present disclosure embodiment provides a kind of integrated circuit chip device and Related product, can promote the processing of computing device Speed improves efficiency.
In a first aspect, providing a kind of training integrated circuit chip device of the neural network of execution, described device is for holding The training of capable neural network, the neural network include n-layer, and the n value range is the integer more than or equal to 2, described integrated Circuit chip device includes: main process task circuit, k branch process circuit and k group based process circuit, the main process task circuit It is separately connected with the k branch process circuit, each branch process circuit corresponds at k group basis in k branch process circuit One group of based process circuit in circuit is managed, one group of based process circuit includes at least one based process circuit;
The branch process circuit includes: data type computing circuit, for executing floating point type data and fixed point type Conversion between data;
The integrated circuit chip device determines that first layer inputs number according to the training instruction for receiving training instruction According to first layer weight group data, the n-layer for executing neural network to first layer input data and first layer weight group data is positive Operation obtains the n-th output result of forward operation;
The main process task circuit is also used to obtain the n-th output result gradient according to the n-th output result, according to described in Training instruction obtains the n-th reversed operation of the reversed operation of n-th layer, exports result gradient, n-th layer input data, n-th according to n-th Layer weight group data and the n-th reversed operation obtain the n-th reversed computational complexity, true according to the described n-th reversed computational complexity Fixed n-th output result gradient, n-th layer input data, the corresponding n-th reverse data type of n-th layer weight group data;
The main process task circuit exports result gradient for n-th for the type according to the n-th reversed operation, n-th layer inputs number It is divided into broadcast data block and distribution data block according to, n-th layer of weight group data, the distribution to the n-th reverse data type Data block carries out deconsolidation process and obtains multiple basic data blocks, and the multiple basic data block is distributed to the k bifurcation At least one branch process circuit in circuit is managed, the broadcast data block of the n-th reverse data type is broadcasted to the k Branch process circuit;
The k branch process circuit, for by the data type computing circuit by broadcast data block and reception To basic data block be converted into the broadcast data block of the n-th reverse data type and the base of the n-th reverse data type received Notebook data block;By the basic data block of the broadcast data block of the n-th reverse data type and the n-th reverse data type received It is transmitted to based process circuit;
The k group based process circuit, for by the broadcast data block and to receive basic data block reversed with n-th Data type executes operation and obtains operation result, and gives operation result to the master by the k branch process circuit transmission Processing circuit;
The main process task circuit obtains n-th layer weight group gradient and n-th layer input for being handled the operation result Data gradient is updated n-th layer weight group data using the n-th layer weight group gradient;The n-th reverse data type It include: fixed point type or floating point type;
The integrated circuit chip device, be also used to using n-th layer input data gradient as (n-1)th layer (n-1)th output As a result gradient executes n-1 layers of direction operation and obtains n-1 layers of weight group gradient, using n-1 layers of weight group gradient updating respective layer Weight group data, the weight group data include;At least two weights.
Second aspect, provides a kind of neural network computing device, and the neural network computing device includes one or more The integrated circuit chip device that first aspect provides.
The third aspect, provides a kind of combined treatment device, and the combined treatment device includes: the nerve that second aspect provides Network operations device, general interconnecting interface and general processing unit;
The neural network computing device is connect by the general interconnecting interface with the general processing unit.
Fourth aspect, provides a kind of chip, the device or third of the device of the integrated chip first aspect, second aspect The device of aspect.
5th aspect, provides a kind of electronic equipment, the electronic equipment includes the chip of fourth aspect.
As can be seen that providing data conversion computing circuit by present disclosure embodiment and converting the type of data block Operation afterwards saves transfer resource and computing resource, so it is with low in energy consumption, the small advantage of calculation amount.
Detailed description of the invention
Fig. 1 is a kind of training method schematic diagram of neural network.
Fig. 1 a is a kind of forward operation schematic diagram of neural network.
Fig. 1 b is a kind of schematic configuration diagram of fixed-point data type.
Fig. 2 a is convolution input data schematic diagram.
Fig. 2 b is convolution kernel schematic diagram.
Fig. 2 c is the operation window schematic diagram of a three-dimensional data block of input data.
Fig. 2 d is another operation window schematic diagram of a three-dimensional data block of input data.
Fig. 2 e is the another operation window schematic diagram of a three-dimensional data block of input data
Fig. 3 is a kind of structural schematic diagram of neural network chip.
Fig. 4 a is Matrix Multiplication with matrix schematic diagram.
Fig. 4 b is Matrix Multiplication with the method flow diagram of matrix.
Fig. 4 c is Matrix Multiplication with vector schematic diagram.
Fig. 4 d is Matrix Multiplication with the method flow diagram of vector.
Fig. 4 e is a kind of neural metwork training schematic diagram.
Fig. 4 f is another neural metwork training schematic diagram.
Fig. 4 g is neural network forward direction and reversed operation schematic diagram.
Fig. 4 h is neural metwork training multilayered structure schematic diagram.
Fig. 5 a is that present disclosure is also disclosed that a combined treatment device structural schematic diagram.
Fig. 5 b is that present disclosure is also disclosed that a combined treatment device another kind structural schematic diagram.
Fig. 5 c is a kind of structural schematic diagram for neural network processor board that present disclosure embodiment provides;
Fig. 5 d is a kind of structural schematic diagram for neural network chip encapsulating structure that present disclosure embodiment stream provides;
Fig. 5 e is a kind of structural schematic diagram for neural network chip that present disclosure embodiment stream provides;
Fig. 6 is a kind of schematic diagram for neural network chip encapsulating structure that present disclosure embodiment stream provides;
Fig. 6 a is the schematic diagram for another neural network chip encapsulating structure that present disclosure embodiment stream provides.
Specific embodiment
In order to make those skilled in the art more fully understand present disclosure scheme, below in conjunction in present disclosure embodiment The technical solution in present disclosure embodiment is clearly and completely described in attached drawing, it is clear that described embodiment is only Present disclosure a part of the embodiment, instead of all the embodiments.Based on the embodiment in present disclosure, those of ordinary skill in the art Every other embodiment obtained without creative efforts belongs to the range of present disclosure protection.
In the device that first aspect provides, the Main Processor Unit, specifically for by the n-th reversed computational complexity and in advance If threshold value comparison, such as the described n-th reversed computational complexity is higher than the preset threshold, determines that the n-th reverse data type is Fixed point type, such as the described n-th reversed computational complexity are less than or equal to the preset threshold, and computing device determines that described n-th is anti- It is floating point type to data type.
In the device that first aspect provides, the Main Processor Unit is specifically used for determining the n-th output result ladder The (n+1)th reverse data type that degree, n-th layer input data, n-th layer weight group data belong to, such as the (n+1)th reverse data class Type is different from the n-th reverse data type, will belong to the (n+1)th reverse data type by the data type computing circuit The n-th output result gradient, n-th layer input data, n-th layer weight group data conversion is at belonging to the n-th reverse data type It is described n-th output result gradient, n-th layer input data, n-th layer weight group data.
In the device that first aspect provides, the Main Processor Unit is convolution fortune for such as the reversed operation of the n-layer It calculates, convolution input data is the n-th layer input data, and convolution kernel is the n-th output result gradient,
N-th reversed computational complexity=α * C*kW*kW*M*N*W*C*H;
Wherein, α is convolution coefficient, and value range is greater than 1;C, kW, kW, M be convolution kernel four dimensions value, N, W, C, H is the value of convolution input data four dimensions;
If the complexity is greater than given threshold, determines that the n-th reverse data type is floating type, determine the volume Whether product input data and convolution kernel are floating data, if the convolution input data and convolution kernel are not floating data, are led to Know that the convolution input data is converted into floating data by the k Branch Processing Unit, convolution kernel is converted into floating data, so Convolution input data, convolution kernel are executed into convolution algorithm with floating type afterwards.
In the device that first aspect provides, the Main Processor Unit is also used to such as the described n-th reversed operation are as follows: matrix Multiply matrix operation, the input data is n-th layer input data, and the weight is the n-th output result gradient;
Complexity=β * F*G*E*F;Wherein, β is matrix coefficient, and value range is more than or equal to 1, and F, G are n-th layer input The row, column value of data, E, F are the row, column value of weight;
If the complexity is greater than given threshold, determine that the n-th reverse data type is floating type, determine this Whether n-layer input data and weight are floating data, if the n-th layer input data and weight are not floating data, notice The n-th layer input data is converted into floating data by the k Branch Processing Unit, weight is converted into floating data, then N-th layer input data, weight are executed into Matrix Multiplication matrix operation with floating type.
In the device that first aspect provides, integrated circuit chip device is also used to such as the described n-th reversed operation are as follows: square Battle array multiplies vector operation, and the input data is n-th layer input data, and the weight is the n-th output result gradient;
Complexity=β * F*G*F;Wherein, β is matrix coefficient, and value range is more than or equal to 1, and F, G are that n-th layer inputs number According to row, column value, F be n-th output result gradient train value;
If the complexity is greater than given threshold, determine that the n-th reverse data type is floating type, determine this Whether n-layer input data and weight are floating data, if the n-th layer input data and weight are not floating data, notice The n-th layer input data is converted into floating data by the k Branch Processing Unit, weight is converted into floating data, then N-th layer input data, weight are executed into Matrix Multiplication vector operation with floating type.
In the device that first aspect provides, the main process task circuit, specifically for the type of such as described n-th reversed operation For multiplying, determines the n-th layer input data and the n-th layer weight group data are distribution data block, described n-th Output result gradient is broadcast data block;If the type of the n-th reversed operation is convolution algorithm, the n-th layer input data is determined And the n-th layer weight group data are broadcast data block, the n-th output result gradient is distribution data block.
In the device that first aspect provides, the n-layer inverse operation further include: bigoted operation connects operation, GEMM fortune entirely One of calculation, GEMV operation, activation operation or any combination.
In the device that first aspect provides, the main process task circuit includes: buffer circuit on master register or main leaf;
The based process circuit includes: base register or basic on piece buffer circuit.
In the device that first aspect provides, the main process task circuit includes: vector operation device circuit, arithmetic logic unit One of circuit, accumulator circuit, matrix transposition circuit, direct memory access circuit or data rearrangement circuit or any group It closes.
In the device that first aspect provides, the n-th output result gradient are as follows: vector, matrix, three-dimensional data block, four A kind of or any combination in dimensional data block and n dimensional data block;
The n-th layer input data are as follows: one in vector, matrix, three-dimensional data block, 4 D data block and n dimensional data block Kind or any combination;
The n-layer weight group data are as follows: one in vector, matrix, three-dimensional data block, 4 D data block and n dimensional data block Kind or any combination.
As shown in Figure 1, the step of neural metwork training, includes:
Each layer in one (multilayer) neural network successively executes forward operation;
Reversed operation, which is successively executed, according to the sequence of opposite layer obtains weight gradient;
The weight of update forward operation is removed with the gradient for the weight being calculated;
Here it is the successively iteration of the training of neural network, entire training process needs repeat (i.e. successive ignition meter Calculate) this process is multiple.
Refering to Fig. 3, Fig. 3 is a kind of integrated circuit chip device, and training of the described device for the neural network of execution should Neural network includes n-layer, and the n value range is the integer more than or equal to 2, and the integrated circuit chip device includes: main place Manage circuit, k branch process circuit and k group based process circuit, the main process task circuit and the k branch process circuit It is separately connected, each branch process circuit corresponds to one group of based process in k group based process circuit in k branch process circuit Circuit, one group of based process circuit include at least one based process circuit;
The branch process circuit includes: data type computing circuit, for executing floating point type data and fixed point type Conversion between data;
The integrated circuit chip device determines that first layer inputs number according to the training instruction for receiving training instruction According to first layer weight group data, the n-layer for executing neural network to first layer input data and first layer weight group data is positive Operation obtains the n-th output result of forward operation;
The main process task circuit is also used to obtain the n-th output result gradient according to the n-th output result, according to described in Training instruction obtains the n-th reversed operation of the reversed operation of n-th layer, exports result gradient, n-th layer input data, n-th according to n-th Layer weight group data and the n-th reversed operation obtain the n-th reversed computational complexity, true according to the described n-th reversed computational complexity Fixed n-th output result gradient, n-th layer input data, the corresponding n-th reverse data type of n-th layer weight group data;
The main process task circuit exports result gradient for n-th for the type according to the n-th reversed operation, n-th layer inputs number It is divided into broadcast data block and distribution data block according to, n-th layer of weight group data, the distribution to the n-th reverse data type Data block carries out deconsolidation process and obtains multiple basic data blocks, and the multiple basic data block is distributed to the k bifurcation At least one branch process circuit in circuit is managed, the broadcast data block of the n-th reverse data type is broadcasted to the k Branch process circuit;
The k branch process circuit, for by the data type computing circuit by broadcast data block and reception To basic data block be converted into the broadcast data block of the n-th reverse data type and the base of the n-th reverse data type received Notebook data block;By the basic data block of the broadcast data block of the n-th reverse data type and the n-th reverse data type received It is transmitted to based process circuit;
The k group based process circuit, for by the broadcast data block and to receive basic data block reversed with n-th Data type executes operation and obtains operation result, and gives operation result to the master by the k branch process circuit transmission Processing circuit;
The main process task circuit obtains n-th layer weight group gradient and n-th layer input for being handled the operation result Data gradient is updated n-th layer weight group data using the n-th layer weight group gradient;The n-th reverse data type It include: fixed point type or floating point type;
The integrated circuit chip device, be also used to using n-th layer input data gradient as (n-1)th layer (n-1)th output As a result gradient executes n-1 layers of direction operation and obtains n-1 layers of weight group gradient, using n-1 layers of weight group gradient updating respective layer Weight group data, the weight group data include;At least two weights.
As shown in Figure 1a, a kind of forward operation of the neural network provided for present disclosure embodiment, each layer uses oneself Type according to layer of input data and weight specified by operation rule corresponding output data is calculated;
The forward operation process (being also reasoning, inference) of neural network is the input data for successively handling each layer, warp Certain calculating is crossed, the process of output data is obtained, has the feature that
The input of a certain layer:
The input of a certain layer can be the input data of neural network;
The input of a certain layer can be the output of other layers;
The input of a certain layer can be the output (the case where corresponding to Recognition with Recurrent Neural Network) of this layer of last moment;
A certain layer can obtain input from multiple above-mentioned input sources simultaneously;
The output of a certain layer:
The output of a certain layer can be used as the output result of neural network;
The output of a certain layer can be other layers of input;
The output of a certain layer can be the input (the case where Recognition with Recurrent Neural Network) of this layer of subsequent time;
The output of a certain layer can export result to above-mentioned multiple outbound courses;
Specifically, the type of the operation of the layer in the neural network includes but is not limited to following several:
Convolutional layer (i.e. execution convolution algorithm);
Full articulamentum (executing full connection operation);
Normalize (regularization) layer: including LRN (Local Response Normalization) layer, BN (Batch Normalization) the types such as layer;
Pond layer;
Active coating: including but is not limited to the Tanh with Sigmoid layers of Types Below, ReLU layers, PReLu layers, LeakyReLu layers Layer;
The reversed operation of layer, each layer of reversed operation need to be implemented two parts operation: a part is using may be dilute It dredges the output data gradient indicated and may be that the input data of rarefaction representation calculates the gradient of weight (for " weight is more Newly " step updates the weight of this layer), another part is using the output data gradient that may be rarefaction representation and may be sparse The weight of expression, calculate input data gradient (for the output data gradient as next layer in reversed operation for its into The reversed operation of row);
Reversed operation is according to the sequence opposite with forward operation, the back transfer gradient since the last layer.
In a kind of optinal plan, the output data gradient that a certain layer retrospectively calculate obtains be can come from:
The gradient of the last loss function of neural network (lost function or cost function) passback;
Other layers of input data gradient;
The input data gradient (the case where corresponding to Recognition with Recurrent Neural Network) of this layer of last moment;
A certain layer can obtain output data gradient from multiple above-mentioned sources simultaneously;
After having executed the reversed operation of neural network, the gradient of the weight of each layer is just calculated, in the step In, the first input-buffer and the second input-buffer of described device are respectively used to store the gradient of the weight of this layer and weight, so Using weights gradient is updated weight in arithmetic element afterwards;
The operation being mentioned above all is that multilayer neural network was realized in one layer of operation in neural network Cheng Shi, in forward operation, after upper one layer of artificial neural network, which executes, to be completed, next layer of operational order can be by operation list Calculated output data carries out operation as next layer of input data and (or carries out certain behaviour to the output data in member It is re-used as next layer of input data), meanwhile, weight is also replaced with to next layer of weight;In reversed operation, when upper one After the completion of the reversed operation of layer artificial neural network executes, next layer of operational order can be by input number calculated in arithmetic element According to gradient as next layer output data gradient carry out operation (or to the input data gradient carry out it is certain operation remake Output data gradient for next layer), while weight being replaced with to next layer of weight;It (is indicated with figure below, in the following figure The arrow of dotted line indicates reversed operation, and the arrow of solid line indicates forward operation, respectively schemes the meaning of following mark expression figure)
The representation method of fixed point data
The method of fixed point refers to that the expression of the data of some data block in network is converted into certain specific fixation is small The data coding method (the 0/1 bit disposing way for being mapped to data on circuit device) of several positions;
In a kind of optinal plan, multiple data composition number is used into same fixed-point representation according to block as a whole Method carries out fixed point expression;
Fig. 1 b shows the specific table of short digit fixed-point data structure for storing data according to an embodiment of the present invention Show method.Wherein, 1Bit are used to indicate symbol, and M are used to indicate integer part, and N for indicating fractional part;It compares In 32 floating data representations, the short position fixed-point data representation that the present invention uses is less in addition to occupying number of bits Outside, it for same layer, same type of data in neural network, such as all weight datas of first convolutional layer, also in addition sets The position of a flag bit Point location record decimal point has been set, number can have been adjusted according to the distribution of real data in this way According to expression precision and can indicate data area.
Expression, that is, 32bit of floating number is indicated, but for this technical solution, uses fixed-point number that can reduce The digit of the bit of one numerical value, to reduce the data volume of transmission and the data volume of operation.
Input data indicated with Fig. 2 a (N number of sample, each sample have C channel, a height of H of the characteristic pattern in each channel, Width is W), weight namely convolution kernel indicate (there is M convolution kernel, each convolution kernel has C channel, and height and width are respectively with Fig. 2 b KH and KW).For N number of sample of input data, the rule of convolution algorithm is the same, and explained later is on a sample The process of convolution algorithm is carried out, on a sample, each of M convolution kernel will carry out same operation, Mei Gejuan Product kernel operation obtains a sheet of planar characteristic pattern, and M plane characteristic figure is finally calculated in M convolution kernel, (to a sample, volume Long-pending output is M characteristic pattern), for a convolution kernel, inner product fortune is carried out in each plan-position of a sample It calculates, is slided then along the direction H and W, for example, Fig. 2 c indicates that a convolution kernel is right in a sample of input data The position of inferior horn carries out the corresponding diagram of inner product operation;Fig. 2 d indicates that the position of convolution slides a lattice and Fig. 2 e to the left and indicates convolution One lattice of position upward sliding.
When the first operation is convolution algorithm, the input data is convolution input data, and the weight data is convolution kernel,
First complexity=α * C*kW*kW*M*N*W*C*H;
Wherein, α is convolution coefficient, and value range is greater than 1;C, kW, kW, M be convolution kernel four dimensions value, N, W, C, H is the value of convolution input data four dimensions;
If first complexity is greater than given threshold, determine whether the convolution input data and convolution kernel are floating number According to which being converted into floating data, will be rolled up if the convolution input data and convolution kernel is not floating data Product consideration convey changes floating data into, and convolution input data, convolution kernel are then executed convolution algorithm with floating type.
Specifically, the mode of the process of convolution can be handled using chip structure as shown in Figure 3, main process task circuit ( Be properly termed as master unit) data conversion computing circuit can the first complexity be greater than given threshold when, by the part of weight Or the data conversion in whole convolution kernels, at the data of fixed point type, the control circuit of main process task circuit is by the part of weight or entirely Data in portion's convolution kernel are sent to those of to be directly connected with main process task circuit based process by lateral Data Input Interface Circuit (is referred to as base unit);
In a kind of optinal plan, the control circuit of main process task circuit sends the data of some convolution kernel in weight every time One number or a part of number give some based process circuit;(for example, for some based process circuit, send for the 1st time The 1st number of 3 rows, the 2nd the 2nd number sent in the 3rd row data, the 3rd number ... or the 1st of the 3rd the 3rd row of transmission The 3rd row the first two number of secondary transmission, second of the 3rd row the 3rd of transmission and the 4th number, third time send the 3rd row the 5th and the 6th Number ...;)
Another situation is that, the control circuit of main process task circuit is by the several convolution kernels of certain in weight in a kind of optinal plan Data every time respectively send an a part of number of number person give some based process circuit;(for example, for some based process electricity Road, the 1st number of the 1st the 3rd, 4, the 5 every row of row of transmission, the 2nd number of the 2nd the 3rd, 4, the 5 every row of row of transmission, the 3rd transmission 3rd number ... of the 3rd, 4, the 5 every row of row or the 1st transmission every row the first two number of the 3rd, 4,5 row, second of transmission the 3rd, The every row the 3rd of 4,5 rows and the 4th number, third time send the every row the 5th of the 3rd, 4,5 row and the 6th number ...;)
The control circuit of main process task circuit divides input data according to the position of convolution, the control of main process task circuit Circuit by the data some or all of in input data in convolution position be sent to by vertical Data Input Interface directly with Main process task circuit be connected those of based process circuit;
In a kind of optinal plan, the control circuit of main process task circuit is every by the data of some convolution position in input data One number of secondary transmission or a part of number give some based process circuit;(for example, for some based process circuit, the 1st time It sending the 3rd and arranges the 1st number, the 2nd the 2nd number sent in the 3rd column data sends the 3rd number ... of the 3rd column for the 3rd time, Or the 1st the 3rd column the first two number of transmission, second, which sends the 3rd, arranges the 3rd and the 4th number, and third time sends the 3rd and arranges the 5th and the 6 numbers ...;)
Another situation is that, the control circuit of main process task circuit is by the several volumes of certain in input data in a kind of optinal plan The data of product position respectively send a number every time or a part of number gives some based process circuit;(for example, for some base Plinth processing circuit, the 1st number of the 1st the 3rd, 4,5 column each column of transmission, the 2nd number of the 2nd the 3rd, 4,5 column each column of transmission, The 3rd number ... or the 1st the 3rd, 4,5 column each column the first two number of transmission of 3rd the 3rd, 4,5 column each column of transmission, second The 3rd, 4,5 column each column the 3rd and the 4th number are sent, third time sends the 3rd, 4,5 column each column the 5th and the 6th number ...;)
After based process circuit receives the data of weight, which is transmitted by its lateral data output interface Be connected next based process circuit to it;After based process circuit receives the data of input data, which is passed through Its vertical data output interface is transferred to coupled next based process circuit;
Each based process circuit carries out operation to the data received;
In a kind of optinal plan, based process circuit calculates the multiplication of one or more groups of two data every time, then will As a result it is added on register and/or on piece caching;
In a kind of optinal plan, based process circuit calculates the inner product of one or more groups of two vectors every time, then will As a result it is added on register and/or on piece caching;
After based process circuit counting goes out result, result can be transferred out from data output interface;
In a kind of optinal plan, which can be the final result or intermediate result of inner product operation;
Specifically, from the interface if the based process circuit has the output interface being directly connected with main process task circuit Transmission is as a result, if it is not, towards that directly can export result to the direction of the based process circuit of main process task circuit output.
After based process circuit receives the calculated result from other based process circuits, transmit the data to Its other based process circuit or main process task circuit for being connected;
Towards can be directly to the direction of main process task circuit output output result (for example, bottom line based process electricity Road outputs it result and is directly output to main process task circuit, other based process circuits transmit downwards fortune from vertical output interface Calculate result);
Main process task circuit receive each based process circuit inner product operation as a result, output result can be obtained.
Refering to Fig. 4 a, Fig. 4 a is a kind of Matrix Multiplication with the operation of matrix, such as first operation are as follows: Matrix Multiplication matrix fortune It calculates, the input data is the first matrix of the Matrix Multiplication matrix operation, and the weight is the Matrix Multiplication matrix operation Second matrix;
First complexity=β * F*G*E*F;Wherein, β is matrix coefficient, and value range is F, G first more than or equal to 1 The row, column value of matrix, E, F are the row, column value of the second matrix;
If first complexity is greater than given threshold, determine whether first matrix and the second matrix are floating number According to if first matrix and the second matrix are not floating data, by first matrix conversion at floating data, by the second matrix It is converted into floating data, the first matrix, the second matrix are then executed into Matrix Multiplication matrix operation with floating type.
Refering to Fig. 4 b, the operation of Matrix Multiplication matrix is completed using device as shown in Figure 3;
Be described below calculate size be M row L column matrix S and size be L row N column matrix P multiplication operation, (square Every a line in battle array S is identical with each column length of matrix P, as shown in Figure 2 d) to possess K a for the neural computing device Based process circuit:
Step S401b, matrix S and matrix P are converted by main process task circuit when such as the first complexity is greater than given threshold Every data line in matrix S is distributed in K based process circuit by the control circuit of fixed point type data, main process task circuit Some on, based process circuit by the data received be stored on piece caching and/or register in;Specifically, can be with It is sent to the based process circuit in K based process circuit with main process task circuit connection.
In a kind of optinal plan, if line number M≤K of S, the control circuit of main process task circuit is to M based process Circuit distributes a line of s-matrix respectively;
In a kind of optinal plan, if line number M > K of S, the control circuit of main process task circuit is to each based process electricity Distribute a line or the data of multirow in s-matrix respectively in road.
There is Mi row to be distributed to i-th of based process circuit in S, the collection of this Mi row is collectively referred to as Ai, as Fig. 2 e indicates i-th of base Calculating to be executed on plinth processing circuit.
In a kind of optinal plan, in each based process circuit, such as in i-th of based process circuit:
Matrix A i is stored in i-th of based process circuit register by the received matrix A i distributed by main process task circuit And/or on piece caching;Advantage be the reduction of after volume of transmitted data, improve computational efficiency, reduce power consumption.
Step S402b, each section in matrix P is transferred to each base by the control circuit of main process task circuit in a broadcast manner Plinth processing circuit;
In a kind of optinal plan, each section in matrix P can only be broadcasted and once arrive posting for each based process circuit In storage or on piece caching, i-th of based process circuit is fully multiplexed the data of the matrix P this time obtained, Complete the corresponding inner product operation with every a line in matrix A i;Multiplexing in the present embodiment is specifically as follows based process circuit and exists Reused in calculating, for example, matrix P data multiplexing, can be and the data of matrix P are being used for multiple times.
In a kind of optinal plan, each section in matrix P can be repeatedly broadcast to respectively by the control circuit of main process task circuit In register or the on piece caching of a based process circuit, data of i-th of based process circuit to the matrix P obtained every time Without multiplexing, the inner product operation of the every a line corresponded in matrix A i is completed by several times;
In a kind of optinal plan, each section in matrix P can be repeatedly broadcast to respectively by the control circuit of main process task circuit In register or the on piece caching of a based process circuit, data of i-th of based process circuit to the matrix P obtained every time Fractional reuse is carried out, the inner product operation of the every a line corresponded in matrix A i is completed;
In a kind of optinal plan, each based process circuit, such as i-th of based process circuit, calculating matrix Ai's The inner product of data and the data of matrix P;
Step S403b, the result of inner product operation is added up and is transmitted by the accumulator circuit of each based process circuit Return main process task circuit.
In a kind of optinal plan, based process circuit can execute the part and be transmitted back to that inner product operation obtains for each Main process task circuit adds up;
The part that can also be obtained the inner product operation that each based process circuit executes in a kind of optinal plan and guarantor It is cumulative to terminate to be transmitted back to main process task circuit later in register and/or the on piece caching of existence foundation processing circuit;
In a kind of optinal plan, can also by the obtained part of inner product operation that each based process circuit executes and It is stored in the register and/or on piece caching of based process circuit and adds up under partial picture, be transferred under partial picture Main process task circuit adds up, cumulative to terminate to be transmitted back to main process task circuit later.
It is a kind of Matrix Multiplication with the operation schematic diagram of vector refering to Fig. 4 c.Such as first operation are as follows: Matrix Multiplication vector fortune It calculates, the input data is the first matrix of the Matrix Multiplication vector operation, and the weight is the Matrix Multiplication vector operation Vector;
First complexity=β * F*G*F;Wherein, β is matrix coefficient, and value range is more than or equal to 1, and F, G are the first square The row, column value of battle array, F are the train value of vector;
If first complexity is greater than given threshold, determine whether first matrix and vector are floating data, such as First matrix and vector are not floating data, and by first matrix conversion at floating data, vector is converted into floating number According to then by the first matrix, vector with floating type execution Matrix Multiplication vector operation.
Refering to Fig. 4 d, Fig. 4 d has provided a kind of implementation method of Matrix Multiplication vector, can specifically include:
Step S401, every data line in matrix S is converted into pinpointing by the data conversion computing circuit of main process task circuit The data of type, the control circuit of main process task circuit are distributed in some in K based process circuit, based process circuit The distribution data received are stored in the on piece caching and/or register of based process circuit;
In a kind of optinal plan, if line number M≤K of matrix S, the control circuit of main process task circuit is to K basis Processing circuit distributes a line of s-matrix respectively;
In a kind of optinal plan, if line number M > K of matrix S, the control circuit of main process task circuit gives each basis Processing circuit distributes a line or the data of multirow in s-matrix respectively.
The collection for the row being distributed in the S of i-th of based process circuit is combined into Ai, shares Mi row, as Fig. 2 c is indicated i-th Calculating to be executed on based process circuit.
In a kind of optinal plan, in each based process circuit, such as in i-th of based process circuit, it can incite somebody to action The distribution data received such as matrix A i is stored in the register and/or on piece caching of i-th of based process circuit;Advantage The volume of transmitted data of distribution data after being the reduction of, improves computational efficiency, reduces power consumption.
Step S402, vector P is converted into the data of fixed point type, main place by the data type computing circuit of main process task circuit Each section in the vector P of fixed point type is transferred to K based process circuit by the control circuit of reason circuit in a broadcast manner;
In a kind of optinal plan, the control circuit of main process task circuit, which can only broadcast each section in vector P, once to be arrived In register or the on piece caching of each based process circuit, i-th of based process circuit is to the vector P's this time obtained Data are fully multiplexed, and the corresponding inner product operation with every a line in matrix A i is completed.Advantage is reduced from main process task circuit To the volume of transmitted data of the repetition transmission of the vector P of based process circuit, execution efficiency is improved, reduces transmission power consumption.
In a kind of optinal plan, each section in vector P can be repeatedly broadcast to respectively by the control circuit of main process task circuit In register or the on piece caching of a based process circuit, data of i-th of based process circuit to the vector P obtained every time Without multiplexing, the inner product operation of the every a line corresponded in matrix A i is completed by several times;Advantage is reduced in based process circuit The volume of transmitted data of the vector P of the single transmission in portion, and the capacity of based process circuit caching and/or register can be reduced, Execution efficiency is improved, transmission power consumption is reduced, reduces cost.
In a kind of optinal plan, each section in vector P can be repeatedly broadcast to respectively by the control circuit of main process task circuit In register or the on piece caching of a based process circuit, data of i-th of based process circuit to the vector P obtained every time Fractional reuse is carried out, the inner product operation of the every a line corresponded in matrix A i is completed;Advantage is reduced from main process task circuit to base The volume of transmitted data of plinth processing circuit also reduces the volume of transmitted data inside based process circuit, improves execution efficiency, reduces and pass Defeated power consumption.
Step S403, the inner product of the data of inner product operation device the circuit counting matrix S and vector P of K based process circuit, Such as i-th of based process circuit, the inner product of the data of the data and vector P of calculating matrix Ai;
Step S404, the accumulator circuit of K based process circuit is added up the result of inner product operation As a result, accumulation result to be transmitted back to main process task circuit in the form of fixed point type.
In a kind of optinal plan, each based process circuit can be executed to the part and (part that inner product operation obtains That is a part of accumulation result, such as accumulation result are as follows: F1*G1+F2*G2+F3*G3+F4*G4+F5*G5, then part and Can be with are as follows: the value of F1*G1+ F2*G2+F3*G3) it is transmitted back to main process task circuit and adds up;Advantage is to reduce based process Operand inside circuit improves the operation efficiency of based process circuit.
The part that can also be obtained the inner product operation that each based process circuit executes in a kind of optinal plan and guarantor It is cumulative to terminate to be transmitted back to main process task circuit later in register and/or the on piece caching of existence foundation processing circuit;Advantage is, Reduce the volume of transmitted data between based process circuit and main process task circuit, improve operation efficiency, reduces data transmission Power consumption.
In a kind of optinal plan, can also by the obtained part of inner product operation that each based process circuit executes and It is stored in the register and/or on piece caching of based process circuit and adds up under partial picture, be transferred under partial picture Main process task circuit adds up, cumulative to terminate to be transmitted back to main process task circuit later;Advantage is to reduce based process circuit and master Volume of transmitted data between processing circuit, improves operation efficiency, reduces data transmission power consumption, reduces based process circuit Internal operand improves the operation efficiency of based process circuit.
Neural network training method
Involved all data can use different data presentation techniques in neural network training process;
Specifically, the data presentation technique includes but is not limited to following situations:
The floating number of different bit wides;
The fixed-point number of different bit wides, the fixed-point number of different fixed positions;
The different moments (at the time of being specifically just different the number of iterations or initialization) of training process trained Different data block (i.e. multiple input numbers in different phase (i.e. positive or reversed operation), different layers, same layer in journey According to block, output block) or the same data block in the sub-block of different piece that divides, be ok:
It can be respectively using fixed point or floating-point;
For fixed point:
Use different fixed point bit wides;
Use different fixed point deviants (namely fixed position);
The concrete methods of realizing for illustrating neural metwork training with an actual example below, is as shown in Figure 1a single layer The specific calculating schematic diagram of the neural metwork training of operation, as shown in Figure 1a, input data and weight or parameter execute this layer Operation, technical solution provided by the embodiments of the present application determine whether according to the forward operation amount of input data, weight and this layer The type of the input data and weight is converted, specific mode can be with are as follows: such as the input data and weight storage institute The register or storage space of occupancy are greater than given threshold and the forward operation amount of this layer is greater than setting operand, determine that this is defeated When to enter data and weight data be floating data, the input data and weight data are converted into fixed-point data.Such as input number Accordingly and the occupied register of weight storage or storage space are less than given threshold, as the input data and weight data are Fixed-point data after input data and weight data are converted into floating data, executes this layer of operation.
Principle the application of above-mentioned data type conversion is elaborated, is a kind of fixed point class as shown in Figure 1 b The expression of type data, for computing system, the storage bit number of 1 floating data is 32bit, and for fixed-point data, especially It is using the expression of the data progress data of the floating point type such as Fig. 1 b shown in, and the storage bit number of 1 fixed-point data can be with Accomplish 16Bit hereinafter, so for this conversion for, the transport overhead that can be significantly reduced between calculator, in addition, for For calculator, the space of the data storage of less bit is also smaller, i.e., storage overhead can be smaller, and calculation amount can also be reduced, I.e. computing cost can be reduced, so the expense of computing cost and storage can be reduced, but the conversion for data type It is the need for the expense of part, hereinafter referred to as transition overhead, for computationally intensive, the big data of data storage capacity, conversion is opened Pin almost can be ignored for subsequent computing cost, storage overhead and transport overhead, so for calculating Amount is big, the big data of data storage capacity, and the application is used data type conversion into the technical solution of the data of fixed point type, It is small conversely, for calculation amount, the small data of data storage capacity, at this time since computing cost itself, storage overhead and transmission are opened Pin just it is smaller, at this time if using fixed-point data, since the precision of fixed-point data can be slightly below floating data, calculation amount compared with Under the premise of small, need to guarantee the precision calculated, so passing through increasing here by the data conversion of fixed point type at floating data Add lesser expense to improve the purpose of the precision of calculating.
Illustrated below with actual example, as shown in fig 4e, this layer of operation mode is matrix multiplication, input data And weight is matrix, input data here is by taking matrix I as an example for convenience of explanation, and weight is by taking matrix W as an example, such as Fig. 4 e It is shown, output data=matrix I* matrix W;Here if the sum of number of columns and line number amount of matrix I and matrix W are larger, It can think that above-mentioned matrix I and matrix W take up too much space in memory and/or register and calculation amount is also larger, In this way if matrix I and matrix W are floating data, matrix I and matrix W are converted into fixed-point data, then executed The operation of matrix multiplication.
For example, matrix I be 1000*1000 matrix, matrix W is also the matrix of 1000*1000, then for number of columns with And the sum of line number amount is 2000, quantity is very big, and corresponding calculation amount is just bigger, and Matrix Multiplication is with the multiplication of the inner product operation of matrix Operation i.e. 109 time, for this technical solution, since the quantity of matrix I and matrix W are very big, it is impossible to once by all numbers According to whole transmission, data same in this way may be transmitted several times, it is assumed that be transmitted for fixed-point data, so that it may be significantly reduced transmission Data volume, and then reduce transport overhead, the calculating and storage relative to, less bit can also reduce computing cost with And storage overhead.
It is for the technical solution that fixed-point data is converted into floating data, by taking reversed operation as an example, as shown in figure 4g Calculate structure on to arrow direction be a kind of reversed operation.By taking reversed operation as an example, for direction operation, direction operation For output data gradient, which is specifically as follows, if the output data gradient is the last of current iteration calculating One layer, (the default operation can by default operation for the output data for the last layer which calculates By producer's sets itself according to their needs, not limit the concrete operation step of the default operation here) obtain output number It is the last layer that non-current iteration calculates according to gradient, such as the output data gradient, such as the output data gradient changes for this The n-th layer that generation calculates, then the output data gradient is the input data gradient that (n+1)th layer of reversed operation is calculated.
Illustrated below with actual example, as shown in figure 4g, this layer of operation mode is matrix multiplication, input data For matrix, weight is scalar, and input data here by taking matrix I as an example, such as scheme by taking scalar C as an example by weight for convenience of explanation Shown in 4g, output data=matrix I*C;At this time due to the data that weight is scalar, data calculation amount is smaller, in this way if matrix I is fixed-point data, then matrix I is converted into floating data, then in the operation for executing Matrix Multiplication scalar.
For example, matrix I is the matrix of 10*10, scalar C is counted then being 20 for the sum of number of columns and line number amount Amount is smaller, (assuming that being greater than 100 here is considered larger, is considered smaller less than 100, for the 100 digital capacity field technique Personnel can arbitrarily set.) corresponding calculation amount with regard to very little, Matrix Multiplication with the multiplying of the inner product operation of matrix i.e. 102 time, Since calculation amount is small, if still calculated using fixed-point data, its precision can be had an impact, in order to enable computational accuracy It is higher, under the premise of smaller calculation amount, computational accuracy can be improved by floating data calculating.
In a kind of optinal plan, fixed fixed point bit wide can be respectively adopted in each data block of each layer in network, but It is its fixed position with training iteration cycle variation;
Specifically, in the training process, the data presentation technique of some data block can be set as follows;
It specifically, can be to some data block selection arbitrary data representation method when starting to train;
In a kind of optinal plan, the floating point representation method of specific bit wide can choose;
In a kind of optinal plan, the fixed-point representation method of particular form can choose;
It can choose specific fixed point bit wide;
It can choose specific fixed position;
In a kind of optional scheme, it is fixed to be arranged according to the maximum value of the absolute value of data all in the data block Point position;
In a kind of optinal plan, fixed point can be set according to the minimum value of the absolute value of data all in the data block Position;
It, can be according to the fixed position of other data blocks come notebook data block when determining initialization in a kind of optinal plan Fixed position;
In a kind of optinal plan, the fixed position of notebook data block can be set based on experience value;
Specifically, in the training process, the data that can change some data block in any iteration cycle number indicate Method;
It, can be without adjustment for some data block in a kind of optinal plan;
In a kind of optinal plan, it can be adjusted every certain the number of iterations;
In a kind of optinal plan, it can be adjusted every certain training epoch number;
In a kind of optinal plan, it can be adjusted according to unfixed the number of iterations interval;
In a kind of optinal plan, unfixed trained epoch number can be spaced and be adjusted;
Specifically, in the training process, it adjusts adjustable for arbitrary data when the representation method of some data block Representation method;
In a kind of optinal plan, if a data block is indicated using fixed fixed point bit wide fixed-point number, The adjustment mode for the fixed position that data indicate may is that
In a kind of optinal plan, fixed position is set according to the setting method of initialization fixed position every time;
In a kind of optinal plan, if what some data block was calculated according to the initial setting method of fixed position Fixed position increased in some iteration cycle than last iteration cycle, that is just by the fixed position in this period towards the side of increase Method changes;Conversely, then changing towards reduced direction.
Present disclosure also provides a kind of integrated circuit chip device, and the integrated circuit chip device is for executing neural network Training, the neural network include multilayer, the integrated circuit chip device includes: processing circuit and external interface;
The external interface, for receiving training instruction;
The processing circuit leads to for determining first layer input data and first layer weight data according to the training instruction The n-layer forward operation for crossing first layer input data and first layer weight data execution neural network obtains the n-th output result;
The processing circuit is also used to obtain the n-th output result gradient according to the n-th output result, refer to according to the training The the n-th reversed operation for obtaining n-th layer of reversed operation is enabled, exports result gradient, n-th layer input data, n-th layer weight according to n-th Group data and the n-th reversed operation obtain the n-th reversed computational complexity, determine that n-th is defeated according to the described n-th reversed computational complexity Result gradient, n-th layer input data, the corresponding n-th reverse data type of n-th layer weight group data out, by the n-th output result ladder Degree, n-th layer input data, n-th layer weight group data are obtained with the reversed operation of n-layer that the n-th reverse data type executes neural network To n weight gradient of n-layer operation;The n-th reverse data type includes: fixed point type or floating point type;
The processing circuit is also used to be updated n weight of n-layer operation using the n weight gradient.
Present disclosure is also disclosed that a neural network computing device comprising it is one or more in chip as shown in Figure 3, For being obtained from other processing units to operational data and control information, specified neural network computing, implementing result are executed Peripheral equipment is passed to by I/O interface.Peripheral equipment for example camera, display, mouse, keyboard, network interface card, wifi interface, Server.When comprising a chip as shown in Figure 3, chip chamber as shown in Figure 3 can carry out chain by specific structure Data are connect and transmitted, for example, data are interconnected and transmitted by PCIE bus, to support the fortune of more massive neural network It calculates.At this point it is possible to share same control system, there can also be control system independent;Can be with shared drive, it can also be with every A accelerator has respective memory.In addition, its mutual contact mode can be any interconnection topology.
The neural network computing device compatibility with higher can pass through PCIE interface and various types of server phases Connection.
Present disclosure is also disclosed that a combined treatment device comprising above-mentioned neural network computing device, general interconnection Interface and other processing units (i.e. general processing unit).Neural network computing device is interacted with other processing units, altogether The operation specified with completion user.Such as the schematic diagram that 5a is combined treatment device.
Other processing units, including central processor CPU, graphics processor GPU, neural network processor etc. are general/special With one of processor or above processor type.Processor quantity included by other processing units is with no restrictions.Its His interface of the processing unit as neural network computing device and external data and control, including data are carried, and are completed to Benshen Unlatching, stopping through network operations device etc. control substantially;Other processing units can also cooperate with neural network computing device It is common to complete processor active task.
General interconnecting interface, for transmitting data and control between the neural network computing device and other processing units Instruction.The neural network computing device obtains required input data, write-in neural network computing dress from other processing units Set the storage device of on piece;Control instruction can be obtained from other processing units, write-in neural network computing device on piece Control caching;The data in the memory module of neural network computing device can also be read and be transferred to other processing units.
As shown in Figure 5 b, optionally, which further includes storage device, for being stored in this arithmetic element/arithmetic unit Or data required for other arithmetic elements, be particularly suitable for required for operation data this neural network computing device or its The data that can not be all saved in the storage inside of his processing unit.
The combined treatment device can be used as the SOC on piece of the equipment such as mobile phone, robot, unmanned plane, video monitoring equipment The die area of control section is effectively reduced in system, improves processing speed, reduces overall power.When this situation, the combined treatment The general interconnecting interface of device is connected with certain components of equipment.Certain components for example camera, display, mouse, keyboard, Network interface card, wifi interface.
C referring to figure 5., Fig. 5 c are a kind of structural representation for neural network processor board that present disclosure embodiment provides Figure.As shown in Fig. 5 c, above-mentioned neural network processor board 10 include neural network chip encapsulating structure 11, first it is electrical and Non-electrical attachment device 12 and first substrate (substrate) 13.
Present disclosure is not construed as limiting the specific structure of neural network chip encapsulating structure 11, optionally, as fig 5d, Above-mentioned neural network chip encapsulating structure 11 includes: neural network chip 111, second electrical and non-electrical attachment device 112, the Two substrates 113.
The concrete form of neural network chip 111 involved in present disclosure is not construed as limiting, above-mentioned neural network chip 111 Including but not limited to the neural network chip for integrating neural network processor, above-mentioned chip can be by silicon materials, germanium material, amount Sub- material or molecular material etc. are made.(such as: more harsh environment) and different application demands can will be upper according to the actual situation Neural network chip is stated to be packaged, so that the major part of neural network chip is wrapped, and will be on neural network chip Pin is connected to the outside of encapsulating structure by conductors such as gold threads, for carrying out circuit connection with more outer layer.
Present disclosure is not construed as limiting the specific structure of neural network chip 111, optionally, please refers to Fig. 1 a or Fig. 1 b institute The device shown.
Present disclosure for first substrate 13 and the second substrate 113 type without limitation, can be printed circuit board (printed circuit board, PCB) or (printed wiring board, PWB), it is also possible to be other circuit boards.It is right The making material of PCB is also without limitation.
The second substrate 113 involved in present disclosure is electrical and non-by second for carrying above-mentioned neural network chip 111 The neural network chip that above-mentioned neural network chip 111 and the second substrate 113 are attached by electrical connection arrangement 112 Encapsulating structure 11, for protecting neural network chip 111, convenient for by neural network chip encapsulating structure 11 and first substrate 13 into Row further encapsulation.
Electrical for above-mentioned specific second and non-electrical attachment device 112 the corresponding knot of packaged type and packaged type Structure is not construed as limiting, and can be selected suitable packaged type with different application demands according to the actual situation and simply be improved, Such as: flip chip ball grid array encapsulates (Flip Chip Ball Grid Array Package, FCBGAP), slim four directions Flat type packaged (Low-profile Quad Flat Package, LQFP), the quad flat package (Quad with radiator Flat Package with Heat sink, HQFP), without pin quad flat package (Quad Flat Non-lead Package, QFN) or the encapsulation side small spacing quad flat formula encapsulation (Fine-pitch Ball Grid Package, FBGA) etc. Formula.
Flip-chip (Flip Chip), suitable for the area requirements after encapsulation are high or biography to the inductance of conducting wire, signal In the case where defeated time-sensitive.In addition to this packaged type that wire bonding (Wire Bonding) can be used, reduces cost, mentions The flexibility of high encapsulating structure.
Ball grid array (Ball Grid Array), is capable of providing more pins, and the average conductor length of pin is short, tool The effect of standby high-speed transfer signal, wherein encapsulation can encapsulate (Pin Grid Array, PGA), zero slotting with Pin-Grid Array Pull out force (Zero Insertion Force, ZIF), single edge contact connection (Single Edge Contact Connection, SECC), contact array (Land Grid Array, LGA) etc. replaces.
Optionally, using the packaged type of flip chip ball grid array (Flip Chip Ball Grid Array) to mind It is packaged through network chip 111 and the second substrate 113, the schematic diagram of specific neural network chip encapsulating structure can refer to Fig. 6.As shown in fig. 6, above-mentioned neural network chip encapsulating structure includes: neural network chip 21, pad 22, soldered ball 23, second Tie point 25, pin 26 on substrate 24, the second substrate 24.
Wherein, pad 22 is connected with neural network chip 21, passes through the tie point 25 on pad 22 and the second substrate 24 Between welding form soldered ball 23, neural network chip 21 and the second substrate 24 are connected, that is, realize neural network chip 21 Encapsulation.
Pin 26 is used for the external circuit with encapsulating structure (for example, the first substrate on neural network processor board 10 13) be connected, it can be achieved that external data and internal data transmission, it is corresponding convenient for neural network chip 21 or neural network chip 21 Neural network processor data are handled.Type and quantity present disclosure for pin are also not construed as limiting, according to difference Encapsulation technology different pin forms can be selected, and defer to certain rule and arranged.
Optionally, above-mentioned neural network chip encapsulating structure further includes insulation filler, is placed in pad 22, soldered ball 23 and connects In gap between contact 25, interference is generated between soldered ball and soldered ball for preventing.
Wherein, the material of insulation filler can be silicon nitride, silica or silicon oxynitride;Interference comprising electromagnetic interference, Inductive interferences etc..
Optionally, above-mentioned neural network chip encapsulating structure further includes radiator, for distributing neural network chip 21 Heat when operation.Wherein, radiator can be the good sheet metal of one piece of thermal conductivity, cooling fin or radiator, for example, wind Fan.
For example, as shown in Figure 6 a, neural network chip encapsulating structure 11 include: neural network chip 21, pad 22, Soldered ball 23, the second substrate 24, the tie point 25 in the second substrate 24, pin 26, insulation filler 27, thermal grease 28 and metal Shell cooling fin 29.Wherein, thermal grease 28 and metal shell cooling fin 29 are used to distribute heat when neural network chip 21 is run Amount.
Optionally, above-mentioned neural network chip encapsulating structure 11 further includes reinforced structure, is connect with pad 22, and interior is embedded in In soldered ball 23, to enhance the bonding strength between soldered ball 23 and pad 22.
Wherein, reinforced structure can be metal wire structure or column structure, it is not limited here.
Present disclosure is electrical for first and the concrete form of non-electrical device of air 12 is also not construed as limiting, can refer to second it is electrical and Neural network chip encapsulating structure 11 is packaged by the description of non-electrical device of air 112 by welding, can also be with By the way of connecting line connection or pluggable mode connection the second substrate 113 and first substrate 13, it is convenient for the first base of subsequent replacement Plate 13 or neural network chip encapsulating structure 11.
Optionally, first substrate 13 includes the interface etc. for the internal storage location of extension storage capacity, such as: synchronous dynamic Random access memory (Synchronous Dynamic Random Access Memory, SDRAM), Double Data Rate synchronous dynamic with Machine memory (Double Date Rate SDRAM, DDR) etc., the place of neural network processor is improved by exented memory Reason ability.
It may also include quick external equipment interconnection bus (Peripheral Component on first substrate 13 Interconnect-Express, PCI-E or PCIe) interface, hot-swappable (the Small Form-factor of small package Pluggable, SFP) interface, Ethernet interface, Controller Area Network BUS (Controller Area Network, CAN) connect Mouthful etc., for the data transmission between encapsulating structure and external circuit, the convenience of arithmetic speed and operation can be improved.
Neural network processor is encapsulated as neural network chip 111, neural network chip 111 is encapsulated as neural network Neural network chip encapsulating structure 11 is encapsulated as neural network processor board 10, by board by chip-packaging structure 11 Interface (slot or lock pin) and external circuit (such as: computer motherboard) carry out data interaction, i.e., directly by using nerve Network processing unit board 10 realizes the function of neural network processor, and protects neural network chip 111.And Processing with Neural Network Other modules can be also added on device board 10, improve the application range and operation efficiency of neural network processor.
In one embodiment, the present disclosure discloses an electronic devices comprising above-mentioned neural network processor plate Card 10 or neural network chip encapsulating structure 11.
Electronic device include data processing equipment, robot, computer, printer, scanner, tablet computer, intelligent terminal, Mobile phone, automobile data recorder, navigator, sensor, camera, server, camera, video camera, projector, wrist-watch, earphone, movement Storage, wearable device, the vehicles, household electrical appliance, and/or Medical Devices.
The vehicles include aircraft, steamer and/or vehicle;The household electrical appliance include TV, air-conditioning, micro-wave oven, Refrigerator, electric cooker, humidifier, washing machine, electric light, gas-cooker, kitchen ventilator;The Medical Devices include Nuclear Magnetic Resonance, B ultrasound instrument And/or electrocardiograph.
Particular embodiments described above has carried out further in detail the purpose of present disclosure, technical scheme and beneficial effects Describe in detail it is bright, it is all it should be understood that be not limited to present disclosure the foregoing is merely the specific embodiment of present disclosure Within the spirit and principle of present disclosure, any modification, equivalent substitution, improvement and etc. done should be included in the guarantor of present disclosure Within the scope of shield.

Claims (16)

1. a kind of integrated circuit chip device, training of the described device for the neural network of execution, the neural network include n Layer, the n value range are the integer more than or equal to 2, which is characterized in that the integrated circuit chip device includes: main process task Circuit, k branch process circuit and k group based process circuit, the main process task circuit and the k branch process circuit point It does not connect, each branch process circuit corresponds to the electricity of one group of based process in k group based process circuit in k branch process circuit Road, one group of based process circuit include at least one based process circuit;
The branch process circuit includes: data type computing circuit, for executing floating point type data and fixed point type data Between conversion;
The integrated circuit chip device, for receiving training instruction, according to the training instruction determine first layer input data and First layer weight group data execute the n-layer forward operation of neural network to first layer input data and first layer weight group data Obtain the n-th output result of forward operation;
The main process task circuit is also used to obtain the n-th output result gradient according to the n-th output result, according to the training Instruction obtains the n-th reversed operation of the reversed operation of n-th layer, exports result gradient, n-th layer input data, n-th layer power according to n-th Value group data and the n-th reversed operation obtain the n-th reversed computational complexity, determine n-th according to the described n-th reversed computational complexity Export result gradient, n-th layer input data, the corresponding n-th reverse data type of n-th layer weight group data;
The main process task circuit, for according to the n-th reversed operation type by n-th export result gradient, n-th layer input data, N-th layer weight group data are divided into broadcast data block and distribution data block, to the distribution data of the n-th reverse data type Block carries out deconsolidation process and obtains multiple basic data blocks, and the multiple basic data block is distributed to the k branch process electricity At least one branch process circuit in road broadcasts the broadcast data block of the n-th reverse data type to the k branch Processing circuit;
The k branch process circuit, for by broadcast data block and being received by the data type computing circuit Basic data block is converted into the broadcast data block of the n-th reverse data type and the basic number of the n-th reverse data type received According to block;The basic data block of the broadcast data block of n-th reverse data type and the n-th reverse data type received is transmitted to Based process circuit;
The k group based process circuit, for by the broadcast data block and receiving basic data block with the n-th reverse data Type executes operation and obtains operation result, and gives operation result to the main process task by the k branch process circuit transmission Circuit;
The main process task circuit obtains n-th layer weight group gradient and n-th layer input data for being handled the operation result Gradient is updated n-th layer weight group data using the n-th layer weight group gradient;The n-th reverse data type packet It includes: fixed point type or floating point type;
The integrated circuit chip device, be also used to using n-th layer input data gradient as (n-1)th layer (n-1)th export result Gradient executes n-1 layers of direction operation and obtains n-1 layers of weight group gradient, using the weight of n-1 layers of weight group gradient updating respective layer Group data, the weight group data include;At least two weights.
2. integrated circuit chip device according to claim 1, which is characterized in that
The Main Processor Unit, specifically for compared with preset threshold, such as described n-th is reversely transported by the n-th reversed computational complexity It calculates complexity and is higher than the preset threshold, determine that the n-th reverse data type is fixed point type, such as the described n-th reversed operation Complexity is less than or equal to the preset threshold, and computing device determines that the n-th reverse data type is floating point type.
3. integrated circuit chip device according to claim 2, which is characterized in that
The Main Processor Unit is specifically used for determining the n-th output the result gradient, n-th layer input data, n-th layer weight group The (n+1)th reverse data type that data belong to, such as the (n+1)th reverse data type is different from the n-th reverse data type, It is by the data type computing circuit that the n-th output result gradient, the n-th layer for belonging to the (n+1)th reverse data type is defeated It is defeated at the n-th output result gradient, the n-th layer for belonging to the n-th reverse data type to enter data, n-th layer weight group data conversion Enter data, n-th layer weight group data.
4. integrated circuit chip device according to claim 1, which is characterized in that
The Main Processor Unit is convolution algorithm for such as the reversed operation of the n-layer, and convolution input data is that the n-th layer is defeated Entering data, convolution kernel is the n-th output result gradient,
N-th reversed computational complexity=α * C*kW*kW*M*N*W*C*H;
Wherein, α is convolution coefficient, and value range is greater than 1;C, kW, kW, M are the value of convolution kernel four dimensions, and N, W, C, H are The value of convolution input data four dimensions;
If the complexity is greater than given threshold, determines that the n-th reverse data type is floating type, determine that the convolution is defeated Enter whether data and convolution kernel are floating data, if the convolution input data and convolution kernel are not floating data, notifies institute It states k Branch Processing Unit and the convolution input data is converted into floating data, convolution kernel is converted into floating data, then will Convolution input data, convolution kernel execute convolution algorithm with floating type.
5. integrated circuit chip device according to claim 1, which is characterized in that
The Main Processor Unit is also used to such as the described n-th reversed operation are as follows: Matrix Multiplication matrix operation, the input data are n-th Layer input data, the weight are the n-th output result gradient;
Complexity=β * F*G*E*F;Wherein, β is matrix coefficient, and value range is more than or equal to 1, and F, G are n-th layer input data Row, column value, E, F be weight row, column value;
If the complexity is greater than given threshold, determines that the n-th reverse data type is floating type, determine the n-th layer Whether input data and weight are floating data, if the n-th layer input data and weight are not floating data, notify institute It states k Branch Processing Unit and the n-th layer input data is converted into floating data, weight is converted into floating data, then will N-th layer input data, weight execute Matrix Multiplication matrix operation with floating type.
6. integrated circuit chip device according to claim 1, which is characterized in that
Integrated circuit chip device is also used to such as the described n-th reversed operation are as follows: Matrix Multiplication vector operation, the input data are N-th layer input data, the weight are the n-th output result gradient;
Complexity=β * F*G*F;Wherein, β is matrix coefficient, and value range is more than or equal to 1, and F, G are n-th layer input data Row, column value, F are the train value of the n-th output result gradient;
If the complexity is greater than given threshold, determines that the n-th reverse data type is floating type, determine the n-th layer Whether input data and weight are floating data, if the n-th layer input data and weight are not floating data, notify institute It states k Branch Processing Unit and the n-th layer input data is converted into floating data, weight is converted into floating data, then will N-th layer input data, weight execute Matrix Multiplication vector operation with floating type.
7. integrated circuit chip device according to claim 1, which is characterized in that
The main process task circuit, the type specifically for such as described n-th reversed operation is multiplying, determines that the n-th layer is defeated Entering data and the n-th layer weight group data is distribution data block, and the n-th output result gradient is broadcast data block; If the type of the n-th reversed operation is convolution algorithm, determine that the n-th layer input data and the n-th layer weight group data are equal For broadcast data block, the n-th output result gradient is distribution data block.
8. integrated circuit chip device described in -7 any one according to claim 1, which is characterized in that
The n-layer inverse operation further include: bigoted operation, entirely connect operation, GEMM operation, GEMV operation, activation operation in one Kind or any combination.
9. integrated circuit chip device according to claim 1, which is characterized in that
The main process task circuit includes: buffer circuit on master register or main leaf;
The based process circuit includes: base register or basic on piece buffer circuit.
10. integrated circuit chip device according to claim 9, which is characterized in that
The main process task circuit includes: vector operation device circuit, arithmetic logic unit circuit, accumulator circuit, matrix transposition electricity One of road, direct memory access circuit or data rearrangement circuit or any combination.
11. integrated circuit chip device according to claim 9, which is characterized in that
The n-th output result gradient are as follows: a kind of in vector, matrix, three-dimensional data block, 4 D data block and n dimensional data block Or any combination;
The n-th layer input data are as follows: it is a kind of in vector, matrix, three-dimensional data block, 4 D data block and n dimensional data block or Any combination;
The n-layer weight group data are as follows: it is a kind of in vector, matrix, three-dimensional data block, 4 D data block and n dimensional data block or Any combination.
12. a kind of neural network computing device, which is characterized in that the neural network computing device includes one or more as weighed Benefit requires integrated circuit chip device described in 1-11 any one.
13. a kind of combined treatment device, which is characterized in that the combined treatment device includes: mind as claimed in claim 12 Through network operations device, general interconnecting interface and general processing unit;
The neural network computing device is connect by the general interconnecting interface with the general processing unit.
14. a kind of chip, which is characterized in that the integrated chip such as claim 1-13 any one described device.
15. a kind of smart machine, which is characterized in that the smart machine includes chip as claimed in claim 14.
16. a kind of operation method of neural network, which is characterized in that the method is applied in integrated circuit chip device, institute Stating integrated circuit chip device includes: the integrated circuit chip device as described in claim 1-11 any one, described integrated Circuit chip device is used to execute the forward operation of neural network.
CN201711467271.8A 2017-12-28 2017-12-28 Integrated circuit chip device and related product Active CN109978148B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711467271.8A CN109978148B (en) 2017-12-28 2017-12-28 Integrated circuit chip device and related product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711467271.8A CN109978148B (en) 2017-12-28 2017-12-28 Integrated circuit chip device and related product

Publications (2)

Publication Number Publication Date
CN109978148A true CN109978148A (en) 2019-07-05
CN109978148B CN109978148B (en) 2020-06-23

Family

ID=67075512

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711467271.8A Active CN109978148B (en) 2017-12-28 2017-12-28 Integrated circuit chip device and related product

Country Status (1)

Country Link
CN (1) CN109978148B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111199276A (en) * 2020-01-02 2020-05-26 上海寒武纪信息科技有限公司 Data processing method and related product

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2467401A1 (en) * 2001-11-16 2003-05-30 Yuan Yan Chen Pausible neural network with supervised and unsupervised cluster analysis
CN105404925A (en) * 2015-11-02 2016-03-16 上海新储集成电路有限公司 Three-dimensional nerve network chip
CN105930903A (en) * 2016-05-16 2016-09-07 浙江大学 Digital-analog hybrid neural network chip architecture
CN106991478A (en) * 2016-01-20 2017-07-28 南京艾溪信息科技有限公司 Apparatus and method for performing artificial neural network reverse train
CN107066239A (en) * 2017-03-01 2017-08-18 智擎信息系统(上海)有限公司 A kind of hardware configuration for realizing convolutional neural networks forward calculation
CN107092959A (en) * 2017-04-07 2017-08-25 武汉大学 Hardware friendly impulsive neural networks model based on STDP unsupervised-learning algorithms
CN107153873A (en) * 2017-05-08 2017-09-12 中国科学院计算技术研究所 A kind of two-value convolutional neural networks processor and its application method
US9779355B1 (en) * 2016-09-15 2017-10-03 International Business Machines Corporation Back propagation gates and storage capacitor for neural networks
CN107330515A (en) * 2016-04-29 2017-11-07 北京中科寒武纪科技有限公司 A kind of apparatus and method for performing artificial neural network forward operation
CN107341547A (en) * 2016-04-29 2017-11-10 北京中科寒武纪科技有限公司 A kind of apparatus and method for being used to perform convolutional neural networks training
CN107340993A (en) * 2016-04-28 2017-11-10 北京中科寒武纪科技有限公司 A kind of apparatus and method for the neural network computing for supporting less digit floating number
CN107423817A (en) * 2017-04-17 2017-12-01 星环信息科技(上海)有限公司 The method and apparatus that a kind of deep learning is realized

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2467401A1 (en) * 2001-11-16 2003-05-30 Yuan Yan Chen Pausible neural network with supervised and unsupervised cluster analysis
CN105404925A (en) * 2015-11-02 2016-03-16 上海新储集成电路有限公司 Three-dimensional nerve network chip
CN106991478A (en) * 2016-01-20 2017-07-28 南京艾溪信息科技有限公司 Apparatus and method for performing artificial neural network reverse train
CN107340993A (en) * 2016-04-28 2017-11-10 北京中科寒武纪科技有限公司 A kind of apparatus and method for the neural network computing for supporting less digit floating number
CN107330515A (en) * 2016-04-29 2017-11-07 北京中科寒武纪科技有限公司 A kind of apparatus and method for performing artificial neural network forward operation
CN107341547A (en) * 2016-04-29 2017-11-10 北京中科寒武纪科技有限公司 A kind of apparatus and method for being used to perform convolutional neural networks training
CN105930903A (en) * 2016-05-16 2016-09-07 浙江大学 Digital-analog hybrid neural network chip architecture
US9779355B1 (en) * 2016-09-15 2017-10-03 International Business Machines Corporation Back propagation gates and storage capacitor for neural networks
CN107066239A (en) * 2017-03-01 2017-08-18 智擎信息系统(上海)有限公司 A kind of hardware configuration for realizing convolutional neural networks forward calculation
CN107092959A (en) * 2017-04-07 2017-08-25 武汉大学 Hardware friendly impulsive neural networks model based on STDP unsupervised-learning algorithms
CN107423817A (en) * 2017-04-17 2017-12-01 星环信息科技(上海)有限公司 The method and apparatus that a kind of deep learning is realized
CN107153873A (en) * 2017-05-08 2017-09-12 中国科学院计算技术研究所 A kind of two-value convolutional neural networks processor and its application method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LIANG Z.等: "Local Cluster Neural Network On-chip Training", 《THE 2006 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORK PROCEEDINGS》 *
王佩琪 等: "深度卷积神经网络的数据表示方法分析与实践", 《计算机研究与发展》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111199276A (en) * 2020-01-02 2020-05-26 上海寒武纪信息科技有限公司 Data processing method and related product
CN111199276B (en) * 2020-01-02 2023-03-24 上海寒武纪信息科技有限公司 Data processing method and related product

Also Published As

Publication number Publication date
CN109978148B (en) 2020-06-23

Similar Documents

Publication Publication Date Title
CN109961138A (en) Neural network training method and Related product
EP3789871A1 (en) Integrated circuit chip device
CN109978131A (en) Integrated circuit chip device and Related product
CN110826712B (en) Neural network processor board card and related products
CN109977446A (en) Integrated circuit chip device and Related product
CN109961134A (en) Integrated circuit chip device and Related product
CN109961131A (en) Neural network forward operation method and Related product
CN109978148A (en) Integrated circuit chip device and Related product
CN109978156A (en) Integrated circuit chip device and Related product
CN109961135A (en) Integrated circuit chip device and Related product
CN109978157A (en) Integrated circuit chip device and Related product
CN110175673A (en) Processing method and accelerator
TWI767097B (en) Integrated circuit chip apparatus and related product
CN109978151A (en) Neural network processor board and Related product
CN109978158A (en) Integrated circuit chip device and Related product
CN109977071A (en) Neural network processor board and Related product
CN109978152A (en) Integrated circuit chip device and Related product
CN110490315A (en) The reversed operation Sparse methods and Related product of neural network
CN110197264A (en) Neural network processor board and Related product
CN109978150A (en) Neural network processor board and Related product
CN110197267A (en) Neural network processor board and Related product
CN109978147A (en) Integrated circuit chip device and Related product
CN109961133A (en) Integrated circuit chip device and Related product
CN109978154A (en) Integrated circuit chip device and Related product
CN109978130A (en) Integrated circuit chip device and Related product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100000 room 644, No. 6, No. 6, South Road, Beijing Academy of Sciences

Applicant after: Zhongke Cambrian Technology Co., Ltd

Address before: 100000 room 644, No. 6, No. 6, South Road, Beijing Academy of Sciences

Applicant before: Beijing Zhongke Cambrian Technology Co., Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant