CN109978156B - Integrated circuit chip device and related product - Google Patents

Integrated circuit chip device and related product Download PDF

Info

Publication number
CN109978156B
CN109978156B CN201711469408.3A CN201711469408A CN109978156B CN 109978156 B CN109978156 B CN 109978156B CN 201711469408 A CN201711469408 A CN 201711469408A CN 109978156 B CN109978156 B CN 109978156B
Authority
CN
China
Prior art keywords
data
nth
processing circuit
input data
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711469408.3A
Other languages
Chinese (zh)
Other versions
CN109978156A (en
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cambricon Technologies Corp Ltd
Original Assignee
Cambricon Technologies Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to CN201711469408.3A priority Critical patent/CN109978156B/en
Application filed by Cambricon Technologies Corp Ltd filed Critical Cambricon Technologies Corp Ltd
Priority to EP18896519.8A priority patent/EP3719712B1/en
Priority to EP20201907.1A priority patent/EP3783477B1/en
Priority to PCT/CN2018/123929 priority patent/WO2019129070A1/en
Priority to EP20203232.2A priority patent/EP3789871B1/en
Publication of CN109978156A publication Critical patent/CN109978156A/en
Application granted granted Critical
Publication of CN109978156B publication Critical patent/CN109978156B/en
Priority to US16/903,304 priority patent/US11544546B2/en
Priority to US17/134,487 priority patent/US11748605B2/en
Priority to US17/134,435 priority patent/US11741351B2/en
Priority to US17/134,446 priority patent/US11748603B2/en
Priority to US17/134,444 priority patent/US11748601B2/en
Priority to US17/134,445 priority patent/US11748602B2/en
Priority to US17/134,486 priority patent/US11748604B2/en
Priority to US18/073,924 priority patent/US20230095610A1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Abstract

The present disclosure provides an integrated circuit chip device and related products, the device is used for training a neural network, the neural network comprises n layers, the value range of n is an integer greater than or equal to 2, the integrated circuit chip device comprises: a main processing circuit and a plurality of basic processing circuits; the main processing circuit includes: a data type arithmetic circuit; the data type arithmetic circuit is used for executing conversion between floating point type data and fixed point type data; the plurality of base processing circuits are distributed in an array; each basic processing circuit is connected with other adjacent basic processing circuits, and the main processing circuit is connected with the n basic processing circuits of the 1 st row, the n basic processing circuits of the m th row and the m basic processing circuits of the 1 st column. The technical scheme provided by the disclosure has the advantages of small calculation amount and low power consumption.

Description

Integrated circuit chip device and related product
Technical Field
The present disclosure relates to the field of neural networks, and more particularly to an integrated circuit chip device and related products.
Background
Artificial Neural Networks (ANN) are a research hotspot in the field of Artificial intelligence since the 80 s of the 20 th century. The method abstracts the human brain neuron network from the information processing angle, establishes a certain simple model, and forms different networks according to different connection modes. It is also often directly referred to in engineering and academia as neural networks or neural-like networks. A neural network is an operational model, which is formed by connecting a large number of nodes (or neurons). The operation of the existing neural network is based on a Central Processing Unit (CPU) or a Graphics Processing Unit (GPU) to realize the forward operation of the neural network, and the forward operation has a large amount of calculation and high power consumption.
Disclosure of Invention
Embodiments of the present disclosure provide an integrated circuit chip device and related products, which can increase the processing speed and efficiency of a computing device.
In a first aspect, an integrated circuit chip apparatus for performing training of a neural network is provided, where the apparatus is configured to perform training of the neural network, the neural network includes n layers, and n is an integer having a value range greater than or equal to 2, and the integrated circuit chip apparatus includes: a main processing circuit and a plurality of basic processing circuits; the main processing circuit includes: a data type arithmetic circuit; the data type arithmetic circuit is used for executing conversion between floating point type data and fixed point type data;
the plurality of base processing circuits are distributed in an array; each basic processing circuit is connected with other adjacent basic processing circuits, and the main processing circuit is connected with the n basic processing circuits of the 1 st row, the n basic processing circuits of the m th row and the m basic processing circuits of the 1 st column;
the integrated circuit chip device is used for receiving a training instruction, determining first-layer input data and first-layer weight group data according to the training instruction, and executing n layers of forward operations of a neural network on the first-layer input data and the first-layer weight group data to obtain an nth output result of the forward operations;
the main processing circuit is further configured to obtain an nth output result gradient according to the nth output result, obtain an nth reverse operation of an nth layer of reverse operation according to the training instruction, obtain an nth reverse operation complexity according to the nth output result gradient, nth layer of input data, nth layer of weight group data and the nth reverse operation, and determine an nth reverse data type corresponding to the nth output result gradient, the nth layer of input data and the nth layer of weight group data according to the nth reverse operation complexity;
the main processing circuit is used for dividing the nth output result gradient, the nth layer input data and the nth layer weight group data into a broadcast data block and a distribution data block according to the nth reverse operation type, splitting the distribution data block of the nth reverse data type to obtain a plurality of basic data blocks, distributing the plurality of basic data blocks to at least one branch processing circuit in a basic processing circuit connected with the main processing circuit, and broadcasting the broadcast data block of the nth reverse data type to the basic processing circuit connected with the main processing circuit;
the basic processing circuit is used for executing operation in the neural network in a parallel mode according to the broadcast data block of the nth reverse data type and the basic data block of the nth reverse data type to obtain an operation result, and transmitting the operation result to the main processing circuit through the basic processing circuit connected with the main processing circuit;
the main processing circuit is used for processing the operation result to obtain the nth layer weight group gradient and the nth layer input data gradient, and updating the nth layer weight group data by applying the nth layer weight group gradient; the nth reverse data type includes: a fixed point type or a floating point type;
the integrated circuit chip device is also used for taking the input data gradient of the nth layer as the output result gradient of the nth-1 layer to execute the directional operation of the n-1 layer to obtain the weight group gradient of the n-1 layer, and updating the weight group data of the corresponding layer by applying the weight group gradient of the n-1 layer, wherein the weight group data comprises; at least two weights.
In a second aspect, a neural network computing device is provided, which includes one or more integrated circuit chip devices provided in the first aspect.
In a third aspect, there is provided a combined processing apparatus comprising: the neural network arithmetic device, the universal interconnection interface and the universal processing device are provided by the second aspect;
the neural network operation device is connected with the general processing device through the general interconnection interface.
In a fourth aspect, a chip is provided that integrates the apparatus of the first aspect, the apparatus of the second aspect, or the apparatus of the third aspect.
In a fifth aspect, an electronic device is provided, which comprises the chip of the fourth aspect.
It can be seen that, by the embodiments of the present disclosure, the data conversion operation circuit is provided to perform the post-conversion operation on the type of the data block, so that transmission resources and calculation resources are saved, and therefore, the data conversion operation circuit has the advantages of low power consumption and small calculation amount.
Drawings
Fig. 1 is a schematic diagram of a training method of a neural network.
FIG. 1a is a schematic diagram of a forward operation of a neural network.
FIG. 1b is a schematic block diagram of a fixed point data type.
Fig. 2a is a schematic diagram of convolved input data.
Fig. 2b is a schematic diagram of a convolution kernel.
FIG. 2c is a diagram of an operation window of a three-dimensional data block of input data.
FIG. 2d is a diagram of another exemplary window for inputting a three-dimensional data block of data.
FIG. 2e is a diagram of another operation window of a three-dimensional data block of input data.
Fig. 3 is a schematic structural diagram of a neural network chip.
Fig. 4a is a schematic diagram of matrix multiplication.
Fig. 4b is a flow chart of a method for multiplying a matrix by a matrix.
FIG. 4c is a diagram of a matrix multiplied by a vector.
FIG. 4d is a flow chart of a method for multiplying a matrix by a vector.
Fig. 4e is a schematic diagram of neural network training.
FIG. 4f is a schematic diagram of another neural network training scheme.
FIG. 4g is a diagram illustrating the forward and backward operations of the neural network.
FIG. 4h is a diagram of a multi-layer structure for neural network training.
Fig. 5a is a schematic structural diagram of a combined processing device according to the disclosure.
Fig. 5b is a schematic view of another structure of a combined processing device disclosed in the present disclosure.
Fig. 5c is a schematic structural diagram of a neural network processor board card according to an embodiment of the present disclosure;
fig. 5d is a schematic structural diagram of a neural network chip package structure according to an embodiment of the present disclosure;
fig. 5e is a schematic structural diagram of a neural network chip according to an embodiment of the present disclosure;
fig. 6 is a schematic diagram of a neural network chip package structure according to an embodiment of the disclosure;
fig. 6a is a schematic diagram of another neural network chip package structure according to an embodiment of the present disclosure.
Detailed Description
In order to make the technical solutions of the present disclosure better understood by those skilled in the art, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. All other embodiments, which can be derived by one of ordinary skill in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.
In the apparatus provided in the first aspect, the main processing circuit is specifically configured to compare the nth inverse operation complexity with a preset threshold, determine that the nth inverse data type is a fixed-point type if the nth inverse operation complexity is higher than the preset threshold, and determine that the nth inverse data type is a floating-point type if the nth inverse operation complexity is lower than or equal to the preset threshold.
In the apparatus provided in the first aspect, the main processing circuit is specifically configured to determine an n +1 th reverse data type to which the nth output result gradient, the nth layer input data, and the nth layer weight group data belong, and if the n +1 th reverse data type is different from the nth reverse data type, convert the nth output result gradient, the nth layer input data, and the nth layer weight group data belonging to the n +1 th reverse data type into the nth output result gradient, the nth layer input data, and the nth layer weight group data belonging to the nth reverse data type through the data type operation circuit.
In the apparatus provided in the first aspect, the main processing circuit is configured to, as the n-th layer inverse operation, perform a convolution operation, wherein the convolution input data is the nth layer input data, and the convolution kernel is the nth output result gradient,
the nth inverse complexity of operation α × C × kH × kW × M × W × C × H;
wherein α is a convolution coefficient with a value range larger than 1, C, kH, kW and M are values of four dimensions of a convolution kernel, and N, W, C, H is a value of four dimensions of convolution input data;
if the complexity is larger than the set threshold, determining that the nth reverse data type is a floating point data type, determining whether the convolution input data and the convolution kernel are floating point data or not, if the convolution input data and the convolution kernel are not floating point data, converting the convolution input data into the floating point data, converting the convolution kernel into the floating point data, and then performing convolution operation on the convolution input data and the convolution kernel according to the floating point data type.
In the apparatus provided in the first aspect, the main processing circuit is further configured to, as the nth inverse operation: performing matrix multiplication matrix operation, wherein the input data is nth layer input data, and the weight is the nth output result gradient;
the complexity is β × F × G × E × F, wherein β is a matrix coefficient, the value range is greater than or equal to 1, F, G is the row and column values of the nth layer of input data, and E, F is the row and column values of the weight;
if the complexity is larger than the set threshold, determining that the nth reverse data type is a floating point data type, determining whether the nth layer input data and the weight are floating point data, if the nth layer input data and the weight are not floating point data, converting the weight into the floating point data, and then performing matrix multiplication operation on the nth layer input data and the weight by using the floating point data type.
In the apparatus provided in the first aspect, the integrated circuit chip apparatus is further configured to perform the operations of: performing matrix multiplication vector operation, wherein the input data is nth layer input data, and the weight is the nth output result gradient;
the complexity is β × F × G × F, wherein β is a matrix coefficient, the value range is greater than or equal to 1, F, G is the row and column values of the nth layer of input data, and F is the column value of the nth output result gradient;
if the complexity is larger than the set threshold, determining that the nth reverse data type is a floating point data type, determining whether the nth layer input data and the weight are floating point data, if the nth layer input data and the weight are not floating point data, informing the k branch processing circuits to convert the nth layer input data into the floating point data, converting the weight into the floating point data, and then performing matrix multiplication vector operation on the nth layer input data and the weight by the floating point data type.
In the apparatus provided in the first aspect, the main processing circuit is specifically configured to determine that the nth layer input data and the nth layer weight group data are both distribution data blocks if the nth inverse operation is a multiplication operation, and the nth output result gradient is a broadcast data block; if the type of the nth reverse operation is convolution operation, determining that the nth layer input data and the nth layer weight group data are broadcast data blocks, and the nth output result gradient is a distribution data block.
In the apparatus provided in the first aspect, the n-layer inverse operation further includes: and performing one or any combination of bias operation, full connection operation, GEMM operation, GEMV operation and activation operation.
In an apparatus provided in the first aspect, the main processing circuit includes: a master register or on-master cache circuit;
the base processing circuit includes: basic registers or basic on-chip cache circuits.
In an apparatus provided in the first aspect, the main processing circuit includes: one or any combination of vector arithmetic unit circuit, arithmetic logic unit circuit, accumulator circuit, matrix transposition circuit, direct memory access circuit or data rearrangement circuit.
In the apparatus provided in the first aspect, the nth output result gradient is: one or any combination of vector, matrix, three-dimensional data block, four-dimensional data block and n-dimensional data block;
the nth layer of input data is as follows: one or any combination of vector, matrix, three-dimensional data block, four-dimensional data block and n-dimensional data block;
the n layers of weight group data are as follows: one or any combination of vectors, matrices, three-dimensional data blocks, four-dimensional data blocks, and n-dimensional data blocks.
As shown in fig. 1, the step of neural network training includes:
each layer in a (multi-layer) neural network performs forward operations in turn;
sequentially performing reverse operation according to the sequence of the opposite layers to obtain a weight gradient;
updating the weight of forward operation by using the gradient of the calculated weight;
this is the sequential iteration of the training of the neural network, and the whole training process needs to be repeatedly executed (i.e. a plurality of iterative computations) for a plurality of times.
Referring to fig. 3, fig. 3 is an integrated circuit chip device, the device is used for training a neural network, the neural network includes n layers, and the value range of n is an integer greater than or equal to 2, the integrated circuit chip device includes: a main processing circuit and a plurality of basic processing circuits; the main processing circuit includes: a data type arithmetic circuit; the data type arithmetic circuit is used for executing conversion between floating point type data and fixed point type data;
the plurality of base processing circuits are distributed in an array; each basic processing circuit is connected with other adjacent basic processing circuits, and the main processing circuit is connected with the n basic processing circuits of the 1 st row, the n basic processing circuits of the m th row and the m basic processing circuits of the 1 st column;
the integrated circuit chip device is used for receiving a training instruction, determining first-layer input data and first-layer weight group data according to the training instruction, and executing n layers of forward operations of a neural network on the first-layer input data and the first-layer weight group data to obtain an nth output result of the forward operations;
the main processing circuit is further configured to obtain an nth output result gradient according to the nth output result, obtain an nth reverse operation of an nth layer of reverse operation according to the training instruction, obtain an nth reverse operation complexity according to the nth output result gradient, nth layer of input data, nth layer of weight group data and the nth reverse operation, and determine an nth reverse data type corresponding to the nth output result gradient, the nth layer of input data and the nth layer of weight group data according to the nth reverse operation complexity;
the main processing circuit is used for dividing the nth output result gradient, the nth layer input data and the nth layer weight group data into a broadcast data block and a distribution data block according to the nth reverse operation type, splitting the distribution data block of the nth reverse data type to obtain a plurality of basic data blocks, distributing the plurality of basic data blocks to at least one branch processing circuit in a basic processing circuit connected with the main processing circuit, and broadcasting the broadcast data block of the nth reverse data type to the basic processing circuit connected with the main processing circuit;
the basic processing circuit is used for executing operation in the neural network in a parallel mode according to the broadcast data block of the nth reverse data type and the basic data block of the nth reverse data type to obtain an operation result, and transmitting the operation result to the main processing circuit through the basic processing circuit connected with the main processing circuit;
the main processing circuit is used for processing the operation result to obtain the nth layer weight group gradient and the nth layer input data gradient, and updating the nth layer weight group data by applying the nth layer weight group gradient; the nth reverse data type includes: a fixed point type or a floating point type;
the integrated circuit chip device is also used for taking the input data gradient of the nth layer as the output result gradient of the nth-1 layer to execute the directional operation of the n-1 layer to obtain the weight group gradient of the n-1 layer, and updating the weight group data of the corresponding layer by applying the weight group gradient of the n-1 layer, wherein the weight group data comprises; at least two weights.
As shown in fig. 1a, for the forward operation of the neural network provided by the embodiment of the present disclosure, each layer uses its own input data and weight to calculate according to the operation rule specified by the type of the layer to obtain corresponding output data;
the forward operation process (also called inference) of the neural network is a process of processing input data of each layer by layer and obtaining output data through certain calculation, and has the following characteristics:
input to a certain layer:
the input of a certain layer can be input data of a neural network;
the input of a certain layer may be the output of other layers;
the input to a certain layer may be the output at a time on the layer (corresponding to the case of a recurrent neural network);
a layer may obtain input from a plurality of said input sources simultaneously;
output of a certain layer:
the output of a certain layer can be used as the output result of the neural network;
the output of a certain layer may be the input of other layers;
the output of a layer may be the input of the layer at the next time (in the case of a recurrent neural network);
the output of a certain layer can output results to the plurality of output directions;
specifically, the types of operations of the layers in the neural network include, but are not limited to, the following:
convolutional layers (i.e., performing convolution operations);
fully-connected layers (i.e., performing fully-connected operations);
normalization (regularization) layer: including LRN (local Response normalization) layer, BN (Batchnormalization) layer, etc.;
a pooling layer;
an active layer: including but not limited to the following types Sigmoid layer, ReLU layer, prilu layer, LeakyReLu layer, Tanh layer;
the inverse operations of the layers, each of which needs to perform two parts of operations: one part is to calculate gradients of weights (for updating weights of the present layer in a "weight update" step) using gradients of output data that may be sparse representations and input data that may be sparse representations, and the other part is to calculate gradients of input data (for being used as gradients of output data of a next layer in an inverse operation for the inverse operation thereof) using gradients of output data that may be sparse representations and weights that may be sparse representations;
the backward operation reversely transfers the gradient from the last layer in the reverse order of the forward operation.
In one alternative, the inverse calculated output data gradient for a layer may be from:
gradient returned by last loss function (lost function or cost function) of the neural network;
input data gradients for other layers;
the input data gradient at a time on the local layer (corresponding to the case of the recurrent neural network);
a layer may simultaneously acquire output data gradients from a plurality of said sources;
after the reverse operation of the neural network is executed, calculating the weight gradient of each layer, wherein in the step, a first input cache and a second input cache of the device are respectively used for storing the weight of the layer and the gradient of the weight, and then the weight is updated by using the weight gradient in an operation unit;
in the forward operation, after the execution of the artificial neural network of the previous layer is completed, the operation instruction of the next layer takes the output data calculated in the operation unit as the input data of the next layer for operation (or performs some operation on the output data and then takes the output data as the input data of the next layer), and at the same time, the weight value is replaced by the weight value of the next layer; in the reverse operation, after the reverse operation of the artificial neural network of the previous layer is completed, the next layer of operation instruction takes the input data gradient calculated in the operation unit as the output data gradient of the next layer to perform operation (or performs some operation on the input data gradient and then takes the input data gradient as the output data gradient of the next layer), and simultaneously, the weight is replaced by the weight of the next layer; (in the following drawings, the arrows of broken lines in the following drawings indicate the inverse operation, the arrows of solid lines indicate the forward operation, and the labels below the drawings indicate the meanings of the drawings.)
Fixed point data representation method
The fixed-point method is a method of converting a data representation of a certain data block in a network into a data representation of a specific certain fixed decimal point position (0/1 bits of data mapped on a circuit device);
in one alternative scheme, a plurality of data are combined into a data block as a whole to be represented in a fixed point mode by using the same fixed point representation method;
FIG. 1b illustrates a specific representation of a short-bit fixed-point data structure for storing data according to an embodiment of the present invention. Wherein 1Bit is used for representing symbols, M bits are used for representing integer parts, and N bits are used for representing decimal parts; compared with a 32-bit floating Point data representation form, the short-bit fixed Point data representation form adopted by the invention has the advantages that the occupied bit number is less, and a flag bit Point location is additionally arranged for recording the position of a decimal Point for data of the same layer and the same type in a neural network, such as ownership value data of a first convolution layer, so that the precision of data representation and the representable data range can be adjusted according to the distribution of actual data.
The floating-point number representation is 32 bits, but for the technical scheme, the fixed-point number can reduce the bit number of one numerical value, so that the transmitted data volume and the operated data volume are reduced.
The input data is represented by fig. 2a (N samples, each sample having C channels, the height of the profile of each channel being H and the width being W), and the weights, i.e. the convolution kernels, are represented by fig. 2b (M convolution kernels, each convolution kernel having C channels, the height and width being KH and KW, respectively). For N samples of input data, the rule of convolution operation is the same, and the following explains the process of performing convolution operation on one sample, where each of M convolution kernels needs to perform the same operation, each convolution kernel obtains one planar feature map, and the M convolution kernels finally obtain M planar feature maps by calculation, (for one sample, the output of convolution is M feature maps), for one convolution kernel, an inner product operation is performed at each planar position of one sample, and then sliding is performed along the H and W directions, for example, fig. 2c shows a corresponding diagram of a convolution kernel performing an inner product operation at the lower right corner position in one sample of input data; figure 2d shows the position of the convolution sliding one grid to the left and figure 2e shows the position of the convolution sliding one grid upwards.
When the first operation is a convolution operation, the input data is convolution input data, the weight data is a convolution kernel,
a first complexity of α × C × kH × kW × M × N × C × H;
wherein α is a convolution coefficient with a value range larger than 1, C, kH, kW and M are values of four dimensions of a convolution kernel, and N, W, C, H is a value of four dimensions of convolution input data;
if the first complexity is larger than the set threshold, determining whether the convolution input data and the convolution kernel are floating point data or not, if the convolution input data and the convolution kernel are not floating point data, converting the convolution input data into the floating point data, converting the convolution kernel into the floating point data, and then performing convolution operation on the convolution input data and the convolution kernel in the floating point data type.
Specifically, the convolution processing may be performed by using a chip structure as shown in fig. 3, a data conversion operation circuit of a main processing circuit (which may also be referred to as a main unit) may convert data in part or all of convolution kernels of weights into fixed-point type data when the first complexity is greater than a set threshold, and a control circuit of the main processing circuit transmits data in part or all of convolution kernels of weights to basic processing circuits (which may also be referred to as basic units) directly connected to the main processing circuit through a horizontal data input interface;
in one alternative scheme, the control circuit of the main processing circuit sends one number or a part of numbers of data of a certain convolution kernel in the weight to a certain basic processing circuit at a time; (for example, for a given basic processing circuit, line 3 1 is transmitted 1 st number, line 3 is transmitted 2 nd number in 2 nd line 3, line 3 is transmitted 3 rd number … …, or line 3 first two numbers are transmitted 1 st time, line 3 and 4 are transmitted second time, line 3 5 and 6 th numbers are transmitted third time … …;)
In another alternative, the control circuit of the main processing circuit sends data of a plurality of convolution kernels in the weight to a certain basic processing circuit one number at a time; (for example, for a base processing circuit, row 3,4,5, line 1, row 2, row 3,4,5, row 3,4,5, … … is transmitted 1 time, row 3,4,5, two previous rows 3,4,5, row 3, row 5, row 6, row 5, … … is transmitted 1 time)
The control circuit of the main processing circuit divides the input data according to the convolution positions, and the control circuit of the main processing circuit sends the data in partial or all convolution positions in the input data to the basic processing circuits which are directly connected with the main processing circuit through the vertical data input interface;
in one alternative, the control circuit of the main processing circuit sends data at a certain convolution position in the input data to a certain basic processing circuit one number or a part of numbers at a time; (for example, for a basic processing circuit, the 1 st transmission of the 1 st number of the 3 rd column, the 2 nd transmission of the 2 nd number in the 3 rd column data, the 3 rd transmission of the 3 rd column of … …, or the 1 st transmission of the first two numbers of the 3 rd column, the second transmission of the 3 rd and 4 th numbers of the 3 rd column, the third transmission of the 3 rd column of the 5 th and 6 th numbers of … …;)
In an alternative, the control circuit of the main processing circuit sends data of a certain number of convolution positions in the input data to a certain basic processing circuit one number or a part of numbers at a time; (for example, for a base processing circuit, the 1 st transmission of the 1 st number of columns 3,4,5 per column, the 2 nd transmission of the 2 nd number of columns 3,4,5 per column, the 3 rd transmission of the 3 rd number of columns 3,4,5 per column … …, or the 1 st transmission of the first two numbers of columns 3,4,5 per column, the second transmission of the 3 rd and 4 th numbers of columns 3,4,5 per column, the third transmission of the 5 th and 6 th numbers of columns 3,4,5 per column … …;)
After the basic processing circuit receives the data of the weight, the data is transmitted to the next basic processing circuit connected with the basic processing circuit through a transverse data output interface of the basic processing circuit; after receiving the data of the input data, the basic processing circuit transmits the data to the next basic processing circuit connected with the basic processing circuit through a vertical data output interface of the basic processing circuit;
each basic processing circuit operates on the received data;
in one alternative, the base processing circuitry computes a multiplication of one or more sets of two data at a time, and then accumulates the results onto registers and/or on-chip caches;
in one alternative, the base processing circuitry computes the inner product of one or more sets of two vectors at a time, and then accumulates the results onto a register and/or on-chip cache;
after the basic processing circuit calculates the result, the result can be transmitted out from the data output interface;
in one alternative, the calculation result may be the final result or an intermediate result of the inner product operation;
specifically, the result is transmitted from the output interface if the basic processing circuit has the output interface directly connected to the main processing circuit, and if not, the result is output in the direction of the basic processing circuit capable of directly outputting to the main processing circuit.
After receiving the calculation results from other basic processing circuits, the basic processing circuit transmits the data to other basic processing circuits or main processing circuits connected with the basic processing circuit;
outputting the result in a direction capable of being directly output to the main processing circuit (for example, the bottom row of basic processing circuits directly outputs the output result to the main processing circuit, and the other basic processing circuits transmit the operation result downwards from the vertical output interface);
the main processing circuit receives the inner product operation result of each basic processing circuit, and the output result can be obtained.
Referring to fig. 4a, fig. 4a is a matrix-by-matrix operation, such as the first operation: matrix multiplication matrix operation, wherein the input data is a first matrix of the matrix multiplication matrix operation, and the weight is a second matrix of the matrix multiplication matrix operation;
the first complexity is β F G E F, wherein β is a matrix coefficient, the value range is more than or equal to 1, F, G is the row and column values of the first matrix, and E, F is the row and column values of the second matrix;
if the first complexity is larger than the set threshold, determining whether the first matrix and the second matrix are floating point data, if the first matrix and the second matrix are not floating point data, converting the first matrix into the floating point data, converting the second matrix into the floating point data, and then performing matrix multiplication operation on the first matrix and the second matrix according to the type of the floating point data.
Referring to FIG. 4b, the matrix multiplication operation is performed using the apparatus shown in FIG. 3;
the following describes the operation of calculating the multiplication of a matrix S of size M rows and L columns and a matrix P of size L rows and N columns, (each row in the matrix S being the same length as each column of the matrix P, as shown in fig. 2 d) the neural network computing device possesses K basic processing circuits:
step S401b, when the first complexity is larger than the set threshold, the main processing circuit converts the matrix S and the matrix P into fixed point type data, the control circuit of the main processing circuit distributes each row of data in the matrix S to one of K basic processing circuits, and the basic processing circuits store the received data in an on-chip cache and/or a register; specifically, the K basic processing circuits may be sent to the basic processing circuit connected to the main processing circuit.
In one alternative, if the number of rows M < ═ K of S, the control circuit of the main processing circuit distributes one row of the S matrix to the M basic processing circuits, respectively;
in an alternative, the control circuit of the main processing circuit distributes data of one or more rows in the S matrix to each of the elementary processing circuits, respectively, if the number of rows M > K of S.
In S, Mi rows are distributed to the ith basic processing circuit, and the set of Mi rows is called Ai, as shown in fig. 2e, which represents the calculation to be performed on the ith basic processing circuit.
In one alternative, in each base processing circuit, for example, in the ith base processing circuit:
the received matrix Ai distributed by the main processing circuit stores the matrix Ai in an ith basic processing circuit register and/or an on-chip cache; the method has the advantages of reducing the subsequent data transmission quantity, improving the calculation efficiency and reducing the power consumption.
Step S402b, the control circuit of the main processing circuit transmits each part in the matrix P to each basic processing circuit in a broadcasting mode;
in an alternative scheme, each part in the matrix P may be broadcasted to the register or on-chip cache of each basic processing circuit only once, and the ith basic processing circuit multiplexes the data of the matrix P obtained this time sufficiently to complete the inner product operation corresponding to each row in the matrix Ai; the multiplexing in this embodiment may be specifically that the basic processing circuit is repeatedly used in the calculation, for example, the multiplexing of the data of the matrix P may be that the data of the matrix P is used multiple times.
In an alternative, the control circuit of the main processing circuit may broadcast each part of the matrix P to the register or on-chip cache of each basic processing circuit for multiple times, and the ith basic processing circuit does not multiplex the data of the matrix P obtained each time, and completes the inner product operation corresponding to each row in the matrix Ai for multiple times;
in an alternative, the control circuit of the main processing circuit may broadcast each part of the matrix P to the register or on-chip cache of each basic processing circuit for multiple times, and the ith basic processing circuit performs partial multiplexing on the data of the matrix P obtained each time, and completes the inner product operation corresponding to each row in the matrix Ai;
in one alternative, each basic processing circuit, for example the ith basic processing circuit, calculates the inner product of the data of matrix Ai and the data of matrix P;
in step S403b, the accumulator circuit of each basic processing circuit accumulates the result of the inner product operation and transmits it back to the main processing circuit.
In one alternative, the base processing circuit may transmit the partial sums obtained by performing the inner product operation each time back to the main processing circuit for accumulation;
in an alternative, the partial sum obtained by the inner product operation executed by the basic processing circuit each time can be stored in a register and/or an on-chip cache of the basic processing circuit, and the partial sum is transmitted back to the main processing circuit after the accumulation is finished;
in an alternative, the partial sum obtained by the inner product operation performed by the basic processing circuit each time may be stored in a register and/or an on-chip buffer of the basic processing circuit in some cases for accumulation, and transmitted to the main processing circuit for accumulation in some cases, and transmitted back to the main processing circuit after the accumulation is finished.
Fig. 4c is a schematic diagram of a matrix multiplied by a vector. If the first operation is: performing matrix multiplication vector operation, wherein the input data is a first matrix of the matrix multiplication vector operation, and the weight is a vector of the matrix multiplication vector operation;
the first complexity is β F G F, wherein β is a matrix coefficient, the value range is more than or equal to 1, F, G is the row and column values of the first matrix, and F is the column value of the vector;
if the first complexity is larger than the set threshold, determining whether the first matrix and the vector are floating point data, if the first matrix and the vector are not floating point data, converting the first matrix into the floating point data, converting the vector into the floating point data, and then performing matrix multiplication and vector calculation on the first matrix and the vector according to the type of the floating point data.
Referring to fig. 4d, fig. 4d provides an implementation method of matrix multiplication vector, which may specifically include:
step S401, each row of data in the matrix S is converted into fixed-point type data by a data conversion operation circuit of a main processing circuit, a control circuit of the main processing circuit distributes the data to one of K basic processing circuits, and the basic processing circuits store the received distributed data in an on-chip cache and/or a register of the basic processing circuit;
in an alternative, if the number M < ═ K of rows of the matrix S, the control circuit of the main processing circuit distributes one row of the matrix S to the K basic processing circuits, respectively;
in an alternative, the control circuit of the main processing circuit distributes data of one or more rows of the S matrix to each of the elementary processing circuits, respectively, if the number of rows M > K of the matrix S.
The set of rows in S distributed to the ith basic processing circuit is Ai, and there are Mi rows in total, as fig. 2c shows the calculations to be performed on the ith basic processing circuit.
In one alternative, in each base processing circuit, e.g., the ith base processing circuit, the received dispatch data, e.g., the matrix Ai, may be stored in a register and/or on-chip cache of the ith base processing circuit; the method has the advantages of reducing the data transmission quantity of the subsequent distribution data, improving the calculation efficiency and reducing the power consumption.
Step S402, a data type arithmetic circuit of the main processing circuit converts the vector P into fixed point type data, and a control circuit of the main processing circuit transmits each part in the fixed point type vector P to K basic processing circuits in a broadcasting mode;
in an alternative, the control circuit of the main processing circuit may broadcast each part of the vector P only once to the register or on-chip buffer of each basic processing circuit, and the ith basic processing circuit may fully multiplex the data of the vector P obtained this time, and perform the inner product operation corresponding to each row in the matrix Ai. The method has the advantages of reducing the data transmission quantity of repeated transmission of the vector P from the main processing circuit to the basic processing circuit, improving the execution efficiency and reducing the transmission power consumption.
In an alternative, the control circuit of the main processing circuit may broadcast each part of the vector P to the register or on-chip cache of each basic processing circuit for multiple times, and the ith basic processing circuit does not multiplex the data of the vector P obtained each time, and completes the inner product operation corresponding to each row in the matrix Ai for multiple times; the method has the advantages of reducing the data transmission quantity of the vector P of single transmission in the basic processing circuit, reducing the capacity of the cache and/or the register of the basic processing circuit, improving the execution efficiency, reducing the transmission power consumption and reducing the cost.
In an alternative, the control circuit of the main processing circuit may broadcast each part of the vector P to the register or on-chip cache of each basic processing circuit for multiple times, and the ith basic processing circuit performs partial multiplexing on the data of the vector P obtained each time, and completes the inner product operation corresponding to each row in the matrix Ai; the method has the advantages of reducing the data transmission quantity from the main processing circuit to the basic processing circuit, reducing the data transmission quantity in the basic processing circuit, improving the execution efficiency and reducing the transmission power consumption.
Step S403, calculating an inner product of the matrix S and the data of the vector P by an inner product operator circuit of K basic processing circuits, for example, an i-th basic processing circuit, calculating an inner product of the data of the matrix Ai and the data of the vector P;
and S404, accumulating the results of the inner product operation by the accumulator circuits of the K basic processing circuits to obtain accumulated results, and transmitting the accumulated results back to the main processing circuit in a fixed-point type mode.
In an alternative, the partial sums (i.e., a portion of the accumulated result, e.g., F1G 1+ F2G 2+ F3G 3+ F4G 4+ F5G 5, then the partial sums may be the values of F1G 1+ F2G 2+ F3G 3) resulting from each inner product operation performed by the basic processing circuit may be transmitted back to the main processing circuit for accumulation; the method has the advantages of reducing the internal operation amount of the basic processing circuit and improving the operation efficiency of the basic processing circuit.
In an alternative, the partial sum obtained by the inner product operation executed by the basic processing circuit each time can be stored in a register and/or an on-chip cache of the basic processing circuit, and the partial sum is transmitted back to the main processing circuit after the accumulation is finished; the method has the advantages of reducing the data transmission quantity between the basic processing circuit and the main processing circuit, improving the operation efficiency and reducing the data transmission power consumption.
In an alternative, the partial sum obtained by the inner product operation executed by the basic processing circuit each time is stored in a register and/or an on-chip cache of the basic processing circuit for accumulation in partial cases, and is transmitted to the main processing circuit for accumulation in partial cases, and is transmitted back to the main processing circuit after the accumulation is finished; the method has the advantages of reducing the data transmission quantity between the basic processing circuit and the main processing circuit, improving the operation efficiency, reducing the data transmission power consumption, reducing the operation quantity in the basic processing circuit and improving the operation efficiency of the basic processing circuit.
Neural network training method
All data involved in the neural network training process can adopt different data representation methods;
specifically, the data representation method includes, but is not limited to, the following cases:
floating point numbers of different bit widths;
fixed point numbers with different bit widths and fixed point numbers in different fixed point positions;
different times of the training process (specifically, different iteration times or initialization time), different stages of the training process (i.e., forward or reverse operation), different layers, different data blocks in the same layer (i.e., multiple input data blocks and output data blocks), or different sub-data blocks divided in the same data block may:
fixed-point or floating-point may be used, respectively;
for fixed points:
using different fixed point bit widths;
using different fixed point offset values (i.e., fixed point positions);
a specific implementation method of neural network training is described below by using an actual example, as shown in fig. 1a, a specific calculation schematic diagram of neural network training of single-layer operation is shown, as shown in fig. 1a, input data and a weight or a parameter execute the layer of operation, and the technical scheme provided by the embodiment of the present application determines whether to convert the types of the input data and the weight according to the input data, the weight and a forward operand of the layer, and the specific manner may be: and if the register or the memory space occupied by the input data and the weight storage is larger than a set threshold and the forward operand of the layer is larger than the set operand, converting the input data and the weight data into fixed point data when the input data and the weight data are determined to be floating point data. If the register or memory space occupied by the input data and the weight storage is smaller than the set threshold, if the input data and the weight data are fixed point data, the input data and the weight data are converted into floating point data, and then the operation of the layer is executed.
For the principle of the above data type conversion, as shown in fig. 1b, the present application describes in detail an expression method of fixed point type data, in which, for a computing system, the storage Bit number of 1 floating point data is 32 bits, and for the fixed point data, especially, the floating point type data shown in fig. 1b is used for data representation, the storage Bit number of 1 fixed point data can be less than 16 bits, so that for the conversion, the transmission overhead between the computers can be greatly reduced, in addition, for the computers, the space for storing data with fewer bits is smaller, that is, the storage overhead is smaller, the calculation amount is also reduced, that is, the calculation overhead is reduced, so that the calculation overhead and the storage overhead can be reduced, but for the conversion of data types, partial overhead is also required, hereinafter, simply referred to as conversion overhead, and for the calculation amount is large, the conversion cost of the data with large data storage capacity can be almost ignored relative to the subsequent calculation cost, storage cost and transmission cost, so that for the data with large calculation capacity and large data storage capacity, the technical scheme of converting the data type into the fixed point type data is adopted in the application, otherwise, for the data with small calculation capacity and small data storage capacity, the calculation cost, the storage cost and the transmission cost are relatively low, at the moment, if the fixed point data is used, the precision of the fixed point data is slightly lower than that of the floating point data, and on the premise of small calculation capacity, the calculation precision needs to be ensured, so that the fixed point type data is converted into the floating point data, namely, the purpose of improving the calculation precision is achieved by increasing the relatively low cost.
In the following, a practical example is described, as shown in fig. 4e, the operation manner of the present layer is matrix multiplication, the input data and the weights are both matrices, for convenience of description, the input data here is exemplified by a matrix I, the weights are exemplified by a matrix W, and as shown in fig. 4e, the output data is represented by a matrix I x matrix W; here, if the sum of the number of columns and the number of rows of the matrix I and the matrix W is large, the space occupied by the matrix I and the matrix W in the memory and/or the register is large, and the calculation amount is also large, so that if the matrix I and the matrix W are floating point data, the matrix I and the matrix W are converted into fixed point data, and then the matrix multiplication is performed.
For example, the matrix I is a 1000 × 1000 matrix, the matrix W is also a 1000 × 1000 matrix, and then the sum of the number of columns and the number of rows is 2000, which results in a large amount of corresponding calculation, and the multiplication operation of matrix multiplication by the inner product of the matrix is 109 times.
For the technical solution of converting fixed-point data into floating-point data, taking inverse operation as an example, the upward arrow direction of the calculation structure shown in fig. 4g is an inverse operation. Taking the inverse operation as an example, for the directional operation, the directional operation is an output data gradient, and the output data gradient may specifically be, if the output data gradient is the last layer of the current iterative computation, the output data of the last layer of the current iterative computation is subjected to a preset operation (the preset operation may be set by a manufacturer according to its needs, and the specific operation step of the preset operation is not limited herein), and if the output data gradient is the last layer of the non-current iterative computation, for example, the output data gradient is the nth layer of the current iterative computation, the output data gradient is the input data gradient obtained by the n +1 th layer of the inverse operation computation.
In the following, a practical example is described, as shown in fig. 4g, the operation method of this layer is matrix multiplication, the input data is a matrix, the weight is a scalar, for convenience of description, the input data here is a matrix I, the weight is a scalar C, and as shown in fig. 4g, the output data is a matrix I × C; at this time, since the weight is scalar data, the data calculation amount is small, so if the matrix I is fixed-point data, the matrix I is converted into floating-point data, and then the operation of multiplying the matrix by the scalar is performed.
For example, if the matrix I is a matrix of 10 × 10 and the scalar is C, the sum of the number of columns and the number of rows is 20, which is a small number of calculations (assuming that greater than 100 is considered to be large and less than 100 is considered to be small, and those skilled in the art can arbitrarily set the number of 100), the corresponding calculation amount is small, and the multiplication operation of the matrix multiplied by the inner product of the matrix is 102 times, and since the calculation amount is small, if the calculation is performed using fixed point data, the precision is affected.
In an alternative, each data block of each layer in the network may respectively adopt a fixed point bit width, but the fixed point position of the fixed point bit width changes with the training iteration cycle;
specifically, in the training process, the data representation method of a certain data block may be set as follows;
specifically, when training is started, any data representation method can be selected for a certain data block;
in one alternative, a floating point representation of a particular bit width may be selected;
in one alternative, a particular form of fixed point representation may be selected;
a particular fixed point bit width may be selected;
a particular pointing location may be selected;
in an alternative, the fixed point position may be set according to the maximum of the absolute values of all data in the data block;
in one alternative, the fixed point position may be set according to the minimum of the absolute values of all the data in the data block;
in one alternative, the fixed-point position of the data block at initialization time can be determined according to the fixed-point positions of other data blocks;
in one alternative, the fixed point position of the data block may be set based on empirical values;
specifically, in the training process, the data representation method of a certain data block can be changed at any iteration cycle number;
in one alternative, no adjustment may be made for a certain data block;
in one alternative, the adjustment may be performed at regular intervals of iterations;
in an alternative, the adjustment may be made at regular intervals of the number of epochs of training;
in one alternative, the adjustment may be made at intervals of non-fixed iterations;
in one alternative, the number of training epochs may be adjusted at intervals that are not fixed;
specifically, in the training process, when the representation method of a certain data block is adjusted, the representation method can be adjusted to any data representation method;
in one alternative, if a data block is represented using a fixed point bit width fixed point number, the fixed point position of its data representation may be adjusted in the following manner:
in one alternative, the fixed point position is set each time according to a setting method of initializing the fixed point position;
in an alternative, if the fixed point position calculated by a data block according to the fixed point position initialization setting method is increased in an iteration period compared with the previous iteration period, the fixed point position of the period is changed towards the increasing method; otherwise, the direction is changed to decrease.
The present disclosure also provides an integrated circuit chip device for performing training of a neural network, the neural network including a plurality of layers, the integrated circuit chip device comprising: a processing circuit and an external interface;
the external interface is used for receiving a training instruction;
the processing circuit is used for determining first-layer input data and first-layer weight data according to the training instruction, and executing n layers of forward operations of the neural network through the first-layer input data and the first-layer weight data to obtain an nth output result;
the processing circuit is further configured to obtain an nth output result gradient according to the nth output result, obtain an nth reverse operation of the nth layer of reverse operation according to the training instruction, obtain an nth reverse operation complexity according to the nth output result gradient, nth layer of input data, nth layer of weight group data and the nth reverse operation, determine an nth reverse data type corresponding to the nth output result gradient, the nth layer of input data and the nth layer of weight group data according to the nth reverse operation complexity, and execute n layers of reverse operations of the neural network on the nth output result gradient, the nth layer of input data and the nth layer of weight group data according to the nth reverse data type to obtain n weight gradients of the n layer of operation; the nth reverse data type includes: a fixed point type or a floating point type;
the processing circuit is further configured to update the n weights of the n-layer operation by applying the n weight gradients.
The disclosure also discloses a neural network computing device, which includes one or more chips shown in fig. 3, and is used for acquiring data to be computed and control information from other processing devices, executing a specified neural network operation, and transmitting the execution result to peripheral equipment through an I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one chip shown in fig. 3 is included, the chips shown in fig. 3 can be linked and transmit data through a specific structure, for example, a PCIE bus interconnects and transmits data to support larger-scale operation of the neural network. At this time, the same control system may be shared, or there may be separate control systems; the memory may be shared or there may be separate memories for each accelerator. In addition, the interconnection mode can be any interconnection topology.
The neural network arithmetic device has high compatibility and can be connected with various types of servers through PCIE interfaces.
The disclosure also discloses a combined processing device, which includes the above neural network computing device, the universal interconnect interface, and other processing devices (i.e., general processing devices). The neural network arithmetic device interacts with other processing devices to jointly complete the operation designated by the user. FIG. 5a is a schematic view of a combined processing apparatus.
Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the neural network arithmetic device and external data and control, and comprise data transportation to finish basic control of starting, stopping and the like of the neural network arithmetic device; other processing devices can cooperate with the neural network arithmetic device to complete the arithmetic task.
And the universal interconnection interface is used for transmitting data and control instructions between the neural network arithmetic device and other processing devices. The neural network arithmetic device acquires required input data from other processing devices and writes the input data into a storage device on the neural network arithmetic device chip; control instructions can be obtained from other processing devices and written into a control cache on a neural network arithmetic device chip; the data in the storage module of the neural network arithmetic device can also be read and transmitted to other processing devices.
As shown in fig. 5b, the structure may further include a storage device for storing data required by the present arithmetic unit/arithmetic device or other arithmetic units, and is particularly suitable for data that is required to be calculated and cannot be stored in the internal storage of the present neural network arithmetic device or other processing devices.
The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.
Referring to fig. 5c, fig. 5c is a schematic structural diagram of a neural network processor board card according to an embodiment of the disclosure. As shown in fig. 5c, the neural network processor board 10 includes a neural network chip package structure 11, a first electrical and non-electrical connection device 12, and a first substrate (substrate) 13.
The present disclosure does not limit the specific structure of the neural network chip package structure 11, and optionally, as shown in fig. 5d, the neural network chip package structure 11 includes: a neural network chip 111, a second electrical and non-electrical connection device 112, and a second substrate 113.
The specific form of the neural network chip 111 related to the present disclosure is not limited, and the neural network chip 111 includes, but is not limited to, a neural network chip integrating a neural network processor, and the neural network chip may be made of silicon material, germanium material, quantum material, molecular material, or the like. The neural network chip can be packaged according to practical conditions (such as a severer environment) and different application requirements, so that most of the neural network chip is wrapped, and the pins on the neural network chip are connected to the outer side of the packaging structure through conductors such as gold wires and the like for circuit connection with a further outer layer.
The present disclosure is not limited to the specific structure of the neural network chip 111, and please refer to the apparatus shown in fig. 1a or fig. 1 b.
The type of the first substrate 13 and the second substrate 113 is not limited in this disclosure, and may be a Printed Circuit Board (PCB) or a Printed Wiring Board (PWB), and may be other circuit boards. The material of the PCB is not limited.
The second substrate 113 according to the present disclosure is used for carrying the neural network chip 111, and the neural network chip package structure 11 obtained by connecting the neural network chip 111 and the second substrate 113 through the second electrical and non-electrical connection device 112 is used for protecting the neural network chip 111, so as to further package the neural network chip package structure 11 and the first substrate 13.
The specific packaging method and the corresponding structure of the second electrical and non-electrical connecting device 112 are not limited, and an appropriate packaging method can be selected according to actual conditions and different application requirements, and can be simply improved, for example: flip Chip Ball Grid Array (FCBGAP) packages, Low-profile Quad Flat packages (LQFP), Quad Flat packages with Heat sinks (HQFP), Quad Flat packages (Quad Flat Non-lead Package, QFN), or small pitch Quad Flat packages (FBGA).
The Flip Chip (Flip Chip) is suitable for the conditions of high requirements on the area after packaging or sensitivity to the inductance of a lead and the transmission time of a signal. In addition, a Wire Bonding (Wire Bonding) packaging mode can be used, so that the cost is reduced, and the flexibility of a packaging structure is improved.
Ball Grid Array (Ball Grid Array) can provide more pins, and the average wire length of the pins is short, and has the function of transmitting signals at high speed, wherein, the package can be replaced by Pin Grid Array Package (PGA), Zero Insertion Force (ZIF), Single Edge Contact Connection (SECC), Land Grid Array (LGA) and the like.
Optionally, the neural network Chip 111 and the second substrate 113 are packaged in a Flip Chip Ball Grid Array (Flip Chip Ball Grid Array) packaging manner, and a schematic diagram of a specific neural network Chip packaging structure may refer to fig. 6. As shown in fig. 6, the neural network chip package structure includes: the neural network chip 21, the bonding pad 22, the solder ball 23, the second substrate 24, the connection point 25 on the second substrate 24, and the pin 26.
The bonding pads 22 are connected to the neural network chip 21, and the solder balls 23 are formed between the bonding pads 22 and the connection points 25 on the second substrate 24 by soldering, so that the neural network chip 21 and the second substrate 24 are connected, that is, the package of the neural network chip 21 is realized.
The pins 26 are used for connecting with an external circuit of the package structure (for example, the first substrate 13 on the neural network processor board 10), so as to realize transmission of external data and internal data, and facilitate processing of data by the neural network chip 21 or a neural network processor corresponding to the neural network chip 21. The present disclosure is also not limited to the type and number of pins, and different pin types can be selected according to different packaging technologies and arranged according to certain rules.
Optionally, the neural network chip packaging structure further includes an insulating filler, which is disposed in a gap between the pad 22, the solder ball 23 and the connection point 25, and is used for preventing interference between the solder ball and the solder ball.
Wherein, the material of the insulating filler can be silicon nitride, silicon oxide or silicon oxynitride; the interference includes electromagnetic interference, inductive interference, and the like.
Optionally, the neural network chip package structure further includes a heat dissipation device for dissipating heat generated when the neural network chip 21 operates. The heat dissipation device may be a metal plate with good thermal conductivity, a heat sink, or a heat sink, such as a fan.
For example, as shown in fig. 6a, the neural network chip package structure 11 includes: the neural network chip 21, the bonding pad 22, the solder ball 23, the second substrate 24, the connection point 25 on the second substrate 24, the pin 26, the insulating filler 27, the thermal grease 28 and the metal housing heat sink 29. The heat dissipation paste 28 and the metal case heat dissipation sheet 29 are used to dissipate heat generated during operation of the neural network chip 21.
Optionally, the neural network chip package structure 11 further includes a reinforcing structure connected to the bonding pad 22 and embedded in the solder ball 23 to enhance the connection strength between the solder ball 23 and the bonding pad 22.
The reinforcing structure may be a metal wire structure or a columnar structure, which is not limited herein.
The present disclosure is not limited to the specific form of the first electrical and non-electrical device 12, and reference may be made to the description of the second electrical and non-electrical device 112, that is, the neural network chip package structure 11 is packaged by soldering, and a connection wire or a plug connection may be used to connect the second substrate 113 and the first substrate 13, so as to facilitate subsequent replacement of the first substrate 13 or the neural network chip package structure 11.
Optionally, the first substrate 13 includes an interface of a memory unit for expanding a storage capacity, for example: synchronous Dynamic Random Access Memory (SDRAM), Double Rate SDRAM (DDR), etc., which improve the processing capability of the neural network processor by expanding the Memory.
The first substrate 13 may further include a Peripheral component interconnect Express (PCI-E or PCIe) interface, a Small Form-factor pluggable (SFP) interface, an ethernet interface, a Controller Area Network (CAN) interface, and the like on the first substrate, for data transmission between the package structure and the external circuit, which may improve the operation speed and the convenience of operation.
The neural network processor is packaged into a neural network chip 111, the neural network chip 111 is packaged into a neural network chip packaging structure 11, the neural network chip packaging structure 11 is packaged into a neural network processor board card 10, and data interaction is performed with an external circuit (for example, a computer motherboard) through an interface (a slot or a plug core) on the board card, that is, the function of the neural network processor is directly realized by using the neural network processor board card 10, and the neural network chip 111 is protected. And other modules can be added to the neural network processor board card 10, so that the application range and the operation efficiency of the neural network processor are improved.
In one embodiment, the present disclosure discloses an electronic device comprising the above neural network processor board card 10 or the neural network chip package 11.
Electronic devices include data processing devices, robots, computers, printers, scanners, tablets, smart terminals, cell phones, tachographs, navigators, sensors, cameras, servers, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, vehicles, home appliances, and/or medical devices.
The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.
The above-described embodiments, objects, technical solutions and advantages of the present disclosure are further described in detail, it should be understood that the above-described embodiments are only illustrative of the embodiments of the present disclosure, and are not intended to limit the present disclosure, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims (15)

1. An integrated circuit chip apparatus, the apparatus being configured to perform training of a neural network, the neural network including n layers, the n values ranging from integers greater than or equal to 2, the integrated circuit chip apparatus comprising: a main processing circuit and a plurality of basic processing circuits; the main processing circuit includes: a data type arithmetic circuit; the data type arithmetic circuit is used for executing conversion between floating point type data and fixed point type data;
the plurality of base processing circuits are distributed in an array; each basic processing circuit is connected with other adjacent basic processing circuits, and the main processing circuit is connected with the n basic processing circuits of the 1 st row, the n basic processing circuits of the m th row and the m basic processing circuits of the 1 st column;
the integrated circuit chip device is used for receiving a training instruction, determining first-layer input data and first-layer weight group data according to the training instruction, and executing n layers of forward operations of a neural network on the first-layer input data and the first-layer weight group data to obtain an nth output result of the forward operations;
the main processing circuit is further configured to obtain an nth output result gradient according to the nth output result, obtain an nth reverse operation of an nth layer of reverse operation according to the training instruction, obtain an nth reverse operation complexity according to the nth output result gradient, nth layer of input data, nth layer of weight group data and the nth reverse operation, and determine an nth reverse data type corresponding to the nth output result gradient, the nth layer of input data and the nth layer of weight group data according to the nth reverse operation complexity;
the main processing circuit is used for dividing the nth output result gradient, the nth layer input data and the nth layer weight group data into a broadcast data block and a distribution data block according to the nth reverse operation type, splitting the distribution data block of the nth reverse data type to obtain a plurality of basic data blocks, distributing the plurality of basic data blocks to at least one branch processing circuit in a basic processing circuit connected with the main processing circuit, and broadcasting the broadcast data block of the nth reverse data type to the basic processing circuit connected with the main processing circuit;
the basic processing circuit is used for executing operation in the neural network in a parallel mode according to the broadcast data block of the nth reverse data type and the basic data block of the nth reverse data type to obtain an operation result, and transmitting the operation result to the main processing circuit through the basic processing circuit connected with the main processing circuit;
the main processing circuit is used for processing the operation result to obtain the nth layer weight group gradient and the nth layer input data gradient, and updating the nth layer weight group data by applying the nth layer weight group gradient; the nth reverse data type includes: a fixed point type or a floating point type;
the integrated circuit chip device is further configured to perform n-1 layer directional operation using the nth layer input data gradient as the nth-1 output result gradient of the nth-1 layer to obtain an n-1 layer weight group gradient, and update weight group data of a corresponding layer by applying the n-1 layer weight group gradient, where the weight group data includes: at least two weights;
the main processing circuit is specifically configured to determine that the nth layer input data and the nth layer weight group data are both distribution data blocks and the nth output result gradient is a broadcast data block, if the nth inverse operation is a multiplication operation; if the type of the nth reverse operation is convolution operation, determining that the nth layer input data and the nth layer weight group data are broadcast data blocks, and the nth output result gradient is a distribution data block.
2. The integrated circuit chip apparatus of claim 1,
the main processing circuit is specifically configured to compare the nth inverse operation complexity with a preset threshold, determine that the nth inverse data type is a fixed-point type if the nth inverse operation complexity is higher than the preset threshold, and determine that the nth inverse data type is a floating-point type if the nth inverse operation complexity is lower than or equal to the preset threshold.
3. The integrated circuit chip apparatus of claim 2,
the main processing circuit is specifically configured to determine an n +1 th reverse data type to which the nth output result gradient, the nth layer input data, and the nth layer weight group data belong, and convert the nth output result gradient, the nth layer input data, and the nth layer weight group data belonging to the n +1 th reverse data type into the nth output result gradient, the nth layer input data, and the nth layer weight group data belonging to the nth reverse data type through the data type operation circuit if the n +1 th reverse data type is different from the nth reverse data type.
4. The integrated circuit chip apparatus of claim 1,
the main processing circuit is used for performing convolution operation on the n layers of inverse operation, the convolution input data is the nth layer of input data, the convolution kernel is the nth output result gradient,
the nth inverse complexity of operation α × C × kH × kW × M × W × C1 × H;
wherein α is a convolution coefficient with a value range larger than 1, C, kH, kW and M are values of four dimensions of a convolution kernel, and N, W, C1 and H are values of four dimensions of convolution input data;
if the complexity of the nth reverse operation is greater than the set threshold, determining that the nth reverse data type is a floating point data type, determining whether the convolution input data and the convolution kernel are floating point data, if the convolution input data and the convolution kernel are not floating point data, converting the convolution input data into the floating point data, converting the convolution kernel into the floating point data, and then performing convolution operation on the convolution input data and the convolution kernel in the floating point data type.
5. The integrated circuit chip apparatus of claim 1,
the main processing circuit is further configured to, as the nth inverse operation: performing matrix multiplication matrix operation, wherein input data is nth layer input data, and the weight is the nth output result gradient;
the nth reverse operation complexity is β F1G E F, wherein β is a matrix coefficient, the value range is more than or equal to 1, F1 and G are row and column values of nth layer input data, and E, F is row and column values of a weight;
if the complexity of the nth reverse operation is greater than the set threshold, determining that the nth reverse data type is a floating point data type, determining whether the nth layer input data and the weight are floating point data, if the nth layer input data and the weight are not floating point data, converting the nth layer input data into the floating point data, converting the weight into the floating point data, and then performing matrix multiplication operation on the nth layer input data and the weight by using the floating point data type.
6. The integrated circuit chip apparatus of claim 1,
an integrated circuit chip apparatus, further configured to perform the inverse operation as described in the nth inverse operation: performing matrix multiplication vector operation, wherein input data is nth layer input data, and the weight is the nth output result gradient;
the nth reverse operation complexity is β F1G F3, wherein β is a matrix coefficient, the value range is more than or equal to 1, F1 and G are row and column values of nth layer input data, and F3 is a column value of nth output result gradient;
if the complexity of the nth reverse operation is greater than the set threshold, determining that the nth reverse data type is a floating point data type, determining whether the nth layer input data and the weight are floating point data, if the nth layer input data and the weight are not floating point data, informing the branch processing circuit to convert the nth layer input data into the floating point data, converting the weight into the floating point data, and then performing matrix multiplication vector operation on the nth layer input data and the weight by using the floating point data type.
7. The integrated circuit chip apparatus of any of claims 1-6,
the n-level inversion further comprises: and performing one or any combination of bias operation, full connection operation, GEMM operation, GEMV operation and activation operation.
8. The integrated circuit chip apparatus of claim 1,
the main processing circuit includes: a master register or on-master cache circuit;
the base processing circuit includes: basic registers or basic on-chip cache circuits.
9. The integrated circuit chip apparatus of claim 8,
the main processing circuit includes: one or any combination of vector arithmetic unit circuit, arithmetic logic unit circuit, accumulator circuit, matrix transposition circuit, direct memory access circuit or data rearrangement circuit.
10. The integrated circuit chip apparatus of claim 8,
the nth output result gradient is as follows: one or any combination of vector, matrix, three-dimensional data block and four-dimensional data block;
the nth layer of input data is as follows: one or any combination of vector, matrix, three-dimensional data block and four-dimensional data block;
the n layers of weight group data are as follows: vector, matrix, three-dimensional data block, four-dimensional data block or any combination thereof.
11. A neural network operation device, comprising one or more integrated circuit chip devices as claimed in any one of claims 1 to 10.
12. A combined processing apparatus, characterized in that the combined processing apparatus comprises: the neural network computing device, the universal interconnect interface, and the general purpose processing device of claim 11;
the neural network operation device is connected with the general processing device through the general interconnection interface.
13. A chip incorporating a device according to any one of claims 1 to 12.
14. A smart device, characterized in that it comprises a chip according to claim 13.
15. A method of operation of a neural network, the method being implemented within an integrated circuit chip device, the integrated circuit chip device comprising: the integrated circuit chip apparatus of any of claims 1-10, the integrated circuit chip apparatus to perform a training operation of a neural network.
CN201711469408.3A 2017-12-27 2017-12-28 Integrated circuit chip device and related product Active CN109978156B (en)

Priority Applications (13)

Application Number Priority Date Filing Date Title
CN201711469408.3A CN109978156B (en) 2017-12-28 2017-12-28 Integrated circuit chip device and related product
EP18896519.8A EP3719712B1 (en) 2017-12-27 2018-12-26 Integrated circuit chip device
EP20201907.1A EP3783477B1 (en) 2017-12-27 2018-12-26 Integrated circuit chip device
PCT/CN2018/123929 WO2019129070A1 (en) 2017-12-27 2018-12-26 Integrated circuit chip device
EP20203232.2A EP3789871B1 (en) 2017-12-27 2018-12-26 Integrated circuit chip device
US16/903,304 US11544546B2 (en) 2017-12-27 2020-06-16 Integrated circuit chip device
US17/134,487 US11748605B2 (en) 2017-12-27 2020-12-27 Integrated circuit chip device
US17/134,486 US11748604B2 (en) 2017-12-27 2020-12-27 Integrated circuit chip device
US17/134,445 US11748602B2 (en) 2017-12-27 2020-12-27 Integrated circuit chip device
US17/134,435 US11741351B2 (en) 2017-12-27 2020-12-27 Integrated circuit chip device
US17/134,446 US11748603B2 (en) 2017-12-27 2020-12-27 Integrated circuit chip device
US17/134,444 US11748601B2 (en) 2017-12-27 2020-12-27 Integrated circuit chip device
US18/073,924 US20230095610A1 (en) 2017-12-27 2022-12-02 Integrated circuit chip device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711469408.3A CN109978156B (en) 2017-12-28 2017-12-28 Integrated circuit chip device and related product

Publications (2)

Publication Number Publication Date
CN109978156A CN109978156A (en) 2019-07-05
CN109978156B true CN109978156B (en) 2020-06-12

Family

ID=67075532

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711469408.3A Active CN109978156B (en) 2017-12-27 2017-12-28 Integrated circuit chip device and related product

Country Status (1)

Country Link
CN (1) CN109978156B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115221102B (en) * 2021-04-16 2024-01-19 中科寒武纪科技股份有限公司 Method for optimizing convolution operation of system-on-chip and related product

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107066239A (en) * 2017-03-01 2017-08-18 智擎信息系统(上海)有限公司 A kind of hardware configuration for realizing convolutional neural networks forward calculation
CN107169563A (en) * 2017-05-08 2017-09-15 中国科学院计算技术研究所 Processing system and method applied to two-value weight convolutional network
US9779355B1 (en) * 2016-09-15 2017-10-03 International Business Machines Corporation Back propagation gates and storage capacitor for neural networks

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017037568A1 (en) * 2015-08-31 2017-03-09 Semiconductor Energy Laboratory Co., Ltd. Semiconductor device or electronic device including the semiconductor device
CN109961138B (en) * 2017-12-14 2020-04-14 中科寒武纪科技股份有限公司 Neural network training method and related product

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9779355B1 (en) * 2016-09-15 2017-10-03 International Business Machines Corporation Back propagation gates and storage capacitor for neural networks
CN107066239A (en) * 2017-03-01 2017-08-18 智擎信息系统(上海)有限公司 A kind of hardware configuration for realizing convolutional neural networks forward calculation
CN107169563A (en) * 2017-05-08 2017-09-15 中国科学院计算技术研究所 Processing system and method applied to two-value weight convolutional network

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Cambricon: An Instruction Set Architecture for Neural Networks;Shaoli Liu等;《2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)》;20160825;第393-405页 *
DaDianNao: A Neural Network Supercomputer;Yunji Chen等;《2014 47th Annual IEEE/ACM International Symposium on Microarchitecture》;20150119;第609-622页 *
基于BP网络的神经元芯片的关键部件设计;毛健;《万方数据知识服务平台》;20130423;第1-42页 *
基于FPGA的卷积神经网络并行结构研究;陆志坚;《中国博士学位论文全文数据库 信息科技辑》;20140415(第04期);第1-42页 *

Also Published As

Publication number Publication date
CN109978156A (en) 2019-07-05

Similar Documents

Publication Publication Date Title
CN109961138B (en) Neural network training method and related product
US11748605B2 (en) Integrated circuit chip device
CN109978131B (en) Integrated circuit chip apparatus, method and related product
CN109961136B (en) Integrated circuit chip device and related product
US11507810B2 (en) Integrated circuit chip apparatus
CN109961134B (en) Integrated circuit chip device and related product
CN109977446B (en) Integrated circuit chip device and related product
CN109961131B (en) Neural network forward operation method and related product
CN109961135B (en) Integrated circuit chip device and related product
CN109978156B (en) Integrated circuit chip device and related product
CN109978148B (en) Integrated circuit chip device and related product
CN109978157B (en) Integrated circuit chip device and related product
CN110197267B (en) Neural network processor board card and related product
CN109978152B (en) Integrated circuit chip device and related product
CN109978158B (en) Integrated circuit chip device and related product
CN109960673B (en) Integrated circuit chip device and related product
CN109961133B (en) Integrated circuit chip device and related product
CN109961137B (en) Integrated circuit chip device and related product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100000 room 644, No. 6, No. 6, South Road, Beijing Academy of Sciences

Applicant after: Zhongke Cambrian Technology Co., Ltd

Address before: 100000 room 644, No. 6, No. 6, South Road, Beijing Academy of Sciences

Applicant before: Beijing Zhongke Cambrian Technology Co., Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant