CN110197275A - Integrated circuit chip device and Related product - Google Patents

Integrated circuit chip device and Related product Download PDF

Info

Publication number
CN110197275A
CN110197275A CN201810164844.8A CN201810164844A CN110197275A CN 110197275 A CN110197275 A CN 110197275A CN 201810164844 A CN201810164844 A CN 201810164844A CN 110197275 A CN110197275 A CN 110197275A
Authority
CN
China
Prior art keywords
data block
circuit
data
based process
treated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810164844.8A
Other languages
Chinese (zh)
Other versions
CN110197275B (en
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Cambricon Information Technology Co Ltd
Original Assignee
Shanghai Cambricon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Cambricon Information Technology Co Ltd filed Critical Shanghai Cambricon Information Technology Co Ltd
Priority to CN201810164844.8A priority Critical patent/CN110197275B/en
Priority to PCT/CN2019/076088 priority patent/WO2019165946A1/en
Publication of CN110197275A publication Critical patent/CN110197275A/en
Application granted granted Critical
Publication of CN110197275B publication Critical patent/CN110197275B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

Present disclosure provides a kind of integrated circuit chip device and Related product, training of the described device for the neural network of execution, the neural network includes n-layer, and the n value range is the integer more than or equal to 2, and the integrated circuit chip device includes: main process task circuit and multiple based process circuits;The main process task circuit includes the first mapping circuit, at least one circuit includes the second mapping circuit in the multiple based process circuit, and first mapping circuit and second mapping circuit are used to execute the compression processing of each data in neural network computing;The multiple based process circuit is in array distribution;Each based process circuit and other adjacent based process circuit connections, m based process circuit of n based process circuit of the 1st row of main process task circuit connection, n based process circuit of m row and the 1st column.The advantage that the technical solution that present disclosure provides has calculation amount small, low in energy consumption.

Description

Integrated circuit chip device and Related product
Technical field
Present disclosure is related to field of neural networks more particularly to a kind of integrated circuit chip device and Related product.
Background technique
Artificial neural network (Artificial Neural Network, i.e. ANN), it is artificial since being the 1980s The research hotspot that smart field rises.It is abstracted human brain neuroid from information processing angle, and it is simple to establish certain Model is formed different networks by different connection types.Neural network or class are also often directly referred to as in engineering and academia Neural network.Neural network is a kind of operational model, is constituted by being coupled to each other between a large amount of node (or neuron).It is existing Neural network operation be based on CPU (Central Processing Unit, central processing unit) or GPU (English: Graphics Processing Unit, graphics processor) Lai Shixian neural network forward operation, the meter of such forward operation Calculation amount is big, and power consumption is high.
Summary of the invention
Present disclosure embodiment provides a kind of integrated circuit chip device and Related product, can promote the processing of computing device Speed improves efficiency.
In a first aspect, providing a kind of training integrated circuit chip device of the neural network of execution, described device is for holding The training of capable neural network, the neural network include n-layer, and the n value range is the integer more than or equal to 2, described integrated Circuit chip device includes: main process task circuit and multiple based process circuits;The main process task circuit includes the first mapping electricity Road, at least one circuit (i.e. part or all of based process circuit) includes the second mapping electricity in the multiple based process circuit Road, first mapping circuit and second mapping circuit are used to execute the pressure of each data in neural network computing Contracting processing;
The multiple based process circuit is in array distribution;Each based process circuit and other adjacent based process electricity Road connection, the n based process circuit and the 1st of n based process circuit of the 1st row of main process task circuit connection, m row M based process circuit of column;
The integrated circuit chip device determines that first layer inputs number according to the training instruction for receiving training instruction According to first layer weight group data, the n-layer for executing neural network to first layer input data and first layer weight group data is positive Operation obtains the n-th output result of forward operation;
The main process task circuit is also used to obtain the n-th output result gradient according to the n-th output result, according to described in N-th layer needed for training instruction obtains the n-th reversed operational order and the n-th reversed operational order of the reversed operation of n-th layer Input data and n-th layer weight group data;Result gradient, n-th are exported by described n-th according to the described n-th reversed operational order Layer input data and n-th layer weight group data are divided into vertical data block and lateral data block;According to the described n-th reversed operation The operation control of instruction determines that the first mapping circuit of starting handles the first data block, first data that obtain that treated Block;First data block includes the lateral data block and/or the vertical data block;Refer to according to the described n-th reversed operation By treated, the first data block is sent at least one base in the based process circuit being connected with the main process task circuit for order Plinth processing circuit;
The multiple based process circuit, for determining whether to open according to the operation control of the described n-th reversed operational order Dynamic second mapping circuit handles the second data block, and according to treated, the second data block executes nerve net in a parallel fashion Operation in network obtains operation result, and by the operation result by passing with the based process circuit of the main process task circuit connection It is defeated by the main process task circuit;Second data block is the reception main process task circuit hair that the based process circuit determines The data block sent, second data block and treated first data block associated;
The main process task circuit is also used to be handled the operation result to obtain n-th layer weight group gradient and n-th layer is defeated Enter data gradient, n-th layer weight group data are updated using the n-th layer weight group gradient;
The integrated circuit chip device, be also used to using n-th layer input data gradient as (n-1)th layer (n-1)th output As a result gradient executes n-1 layers of reversed operation and obtains n-1 layers of weight group gradient, using n-1 layers of weight group gradient updating respective layer Weight group data, the weight group data include at least two weights.
Second aspect, provides a kind of neural network computing device, and the neural network computing device includes one or more The integrated circuit chip device that first aspect provides.
The third aspect, provides a kind of combined treatment device, and the combined treatment device includes: the nerve that second aspect provides Network operations device, general interconnecting interface and general processing unit;
The neural network computing device is connect by the general interconnecting interface with the general processing unit.
Fourth aspect, provides a kind of chip, the device or third of the device of the integrated chip first aspect, second aspect The device of aspect.
5th aspect, provides a kind of electronic equipment, the electronic equipment includes the chip of fourth aspect.
As can be seen that operation will be carried out again by providing mapping circuit by present disclosure embodiment after data block compression processing, section Transfer resource and computing resource are saved, so it is with low in energy consumption, the small advantage of calculation amount.
Detailed description of the invention
Fig. 1 is a kind of training method schematic diagram of neural network.
Fig. 1 a is a kind of forward operation schematic diagram of neural network.
Fig. 1 b is a kind of schematic diagram of neural network computing.
Fig. 2 a is convolution input data schematic diagram.
Fig. 2 b is convolution kernel schematic diagram.
Fig. 2 c is the operation window schematic diagram of a three-dimensional data block of input data.
Fig. 2 d is another operation window schematic diagram of a three-dimensional data block of input data.
Fig. 2 e is the another operation window schematic diagram of a three-dimensional data block of input data
Fig. 3 is a kind of structural schematic diagram of neural network chip.
Fig. 4 a is Matrix Multiplication with matrix schematic diagram.
Fig. 4 b is Matrix Multiplication with the method flow diagram of matrix.
Fig. 4 c is Matrix Multiplication with vector schematic diagram.
Fig. 4 d is Matrix Multiplication with the method flow diagram of vector.
Fig. 4 e is a kind of neural metwork training schematic diagram.
Fig. 4 f is another neural metwork training schematic diagram.
Fig. 4 g is neural network forward direction and reversed operation schematic diagram.
Fig. 4 h is neural metwork training multilayered structure schematic diagram.
Fig. 5 is a kind of structural schematic diagram for neural network chip that present disclosure embodiment stream provides;
Fig. 6 a- Fig. 6 b is the structural schematic diagram of two kinds of mapping circuits provided by the embodiments of the present application.
Specific embodiment
In order to make those skilled in the art more fully understand present disclosure scheme, below in conjunction in present disclosure embodiment The technical solution in present disclosure embodiment is clearly and completely described in attached drawing, it is clear that described embodiment is only Present disclosure a part of the embodiment, instead of all the embodiments.Based on the embodiment in present disclosure, those of ordinary skill in the art Every other embodiment obtained without creative efforts belongs to the range of present disclosure protection.
In the device that first aspect provides, the integrated circuit chip device, for receiving training instruction, according to the instruction Practice to instruct and determine first layer input data and first layer weight group data, to first layer input data and first layer weight group data The n-layer forward operation for executing neural network obtains the n-th output result of forward operation;
The main process task circuit is also used to obtain the n-th output result gradient according to the n-th output result, according to described in N-th layer needed for training instruction obtains the n-th reversed operational order and the n-th reversed operational order of the reversed operation of n-th layer Input data and n-th layer weight group data;Result gradient, n-th are exported by described n-th according to the described n-th reversed operational order Layer input data and n-th layer weight group data are divided into vertical data block and lateral data block;According to the described n-th reversed operation The operation control of instruction determines that the first mapping circuit of starting handles the first data block, first data that obtain that treated Block;First data block includes the lateral data block and/or the vertical data block;Refer to according to the described n-th reversed operation By treated, the first data block is sent at least one base in the based process circuit being connected with the main process task circuit for order Plinth processing circuit;
The multiple based process circuit, for determining whether to open according to the operation control of the described n-th reversed operational order Dynamic second mapping circuit handles the second data block, and according to treated, the second data block executes nerve net in a parallel fashion Operation in network obtains operation result, and by the operation result by passing with the based process circuit of the main process task circuit connection It is defeated by the main process task circuit;Second data block is the reception main process task circuit hair that the based process circuit determines The data block sent, second data block and treated first data block associated;
The main process task circuit is also used to be handled the operation result to obtain n-th layer weight group gradient and n-th layer is defeated Enter data gradient, n-th layer weight group data are updated using the n-th layer weight group gradient;
The integrated circuit chip device, be also used to using n-th layer input data gradient as (n-1)th layer (n-1)th output As a result gradient executes n-1 layers of reversed operation and obtains n-1 layers of weight group gradient, using n-1 layers of weight group gradient updating respective layer Weight group data, the weight group data include at least two weights.
In the device that first aspect provides, when first data block includes lateral data block and vertical data block, The main process task circuit, be specifically used for starting first mapping circuit to the lateral data block and the vertical data block into The identification data block of lateral data block that row processing obtains that treated and the transverse direction data block associated, vertical data that treated The identification data block of block and the vertical data block associated;Lateral data block that treated by described in and the transverse direction data block are closed The identification data block of connection carries out deconsolidation process and obtains multiple basic data blocks and the basic data block respectively associated mark Data block, by the multiple basic data block and the multiple basic data block respectively associated identification data block be distributed to Its based process circuit connected, by treated the vertical data block and the identification data block of the vertical data block associated It broadcasts to based process circuit connected to it;Wherein, direct index or step-length index specifically can be used in the identification data block Mode indicate, optionally go back list (List of Lists, LIL), the list of coordinates (Coordinate of freelist List, COO), compression loose line (Compressed Sparse Row, CSR), sparse column (the Compressed Sparse of compression Column, CSC), (ELL Pack, ELL) and mixing (Hybird, HYB) etc. modes indicate that the application is without limitation.
By taking the identification data block indicates in the way of direct index as an example, the identification data block is concretely by 0 The data block constituted with 1, wherein the absolute value for the data (such as weight or input neuron) for including in 0 expression data block is less than Or it is equal to first threshold, the absolute value for the data (such as weight or input neuron) for including in 1 expression data block is greater than the first threshold Value, first threshold is user side or device side is customized is randomly provided, such as 0.05,0 etc..
To save volume of transmitted data, improve data transfer efficiency, in the main process task circuit to the based process circuit Send data during, specifically can by the multiple basic data block target data and the multiple basic data block Respective associated identification data block is distributed to based process circuit connected to it;It is optional, it can also will described that treated be vertical The identification data block of target data and the vertical data block associated in data block is broadcasted to based process electricity connected to it Road.Wherein, the target data refers to that absolute value is greater than the data of first threshold in data block, or refers to data block (here Lateral data block that concretely treated or treated vertical data block) in non-zero data.
Correspondingly, the based process circuit is specifically used for starting second mapping circuit according to the vertical data The associated identification data block of block and the associated mark data of the basic data block obtain connection identifier data block, and according to described Connection identifier data block is handled to obtain treated vertical data block to the vertical data block and the basic data block And basic data block;Are executed by reversed operation and obtains operation result for treated the vertical data block and basic data block, it will The operation result is sent to the main process task circuit;Wherein.The reversed operation includes but is not limited to any one of following Or multinomial combination: convolution algorithm (i.e. inner product operation), bigoted operation, connects operation, GEMM operation, GEMV at product calculation entirely One of operation, activation operation or any combination;
The main process task circuit obtains described instruction result for handling the operation result.
For example, lateral data block is M1Row N1The matrix of column, basic data block M2Row N2The matrix of column, wherein M1>M2, N1 >N2.Correspondingly, the identification data block of the transverse direction data block associated is equally also M1Row N1The matrix of column, basic data block association Identification data block be similarly M2Row N2The matrix of column.By taking basic data block is the matrix of 2*2 as an example, it is set asThe One threshold value is 0.05, then the associated identification data block of the basic data block isIt is reflected about the first mapping circuit and second The processing of data block will be specifically addressed in transmit-receive radio road later.
In the device that first aspect provides, when first data block includes lateral data block, the main process task electricity Road is handled to obtain treated lateral data block to the lateral data block specifically for starting first mapping circuit And the identification data block of the transverse direction data block associated, or starting first mapping circuit is according to the lateral number prestored Handled to obtain treated lateral data block to the lateral data block according to the associated identification data block of block;By the processing The identification data block of lateral data block and the transverse direction data block associated afterwards carries out deconsolidation process and obtains multiple basic data blocks And the respective associated identification data block of the basic data block, by the multiple basic data block and the multiple basic number According to block, respectively associated identification data block is distributed to based process circuit connected to it, by the vertical data block broadcast to Its based process circuit connected;
It is associated according to the basic data block to be specifically used for starting second mapping circuit for the based process circuit Identification data block handles the vertical data block, the vertical data block that obtains that treated;That treated is vertical to described Data block and treated the basic data block execute reversed operation and obtain operation result, and the operation result is sent to institute State main process task circuit.
In an alternative embodiment, the main process task circuit, also particularly useful for by the vertical data block or that treated is perpendicular To the identification data block of data block and the vertical data block associated carry out deconsolidation process obtain the vertical data block of multiple portions with And the respective associated identification data block of the multiple vertical data block in part;By the vertical data block in the multiple part and described Respectively associated identification data block by one or many is broadcast to the based process circuit to the vertical data block of multiple portions;Its In, the multiple vertical data block combinations in part form the vertical data block or treated vertical data block.
Correspondingly, it is vertical according to the part to be specifically used for starting second mapping circuit for the based process circuit The identification data block of data block associated and the associated identification data block of the basic data block obtain connection identifier data block;Root Handled to obtain that treated to the vertical data block in the part and the basic data block according to the connection identifier data The vertical data block in part and treated basic data block;To treated the vertical data block in part and the processing Basic data block afterwards executes reversed operation.
Wherein, which is by the associated identification data block of the basic data block and the part The identification data block of vertical data block associated carries out the data block obtained by element and operation.Optionally, the connection identifier number It is used to indicate that data in two data blocks (specially basic data block and vertical data block) to be all larger than the number of absolute value according to block According to.Specifically it is described in detail later.
For example, the matrix that the identification data block of lateral data block associated is 2*3Partially vertical data block associated Identification data block be 2*2 matrixThen the corresponding connection identifier data block obtained is
In the device that first aspect provides, when first data block includes vertical data block, the main process task electricity Road is handled the vertical data block specifically for starting first mapping circuit, the vertical data that obtain that treated The identification data block of block and the vertical data block associated, or starting first mapping circuit are described vertical according to what is prestored The identification data block of data block associated is handled to obtain treated vertical data block to the vertical data block;To the cross Deconsolidation process, which is carried out, to data block obtains multiple basic data blocks;The multiple basic data block is distributed to base connected to it Plinth processing circuit, by the identification data block of treated the vertical data block and the vertical data block associated broadcast to its The based process circuit of connection;
The based process circuit, specifically for starting second mapping circuit according to the vertical data block associated Identification data block is handled to obtain treated basic data block to the basic data block;To treated the vertical number Reversed operation is executed according to block and treated the basic data block and obtains operation result, the operation result is sent to described Main process task circuit.
In an alternative embodiment, the main process task circuit, also particularly useful for will treated the vertical data block and should The identification data block of vertical data block associated carries out deconsolidation process and obtains the vertical data block of multiple portions and the multiple part The identification data block of vertical data block associated;By the vertical data block of the vertical data block in the multiple part and the multiple part Respective associated identification data block is broadcast to the based process circuit by one or many;Wherein, the multiple part is perpendicular The vertical data block or treated vertical data block are formed to data block combinations.
Correspondingly, the based process circuit is specifically used for the identification data block according to the vertical data block associated in the part Handled to obtain treated basic data block to the basic data block;To treated basic data block and the institute It states the vertical data block in part and executes reversed operation.
In the device that first aspect provides, the main process task circuit is specifically used for the vertical data block (concretely The vertical data block or treated vertical data block) pass through a broadcast transmission to the based process connected to it Circuit.
In the device that first aspect provides, the based process circuit is specifically used for (similarly may be used the basic data block For the basic data block or treated basic data block) reversed calculation process is executed with the vertical data block is reversely transported Calculation obtains operation result as a result, the reversed operation result is added up, and the operation result is sent to the main process task circuit.
In the device that first aspect provides, the based process circuit, is specifically used for the basic data block and this is vertical Data block executes reversed calculation process and obtains processing result, and the processing result is added up and obtains operation result, by the operation As a result it is sent to the main process task circuit;
The main process task circuit arranges the accumulation result for obtaining accumulation result after adding up to the operation result Obtain described instruction result.
In the device that first aspect provides, the main process task circuit is more specifically for the vertical data block to be divided into A vertical data block in part, by the vertical data block in the multiple part by repeatedly broadcasting to the based process circuit;It is described The vertical data block combinations of multiple portions form the vertical data block.
In the device that first aspect provides, the main process task circuit, specifically for the class of such as described first operational order Type is multiplying order, determines the input data as lateral data block, the weight data is vertical data block;Such as described first The type of operational order is convolution instruction, determines that the input data is vertical data block, the weight data is lateral data Block.
In the device that first aspect provides, the based process circuit is specifically used for the vertical data block in the part (tool Body can be the vertical data block in part or treated the vertical data block in part) inner product is executed with the basic data block handles After obtain inner product processing result, partial arithmetic result is obtained by the inner product processing result is cumulative, by the partial arithmetic result It is sent to the main process task circuit.
In the device that first aspect provides, the based process circuit is specifically used for the multiplexing vertical data in the n times part Block executes the vertical data block in the part and the n basic data block inner product operation obtains n part processing result, by n part Processing result obtains n partial arithmetic result after adding up respectively, and the n partial arithmetic result is sent to main process task circuit, The n is the integer more than or equal to 2.
In the device that first aspect provides, the reversed operation of the n-layer further include: bigoted operation, full connection operation, One of GEMM operation, GEMV operation, activation operation or any combination.
In the device that first aspect provides, the main process task circuit includes: buffer circuit on master register or main leaf;
The based process circuit includes: base register or basic on piece buffer circuit.
In the device that first aspect provides, the main process task circuit includes: vector operation device circuit, arithmetic logic unit One of circuit, accumulator circuit, matrix transposition circuit, direct memory access circuit or data rearrangement circuit or any group It closes.
In the device that first aspect provides, the n-th output result gradient are as follows: vector, matrix, three-dimensional data block, four A kind of or any combination in dimensional data block and n dimensional data block;
The n-th layer input data can use tensor representation, concretely: vector, matrix, three-dimensional data block, four dimensions According to a kind of or any combination in block and n dimensional data block;
The n-layer weight group data can use tensor representation, concretely: vector, matrix, three-dimensional data block, four dimensions According to a kind of or any combination in block and n dimensional data block.
As shown in Figure 1, the step of neural metwork training, includes:
Each layer in one (multilayer) neural network successively executes forward operation;
Reversed operation, which is successively executed, according to the sequence of opposite layer obtains weight gradient;
The weight of update forward operation is removed with the gradient for the weight being calculated;
Here it is the successively iteration of the training of neural network, entire training process needs repeat (i.e. successive ignition meter Calculate) this process is multiple.
Refering to Fig. 3, Fig. 3 is a kind of integrated circuit chip device, and training of the described device for the neural network of execution should Neural network includes n-layer, and the n value range is the integer more than or equal to 2, which is characterized in that the ic core is on chip Set includes: main process task circuit and multiple based process circuits;The main process task circuit includes the first mapping circuit, the multiple At least one circuit includes the second mapping circuit, first mapping circuit and the second mapping electricity in based process circuit Road is used to execute the compression processing of each data in neural network computing;
The multiple based process circuit is in array distribution;Each based process circuit and other adjacent based process electricity Road connection, the n based process circuit and the 1st of n based process circuit of the 1st row of main process task circuit connection, m row M based process circuit of column;
The integrated circuit chip device determines that first layer inputs number according to the training instruction for receiving training instruction According to first layer weight group data, the n-layer for executing neural network to first layer input data and first layer weight group data is positive Operation obtains the n-th output result of forward operation;
The main process task circuit is also used to obtain the n-th output result gradient according to the n-th output result, according to described in N-th layer needed for training instruction obtains the n-th reversed operational order and the n-th reversed operational order of the reversed operation of n-th layer Input data and n-th layer weight group data;Result gradient, n-th are exported by described n-th according to the described n-th reversed operational order Layer input data and n-th layer weight group data are divided into vertical data block and lateral data block;According to the described n-th reversed operation The operation control of instruction determines that the first mapping circuit of starting handles the first data block, first data that obtain that treated Block;First data block includes the lateral data block and/or the vertical data block;Refer to according to the described n-th reversed operation By treated, the first data block is sent at least one base in the based process circuit being connected with the main process task circuit for order Plinth processing circuit;
The multiple based process circuit, for determining whether to open according to the operation control of the described n-th reversed operational order Dynamic second mapping circuit handles the second data block, and according to treated, the second data block executes nerve net in a parallel fashion Operation in network obtains operation result, and by the operation result by passing with the based process circuit of the main process task circuit connection It is defeated by the main process task circuit;Second data block is the reception main process task circuit hair that the based process circuit determines The data block sent, second data block and treated first data block associated;
The main process task circuit is also used to be handled the operation result to obtain n-th layer weight group gradient and n-th layer is defeated Enter data gradient, n-th layer weight group data are updated using the n-th layer weight group gradient;
The integrated circuit chip device, be also used to using n-th layer input data gradient as (n-1)th layer (n-1)th output As a result gradient executes n-1 layers of reversed operation and obtains n-1 layers of weight group gradient, using n-1 layers of weight group gradient updating respective layer Weight group data, the weight group data include at least two weights.
As shown in Figure 1a, a kind of forward operation of the neural network provided for present disclosure embodiment, each layer uses oneself Type according to layer of input data and weight specified by operation rule corresponding output data is calculated;
The forward operation process (being also reasoning, inference) of neural network is the input data for successively handling each layer, warp Certain calculating is crossed, the process of output data is obtained, has the feature that
The input of a certain layer:
The input of a certain layer can be the input data of neural network;
The input of a certain layer can be the output of other layers;
The input of a certain layer can be the output (the case where corresponding to Recognition with Recurrent Neural Network) of this layer of last moment;
A certain layer can obtain input from multiple above-mentioned input sources simultaneously;
The output of a certain layer:
The output of a certain layer can be used as the output result of neural network;
The output of a certain layer can be other layers of input;
The output of a certain layer can be the input (the case where Recognition with Recurrent Neural Network) of this layer of subsequent time;
The output of a certain layer can export result to above-mentioned multiple outbound courses;
Specifically, the type of the operation of the layer in the neural network includes but is not limited to following several:
Convolutional layer (i.e. execution convolution algorithm);
Full articulamentum (executing full connection operation);
Normalize (regularization) layer: including LRN (Local Response Normalization) layer, BN (Batch Normalization) the types such as layer;
Pond layer;
Active coating: including but is not limited to the Tanh with Sigmoid layers of Types Below, ReLU layers, PReLu layers, LeakyReLu layers Layer;
The reversed operation of layer, each layer of reversed operation need to be implemented two parts operation: a part is using may be dilute It dredges the output data gradient indicated and may be that the input data of rarefaction representation calculates the gradient of weight (for " weight is more Newly " step updates the weight of this layer), another part is using the output data gradient that may be rarefaction representation and may be sparse The weight of expression, calculate input data gradient (for the output data gradient as next layer in reversed operation for its into The reversed operation of row);
Reversed operation is according to the sequence opposite with forward operation, the back transfer gradient since the last layer.
In a kind of optinal plan, the output data gradient that a certain layer retrospectively calculate obtains be can come from:
The gradient of the last loss function of neural network (lost function or cost function) passback;
Other layers of input data gradient;
The input data gradient (the case where corresponding to Recognition with Recurrent Neural Network) of this layer of last moment;
A certain layer can obtain output data gradient from multiple above-mentioned sources simultaneously;
After having executed the reversed operation of neural network, the gradient of the weight of each layer is just calculated, in the step In, the first input-buffer and the second input-buffer of described device are respectively used to store the gradient of the weight of this layer and weight, so Using weights gradient is updated weight in arithmetic element afterwards;
The operation being mentioned above all is that multilayer neural network was realized in one layer of operation in neural network Cheng Shi, in forward operation, after upper one layer of artificial neural network, which executes, to be completed, next layer of operational order can be by operation list Calculated output data carries out operation as next layer of input data and (or carries out certain behaviour to the output data in member It is re-used as next layer of input data), meanwhile, weight is also replaced with to next layer of weight;In reversed operation, when upper one After the completion of the reversed operation of layer artificial neural network executes, next layer of operational order can be by input number calculated in arithmetic element According to gradient as next layer output data gradient carry out operation (or to the input data gradient carry out it is certain operation remake Output data gradient for next layer), while weight being replaced with to next layer of weight;It is specific as shown in Figure 1 b, dotted line in figure Arrow indicate reversed operation, the arrow of solid line indicates forward operation, respectively schemes the meaning of following mark expression figure.
This application involves data (data i.e. in data block) be compression processing after data, specifically can first mapping It is realized in circuit and the second mapping circuit.It is to be understood that since neural network is the algorithm of a high calculation amount and high memory access, power Value is more, and calculation amount and memory access amount can all increase.In particular, smaller (for example 0, or less than the weight of setting numerical value) for weight In the case where, to improve computation rate, reducing expense compression processing need to be carried out to the lesser data of these weights.In practical application In, data compression process is applied in sparse neural network, and effect is the most obvious, such as reduces workload, reduction that data calculate Data overhead improves data computation rate etc..
By taking input data as an example, the specific embodiment that data compression process is related to is illustrated.The input data includes but not It is limited at least one input neuron and/or at least one weight.
In first embodiment:
First mapping circuit receives the first input data (concretely data to be calculated that main process task circuit is sent Block, such as lateral data block or vertical data block) after, first mapping circuit can be to first input data at Reason, to obtain treated the first input data with the associated mark mask data of first input data, mask data use Whether it is greater than first threshold, such as 0.5,0 in the absolute value for indicating first input data.
Specifically, the absolute value when first input data is greater than first threshold, then retain the input data;Otherwise it deletes 0 is set to except first input data or by first input data.For example, the matrix data block of input isFirst threshold is 0.05, then can get treated matrix after the processing of the first mapping circuit Data blockIt is with the associated identification data block of matrix data block (alternatively referred to as mask matrix)
Further, to reduce volume of transmitted data, the main process task circuit is again into based process circuit connected to it When distributing data, can be transmitted in treated the matrix data block target data (be in this example 1,0.06 and 0.5) with And the associated identification data block of matrix data block.When it is implemented, the main process task circuit can will be described according to setting rule Target data in treated matrix data block is distributed in based process circuit, for example, successively send according to row sequence or Successively according to column sequence etc., the application is without limitation.Correspondingly, based process circuit receive the target data and After the target data corresponds to associated identification data block, according to setting rule (such as the row sequence) square that is reduced to that treated Battle array data block.0.5) and identification data block such as in this example, based process circuit can data (1,0.06 and based on the receivedIt would know that the corresponding matrix data block of the data (the first mapping circuit treated square i.e. in main process task circuit Battle array data block) be
In embodiments of the present invention, which can be lateral data block and/or vertical data block.
Correspondingly, the second mapping circuit carries out the second input data using the associated mark data of the first input data Processing, to obtain treated the second input data;Wherein the first input data is different from second input data.Such as When first input data is at least one weight, then second input data can be at least one input neuron; Alternatively, then second input data can be at least one when first input data is at least one input neuron Weight.
In embodiments of the present invention, second input data is different from first input data, the second input number According to can be any of following: lateral data block, basic data block, the vertical data block of vertical data block and part.
For example, then the second input data is the vertical data block in part when first input data is lateral data block. Assuming that the second input data is matrix data blockAccordingly with mask matrix in upper exampleAfter processing, obtaining that treated, the vertical data block in part isDue in practical applications, The matrix data block dimension that input data is related to is larger, and it is only for signals by the application, this does not constitute restriction.
In second embodiment:
First mapping circuit can be used for handling the first input data and the second input data, to be handled The first input data and the associated first identifier mask data of first input data afterwards, treated the second input number Accordingly and the associated second identifier mask data of second input data.Wherein, the first mask data or second Whether the absolute value that mask data are used to indicate first or second input data is greater than second threshold, which is user side Or the customized setting of device side, such as 0.05,0 etc..
Treated first input data or the second input data can be treated input data, can also be not locate Input data before reason.For example, the first input data is lateral data block, such as the matrix data block in above-mentioned exampleThe lateral data block that can get that treated after the processing of the first mapping circuit, after handling here Lateral data block can be original matrix data blockIt can also be the matrix data block after compression processingIt is to be understood that the application is data processing effect in the transmission and based process circuit for reduce data volume Rate, preferably treated the input data (such as treated basic data block or part vertically data block) should be compression Data that treated.Preferably, the data that main process task circuit is sent into based process circuit, concretely described treated Target data in input data, concretely absolute value can also be non-zero data greater than the data of preset threshold to the target data Etc..
Correspondingly in based process circuit, the second mapping circuit can be according to associated first mark of first input data Know data and the associated second identifier data of second input data obtain connection identifier data;The connection identifier data are used Absolute value is all larger than the data of third threshold value in instruction first input data and second input data, wherein third Threshold value is user side or the customized setting of device side, such as 0.05,0.Further, second mapping circuit can be according to institute It states connection identifier data respectively to handle received first input data and the second input data, to obtain, treated First input data and treated the second input data.
For example, the first input data is matrix data blockSecond input block is equally For matrix data blockIt can get first input data after the processing of the first mapping circuit to close The first identifier data block of connectionAnd treated the first input blockCorrespondingly Obtain the associated second identifier data block of second input dataTreated, and the second input block isIt correspondingly, is improve data transfer rate, it only can will treated the first input in main process task circuit Target data 1,0.06 and 0.5 and the associated first identifier data block of first input block in data block are sent to Based process circuit;Meanwhile by the target data 1,1.1,0.6,0.3 and 0.5 in treated the second input block, and The associated second identifier data block of second input block is sent to based process circuit.
Correspondingly, based process circuit, can be by the second mapping circuit to above-mentioned first mark after receiving above-mentioned data Know data block and second identifier data block carries out obtaining connection identifier data block by element and operationAccordingly Ground, the second mapping circuit is using the connection identifier data block respectively to treated first input block and treated Second input block is respectively processed, to obtain, treated that the first input block isPlace The second input block after reason isIt wherein, can be according to first identifier data block in based process circuit And the target data in received first data block, determine that the first data block where the target data is corresponding (is passed through First mapping circuit treated the first data block);Correspondingly, according to second identifier data block and received second data block In target data, (i.e. by the first mapping circuit, treated for the second data block for determining where the target data is corresponding Second data block);Then, after the second mapping circuit knows connection identifier data block, distinguished using the connection identifier data block It carries out with the first determining data block and the second data block determined by element and operation, to obtain at via the second mapping circuit The first data block after reason and treated the second data block.
In 3rd embodiment:
First mapping circuit can't be set in the main process task circuit, but third can be inputted number by the main process task circuit Accordingly and the associated third mark data of the third input data that prestores is sent in based process circuit connected to it. The second mapping circuit is provided in the based process circuit.The tool for the data compression process that the second mapping circuit is related to is described below Body embodiment.
It is to be understood that the third input data includes but is not limited to basic data block, the vertical data block in part, vertical number According to block etc..Similarly, in neural network processor, which can also be at least one weight, and/or at least one A input nerve, the application is without limitation.
In the second mapping circuit, second mapping circuit can the associated third mark of third input data based on the received Know data to handle the third input data, so that treated third input data is obtained, so as to subsequent to processing Third input data afterwards executes correlation operation, such as inner product operation.
For example, the received third input data of the second mapping circuit is matrix data blockPhase The associated third identification data block of the third input data (also at mask matrix data block) prestored with answering isFurther, the second mapping circuit handle to third input block according to third identification data block To treated, third input block is specially
In addition, the input neuron and output neuron mentioned in the embodiment of the present invention do not mean that entire neural network The neuron in neuron and output layer in input layer, but for two layers of neuron of arbitrary neighborhood in neural network, place Neuron in network feed forward operation lower layer is to input neuron, and the neuron in network feed forward operation upper layer is Output neuron.By taking convolutional neural networks as an example, it is assumed that a convolutional neural networks have L layers, K=1,2,3 ... L-1, for K For layer and K+1 layer, K layer referred to as input layer, the neuron in this layer is above-mentioned input neuron, and K+1 layers are claimed For input layer, the neuron in this layer is above-mentioned output neuron, i.e., other than top layer, each layer all can serve as to input Layer, next layer are corresponding output layer.
In 4th implementation:
In the main process task circuit and it is not provided with mapping circuit, the first mapping electricity is provided in the based process circuit Road and the second mapping circuit.About the data processing of first mapping circuit and the second mapping circuit, for details, reference can be made to aforementioned Described in one embodiment to 3rd embodiment, which is not described herein again.
Optionally, there is also the 5th embodiments.In 5th embodiment, in the based process circuit and it is not provided with mapping electricity First mapping circuit and the second mapping circuit are arranged in main process task circuit by road, about first mapping circuit Data processing with the second mapping circuit is no longer gone to live in the household of one's in-laws on getting married here for details, reference can be made to described in aforementioned first embodiment to 3rd embodiment It states.It is that the compression processing of data is completed in main process task circuit, by treated, input data is sent to based process circuit, So that based process circuit is executed using treated input data (weight after concretely treated neuron and processing) Arithmetic operation correspondingly.
The concrete structure schematic diagram this application involves mapping circuit is described below.It possible is reflected as Fig. 6 a and 6b show two kinds Transmit-receive radio road.Wherein, mapping circuit as shown in Figure 6 a includes comparator and selector.Number about the comparator and selector Measure the application without limitation.As Fig. 6 a shows a comparator and two selectors, wherein the comparator is for determining input Whether data meet preset condition.The preset condition can be above-mentioned for the customized setting of user side or equipment side, such as the application The input data absolute value be greater than or equal to preset threshold.If meeting preset condition, comparator can determine permission The input data is exported, it is 1 which, which corresponds to associated mark data,;Otherwise it can determine and do not export the input data, or It is 0 that person, which defaults the input data,.Correspondingly, it is 0 that the input data, which corresponds to associated mark data, at this time.It that is to say, by this After comparator, the associated mark data of input data would know that.
It further, can be by the mark data of acquisition after the comparator is to the judgement of input data progress preset condition It is input in selector, so that selector decides whether to export input data correspondingly using the mark data, that is, obtains Input data that treated.
As Fig. 6 a can be in the matrix data block by comparator by taking the input data is matrix data block as an example Each data carry out the judgement of preset condition, to can get the associated identification data block of matrix data block (mask matrix). Further, the matrix data block is screened using the identification data block in first selector, by the matrix The data that absolute value is greater than or equal to preset threshold (meeting preset condition) in data block are retained, and remainder data is deleted It removes, with output treated matrix data block.Optionally, also defeated to other using the identification data block in second selector Enter data (such as second matrix data block) to be handled, such as carries out by element and operation, by the second matrix data block The data that middle absolute value is greater than or equal to preset threshold are retained, with output treated the second matrix data block.
It is to be understood that corresponding in above-mentioned the first and second embodiments, the specific structure of first mapping circuit can be wrapped Include the comparator and first selector at least one comparator and at least one selector, such as upper example in Fig. 6 a;Described The concrete outcome of two mapping circuits may include one or more selectors, such as go up the second selector of Fig. 6 a in example.
Such as Fig. 6 b, the structural schematic diagram of another mapping circuit is shown.Such as Fig. 6 b, the mapping circuit includes selector, The quantity of the selector without limitation, can be one, can also be multiple.Specifically, the selector is used for according to input Mark data associated by input data selects the input data of input, will be in the input data absolutely The data that value is greater than or equal to preset threshold are exported, and remainder data delete/do not export, to obtain, that treated is defeated Enter data.
By taking the input data is matrix data block as an example, Xiang Suoshu mapping circuit inputs the matrix data block and the square The identification data block of battle array data block associated, selector can select the matrix data block according to the identification data block, will Its absolute value is exported more than or equal to 0 data, and remainder data not exports, thus output treated matrix data Block.
It is to be understood that structure as shown in Figure 6 b can be applied to the second mapping circuit in above-mentioned 3rd embodiment, it is The concrete outcome of the second mapping circuit in above-mentioned 3rd embodiment may include at least one selector.Similarly, for main process task The first mapping circuit and the second mapping circuit that design in circuit and based process circuit can be according to as shown in figures 6 a and 6b Functional component carries out combined crosswise or component is split, and the application is without limitation.
Based on previous embodiment, the concrete methods of realizing of several neural network forward operations is illustratively provided below.When When first operational order is that convolution instructs, the input block (data i.e. in input block) is convolution input data, The weight data (block) is convolution kernel.Convolution algorithm is described below, a square indicates a data, input in figure below Data indicate (N number of sample, each sample have C channel, a height of H, width W of the characteristic pattern in each channel), weight with Fig. 2 a Namely convolution kernel is indicated with Fig. 2 b and (is had M convolution kernel, each convolution kernel has C channel, and height and width are respectively KH and KW).For N number of sample of input data, the rule of convolution algorithm are the same, and explained later carries out convolution algorithm on a sample Process, on a sample, each of M convolution kernel will carry out same operation, and each convolution kernel operation obtains one M plane characteristic figure is finally calculated in sheet of planar characteristic pattern, M convolution kernel, and (to a sample, the output of convolution is M spy Sign figure), for a convolution kernel, inner product operation is carried out in each plan-position of a sample, then along H and the side W To being slided, for example, Fig. 2 c indicates that the position in convolution kernel lower right corner in a sample of input data carries out inner product The corresponding diagram of operation;Fig. 2 d indicates one lattice of position upward sliding that a lattice are slided in the position of convolution to the left and Fig. 2 e indicates convolution.
Specifically, the mode of the process of convolution can be handled using chip structure as shown in Figure 3, the of main process task circuit One mapping circuit can be handled the data in some or all of weight convolution kernel, obtain corresponding mask data and place Weight data (being the data after handling in some or all of weight convolution kernel) after reason.
By the data in some or all of weight convolution kernel, (data can be original to the control circuit of main process task circuit Weight data or treated weight data) it is sent to and is directly connected with main process task circuit by lateral Data Input Interface Those based process circuits (being referred to as base unit);Meanwhile control circuit will associated mask number corresponding with the data According to also sending jointly in the based process circuit with main process task circuit connection;
In a kind of optinal plan, the control circuit of main process task circuit sends the data of some convolution kernel in weight every time One number or a part of number give some based process circuit;(for example, for some based process circuit, send for the 1st time The 1st number of 3 rows, the 2nd the 2nd number sent in the 3rd row data, the 3rd number ... or the 1st of the 3rd the 3rd row of transmission The 3rd row the first two number of secondary transmission, second of the 3rd row the 3rd of transmission and the 4th number, third time send the 3rd row the 5th and the 6th Number ...;) simultaneously, control circuit is by the corresponding mask data of some convolution kernel in the weight also using above-mentioned each generation one Several or a part of data give that based process circuit;
Another situation is that, the control circuit of main process task circuit is by the several convolution kernels of certain in weight in a kind of optinal plan Data every time respectively send an a part of number of number person give some based process circuit;(for example, for some based process electricity Road, the 1st number of the 1st the 3rd, 4, the 5 every row of row of transmission, the 2nd number of the 2nd the 3rd, 4, the 5 every row of row of transmission, the 3rd transmission 3rd number ... of the 3rd, 4, the 5 every row of row or the 1st transmission every row the first two number of the 3rd, 4,5 row, second of transmission the 3rd, The every row the 3rd of 4,5 rows and the 4th number, third time send the every row the 5th of the 3rd, 4,5 row and the 6th number ...;) correspondingly, control electricity Road also will occur every time one using above-mentioned identical method with associated mask data corresponding to certain several convolution kernel in the weight Number or a part of data give that based process circuit;
The control circuit of main process task circuit divides input data according to the position of convolution, the control of main process task circuit Circuit by the data some or all of in input data in convolution position be sent to by vertical Data Input Interface directly with Main process task circuit be connected those of based process circuit;Correspondingly, control circuit equally also can be according to the position of convolution for institute It states the associated mask data of input data to be divided, correspondingly control circuit simultaneously also can be by the part in the input data Or mask data corresponding to the data in whole convolution position also send jointly to the basis being electrically connected with main process task circuit In processing circuit;
In a kind of optinal plan, the control circuit of main process task circuit by the data of some convolution position in input data with And associated mask data corresponding with the data send a number or a part of number every time and give some based process circuit;(example Such as, for some based process circuit, the 1st transmission the 3rd arranges the 1st number, the 2nd the 2nd sent in the 3rd column data Number, the 3rd number ... or the 1st the 3rd column the first two number of transmission of the 3rd column of the 3rd transmission, second of transmission the 3rd arrange the 3rd With the 4th number, third time sends the 3rd and arranges the 5th and the 6th number ...;)
Another situation is that, the control circuit of main process task circuit is by the several volumes of certain in input data in a kind of optinal plan The data and associated mask data corresponding with the data of product position respectively send a number or a part of number to some every time Based process circuit;(for example, for some based process circuit, the 1st number of the 1st the 3rd, 4,5 column each column of transmission, the 2nd Secondary the 2nd number for sending the 3rd, 4,5 column each column, the 3rd number ... or the 1st hair of the 3rd the 3rd, 4,5 column each column of transmission The 3rd, 4,5 column each column the first two number is sent, second of the 3rd, 4,5 column each column the 3rd of transmission and the 4th number, third time send the 3rd, 4,5 Column each column the 5th and the 6th number ...;)
Based process circuit receive weight data (concretely in weight convolution kernel data (abbreviation weight data) Or associated mask data corresponding with the weight data) after, which is transmitted by its lateral data output interface Be connected next based process circuit to it;Based process circuit receive input data data (data can based on Manage input data and the associated mark mask data of the input data that circuit is sent) after, the data are vertical by it Data output interface is transferred to coupled next based process circuit;
Specifically, the control circuit of main process task circuit can be by input data and the associated mask data one of the input data It rises and is sent to base processing circuit, based process circuit receives the input data and the associated mask data of the input data;
Each based process circuit carries out operation to the data received;Specifically, based process circuit can enable Two mapping circuits are according to the associated mask data of input data and associated mask data (the i.e. convolution kernel in weight of weight data Associated mask data) obtain connection identifier data;Recycle connection identifier data selection input data and weight data The data that middle absolute value is greater than preset threshold carry out multiplying;
In a kind of optinal plan, if received data (concretely data to be calculated in each based process circuit Block, as the mask data, input data or the input data of data and the data correlation in weight in convolution kernel are associated with Mask data) data volume be more than preset threshold when, which will no longer receive new input data, such as main place Reason circuit by the several convolution kernels of certain in the weight of subsequent transmission data and the data correspond to associated mask data etc., Until possessing enough buffer/store spaces in based process circuit, then receive the data that main process task circuit is newly sent.
In a kind of optinal plan, based process circuit calculates the multiplication of one or more groups of two data every time, then will As a result it is added on register and/or on piece caching;
In a kind of optinal plan, based process circuit calculates the inner product of one or more groups of two vectors every time, then will As a result it is added on register and/or on piece caching;
After based process circuit counting goes out result, result can be transferred out from data output interface;
In a kind of optinal plan, which can be the final result or intermediate result of inner product operation;
Specifically, from the interface if the based process circuit has the output interface being directly connected with main process task circuit Transmission is as a result, if it is not, towards that directly can export result to the direction of the based process circuit of main process task circuit output.
After based process circuit receives the calculated result from other based process circuits, transmit the data to Its other based process circuit or main process task circuit for being connected;
Towards can be directly to the direction of main process task circuit output output result (for example, bottom line based process electricity Road outputs it result and is directly output to main process task circuit, other based process circuits transmit downwards fortune from vertical output interface Calculate result);
Main process task circuit receive each based process circuit inner product operation as a result, output result can be obtained.
Device completion tensor as shown in Figure 3 is described below and multiplies tensor operation, the previously described data block phase of tensor sum It together, can be matrix, vector, three-dimensional data block, four figures according to any one of block and high dimensional data block or multinomial combination; The specific concrete methods of realizing that Matrix Multiplication vector sum Matrix Multiplication matrix operation is shown respectively such as Fig. 4 b and 4d.
Refering to Fig. 4 a, Fig. 4 a is a kind of Matrix Multiplication with the operation of matrix, the forward direction indicated by first operational order Operation is Matrix Multiplication matrix operation, and the input data is the first matrix of the Matrix Multiplication matrix operation, and the weight is institute State the second matrix of Matrix Multiplication matrix operation.
Refering to Fig. 4 b, the operation of Matrix Multiplication matrix is completed using device as shown in Figure 3;
Be described below calculate size be M row L column matrix S and size be L row N column matrix P multiplication operation, (square Every a line in battle array S is identical with each column length of matrix P, as shown in Figure 2 d) the neural computing device possesses K base Plinth processing circuit:
Step S401b, every data line in matrix S is distributed to K based process by the control circuit of main process task circuit In some in circuit, the data received are stored on piece caching and/or register by based process circuit;
In a kind of optional scheme, data that the data of the matrix S are that treated.Specifically, main process task circuit opens Matrix S is handled with the first mapping circuit, to obtain treated matrix S and the associated first identifier of matrix S (mask) matrix.Alternatively, the first mapping circuit of main process task circuit is according to the associated first mask matrix of the matrix S prestored to square Battle array S is handled, the matrix S that obtains that treated.Further, by control circuit by every a line in treated matrix S Data and the row data correspondence correspond to associated mark data in the first mask matrix and send jointly to K based process electricity In road some or it is multiple in.It, specifically can will treated square when main process task circuit sends data to based process circuit Absolute value is greater than the data of preset threshold in battle array S or non-zero data are sent to based process circuit, to reduce volume of transmitted data.
In a kind of optinal plan, if line number M≤K of S, the control circuit of main process task circuit is to M based process Circuit distributes a line of s-matrix respectively;Optionally, while also it sends by the mark of the corresponding row in first identifier matrix of the row Data;
In a kind of optinal plan, if line number M > K of S, the control circuit of main process task circuit is to each based process electricity Distribute a line or the data of multirow in s-matrix respectively in road.Optionally, while also it sends corresponding in the first mark by a line or a few rows Know the mark data of the row in matrix;
There is Mi row to be distributed to i-th of based process circuit in S, the collection of this Mi row is collectively referred to as Ai, as Fig. 2 e indicates i-th of base Calculating to be executed on plinth processing circuit.
In a kind of optinal plan, in each based process circuit, such as in i-th of based process circuit:
Matrix A i is stored in i-th of based process circuit register by the received matrix A i distributed by main process task circuit And/or on piece caching;Advantage be the reduction of after volume of transmitted data, improve computational efficiency, reduce power consumption.
Step S402b, each section in matrix P is transferred to each base by the control circuit of main process task circuit in a broadcast manner Plinth processing circuit;
In a kind of optinal plan, the data (each section) of the matrix P can be treated data.Specifically, main place Reason circuit enables the first mapping circuit and handles matrix P, to obtain, treated that matrix P and matrix P is associated Second identifier (mask) matrix.Alternatively, the first mapping circuit of main process task circuit is according to associated 2nd mask of the matrix P prestored Matrix handles matrix P, the matrix P that obtains that treated.It further, will be in treated matrix P by control circuit Data (i.e. each section) and the data are corresponding correspond to associated mark data in the 2nd mask matrix to send jointly to K a In based process circuit some or it is multiple in.When main process task circuit sends data to based process circuit, can specifically incite somebody to action Absolute value is greater than the data of preset threshold in treated matrix P or non-zero data are sent to based process circuit, to reduce Volume of transmitted data.
In a kind of optinal plan, each section in matrix P can only be broadcasted and once arrive posting for each based process circuit In storage or on piece caching, i-th of based process circuit is fully multiplexed the data of the matrix P this time obtained, Complete the corresponding inner product operation with every a line in matrix A i;Multiplexing in the present embodiment is specifically as follows based process circuit and is counting Reused in calculation, for example, matrix P data multiplexing, can be and the data of matrix P are being used for multiple times.
In a kind of optinal plan, each section in matrix P can be repeatedly broadcast to respectively by the control circuit of main process task circuit In register or the on piece caching of a based process circuit, data of i-th of based process circuit to the matrix P obtained every time Without multiplexing, the inner product operation of the every a line corresponded in matrix A i is completed by several times;
In a kind of optinal plan, each section in matrix P can be repeatedly broadcast to respectively by the control circuit of main process task circuit In register or the on piece caching of a based process circuit, data of i-th of based process circuit to the matrix P obtained every time Fractional reuse is carried out, the inner product operation of the every a line corresponded in matrix A i is completed;
In a kind of optinal plan, each based process circuit, such as i-th of based process circuit, calculating matrix Ai's The inner product of data and the data of matrix P;
Step S403b, the result of inner product operation is added up and is transmitted by the accumulator circuit of each based process circuit Return main process task circuit.
Optionally, before step S403b, the inner product operation device of based process circuit needs the data of calculating matrix S and matrix P Inner product, specifically there are following several embodiments.
In a kind of specific embodiment, data in the based process circuit receives that treated matrix S and The corresponding associated mark data in the first mask matrix of the data;Data in the matrix P that also receives that treated simultaneously.Phase Ying Di, based process circuit enable mark data of second mapping circuit based on the received in the first mask matrix to received square The data of battle array P are handled, the data for the matrix P that obtains that treated.Further, which enables inner product operation Device circuit executes inner product operation to the data of data and treated matrix P in received treated matrix S, obtains inner product The result of operation.
In a kind of specific embodiment, data in the based process circuit receives that treated matrix P and The corresponding associated mark data in the 2nd mask matrix of the data;Data in the matrix S that also receives that treated simultaneously.Phase Ying Di, based process circuit enable mark data of second mapping circuit based on the received in the 2nd mask matrix to received square The data of battle array S are handled, the data for the matrix S that obtains that treated.Further, which enables inner product operation Device circuit executes inner product operation to the data in the data of received treated matrix P and treated matrix S, obtains inner product The result of operation.
In a kind of specific embodiment, data in the based process circuit receives that treated matrix S and The corresponding associated mark data in the first mask matrix of the data;Data in the matrix P that also receives that treated simultaneously with And the corresponding associated mark data in the 2nd mask matrix of the data.Correspondingly, based process circuit enables the second mapping electricity Mark data of the road based on the received in the mark data and the 2nd mask matrix in the first mask matrix obtains relation identity square Battle array;Then using the mark data in relation identity matrix respectively to the data in the data and matrix P in received matrix S into The data of row processing, the data for the matrix S that obtains that treated and treated matrix P.Further, inner product operation device electricity is enabled Road executes inner product operation to the data of data and treated matrix P in treated matrix S, obtains the knot of inner product operation Fruit.For example, i-th of based process circuit, receives the associated identity matrix Bi of matrix A i, the Ai, matrix P and matrix P is closed The second identifier matrix of connection;The second mapping circuit can be enabled at this time obtains relation identity matrix using Bi and second identifier matrix, The relation identity matrix is recycled simultaneously or separately to handle matrix A i and matrix P, obtain that treated matrix A i and place Matrix P after reason.Then, inner product operation device circuit is enabled to treated matrix A i and treated that matrix P carries out inner product fortune It calculates.
In a kind of optinal plan, based process circuit can execute the part and be transmitted back to that inner product operation obtains for each Main process task circuit adds up;
The part that can also be obtained the inner product operation that each based process circuit executes in a kind of optinal plan and guarantor It is cumulative to terminate to be transmitted back to main process task circuit later in register and/or the on piece caching of existence foundation processing circuit;
In a kind of optinal plan, can also by the obtained part of inner product operation that each based process circuit executes and It is stored in the register and/or on piece caching of based process circuit and adds up under partial picture, be transferred under partial picture Main process task circuit adds up, cumulative to terminate to be transmitted back to main process task circuit later.
It is a kind of Matrix Multiplication with the operation schematic diagram of vector refering to Fig. 4 c.Indicated by first operational order just To operation are as follows: Matrix Multiplication vector operation, the input data are the first matrix of the Matrix Multiplication vector operation, and the weight is The vector of the Matrix Multiplication vector operation.Refering to Fig. 4 d, Fig. 4 d provides a kind of implementation method of Matrix Multiplication vector, specifically can be with Include:
Step S401, every data line in matrix S is distributed to K based process electricity by the control circuit of main process task circuit In some in road, based process circuit by the distribution data received be stored in based process circuit on piece caching and/ Or in register;
In a kind of optional scheme, data that the data of the matrix S are that treated.Specifically, main process task circuit opens Matrix S is handled with the first mapping circuit, to obtain treated matrix S and the associated first identifier of matrix S (mask) matrix.Alternatively, the first mapping circuit of main process task circuit is according to the associated first mask matrix of the matrix S prestored to square Battle array S is handled, the matrix S that obtains that treated.Further, by control circuit by every a line in treated matrix S Data and the row data correspondence correspond to associated mark data in the first mask matrix and send jointly to K based process electricity In road some or it is multiple in.It, specifically can will treated square when main process task circuit sends data to based process circuit Absolute value is greater than the data of preset threshold in battle array S or non-zero data are sent to based process circuit, to reduce volume of transmitted data. For example, the collection of the row in the matrix S that is distributed in i-th of based process circuit that treated is combined into Ai, Mi row is shared;Correspondingly, It is a part of the first mask matrix that also distribution, which has identity matrix Bi corresponding with Ai, Bi, simultaneously, shares and is greater than or equal to Mi row.
In a kind of optinal plan, if line number M≤K of matrix S, the control circuit of main process task circuit is to K basis Processing circuit distributes a line of s-matrix respectively;Optionally, while also it sends by the corresponding row in first identifier matrix of the row Mark data;
In a kind of optinal plan, if line number M > K of matrix S, the control circuit of main process task circuit gives each basis Processing circuit distributes a line or the data of multirow in s-matrix respectively.Optionally, it while also sending and is corresponded to by a line or a few rows The mark data of row in first identifier matrix;
The collection for the row being distributed in the S of i-th of based process circuit is combined into Ai, shares Mi row, as Fig. 2 c is indicated i-th Calculating to be executed on based process circuit.
In a kind of optinal plan, in each based process circuit, such as in i-th of based process circuit, it can incite somebody to action The distribution data received such as matrix A i is stored in the register and/or on piece caching of i-th of based process circuit;Advantage The volume of transmitted data of distribution data after being the reduction of, improves computational efficiency, reduces power consumption.
Step S402, each section in vector P is transferred to K base by the control circuit of main process task circuit in a broadcast manner Plinth processing circuit;
In a kind of optinal plan, the data (each section) of the vector P can be treated data.Specifically, main place Reason circuit enables the first mapping circuit and handles vector P, to obtain, treated that vector P and vector P is associated Second identifier (mask) matrix.Alternatively, the first mapping circuit of main process task circuit is according to associated 2nd mask of the vector P prestored Matrix handles vector P, the vector P that obtains that treated.It further, will be in treated vector P by control circuit Data (i.e. each section) and the data are corresponding correspond to associated mark data in the 2nd mask matrix to send jointly to K a In based process circuit some or it is multiple in.When main process task circuit sends data to based process circuit, can specifically incite somebody to action Absolute value is greater than the data of preset threshold in treated vector P or non-zero data are sent to based process circuit, to reduce Volume of transmitted data.
In a kind of optinal plan, the control circuit of main process task circuit, which can only broadcast each section in vector P, once to be arrived In register or the on piece caching of each based process circuit, i-th of based process circuit is to the vector P's this time obtained Data are fully multiplexed, and the corresponding inner product operation with every a line in matrix A i is completed.Advantage is reduced from main process task circuit To the volume of transmitted data of the repetition transmission of the vector P of based process circuit, execution efficiency is improved, reduces transmission power consumption.
In a kind of optinal plan, each section in vector P can be repeatedly broadcast to respectively by the control circuit of main process task circuit In register or the on piece caching of a based process circuit, data of i-th of based process circuit to the vector P obtained every time Without multiplexing, the inner product operation of the every a line corresponded in matrix A i is completed by several times;Advantage is reduced in based process circuit The volume of transmitted data of the vector P of the single transmission in portion, and the capacity of based process circuit caching and/or register can be reduced, Execution efficiency is improved, transmission power consumption is reduced, reduces cost.
In a kind of optinal plan, each section in vector P can be repeatedly broadcast to respectively by the control circuit of main process task circuit In register or the on piece caching of a based process circuit, data of i-th of based process circuit to the vector P obtained every time Fractional reuse is carried out, the inner product operation of the every a line corresponded in matrix A i is completed;Advantage is reduced from main process task circuit to base The volume of transmitted data of plinth processing circuit also reduces the volume of transmitted data inside based process circuit, improves execution efficiency, reduces and pass Defeated power consumption.
Step S403, the inner product of the data of inner product operation device the circuit counting matrix S and vector P of K based process circuit, Such as i-th of based process circuit, the inner product of the data of the data and vector P of calculating matrix Ai;
In a kind of specific embodiment, data in the based process circuit receives that treated matrix S and The corresponding associated mark data in the first mask matrix of the data;Data in the vector P that also receives that treated simultaneously.Phase Ying Di, based process circuit enable the second mapping circuit based on the received the mark data in the first mask matrix to it is received to The data of amount P are handled, the data for the vector P that obtains that treated.Further, which enables inner product operation Device circuit executes inner product operation to the data of data and treated vector P in received treated matrix S, obtains inner product The result of operation.For example, i-th of based process circuit, receives matrix A i, the Ai associated identity matrix Bi and vector P; The second mapping circuit can be enabled at this time is handled to obtain treated vector P to vector P using Bi;Inner product operation device is enabled again To matrix A i, vector P carries out inner product operation to circuit with treated.
In a kind of specific embodiment, data in the based process circuit receives that treated vector P and The corresponding associated mark data in the 2nd mask matrix of the data;Data in the matrix S that also receives that treated simultaneously.Phase Ying Di, based process circuit enable mark data of second mapping circuit based on the received in the 2nd mask matrix to received square The data of battle array S are handled, the data for the matrix S that obtains that treated.Further, which enables inner product operation Device circuit executes inner product operation to the data in the data of received treated vector P and treated matrix S, obtains inner product The result of operation.For example, i-th of based process circuit, receive that matrix A i, treated, and vector P and vector P is associated Second identifier matrix;The second mapping circuit can be enabled at this time is handled to obtain treated square to Ai using second identifier matrix Battle array Ai;Inner product operation device circuit is enabled again to treated matrix A i and treated that vector P carries out inner product operation.
In a kind of specific embodiment, data in the based process circuit receives that treated matrix S and The corresponding associated mark data in the first mask matrix of the data;Data in the vector P that also receives that treated simultaneously with And the corresponding associated mark data in the 2nd mask matrix of the data.Correspondingly, based process circuit enables the second mapping electricity Mark data of the road based on the received in the mark data and the 2nd mask matrix in the first mask matrix obtains relation identity square Battle array;Then using the mark data in relation identity matrix respectively to the data in the data and vector P in received matrix S into The data of row processing, the data for the matrix S that obtains that treated and treated vector P.Further, inner product operation device electricity is enabled Road executes inner product operation to the data of data and treated vector P in treated matrix S, obtains the knot of inner product operation Fruit.For example, i-th of based process circuit, receives the associated identity matrix Bi of matrix A i, the Ai, vector P and vector P is closed The second identifier matrix of connection;The second mapping circuit can be enabled at this time obtains relation identity matrix using Bi and second identifier matrix, The relation identity matrix is recycled simultaneously or separately to handle matrix A i and vector P, obtain that treated matrix A i and place Vector P after reason.Then, inner product operation device circuit is enabled to treated matrix A i and treated that vector P carries out inner product fortune It calculates.
Step S404, the accumulator circuit of K based process circuit is added up the result of inner product operation As a result, accumulation result to be transmitted back to main process task circuit in the form of fixed point type.
In a kind of optinal plan, each based process circuit can be executed to the part and (part that inner product operation obtains That is a part of accumulation result, such as accumulation result are as follows: F1*G1+F2*G2+F3*G3+F4*G4+F5*G5, then part and Can be with are as follows: the value of F1*G1+F2*G2+F3*G3) it is transmitted back to main process task circuit and adds up;Advantage is to reduce based process electricity Operand inside road improves the operation efficiency of based process circuit.
The part that can also be obtained the inner product operation that each based process circuit executes in a kind of optinal plan and guarantor It is cumulative to terminate to be transmitted back to main process task circuit later in register and/or the on piece caching of existence foundation processing circuit;Advantage is, Reduce the volume of transmitted data between based process circuit and main process task circuit, improve operation efficiency, reduces data transmission Power consumption.
In a kind of optinal plan, can also by the obtained part of inner product operation that each based process circuit executes and It is stored in the register and/or on piece caching of based process circuit and adds up under partial picture, be transferred under partial picture Main process task circuit adds up, cumulative to terminate to be transmitted back to main process task circuit later;Advantage is to reduce based process circuit and master Volume of transmitted data between processing circuit, improves operation efficiency, reduces data transmission power consumption, reduces based process circuit Internal operand improves the operation efficiency of based process circuit.
Neural network training method
Involved all or part of data data that can be that treated, can specifically join in neural network training process See that previous embodiment is obtained by the first mapping circuit and/or the processing of the second mapping circuit, is not repeating here.
It should be noted that the different moments of training process (are specifically just different the number of iterations or initialization At the time of), the different phase (i.e. positive or reversed operation) in training process, different layers, the different data in same layer The sub-block of the different piece divided in block (i.e. multiple input blocks, output block) or the same data block, all The data block that can refer to that treated.
The concrete methods of realizing for illustrating neural metwork training with an actual example below, is as shown in Figure 1 b single layer The specific calculating schematic diagram of the neural metwork training of operation, if Fig. 1 b solid line shows the forward operation of monolayer neural networks, such as Fig. 1 b dotted line shows the reversed operation of monolayer neural networks.Specifically, first executing this layer according to input data and weight or parameter Forward operation, obtain output data, further according to output data carry out preset rules operation (preset rules can be by producer Sets itself according to their needs does not limit the concrete operation step of the preset rules operation here) obtain the output of this layer Data gradient.Then, this layer of neural network can be executed according to the input data of this layer, weight or parameter, output data gradient Reversed operation obtains the input data gradient of this layer and the gradient of weight or parameter, using the weight or parameter for calculating acquisition The corresponding weight or parameter for updating this layer of gradient, that is, complete this layer of neural metwork training.
During specific implementation, the data involved in forward operation or reversed calculating process can count for treated According to by taking forward operation as an example, technical solution provided by the embodiments of the present application can be for the reversed operational order determination according to this layer No starting correlation map circuit (concretely the first mapping circuit and/or the second mapping circuit) is to input data and/or weight It is handled, then executes this layer of operation using treated input data and/or weight.For the principle of above-mentioned data processing It can be found in the related elaboration in previous embodiment, which is not described herein again.It is to be understood that being executed using treated data Neural network computing, the transport overhead that can be significantly reduced between calculator, in addition, for calculator, less bit The space of the data storage of position is also smaller, i.e., storage overhead can be smaller, and calculation amount can also be reduced, i.e., computing cost can be reduced, institute The expense of computing cost and storage can be reduced.
Below by taking Fig. 4 e and Fig. 4 f as an example, the structural representation of the neural metwork training of matrix multiplication and convolution is specifically given Figure.Wherein, this layer of operation mode that Fig. 4 e is shown is matrix multiplication, this layer of operation mode that Fig. 4 f is shown is convolution algorithm, false If this layer of input data and weight are matrix, for convenience of explanation input data here by taking matrix I as an example, weight with For matrix W, wherein output data=matrix I* matrix W;If matrix I and matrix W are the biggish sparse matrix of dimension, this In sparse matrix refer in matrix to include that absolute value is less than or equal to preset threshold, or it is more for 0 data.Dimension is larger It is larger to can be understood as the sum of number of columns and line number amount of matrix I and matrix W, it can think above-mentioned matrix I and square Battle array W takes up too much space in memory and/or register and calculation amount is also larger, at this time if at according to conventional matrix multiplication Reason, data calculation amount are larger;To improve data-handling efficiency, matrix I and matrix W need to be handled, then execute square again The operation of battle array multiplication.
For example, matrix I is the sparse matrix of 1000*1000, matrix W is also the sparse matrix of 1000*1000, then for The sum of number of columns and line number amount are 2000, and quantity is very big, and corresponding calculation amount is just bigger, and Matrix Multiplication is transported with the inner product of matrix The multiplying of calculation i.e. 109 time, for this technical solution, since the quantity of matrix I and matrix W are very big, it is impossible to once will All transmission and calculating, data same in this way may be transmitted several times and calculate all data, if at this time using at data Reason method carries out data processings to two sparse matrixes, can largely reduce the dimension of matrix I and matrix W (i.e. Data volume), so that it may it is significantly reduced the data volume and calculation amount of transmission, and then reduces transport overhead and computing cost.
As Fig. 4 g and 4h show the concrete structure schematic diagram of multilayer neural network training.As shown in figure 4g, dotted arrow side To showing a kind of reversed operation.For reversed operation, the output of reversed operation is output data gradient;When the output data Gradient is the last layer of multilayer neural network iterative calculation, then the output data gradient is specially the last of current iteration calculating (the default operation can be by producer's sets itself according to their needs, here not by default operation for one layer of output data Limit the concrete operation step of the default operation) obtained by;It is multilayer neural network iteration as crossed the output data gradient not being The last layer of calculating, such as the output data gradient are the n-th layer that current iteration calculates, then the output data of the n-th layer Gradient can be the input data gradient that (n+1)th layer of reversed operation is calculated.Similarly it is appreciated that Fig. 4 h, Fig. 4 h concretely multilayer The schematic diagram (including forward operation and reversed operation) of convolutional neural networks training, other operations in diagram be represented by addition to Other layers except convolutional layer or layer between operation, without limitation.
Present disclosure also provides a kind of integrated circuit chip device, and the integrated circuit chip device is for executing neural network Training, the neural network include multilayer, the integrated circuit chip device includes: processing circuit and external interface;
The external interface, for receiving training instruction;
The processing circuit leads to for determining first layer input data and first layer weight data according to the training instruction The n-layer forward operation for crossing first layer input data and first layer weight data execution neural network obtains the n-th output result;
The processing circuit is also used to obtain the n-th output result gradient according to the n-th output result, refer to according to the training N-th layer needed for enabling the n-th reversed operational order and the n-th reversed operational order that obtain the reversed operation of n-th layer inputs number Accordingly and n-th layer weight group data;According to the n-th reversed operational order, the n-th output result gradient, n-th layer input data and the The reversed operation of n-layer that n-layer weight group data execute neural network obtains n weight gradient of n-layer operation;
The processing circuit is also used to be updated n weight of n-layer operation using the n weight gradient.
Present disclosure is also disclosed that a neural network computing device comprising it is one or more in chip as shown in Figure 3, For being obtained from other processing units to operational data and control information, specified neural network computing, implementing result are executed Peripheral equipment is passed to by I/O interface.Peripheral equipment for example camera, display, mouse, keyboard, network interface card, wifi interface, Server.When comprising more than one mind chip as shown in Figure 3, chip chamber as shown in Figure 3 can pass through specific structure Data are linked and transmitted, for example, data are interconnected and transmitted by PCIE bus, to support more massive nerve net The operation of network.At this point it is possible to share same control system, there can also be control system independent;Can with shared drive, Can each accelerator have respective memory.In addition, its mutual contact mode can be any interconnection topology.Optionally, the nerve net Network arithmetic unit compatibility with higher can be connected by PCIE interface with various types of servers.
In one embodiment, the invention discloses a kind of chip (such as Fig. 5), for executing embodiment of the method as described above All or part of embodiments of middle offer.
In one embodiment, the invention discloses a kind of electronic devices comprising real for executing method as described above Apply the functional unit of all or part of embodiments in example.
Electronic device include data processing equipment, robot, computer, printer, scanner, tablet computer, intelligent terminal, Mobile phone, automobile data recorder, navigator, sensor, camera, server, camera, video camera, projector, wrist-watch, earphone, movement Storage, wearable device, the vehicles, household electrical appliance, and/or Medical Devices.The vehicles include aircraft, steamer and/ Or vehicle;The household electrical appliance include TV, air-conditioning, micro-wave oven, refrigerator, electric cooker, humidifier, washing machine, electric light, combustion gas Stove, kitchen ventilator;The Medical Devices include Nuclear Magnetic Resonance, B ultrasound instrument and/or electrocardiograph.
Particular embodiments described above has carried out further in detail the purpose of present disclosure, technical scheme and beneficial effects Describe in detail it is bright, it is all it should be understood that be not limited to present disclosure the foregoing is merely the specific embodiment of present disclosure Within the spirit and principle of present disclosure, any modification, equivalent substitution, improvement and etc. done should be included in the guarantor of present disclosure Within the scope of shield.

Claims (12)

1. a kind of integrated circuit chip device, training of the described device for the neural network of execution, the neural network include n Layer, the n value range are the integer more than or equal to 2, which is characterized in that the integrated circuit chip device includes: main process task Circuit and multiple based process circuits;The main process task circuit includes the first mapping circuit, the multiple based process circuit In at least one circuit include the second mapping circuit, first mapping circuit and second mapping circuit are used to execute The compression processing of each data in neural network computing;
The multiple based process circuit is in array distribution;Each based process circuit and other adjacent based process circuits connect It connects, what n based process circuit of the 1st row of main process task circuit connection, n based process circuit of m row and the 1st arranged M based process circuit;
The integrated circuit chip device, for receiving training instruction, according to the training instruction determine first layer input data and First layer weight group data execute the n-layer forward operation of neural network to first layer input data and first layer weight group data Obtain the n-th output result of forward operation;
The main process task circuit is also used to obtain the n-th output result gradient according to the n-th output result, according to the training N-th layer needed for instruction obtains the n-th reversed operational order and the n-th reversed operational order of the reversed operation of n-th layer inputs Data and n-th layer weight group data;It is according to the described n-th reversed operational order that the n-th output result gradient, n-th layer is defeated Enter data and n-th layer weight group data are divided into vertical data block and lateral data block;According to the described n-th reversed operational order Operation control determine that the first mapping circuit of starting handles the first data block, first data block that obtains that treated;Institute Stating the first data block includes the lateral data block and/or the vertical data block;It will locate according to the described n-th reversed operational order The first data block after reason is sent at least one based process in the based process circuit being connected with the main process task circuit Circuit;
The multiple based process circuit, for determining whether starting according to the operation control of the described n-th reversed operational order Two mapping circuits handle the second data block, and according to treated, the second data block executes in neural network in a parallel fashion Operation obtain operation result, and by the operation result by being given with the based process circuit transmission of the main process task circuit connection The main process task circuit;Second data block is that the reception main process task circuit that the based process circuit determines is sent Data block, second data block and treated first data block associated;
The main process task circuit is also used to be handled the operation result to obtain n-th layer weight group gradient and n-th layer input number According to gradient, n-th layer weight group data are updated using the n-th layer weight group gradient;
The integrated circuit chip device, be also used to using n-th layer input data gradient as (n-1)th layer (n-1)th export result Gradient executes n-1 layers of reversed operation and obtains n-1 layers of weight group gradient, using the weight of n-1 layers of weight group gradient updating respective layer Group data, the weight group data include at least two weights.
2. integrated circuit chip device according to claim 1, which is characterized in that when first data block includes laterally When data block and vertical data block,
The main process task circuit is specifically used for starting first mapping circuit to the lateral data block and the vertical data Block is handled to obtain, and the identification data block of treated lateral data block and the transverse direction data block associated, treated vertically The identification data block of data block and the vertical data block associated;Lateral data block after the processing and the transverse direction data block are closed The identification data block of connection is split to obtain multiple basic data blocks and the multiple basic data block respectively associated mark Data block;By the multiple basic data block and the multiple basic data block respectively associated identification data block be distributed to Its connect based process circuit, by the identification data block of the vertical data block and the vertical data block associated broadcast to Its based process circuit connected;
The based process circuit, for starting second mapping circuit according to the associated mark data of the basic data block The identification data block of block and the vertical data block associated obtains connection identifier data block;According to the connection identifier data block The basic data block and the vertical data block are handled, to treated basic data block and that treated is perpendicular Reversed operation is executed to data block and obtains operation result, and the operation result is sent to the main process task circuit.
3. integrated circuit chip device according to claim 1, which is characterized in that when first data block includes laterally When data block,
The main process task circuit carries out handling everywhere specifically for starting first mapping circuit to the lateral data block The identification data block of lateral data block and the transverse direction data block associated after reason, or starting first mapping circuit according to It prestores the identification data block of the lateral data block associated and is handled to obtain treated lateral number to the lateral data block According to block;The identification data block of treated the lateral data block and the transverse direction data block associated is split to obtain multiple The respective associated identification data block of basic data block and the multiple basic data block;By the multiple basic data block and Respectively associated identification data block is distributed to based process circuit connected to it to the multiple basic data block;It will be described vertical Data block is broadcasted to based process circuit connected to it;
The based process circuit, for starting second mapping circuit according to the associated mark data of the basic data block Block handles the vertical data block, and to treated, vertically data block and the basic data block execute reversed operation Operation result is obtained, the operation result is sent to the main process task circuit.
4. integrated circuit chip device according to claim 1, which is characterized in that when first data block includes vertical When data block,
The main process task circuit carries out handling everywhere specifically for starting first mapping circuit to the vertical data block The identification data block of vertical data block and the vertical data block associated after reason, or starting the first mapping circuit foundation The identification data block of the vertical data block associated prestored is handled to obtain to the vertical data block, and that treated is vertical Data block;It is split the lateral data block to obtain multiple basic data blocks;By the multiple master data be distributed to Its based process circuit connected;By treated the vertical data block and the identification data block of the vertical data block associated It broadcasts to based process circuit connected to it;
The based process circuit, for starting mark data of second mapping circuit according to the vertical data block associated Block is handled to obtain treated basic data block to the basic data block;To treated the basic data block and Described treated that vertical data block executes reversed operation obtains operation result, and the operation result is sent to the main process task Circuit.
5. the integrated circuit chip device according to any one of claim 2-4, which is characterized in that
The based process circuit obtains reversely specifically for executing reversed operation to the basic data block and the vertical data block The reversed operation result is added up and obtains operation result by operation result, and the operation result is sent to the main process task electricity Road;
The main process task circuit arranges the accumulation result for obtaining accumulation result after cumulative to the operation result To described instruction result.
6. the integrated circuit chip device according to any one of claim 2-4, which is characterized in that
The main process task circuit, specifically for by the vertical data block or treated vertical data block by once broadcast to The multiple based process circuit;Alternatively,
The main process task circuit, specifically for by the vertical data block or treated that vertical data block is divided into multiple portions Vertical data block, by the vertical data block in the multiple part by repeatedly broadcasting to the multiple based process circuit.
7. the integrated circuit chip device according to any one of claim 2-4, which is characterized in that
The main process task circuit, specifically for by the mark of treated the vertical data block and the vertical data block associated Data block is split to obtain the vertical data block of the vertical data block of multiple portions and the multiple part respectively associated mark Data block;By the respective associated identification data block of the vertical data block of the vertical data block in the multiple part and the multiple part Pass through one or many broadcast to the based process circuit;After the multiple vertical data block combinations in part form the processing Vertical data block;
The based process circuit, specifically for starting second mapping circuit according to the vertical data block associated in the part Identification data block and the associated identification data block of the basic data block obtain connection identifier data block;It is marked according to the connection Know data block and is handled to obtain treated vertical data block to the vertical data block in the part and the basic data block And treated basic data block;To treated the vertical data block and treated that basic data block executes is reversed Operation;
Alternatively, the based process circuit, is specifically used for starting second mapping circuit according to the vertical data block in the part Associated identification data block is handled to obtain treated basic data block to the basic data block, and treated to described The vertical data block of master data and the part executes reversed operation.
8. integrated circuit chip device according to claim 1, which is characterized in that
The main process task circuit is specifically used for such as the described n-th reversed operational order being multiplying order, determines the n-th layer input Data and the n-th layer weight group data are lateral data block, and the n-th output result gradient is vertical data block;Such as N-th reversed operational order is convolution instruction, determines that the n-th layer input data and the n-th layer weight group data are perpendicular To data block, the n-th output result gradient is lateral data block.
9. integrated circuit chip device described in -7 any one according to claim 1, which is characterized in that
The reversed operation of the n-layer further include: bigoted operation connects operation entirely, GEMM operation, GEMV operation, activates in operation One kind or any combination.
10. integrated circuit chip device according to claim 9, which is characterized in that
The n-th output result gradient are as follows: a kind of in vector, matrix, three-dimensional data block, 4 D data block and n dimensional data block Or any combination;
The n-th layer input data are as follows: it is a kind of in vector, matrix, three-dimensional data block, 4 D data block and n dimensional data block or Any combination;
The n-layer weight group data are as follows: it is a kind of in vector, matrix, three-dimensional data block, 4 D data block and n dimensional data block or Any combination.
11. a kind of chip, which is characterized in that the integrated chip such as claim 1-10 any one described device.
12. a kind of operation method of neural network, which is characterized in that the method is applied in integrated circuit chip device, institute Stating integrated circuit chip device includes: the integrated circuit chip device as described in claim 1-10 any one, described integrated Circuit chip device is used to execute the training operation of neural network.
CN201810164844.8A 2018-02-27 2018-02-27 Integrated circuit chip device and related product Active CN110197275B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810164844.8A CN110197275B (en) 2018-02-27 2018-02-27 Integrated circuit chip device and related product
PCT/CN2019/076088 WO2019165946A1 (en) 2018-02-27 2019-02-25 Integrated circuit chip device, board card and related product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810164844.8A CN110197275B (en) 2018-02-27 2018-02-27 Integrated circuit chip device and related product

Publications (2)

Publication Number Publication Date
CN110197275A true CN110197275A (en) 2019-09-03
CN110197275B CN110197275B (en) 2020-08-04

Family

ID=67751313

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810164844.8A Active CN110197275B (en) 2018-02-27 2018-02-27 Integrated circuit chip device and related product

Country Status (1)

Country Link
CN (1) CN110197275B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021068243A1 (en) * 2019-10-12 2021-04-15 Baidu.Com Times Technology (Beijing) Co., Ltd. Method and system for accelerating ai training with advanced interconnect technologies

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1290367A (en) * 1998-02-05 2001-04-04 英泰利克斯公司 N-tuple or ram based neural network classification system and method
CN106126481A (en) * 2016-06-29 2016-11-16 华为技术有限公司 A kind of computing engines and electronic equipment
CN106447034A (en) * 2016-10-27 2017-02-22 中国科学院计算技术研究所 Neutral network processor based on data compression, design method and chip
CN107609641A (en) * 2017-08-30 2018-01-19 清华大学 Sparse neural network framework and its implementation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1290367A (en) * 1998-02-05 2001-04-04 英泰利克斯公司 N-tuple or ram based neural network classification system and method
CN106126481A (en) * 2016-06-29 2016-11-16 华为技术有限公司 A kind of computing engines and electronic equipment
CN106447034A (en) * 2016-10-27 2017-02-22 中国科学院计算技术研究所 Neutral network processor based on data compression, design method and chip
CN107609641A (en) * 2017-08-30 2018-01-19 清华大学 Sparse neural network framework and its implementation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YUNJI CHEN: "DaDianNao: A Machine-Learning Supercomputer", 《2014 47TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021068243A1 (en) * 2019-10-12 2021-04-15 Baidu.Com Times Technology (Beijing) Co., Ltd. Method and system for accelerating ai training with advanced interconnect technologies
US11544067B2 (en) 2019-10-12 2023-01-03 Baidu Usa Llc Accelerating AI training by an all-reduce process with compression over a distributed system

Also Published As

Publication number Publication date
CN110197275B (en) 2020-08-04

Similar Documents

Publication Publication Date Title
CN110197270A (en) Integrated circuit chip device and Related product
CN109993301A (en) Neural metwork training device and Related product
CN108170640A (en) The method of its progress operation of neural network computing device and application
CN110909872B (en) Integrated circuit chip device and related products
CN107957975A (en) A kind of computational methods and Related product
CN109993291A (en) Integrated circuit chip device and Related product
CN110197275A (en) Integrated circuit chip device and Related product
CN113837922A (en) Computing device, data processing method and related product
US11651202B2 (en) Integrated circuit chip device and related product
US11710031B2 (en) Parallel processing circuits for neural networks
CN110197265A (en) Integrated circuit chip device and Related product
TWI787430B (en) Integrated circuit chip apparatus, chip, electronic device, and computing method of neural network
CN110197273A (en) Integrated circuit chip device and Related product
CN110197271A (en) Integrated circuit chip device and Related product
US11704544B2 (en) Integrated circuit chip device and related product
CN110197272A (en) Integrated circuit chip device and Related product
CN108037908A (en) A kind of computational methods and Related product
CN110197274A (en) Integrated circuit chip device and Related product
TWI768168B (en) Integrated circuit chip device and related products
CN110197267A (en) Neural network processor board and Related product
CN113807510B (en) Integrated circuit chip device and related products
CN110197269A (en) Integrated circuit chip device and Related product
CN110197268A (en) Integrated circuit chip device and Related product
CN110197266A (en) Integrated circuit chip device and Related product
CN110197264A (en) Neural network processor board and Related product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant