CN110197275B

CN110197275B - Integrated circuit chip device and related product

Info

Publication number: CN110197275B
Application number: CN201810164844.8A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2018-02-27
Filing date: 2018-02-27
Publication date: 2020-08-04
Anticipated expiration: 2038-02-27
Also published as: CN110197275A

Abstract

The present disclosure provides an integrated circuit chip device and related products, the device is used for training a neural network, the neural network comprises n layers, the value range of n is an integer greater than or equal to 2, the integrated circuit chip device comprises: a main processing circuit and a plurality of basic processing circuits; the main processing circuit comprises a first mapping circuit, at least one circuit of the plurality of basic processing circuits comprises a second mapping circuit, and the first mapping circuit and the second mapping circuit are used for executing compression processing of each data in neural network operation; the plurality of base processing circuits are distributed in an array; each basic processing circuit is connected with other adjacent basic processing circuits, and the main processing circuit is connected with the n basic processing circuits of the 1 st row, the n basic processing circuits of the m th row and the m basic processing circuits of the 1 st column. The technical scheme provided by the disclosure has the advantages of small calculation amount and low power consumption.

Description

Integrated circuit chip device and related product

Technical Field

The present disclosure relates to the field of neural networks, and more particularly to an integrated circuit chip device and related products.

Background

Artificial Neural Networks (ANN) are a research hotspot in the field of Artificial intelligence since the 80 s of the 20 th century. The method abstracts the human brain neuron network from the information processing angle, establishes a certain simple model, and forms different networks according to different connection modes. It is also often directly referred to in engineering and academia as neural networks or neural-like networks. A neural network is an operational model, which is formed by connecting a large number of nodes (or neurons). The operation of the existing neural network is based on a Central Processing Unit (CPU) or a Graphics Processing Unit (GPU) to realize the forward operation of the neural network, and the forward operation has a large amount of calculation and high power consumption.

Disclosure of Invention

Embodiments of the present disclosure provide an integrated circuit chip device and related products, which can increase the processing speed and efficiency of a computing device.

In a first aspect, an integrated circuit chip apparatus for performing training of a neural network is provided, where the apparatus is configured to perform training of the neural network, the neural network includes n layers, and n is an integer having a value range greater than or equal to 2, and the integrated circuit chip apparatus includes: a main processing circuit and a plurality of basic processing circuits; the main processing circuit comprises a first mapping circuit, at least one circuit (namely, part or all of the basic processing circuits) in the plurality of basic processing circuits comprises a second mapping circuit, and the first mapping circuit and the second mapping circuit are used for executing compression processing of each data in the neural network operation;

the plurality of base processing circuits are distributed in an array; each basic processing circuit is connected with other adjacent basic processing circuits, and the main processing circuit is connected with the n basic processing circuits of the 1 st row, the n basic processing circuits of the m th row and the m basic processing circuits of the 1 st column;

the integrated circuit chip device is used for receiving a training instruction, determining first-layer input data and first-layer weight group data according to the training instruction, and executing n layers of forward operations of a neural network on the first-layer input data and the first-layer weight group data to obtain an nth output result of the forward operations;

the main processing circuit is further configured to obtain an nth output result gradient according to the nth output result, and obtain an nth reverse operation instruction of an nth layer of reverse operation and nth layer of input data and nth layer of weight group data required by the nth reverse operation instruction according to the training instruction; dividing the nth output result gradient, the nth layer of input data and the nth layer of weight group data into a vertical data block and a horizontal data block according to the nth reverse operation instruction; determining to start a first mapping circuit to process a first data block according to the operation control of the nth reverse operation instruction to obtain a processed first data block; the first data block comprises the horizontal data block and/or the vertical data block; sending the processed first data block to at least one basic processing circuit in basic processing circuits connected with the main processing circuit according to the nth reverse operation instruction;

the plurality of basic processing circuits are used for determining whether to start a second mapping circuit to process a second data block according to the operation control of the nth reverse operation instruction, executing the operation in the neural network in a parallel mode according to the processed second data block to obtain an operation result, and transmitting the operation result to the main processing circuit through the basic processing circuit connected with the main processing circuit; the second data block is determined by the basic processing circuit to receive the data block sent by the main processing circuit, and the second data block is associated with the processed first data block;

the main processing circuit is also used for processing the operation result to obtain the nth layer weight group gradient and the nth layer input data gradient, and updating the nth layer weight group data by applying the nth layer weight group gradient;

the integrated circuit chip device is also used for taking the input data gradient of the nth layer as the output result gradient of the nth-1 layer to execute reverse operation of the n-1 layer to obtain the weight group gradient of the n-1 layer, and updating the weight group data of the corresponding layer by applying the weight group gradient of the n-1 layer, wherein the weight group data comprises at least two weights.

In a second aspect, a neural network computing device is provided, which includes one or more integrated circuit chip devices provided in the first aspect.

In a third aspect, there is provided a combined processing apparatus comprising: the neural network arithmetic device, the universal interconnection interface and the universal processing device are provided by the second aspect;

the neural network operation device is connected with the general processing device through the general interconnection interface.

In a fourth aspect, a chip is provided that integrates the apparatus of the first aspect, the apparatus of the second aspect, or the apparatus of the third aspect.

In a fifth aspect, an electronic device is provided, which comprises the chip of the fourth aspect.

It can be seen that, according to the embodiment of the disclosure, the mapping circuit is provided to compress the data blocks and then perform the operation, so that the transmission resource and the calculation resource are saved, and therefore, the mapping circuit has the advantages of low power consumption and small calculation amount.

Drawings

Fig. 1 is a schematic diagram of a training method of a neural network.

FIG. 1a is a schematic diagram of a forward operation of a neural network.

FIG. 1b is a schematic diagram of a neural network operation.

Fig. 2a is a schematic diagram of convolved input data.

Fig. 2b is a schematic diagram of a convolution kernel.

FIG. 2c is a diagram of an operation window of a three-dimensional data block of input data.

FIG. 2d is a diagram of another exemplary window for inputting a three-dimensional data block of data.

FIG. 2e is a diagram of another operation window of a three-dimensional data block of input data.

Fig. 3 is a schematic structural diagram of a neural network chip.

Fig. 4a is a schematic diagram of matrix multiplication.

Fig. 4b is a flow chart of a method for multiplying a matrix by a matrix.

FIG. 4c is a diagram of a matrix multiplied by a vector.

FIG. 4d is a flow chart of a method for multiplying a matrix by a vector.

Fig. 4e is a schematic diagram of neural network training.

FIG. 4f is a schematic diagram of another neural network training scheme.

FIG. 4g is a diagram illustrating the forward and backward operations of the neural network.

FIG. 4h is a diagram of a multi-layer structure for neural network training.

Fig. 5 is a schematic structural diagram of a neural network chip according to an embodiment of the present disclosure;

fig. 6 a-6 b are schematic structural diagrams of two mapping circuits provided in the present embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those skilled in the art, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. All other embodiments, which can be derived by one of ordinary skill in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.

In the apparatus provided in the first aspect, the integrated circuit chip apparatus is configured to receive a training instruction, determine first layer input data and first layer weight group data according to the training instruction, and perform n layers of forward operations of a neural network on the first layer input data and the first layer weight group data to obtain an nth output result of the forward operations;

In the apparatus provided in the first aspect, when the first data block includes a horizontal data block and a vertical data block, the main processing circuit is specifically configured to start the first mapping circuit to process the horizontal data block and the vertical data block to obtain a processed horizontal data block and an identification data block associated with the horizontal data block, and a processed vertical data block and an identification data block associated with the vertical data block, split the processed horizontal data block and the identification data block associated with the horizontal data block to obtain a plurality of basic data blocks and identification data blocks associated with the basic data blocks, distribute the identification data blocks associated with the basic data blocks and the basic data blocks to a base processing circuit connected thereto, broadcast the processed vertical data block and the identification data block associated with the vertical data block to a base processing circuit connected thereto, wherein the identification data blocks may be specifically represented by a direct index or a step index, and optionally may also be represented by a CSC 36 list of coordinates (539) or a Sparse list (cse) or a Sparse list (L, 6754, a Sparse list) or a Sparse list (cse) Compressed by a hybrid scheme).

Taking the identification data block represented by a direct index as an example, the identification data block may specifically be a data block composed of 0 and 1, where 0 represents that an absolute value of data (such as a weight or an input neuron) included in the data block is less than or equal to a first threshold, 1 represents that an absolute value of data (such as a weight or an input neuron) included in the data block is greater than a first threshold, and the first threshold is randomly set by a user side or a device side in a customized manner, for example, 0.05, 0, and so on.

In order to save data transmission amount and improve data transmission efficiency, in the process that the main processing circuit sends data to the basic processing circuit, target data in the plurality of basic data blocks and identification data blocks respectively associated with the plurality of basic data blocks can be specifically distributed to the basic processing circuit connected with the main processing circuit; optionally, the target data in the processed vertical data block and the identification data block associated with the vertical data block may also be broadcast to a basic processing circuit connected thereto. The target data refers to data with an absolute value greater than a first threshold in a data block, or refers to non-0 data in a data block (which may be specifically a processed horizontal data block or a processed vertical data block).

Correspondingly, the basic processing circuit is specifically configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the vertical data block and the identification data associated with the basic data block, and process the vertical data block and the basic data block according to the connection identification data block to obtain a processed vertical data block and a processed basic data block; performing reverse operation on the processed vertical data block and the processed basic data block to obtain an operation result, and sending the operation result to the main processing circuit; wherein. The inverse operation includes, but is not limited to, any one or combination of more of the following: one or any combination of convolution operation (inner product operation), product operation, bias execution operation, full connection operation, GEMM operation, GEMV operation and activation operation;

and the main processing circuit is used for processing the operation result to obtain the instruction result.

For example, the horizontal data block is M₁Line N₁Matrix of columns, basic data block M₂Line N₂A matrix of columns, wherein M₁>M₂，N₁>N₂. Correspondingly, the identification data block associated with the horizontal data block is also M₁Line N₁A matrix of columns, the identification data block associated with the basic data block being likewise M₂Line N₂A matrix of columns. Take the matrix with 2 x 2 as the basic data block as an example, set as

If the first threshold is 0.05, the identification data block associated with the basic data block is

The processing of the data blocks with respect to the first mapping circuit and the second mapping circuit will be described in detail later.

In the apparatus provided in the first aspect, when the first data block includes a horizontal data block, the main processing circuit is specifically configured to start the first mapping circuit to process the horizontal data block to obtain a processed horizontal data block and an identification data block associated with the horizontal data block, or start the first mapping circuit to process the horizontal data block according to a pre-stored identification data block associated with the horizontal data block to obtain a processed horizontal data block; splitting the processed transverse data block and the identification data block associated with the transverse data block to obtain a plurality of basic data blocks and identification data blocks associated with the basic data blocks, distributing the plurality of basic data blocks and the identification data blocks associated with the plurality of basic data blocks to a basic processing circuit connected with the basic processing circuit, and broadcasting the vertical data block to the basic processing circuit connected with the basic processing circuit;

the basic processing circuit is specifically configured to start the second mapping circuit to process the vertical data block according to the identification data block associated with the basic data block, so as to obtain a processed vertical data block; and performing reverse operation on the processed vertical data block and the processed basic data block to obtain an operation result, and sending the operation result to the main processing circuit.

In an optional embodiment, the main processing circuit is further specifically configured to split the vertical data block or the processed vertical data block and the identification data block associated with the vertical data block to obtain a plurality of partial vertical data blocks and identification data blocks associated with the plurality of partial vertical data blocks; broadcasting the plurality of partial vertical data blocks and the identification data blocks respectively associated with the plurality of partial vertical data blocks to the basic processing circuit by one or more times; and combining the plurality of partial vertical data blocks to form the vertical data block or the processed vertical data block.

Correspondingly, the basic processing circuit is specifically configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the partial vertical data block and the identification data block associated with the basic data block; processing the partial vertical data blocks and the basic data blocks according to the connection identification data to obtain processed partial vertical data blocks and processed basic data blocks; and performing reverse operation on the processed partial vertical data blocks and the processed basic data blocks.

Wherein the connection identification data block is a data block obtained by performing an element-by-element and operation on the identification data block associated with the basic data block and the identification data block associated with the partial vertical data block. Optionally, the connection identifier data block is used to indicate that data in both data blocks (specifically, the basic data block and the vertical data block) is greater than an absolute value. Details will be described later.

For example, the matrix for identifying the data block as 2 x 3 is associated with the horizontal data block

Matrix with associated identification data blocks of 2 x 2 for partial vertical data blocks

The connection identification data block obtained correspondingly is

In the apparatus provided in the first aspect, when the first data block includes a vertical data block, the main processing circuit is specifically configured to start the first mapping circuit to process the vertical data block to obtain a processed vertical data block and an identification data block associated with the vertical data block, or start the first mapping circuit to process the vertical data block according to a pre-stored identification data block associated with the vertical data block to obtain a processed vertical data block; splitting the transverse data block to obtain a plurality of basic data blocks; distributing the plurality of basic data blocks to a basic processing circuit connected with the basic data blocks, and broadcasting the processed vertical data blocks and the identification data blocks related to the vertical data blocks to the basic processing circuit connected with the vertical data blocks;

the basic processing circuit is specifically configured to start the second mapping circuit to process the basic data block according to the identification data block associated with the vertical data block to obtain a processed basic data block; and performing reverse operation on the processed vertical data block and the processed basic data block to obtain an operation result, and sending the operation result to the main processing circuit.

In an optional embodiment, the main processing circuit is further specifically configured to split the processed vertical data block and the identification data block associated with the vertical data block to obtain a plurality of partial vertical data blocks and identification data blocks associated with the plurality of partial vertical data blocks; broadcasting the plurality of partial vertical data blocks and the identification data blocks respectively associated with the plurality of partial vertical data blocks to the basic processing circuit by one or more times; and combining the plurality of partial vertical data blocks to form the vertical data block or the processed vertical data block.

Correspondingly, the basic processing circuit is specifically configured to process the basic data block according to the identification data block associated with the partial vertical data block to obtain a processed basic data block; and performing inverse operation on the processed basic data blocks and the partial vertical data blocks.

In the apparatus provided in the first aspect, the main processing circuit is specifically configured to send the vertical data block (specifically, the vertical data block or the processed vertical data block) to the basic processing circuit connected thereto through one broadcast.

In the apparatus provided in the first aspect, the basic processing circuit is specifically configured to perform a reverse operation on the basic data block (which may be the basic data block or the processed basic data block) and the vertical data block to obtain a reverse operation result, accumulate the reverse operation result to obtain an operation result, and send the operation result to the main processing circuit.

In the apparatus provided in the first aspect, the basic processing circuit is specifically configured to perform inverse operation processing on the basic data block and the vertical data block to obtain a processing result, accumulate the processing result to obtain an operation result, and send the operation result to the main processing circuit;

and the main processing circuit is used for accumulating the operation results to obtain accumulation results, and arranging the accumulation results to obtain the instruction results.

In the apparatus provided in the first aspect, the main processing circuit is specifically configured to divide the vertical data block into a plurality of partial vertical data blocks, and broadcast the plurality of partial vertical data blocks to the base processing circuit by multiple times; the plurality of partial vertical data blocks are combined to form the vertical data block.

In the apparatus provided in the first aspect, the main processing circuit is specifically configured to determine that the input data is a horizontal data block and the weight data is a vertical data block if the type of the first operation instruction is a multiplication instruction; if the type of the first operation instruction is a convolution instruction, determining that the input data is a vertical data block, and the weight data is a horizontal data block.

In the apparatus provided in the first aspect, the basic processing circuit is specifically configured to perform an inner product processing on the partial vertical data block (specifically, the partial vertical data block or the processed partial vertical data block) and the basic data block to obtain an inner product processing result, accumulate the inner product processing result to obtain a partial operation result, and send the partial operation result to the main processing circuit.

In the apparatus provided in the first aspect, the basic processing circuit is specifically configured to multiplex the partial vertical data block n times to perform an inner product operation on the partial vertical data block and the n basic data blocks to obtain n partial processing results, accumulate the n partial processing results respectively to obtain n partial operation results, and send the n partial operation results to the main processing circuit, where n is an integer greater than or equal to 2.

In the apparatus provided in the first aspect, the n-layer inversion operation further includes: and performing one or any combination of bias operation, full connection operation, GEMM operation, GEMV operation and activation operation.

In an apparatus provided in the first aspect, the main processing circuit includes: a master register or on-master cache circuit;

the base processing circuit includes: basic registers or basic on-chip cache circuits.

In an apparatus provided in the first aspect, the main processing circuit includes: one or any combination of vector arithmetic unit circuit, arithmetic logic unit circuit, accumulator circuit, matrix transposition circuit, direct memory access circuit or data rearrangement circuit.

In the apparatus provided in the first aspect, the nth output result gradient is: one or any combination of vector, matrix, three-dimensional data block, four-dimensional data block and n-dimensional data block;

the nth layer input data may be represented by a tensor, which may specifically be: one or any combination of vector, matrix, three-dimensional data block, four-dimensional data block and n-dimensional data block;

the n layers of weight group data may be represented by tensors, which may specifically be: one or any combination of vectors, matrices, three-dimensional data blocks, four-dimensional data blocks, and n-dimensional data blocks.

As shown in fig. 1, the step of neural network training includes:

each layer in a (multi-layer) neural network performs forward operations in turn;

sequentially performing reverse operation according to the sequence of the opposite layers to obtain a weight gradient;

updating the weight of forward operation by using the gradient of the calculated weight;

this is the sequential iteration of the training of the neural network, and the whole training process needs to be repeatedly executed (i.e. a plurality of iterative computations) for a plurality of times.

Referring to fig. 3, fig. 3 is an integrated circuit chip apparatus, for performing training of a neural network, where the neural network includes n layers, where n is an integer having a value range greater than or equal to 2, and the integrated circuit chip apparatus includes: a main processing circuit and a plurality of basic processing circuits; the main processing circuit comprises a first mapping circuit, at least one circuit of the plurality of basic processing circuits comprises a second mapping circuit, and the first mapping circuit and the second mapping circuit are used for executing compression processing of each data in neural network operation;

As shown in fig. 1a, for the forward operation of the neural network provided by the embodiment of the present disclosure, each layer uses its own input data and weight to calculate according to the operation rule specified by the type of the layer to obtain corresponding output data;

the forward operation process (also called inference) of the neural network is a process of processing input data of each layer by layer and obtaining output data through certain calculation, and has the following characteristics:

input to a certain layer:

the input of a certain layer can be input data of a neural network;

the input of a certain layer may be the output of other layers;

the input to a certain layer may be the output at a time on the layer (corresponding to the case of a recurrent neural network);

a layer may obtain input from a plurality of said input sources simultaneously;

output of a certain layer:

the output of a certain layer can be used as the output result of the neural network;

the output of a certain layer may be the input of other layers;

the output of a layer may be the input of the layer at the next time (in the case of a recurrent neural network);

the output of a certain layer can output results to the plurality of output directions;

specifically, the types of operations of the layers in the neural network include, but are not limited to, the following:

convolutional layers (i.e., performing convolution operations);

fully-connected layers (i.e., performing fully-connected operations);

normalization (regularization) layer comprising L RN (L cal Response Normalization) layer, BN (Batchnormalization) layer and other types;

a pooling layer;

an active layer including but not limited to the following types Sigmoid layer, Re L U layer, PRe L U layer, L eakyRe L U layer, Tanh layer;

the inverse operations of the layers, each of which needs to perform two parts of operations: one part is to calculate gradients of weights (for updating weights of the present layer in a "weight update" step) using gradients of output data that may be sparse representations and input data that may be sparse representations, and the other part is to calculate gradients of input data (for being used as gradients of output data of a next layer in an inverse operation for the inverse operation thereof) using gradients of output data that may be sparse representations and weights that may be sparse representations;

the backward operation reversely transfers the gradient from the last layer in the reverse order of the forward operation.

In one alternative, the inverse calculated output data gradient for a layer may be from:

gradient returned by last loss function (lost function or cost function) of the neural network;

input data gradients for other layers;

the input data gradient at a time on the local layer (corresponding to the case of the recurrent neural network);

a layer may simultaneously acquire output data gradients from a plurality of said sources;

after the reverse operation of the neural network is executed, calculating the weight gradient of each layer, wherein in the step, a first input cache and a second input cache of the device are respectively used for storing the weight of the layer and the gradient of the weight, and then the weight is updated by using the weight gradient in an operation unit;

in the forward operation, after the execution of the artificial neural network of the previous layer is completed, the operation instruction of the next layer takes the output data calculated in the operation unit as the input data of the next layer for operation (or performs some operation on the output data and then takes the output data as the input data of the next layer), and at the same time, the weight value is replaced by the weight value of the next layer; in the reverse operation, after the reverse operation of the artificial neural network of the previous layer is completed, the next layer of operation instruction takes the input data gradient calculated in the operation unit as the output data gradient of the next layer to perform operation (or performs some operation on the input data gradient and then takes the input data gradient as the output data gradient of the next layer), and simultaneously, the weight is replaced by the weight of the next layer; specifically, as shown in fig. 1b, the arrows of the broken lines in the figure indicate the backward operation, the arrows of the solid lines indicate the forward operation, and the labels below the figures indicate the meanings of the figures.

The data (i.e. the data in the data block) referred to in this application is the data after compression processing, and may be implemented in the first mapping circuit and the second mapping circuit. It should be understood that, since the neural network is an algorithm with high computation amount and high memory access, the more the weight is, the more the computation amount and the memory access amount are increased. Particularly, in the case of a small weight (e.g. 0, or a weight smaller than a set value), the data with a small weight needs to be compressed to increase the calculation rate and reduce the overhead. In practical application, the data compression processing is applied to the sparse neural network, and the effect is most obvious, such as reducing the workload of data calculation, reducing the data overhead, improving the data calculation rate and the like.

The specific embodiment related to the data compression processing is explained by taking input data as an example. The input data includes, but is not limited to, at least one input neuron and/or at least one weight.

In a first embodiment:

after the first mapping circuit receives first input data (specifically, a data block to be calculated, such as a horizontal data block or a vertical data block, which is sent by the main processing circuit), the first mapping circuit may process the first input data to obtain identification mask data associated with the processed first input data by the first input data, where the mask data is used to indicate whether an absolute value of the first input data is greater than a first threshold, such as 0.5, 0, and so on.

Specifically, when the absolute value of the first input data is greater than a first threshold, the input data is retained; otherwise, deleting the first input data or setting the first input data to be 0. For example, the input matrix data block is

The first threshold is 0.05, and the processed matrix data block can be obtained after the processing of the first mapping circuit

The identification data block (also called mask matrix) associated with the matrix data block is

Further, in order to reduce the data transmission amount, when the main processing circuit distributes data to the basic processing circuit connected to the main processing circuit, the target data (in this case, 1,0.06, and 0.5) in the processed matrix data block and the identification data block associated with the matrix data block may be sent. When embodied in practiceThe main processing circuit may distribute the target data in the processed matrix data block to the basic processing circuit according to a set rule, for example, the target data is sequentially sent in a row order or sequentially sent in a column order, and the like, which is not limited in this application. Accordingly, after receiving the target data and the identification data block associated with the target data, the basic processing circuit restores the target data and the identification data block to the processed matrix data block according to a set rule (for example, a row sequence). For example, in this example, the base processing circuit may be based on the received data (1,0.06 and 0.5) and the identification data block

The matrix data block corresponding to the data (i.e. the matrix data block processed by the first mapping circuit in the main processing circuit) can be known as

In an embodiment of the present invention, the first input data may be a horizontal data block and/or a vertical data block.

Correspondingly, the second mapping circuit can process the second input data by using the identification data associated with the first input data, thereby obtaining the processed second input data; wherein the first input data is different from the second input data. For example, when the first input data is at least one weight, then the second input data may be at least one input neuron; alternatively, when the first input data is at least one input neuron, then the second input data may be at least one weight.

In an embodiment of the present invention, the second input data is different from the first input data, and the second input data may be any one of the following: a horizontal data block, a basic data block, a vertical data block, and a partial vertical data block.

For example, when the first input data is a horizontal data block, then the second input data is a partial vertical data block. Assuming the second input data as a matrix data block

Using mask matrix in the above example accordingly

After processing, obtaining partial vertical data blocks after processing as

Since the dimension of the matrix data block related to the input data is large in practical application, the present application is only illustrative and should not be construed as limiting.

In a second embodiment:

the first mapping circuit may be configured to process first input data and second input data to obtain processed first input data and first identification mask data associated with the first input data, processed second input data and second identification mask data associated with the second input data. Wherein, the first mask data or the second mask data is used to indicate whether the absolute value of the first or the second input data is greater than a second threshold, and the second threshold is set by the user side or the device side in a self-defined way, such as 0.05, 0, etc.

The processed first input data or the second input data may be processed input data or unprocessed input data. For example, the first input data is a horizontal data block, such as the matrix data block in the above example

After being processed by the first mapping circuit, the processed transverse data block can be obtained, where the processed transverse data block can be the original matrix data block

Or the compressed matrix data block

It should be appreciated that the present application is preferred to reduce the amount of data transferred and the efficiency of data processing in the underlying processing circuitryThe processed input data (such as the processed basic data block or the partial vertical data block) should be the compressed data. Preferably, the data sent by the main processing circuit to the basic processing circuit may specifically be target data in the processed input data, and the target data may specifically be data whose absolute value is greater than a preset threshold, and may also be non-0 data, and the like.

Correspondingly, in the basic processing circuit, the second mapping circuit may obtain connection identification data according to the first identification data associated with the first input data and the second identification data associated with the second input data; the connection identification data is used to indicate data whose absolute value is greater than a third threshold in the first input data and the second input data, where the third threshold is set by a user or a device in a user-defined manner, such as 0.05, 0, or the like. Further, the second mapping circuit may process the received first input data and the second input data according to the connection identification data, respectively, so as to obtain processed first input data and processed second input data.

For example, the first input data is a matrix data block

The second input data block is likewise a matrix data block

After being processed by the first mapping circuit, the first identification data block related to the first input data can be obtained

And a processed first input data block

Correspondingly obtaining a second identification data block associated with the second input data

The second input data block after processing is

Correspondingly, in order to improve the data transmission rate, only the target data 1,0.06 and 0.5 in the processed first input data block and the first identification data block associated with the first input data block can be sent to the basic processing circuit in the main processing circuit; and simultaneously, sending the target data 1,1.1,0.6,0.3 and 0.5 in the processed second input data block and a second identification data block associated with the second input data block to the basic processing circuit.

Correspondingly, after the basic processing circuit receives the data, the basic processing circuit can perform element-by-element and operation on the first identification data block and the second identification data block through the second mapping circuit to obtain a connection identification data block

Correspondingly, the second mapping circuit respectively processes the processed first input data block and the processed second input data block by using the connection identification data block, so as to obtain the processed first input data block as

The second input data block after processing is

The basic processing circuit can determine a first data block (i.e. the first data block processed by the first mapping circuit) corresponding to the target data according to the first identification data block and the received target data in the first data block; correspondingly, according to the second identification data block and the received target data in the second data block, determining a second data block (namely, the second data block processed by the first mapping circuit) corresponding to the target data; then, after the second mapping circuit learns the connection identification data block, the connection identification data block is utilized to perform element-by-element AND operation with the determined first data block and the determined second data block respectively so as to obtain the data which is passed through the second mapping circuitA processed first data block and a processed second data block.

In the third embodiment:

the first mapping circuit is not arranged in the main processing circuit, but the main processing circuit can send third input data and third identification data which is pre-stored and associated with the third input data to a basic processing circuit connected with the main processing circuit. A second mapping circuit is disposed in the base processing circuit. A specific example of the data compression process involved in the second mapping circuit is set forth below.

It should be understood that the third input data includes, but is not limited to, a basic data block, a partial vertical data block, a vertical data block, and the like. Similarly, in the neural network processor, the third input data may also be at least one weight, and/or at least one input nerve, which is not limited in this application.

In the second mapping circuit, the second mapping circuit may process the third input data according to third identification data associated with the received third input data, so as to obtain processed third input data, so as to subsequently perform a correlation operation, such as an inner product operation, on the processed third input data.

For example, the third input data received by the second mapping circuit is a matrix data block

A third identification data block (also referred to as a mask matrix data block) associated with the third input data, which is prestored correspondingly

Further, the second mapping circuit processes the third input data block according to the third identification data block to obtain a processed third input data block, which is specifically the processed third input data block

In addition, the input neurons and the output neurons mentioned in the embodiments of the present invention do not refer to neurons in the input layer and neurons in the output layer of the entire neural network, but for any two adjacent layers of neurons in the neural network, the neurons in the lower layer of the network feedforward operation are input neurons, and the neurons in the upper layer of the network feedforward operation are output neurons.

In the fourth embodiment:

the main processing circuit is not provided with a mapping circuit, and the basic processing circuit is provided with a first mapping circuit and a second mapping circuit. For data processing of the first mapping circuit and the second mapping circuit, reference may be made to the foregoing first embodiment to the third embodiment, which are not described herein again.

Alternatively, a fifth embodiment is also present. In a fifth embodiment, a mapping circuit is not disposed in the basic processing circuit, and both the first mapping circuit and the second mapping circuit are disposed in the main processing circuit, and for data processing of the first mapping circuit and the second mapping circuit, reference may be specifically made to the foregoing first to third embodiments, and details are not repeated here. That is, the main processing circuit completes the compression processing of the data, and sends the processed input data to the basic processing circuit, so that the basic processing circuit performs the corresponding operation by using the processed input data (specifically, the processed neurons and the processed weights).

The following sets forth a specific structural schematic diagram of the present application relating to a mapping circuit. Two possible mapping circuits are shown in fig. 6a and 6 b. Wherein the mapping circuit as shown in fig. 6a comprises a comparator and a selector. The present application is not limited with respect to the number of comparators and selectors. Fig. 6a shows a comparator and two selectors, wherein the comparator is used to determine whether the input data meets the preset condition. The preset condition may be set by a user or a device, for example, an absolute value of the input data is greater than or equal to a preset threshold. If the preset condition is met, the comparator can determine that the input data is allowed to be output, and the input data corresponds to the associated identification data and is 1; otherwise, it may be determined not to output the input data, or to default the input data to 0. Accordingly, the input data is 0 corresponding to the associated identification data at this time. That is, after passing through the comparator, the identification data associated with the input data can be known.

Further, after the comparator determines the preset condition for the input data, the obtained identification data may be input to the selector, so that the selector uses the identification data to determine whether to output the corresponding input data, that is, to obtain the processed input data.

As shown in fig. 6a, taking the input data as a matrix data block as an example, each data in the matrix data block may be determined by a comparator according to a preset condition, so that an identification data block (mask matrix) associated with the matrix data block may be obtained. Further, the matrix data block can be screened by the identification data block in the first selector, data with an absolute value greater than or equal to a preset threshold (that is, meeting a preset condition) in the matrix data block is retained, and the rest of data is deleted to output the processed matrix data block. Optionally, the second selector may further process other input data (e.g., a second matrix data block) by using the identification data block, for example, perform an element-by-element and operation to reserve data whose absolute value is greater than or equal to a preset threshold in the second matrix data block, so as to output the processed second matrix data block.

It should be understood that, corresponding to the first and second embodiments described above, the specific structure of the first mapping circuit may include at least one comparator and at least one selector, such as the comparator and the first selector in fig. 6a in the above example; the specific result of the second mapping circuit may comprise one or more selectors, such as the second selector of fig. 6a in the example above.

Fig. 6b shows a schematic diagram of another mapping circuit. As shown in fig. 6b, the mapping circuit includes selectors, and the number of the selectors is not limited, and may be one or more. Specifically, the selector is configured to select the input data according to identification data associated with the input data, so as to output data, of which an absolute value is greater than or equal to a preset threshold, from the input data, and delete/not output the remaining data, thereby obtaining processed input data.

Taking the input data as a matrix data block as an example, the matrix data block and an identification data block associated with the matrix data block are input to the mapping circuit, the selector can select the matrix data block according to the identification data block, output data of which the absolute value is greater than or equal to 0, and output no other data, thereby outputting the processed matrix data block.

It will be appreciated that the structure shown in fig. 6b may be applied to the second mapping circuit in the third embodiment described above, i.e. the specific result of the second mapping circuit in the third embodiment described above may comprise at least one selector. Similarly, the first mapping circuit and the second mapping circuit designed in the main processing circuit and the basic processing circuit may be cross-combined or split according to the functional components shown in fig. 6a and 6b, which is not limited in this application.

Based on the foregoing embodiments, several specific implementation methods of the neural network forward operation are exemplarily given below. When the first operation instruction is a convolution instruction, the input data block (i.e., data in the input data block) is convolution input data, and the weight data (block) is a convolution kernel. The following describes the convolution operation, and in the following figure, a square represents a data, the input data is represented by fig. 2a (N samples, each sample having C channels, the height of the characteristic diagram of each channel being H, and the width being W), and the weight, i.e., the convolution kernel, is represented by fig. 2b (M convolution kernels, each convolution kernel having C channels, and the height and width being KH and KW, respectively). For N samples of input data, the rule of convolution operation is the same, and the following explains the process of performing convolution operation on one sample, where each of M convolution kernels needs to perform the same operation, each convolution kernel obtains one planar feature map, and the M convolution kernels finally obtain M planar feature maps by calculation, (for one sample, the output of convolution is M feature maps), for one convolution kernel, an inner product operation is performed at each planar position of one sample, and then sliding is performed along the H and W directions, for example, fig. 2c shows a corresponding diagram of a convolution kernel performing an inner product operation at the lower right corner position in one sample of input data; figure 2d shows the position of the convolution sliding one grid to the left and figure 2e shows the position of the convolution sliding one grid upwards.

Specifically, the convolution processing may be performed by using a chip structure as shown in fig. 3, and the first mapping circuit of the main processing circuit may process data in part or all of the convolution kernels of the weights to obtain corresponding mask data and processed weight data (i.e., data in part or all of the convolution kernels of the processed weights).

The control circuit of the main processing circuit sends data in part or all of the convolution kernels of the weights (the data can be original weight data or processed weight data) to basic processing circuits (also called basic units) which are directly connected with the main processing circuit through a transverse data input interface; meanwhile, the control circuit sends the mask data correspondingly associated with the data to a basic processing circuit connected with the main processing circuit;

in one alternative scheme, the control circuit of the main processing circuit sends one number or a part of numbers of data of a certain convolution kernel in the weight to a certain basic processing circuit at a time; (for example, for a certain basic processing circuit, 1 st row 3 1 st number is sent, 2 nd row 3 2 nd number is sent, 3 rd row 3 rd number is sent … … for 3 rd time, or 1 st row 3 first two numbers are sent, 3 rd row 3 rd and 4 rd numbers are sent for the second time, 3 rd row 5 th and 6 th numbers are sent for the third time … …;) at the same time, the control circuit also adopts the above mentioned number or part of data generated each time for mask data corresponding to a certain convolution kernel in the weight to the certain basic processing circuit;

in another alternative, the control circuit of the main processing circuit sends data of a plurality of convolution kernels in the weight to a certain basic processing circuit one number at a time; (for example, for a certain basic processing circuit, the 1 st sending of the 1 st number of the 3 rd, 4 th and 5 th rows per row, the 2 nd sending of the 2 nd number of the 3 rd, 4 th and 5 th rows per row, the 3 rd sending of the 3 rd, 4 rd and 5 th numbers … … of the 3 rd, 4 th and 5 th rows per row, or the 1 st sending of the first two numbers of the 3 rd, 4 th and 5 th rows per row, the 3 rd and 4 th sending of the 3 rd, 4 th and 5 th rows per row, the 5 th and 6 th numbers … … of the 3 rd, 4 th and 5 th rows per row) correspondingly, the control circuit generates the number or a part of data of the mask data corresponding to some convolution kernels in the weight to the certain basic processing circuit by the same method;

the control circuit of the main processing circuit divides the input data according to the convolution positions, and the control circuit of the main processing circuit sends the data in partial or all convolution positions in the input data to the basic processing circuits which are directly connected with the main processing circuit through the vertical data input interface; correspondingly, the control circuit also divides the mask data related to the input data according to the convolution position, and correspondingly, the control circuit also sends the mask data corresponding to the data in part or all of the convolution positions in the input data to the basic processing circuit electrically connected with the main processing circuit;

in one alternative, the control circuit of the main processing circuit sends data of a certain convolution position in the input data and mask data corresponding to the data to a certain basic processing circuit one number or part of the number at a time; (for example, for a basic processing circuit, the 1 st transmission of the 1 st number of the 3 rd column, the 2 nd transmission of the 2 nd number in the 3 rd column data, the 3 rd transmission of the 3 rd column of … …, or the 1 st transmission of the first two numbers of the 3 rd column, the second transmission of the 3 rd and 4 th numbers of the 3 rd column, the third transmission of the 3 rd column of the 5 th and 6 th numbers of … …;)

In an alternative, the control circuit of the main processing circuit sends data of some convolution positions in the input data and mask data corresponding to the data to some basic processing circuit one number or part of number at a time; (for example, for a base processing circuit, the 1 st transmission of the 1 st number of columns 3,4,5 per column, the 2 nd transmission of the 2 nd number of columns 3,4,5 per column, the 3 rd transmission of the 3 rd number of columns 3,4,5 per column … …, or the 1 st transmission of the first two numbers of columns 3,4,5 per column, the second transmission of the 3 rd and 4 th numbers of columns 3,4,5 per column, the third transmission of the 5 th and 6 th numbers of columns 3,4,5 per column … …;)

After receiving data of the weight (specifically, data of a convolution kernel in the weight (weight data for short) or mask data associated with the weight data), the basic processing circuit transmits the data to a next basic processing circuit connected with the basic processing circuit through a transverse data output interface; after receiving data of input data (the data can be the input data sent by the main processing circuit and the identification mask data associated with the input data), the basic processing circuit transmits the data to the next basic processing circuit connected with the basic processing circuit through a vertical data output interface of the basic processing circuit;

specifically, the control circuit of the main processing circuit may send the input data and the mask data associated with the input data to the base processing circuit together, and the base processing circuit receives the input data and the mask data associated with the input data;

each basic processing circuit operates on the received data; specifically, the basic processing circuit may enable the second mapping circuit to obtain the connection identification data according to the mask data associated with the input data and the mask data associated with the weight data (i.e., the mask data associated with the convolution kernel in the weight); then, selecting input data and data with the absolute value larger than a preset threshold value in the weight data by using the connection identification data to perform multiplication operation;

in an alternative, if the data amount of the received data (specifically, the data block to be calculated, such as data in a convolution kernel in a weight and mask data associated with the data, input data, or mask data associated with the input data) in each basic processing circuit exceeds a preset threshold, the basic processing circuit will not receive new input data, such as data in some convolution kernels in weights and mask data associated with the data, which are sent subsequently by the main processing circuit, until the basic processing circuit has enough buffer/storage space, and then receive data newly sent by the main processing circuit.

In one alternative, the base processing circuitry computes a multiplication of one or more sets of two data at a time, and then accumulates the results onto registers and/or on-chip caches;

in one alternative, the base processing circuitry computes the inner product of one or more sets of two vectors at a time, and then accumulates the results onto a register and/or on-chip cache;

after the basic processing circuit calculates the result, the result can be transmitted out from the data output interface;

in one alternative, the calculation result may be the final result or an intermediate result of the inner product operation;

specifically, the result is transmitted from the output interface if the basic processing circuit has the output interface directly connected to the main processing circuit, and if not, the result is output in the direction of the basic processing circuit capable of directly outputting to the main processing circuit.

After receiving the calculation results from other basic processing circuits, the basic processing circuit transmits the data to other basic processing circuits or main processing circuits connected with the basic processing circuit;

outputting the result in a direction capable of being directly output to the main processing circuit (for example, the bottom row of basic processing circuits directly outputs the output result to the main processing circuit, and the other basic processing circuits transmit the operation result downwards from the vertical output interface);

the main processing circuit receives the inner product operation result of each basic processing circuit, and the output result can be obtained.

The following describes that the apparatus shown in fig. 3 performs a tensor multiplication tensor operation, which is the same as the data block described above, and may be any one or a combination of a matrix, a vector, a three-dimensional data block, a four-bit data block, and a high-dimensional data block; fig. 4b and 4d show a specific implementation method of matrix multiplication vector and matrix multiplication matrix operation, respectively.

Referring to fig. 4a, fig. 4a is a matrix-by-matrix operation, when the forward operation indicated by the first operation instruction is a matrix-by-matrix operation, the input data is a first matrix of the matrix-by-matrix operation, and the weight is a second matrix of the matrix-by-matrix operation.

Referring to FIG. 4b, the matrix multiplication operation is performed using the apparatus shown in FIG. 3;

the following describes the operation of calculating the multiplication of a matrix S of size M rows and L columns and a matrix P of size L rows and N columns, (each row in matrix S being the same length as each column of matrix P, as shown in fig. 2 d) the neural network computing device possesses K basic processing circuits:

step S401b, the control circuit of the main processing circuit distributes each row of data in the matrix S to one of K basic processing circuits, and the basic processing circuits store the received data in an on-chip cache and/or a register;

in an optional scheme, the data of the matrix S is processed data. Specifically, the main processing circuit enables the first mapping circuit to process the matrix S, so as to obtain the processed matrix S and a first identifier (mask) matrix associated with the matrix S. Or the first mapping circuit of the main processing circuit processes the matrix S according to a first mask matrix associated with a pre-stored matrix S to obtain a processed matrix S. Further, each row of data in the processed matrix S and the identification data of the row of data corresponding to the first mask matrix are sent to one or more of the K basic processing circuits through the control circuit. When the main processing circuit sends data to the basic processing circuit, the processed data with the absolute value greater than the preset threshold value in the matrix S or the non-0 data can be sent to the basic processing circuit to reduce the data transmission amount.

In one alternative, if the number of rows M < ═ K of S, the control circuit of the main processing circuit distributes one row of the S matrix to the M basic processing circuits, respectively; optionally, the identification data of the row in the first identification matrix corresponding to the row is also sent at the same time;

in an alternative, the control circuit of the main processing circuit distributes data of one or more rows in the S matrix to each of the elementary processing circuits, respectively, if the number of rows M > K of S. Optionally, the identification data of the row or rows corresponding to the rows in the first identification matrix is also sent at the same time;

in S, Mi rows are distributed to the ith basic processing circuit, and the set of Mi rows is called Ai, as shown in fig. 2e, which represents the calculation to be performed on the ith basic processing circuit.

In one alternative, in each base processing circuit, for example, in the ith base processing circuit:

the received matrix Ai distributed by the main processing circuit stores the matrix Ai in an ith basic processing circuit register and/or an on-chip cache; the method has the advantages of reducing the subsequent data transmission quantity, improving the calculation efficiency and reducing the power consumption.

Step S402b, the control circuit of the main processing circuit transmits each part in the matrix P to each basic processing circuit in a broadcasting mode;

in one alternative, the data (portions) of the matrix P may be processed data. Specifically, the main processing circuit enables the first mapping circuit to process the matrix P, so as to obtain the processed matrix P and a second identifier (mask) matrix associated with the matrix P. Or the first mapping circuit of the main processing circuit processes the matrix P according to a second mask matrix associated with the pre-stored matrix P to obtain the processed matrix P. Further, the processed data (i.e. each part) in the matrix P and the identification data corresponding to the data and associated in the second mask matrix are sent to one or more of the K basic processing circuits through the control circuit. When the main processing circuit sends data to the basic processing circuit, the processed data with the absolute value greater than the preset threshold value in the matrix P or the non-0 data can be sent to the basic processing circuit to reduce the data transmission amount.

In an alternative scheme, each part in the matrix P may be broadcasted to the register or on-chip cache of each basic processing circuit only once, and the ith basic processing circuit multiplexes the data of the matrix P obtained this time sufficiently to complete the inner product operation corresponding to each row in the matrix Ai; the multiplexing in this embodiment may be specifically that the basic processing circuit is repeatedly used in the calculation, for example, the multiplexing of the data of the matrix P may be that the data of the matrix P is used multiple times.

In an alternative, the control circuit of the main processing circuit may broadcast each part of the matrix P to the register or on-chip cache of each basic processing circuit for multiple times, and the ith basic processing circuit does not multiplex the data of the matrix P obtained each time, and completes the inner product operation corresponding to each row in the matrix Ai for multiple times;

in an alternative, the control circuit of the main processing circuit may broadcast each part of the matrix P to the register or on-chip cache of each basic processing circuit for multiple times, and the ith basic processing circuit performs partial multiplexing on the data of the matrix P obtained each time, and completes the inner product operation corresponding to each row in the matrix Ai;

in one alternative, each basic processing circuit, for example the ith basic processing circuit, calculates the inner product of the data of matrix Ai and the data of matrix P;

in step S403b, the accumulator circuit of each basic processing circuit accumulates the result of the inner product operation and transmits it back to the main processing circuit.

Optionally, before step S403b, the inner product operator of the basic processing circuit needs to calculate the inner product of the data of the matrix S and the matrix P, and the following embodiments exist in detail.

In a specific embodiment, the basic processing circuit receives the processed data in the matrix S and the identification data associated with the data in the first mask matrix; and meanwhile, receiving the processed data in the matrix P. Correspondingly, the basic processing circuit enables the second mapping circuit to process the received data of the matrix P according to the identification data in the received first mask matrix, and the processed data of the matrix P is obtained. Further, the basic processing circuit enables the inner product arithmetic circuit to execute the inner product operation on the received data in the processed matrix S and the data of the processed matrix P, and the result of the inner product operation is obtained.

In a specific embodiment, the basic processing circuit receives the processed data in the matrix P and the identification data associated with the data in the second mask matrix; and simultaneously, receiving the processed data in the matrix S. Correspondingly, the basic processing circuit enables the second mapping circuit to process the received data of the matrix S according to the identification data in the received second mask matrix to obtain the processed data of the matrix S. Further, the basic processing circuit enables the inner product arithmetic circuit to execute inner product operation on the received data of the processed matrix P and the data in the processed matrix S, and the result of the inner product operation is obtained.

In a specific embodiment, the basic processing circuit receives the processed data in the matrix S and the identification data associated with the data in the first mask matrix; and meanwhile, receiving the processed data in the matrix P and the identification data which is associated in the second mask matrix and corresponds to the data. Correspondingly, the basic processing circuit starts a second mapping circuit to obtain a relation identification matrix according to the received identification data in the first mask matrix and the received identification data in the second mask matrix; and then, the identification data in the relation identification matrix is used for respectively processing the received data in the matrix S and the received data in the matrix P to obtain the processed data of the matrix S and the processed data of the matrix P. Further, the inner product arithmetic circuit is started to execute inner product operation on the data in the processed matrix S and the data of the processed matrix P, and the result of the inner product operation is obtained. For example, the ith basic processing circuit receives a matrix Ai, an identification matrix Bi associated with the Ai, a matrix P and a second identification matrix associated with the matrix P; at this time, the second mapping circuit can be started to obtain a relation identification matrix by using the Bi and the second identification matrix, and then the matrix Ai and the matrix P are processed simultaneously or respectively by using the relation identification matrix to obtain a processed matrix Ai and a processed matrix P. Next, the inner product operator circuit is enabled to perform inner product operation on the processed matrix Ai and the processed matrix P.

In one alternative, the base processing circuit may transmit the partial sums obtained by performing the inner product operation each time back to the main processing circuit for accumulation;

in an alternative, the partial sum obtained by the inner product operation executed by the basic processing circuit each time can be stored in a register and/or an on-chip cache of the basic processing circuit, and the partial sum is transmitted back to the main processing circuit after the accumulation is finished;

in an alternative, the partial sum obtained by the inner product operation performed by the basic processing circuit each time may be stored in a register and/or an on-chip buffer of the basic processing circuit in some cases for accumulation, and transmitted to the main processing circuit for accumulation in some cases, and transmitted back to the main processing circuit after the accumulation is finished.

Fig. 4c is a schematic diagram of a matrix multiplied by a vector. When the forward operation indicated by the first operation instruction is: and performing matrix multiplication vector operation, wherein the input data is a first matrix of the matrix multiplication vector operation, and the weight is a vector of the matrix multiplication vector operation. Referring to fig. 4d, fig. 4d provides an implementation method of matrix multiplication vector, which may specifically include:

step S401, the control circuit of the main processing circuit distributes each row of data in the matrix S to one of K basic processing circuits, and the basic processing circuits store the received distributed data in an on-chip cache and/or a register of the basic processing circuits;

in an optional scheme, the data of the matrix S is processed data. Specifically, the main processing circuit enables the first mapping circuit to process the matrix S, so as to obtain the processed matrix S and a first identifier (mask) matrix associated with the matrix S. Or the first mapping circuit of the main processing circuit processes the matrix S according to a first mask matrix associated with a pre-stored matrix S to obtain a processed matrix S. Further, each row of data in the processed matrix S and the identification data of the row of data corresponding to the first mask matrix are sent to one or more of the K basic processing circuits through the control circuit. When the main processing circuit sends data to the basic processing circuit, the processed data with the absolute value greater than the preset threshold value in the matrix S or the non-0 data can be sent to the basic processing circuit to reduce the data transmission amount. For example, the set of rows in the matrix S processed by the ith basic processing circuit is Ai, and there are Mi rows in total; correspondingly, an identification matrix Bi corresponding to Ai is distributed at the same time, wherein the Bi is a part of the first mask matrix and has a row number which is larger than or equal to Mi in total.

In an alternative, if the number M < ═ K of rows of the matrix S, the control circuit of the main processing circuit distributes one row of the matrix S to the K basic processing circuits, respectively; optionally, the identification data of the row in the first identification matrix corresponding to the row is also sent at the same time;

in an alternative, the control circuit of the main processing circuit distributes data of one or more rows of the S matrix to each of the elementary processing circuits, respectively, if the number of rows M > K of the matrix S. Optionally, the identification data of the row or rows corresponding to the rows in the first identification matrix is also sent at the same time;

the set of rows in S distributed to the ith basic processing circuit is Ai, and there are Mi rows in total, as fig. 2c shows the calculations to be performed on the ith basic processing circuit.

In one alternative, in each base processing circuit, e.g., the ith base processing circuit, the received dispatch data, e.g., the matrix Ai, may be stored in a register and/or on-chip cache of the ith base processing circuit; the method has the advantages of reducing the data transmission quantity of the subsequent distribution data, improving the calculation efficiency and reducing the power consumption.

Step S402, the control circuit of the main processing circuit transmits each part in the vector P to K basic processing circuits in a broadcasting mode;

in one alternative, the data (parts) of the vector P may be processed data. Specifically, the main processing circuit enables the first mapping circuit to process the vector P, so as to obtain the processed vector P and a second identifier (mask) matrix associated with the vector P. Or the first mapping circuit of the main processing circuit processes the vector P according to a second mask matrix associated with the pre-stored vector P to obtain the processed vector P. Further, the data (i.e. each part) in the processed vector P and the identification data corresponding to the data and corresponding to the data in the second mask matrix are sent to one or more of the K basic processing circuits through the control circuit. When the main processing circuit sends data to the basic processing circuit, data with an absolute value larger than a preset threshold value or non-0 data in the processed vector P can be sent to the basic processing circuit to reduce the data transmission amount.

In an alternative, the control circuit of the main processing circuit may broadcast each part of the vector P only once to the register or on-chip buffer of each basic processing circuit, and the ith basic processing circuit may fully multiplex the data of the vector P obtained this time, and perform the inner product operation corresponding to each row in the matrix Ai. The method has the advantages of reducing the data transmission quantity of repeated transmission of the vector P from the main processing circuit to the basic processing circuit, improving the execution efficiency and reducing the transmission power consumption.

In an alternative, the control circuit of the main processing circuit may broadcast each part of the vector P to the register or on-chip cache of each basic processing circuit for multiple times, and the ith basic processing circuit does not multiplex the data of the vector P obtained each time, and completes the inner product operation corresponding to each row in the matrix Ai for multiple times; the method has the advantages of reducing the data transmission quantity of the vector P of single transmission in the basic processing circuit, reducing the capacity of the cache and/or the register of the basic processing circuit, improving the execution efficiency, reducing the transmission power consumption and reducing the cost.

In an alternative, the control circuit of the main processing circuit may broadcast each part of the vector P to the register or on-chip cache of each basic processing circuit for multiple times, and the ith basic processing circuit performs partial multiplexing on the data of the vector P obtained each time, and completes the inner product operation corresponding to each row in the matrix Ai; the method has the advantages of reducing the data transmission quantity from the main processing circuit to the basic processing circuit, reducing the data transmission quantity in the basic processing circuit, improving the execution efficiency and reducing the transmission power consumption.

Step S403, calculating an inner product of the matrix S and the data of the vector P by an inner product operator circuit of K basic processing circuits, for example, an i-th basic processing circuit, calculating an inner product of the data of the matrix Ai and the data of the vector P;

in a specific embodiment, the basic processing circuit receives the processed data in the matrix S and the identification data associated with the data in the first mask matrix; and simultaneously receiving the data in the processed vector P. Correspondingly, the basic processing circuit enables the second mapping circuit to process the received data of the vector P according to the received identification data in the first mask matrix, and the processed data of the vector P is obtained. Further, the basic processing circuit enables the inner product arithmetic circuit to execute the inner product operation on the received data in the processed matrix S and the data of the processed vector P, and the result of the inner product operation is obtained. For example, the ith basic processing circuit receives a matrix Ai, an identification matrix Bi associated with the Ai, and a vector P; at the moment, a second mapping circuit can be started to process the vector P by utilizing Bi to obtain a processed vector P; and starting an inner product arithmetic circuit to carry out inner product operation on the matrix Ai and the processed vector P.

In a specific embodiment, the basic processing circuit receives the data in the processed vector P and the identification data associated with the data in the second mask matrix; and simultaneously, receiving the processed data in the matrix S. Correspondingly, the basic processing circuit enables the second mapping circuit to process the received data of the matrix S according to the identification data in the received second mask matrix to obtain the processed data of the matrix S. Further, the basic processing circuit enables the inner product arithmetic circuit to execute inner product operation on the received data of the processed vector P and the data in the processed matrix S, and the result of the inner product operation is obtained. For example, the ith basic processing circuit receives the matrix Ai, the processed vector P and a second identification matrix associated with the vector P; at the moment, a second mapping circuit can be started to process Ai by using a second identification matrix to obtain a processed matrix Ai; and then starting an inner product arithmetic circuit to carry out inner product operation on the processed matrix Ai and the processed vector P.

In a specific embodiment, the basic processing circuit receives the processed data in the matrix S and the identification data associated with the data in the first mask matrix; and meanwhile, receiving the data in the processed vector P and the identification data which is associated in the second mask matrix and corresponds to the data. Correspondingly, the basic processing circuit starts a second mapping circuit to obtain a relation identification matrix according to the received identification data in the first mask matrix and the received identification data in the second mask matrix; and then, respectively processing the received data in the matrix S and the received data in the vector P by using the identification data in the relation identification matrix to obtain the processed data of the matrix S and the processed data of the vector P. Further, an inner product arithmetic circuit is started to execute inner product operation on the data in the processed matrix S and the data of the processed vector P, and the result of the inner product operation is obtained. For example, the ith base processing circuit receives a matrix Ai, an identification matrix Bi associated with the Ai, a vector P, and a second identification matrix associated with the vector P; at this time, the second mapping circuit can be started to obtain a relation identification matrix by using the Bi and the second identification matrix, and then the matrix Ai and the vector P are processed simultaneously or respectively by using the relation identification matrix to obtain a processed matrix Ai and a processed vector P. Next, the inner product operator circuit is enabled to perform an inner product operation on the processed matrix Ai and the processed vector P.

And S404, accumulating the results of the inner product operation by the accumulator circuits of the K basic processing circuits to obtain accumulated results, and transmitting the accumulated results back to the main processing circuit in a fixed-point type mode.

In an alternative, the partial sums (i.e., a portion of the accumulated result, e.g., F1G 1+ F2G 2+ F3G 3+ F4G 4+ F5G 5, then the partial sums may be the values of F1G 1+ F2G 2+ F3G 3) resulting from each inner product operation performed by the basic processing circuit may be transmitted back to the main processing circuit for accumulation; the method has the advantages of reducing the internal operation amount of the basic processing circuit and improving the operation efficiency of the basic processing circuit.

In an alternative, the partial sum obtained by the inner product operation executed by the basic processing circuit each time can be stored in a register and/or an on-chip cache of the basic processing circuit, and the partial sum is transmitted back to the main processing circuit after the accumulation is finished; the method has the advantages of reducing the data transmission quantity between the basic processing circuit and the main processing circuit, improving the operation efficiency and reducing the data transmission power consumption.

In an alternative, the partial sum obtained by the inner product operation executed by the basic processing circuit each time is stored in a register and/or an on-chip cache of the basic processing circuit for accumulation in partial cases, and is transmitted to the main processing circuit for accumulation in partial cases, and is transmitted back to the main processing circuit after the accumulation is finished; the method has the advantages of reducing the data transmission quantity between the basic processing circuit and the main processing circuit, improving the operation efficiency, reducing the data transmission power consumption, reducing the operation quantity in the basic processing circuit and improving the operation efficiency of the basic processing circuit.

Neural network training method

All or part of the data involved in the neural network training process may be processed data, which may be obtained by processing by the first mapping circuit and/or the second mapping circuit in the foregoing embodiments, and is not described herein again.

It should be noted that different time points (specifically, different iteration times or initialization time points) in the training process, different stages (i.e., forward or reverse operations) in the training process, different layers, different data blocks (i.e., multiple input data blocks and output data blocks) in the same layer, or different sub data blocks divided in the same data block may all refer to processed data blocks.

The following describes a practical implementation method of neural network training, as shown in fig. 1b, which is a specific calculation diagram of neural network training with single-layer operation, as shown in fig. 1b, which shows forward operation of the single-layer neural network with solid lines, and shows backward operation of the single-layer neural network with dashed lines. Specifically, the forward operation of the layer is performed according to the input data and the weight or the parameter to obtain the output data, and then the operation of the preset rule is performed according to the output data (the preset rule can be set by the manufacturer according to the needs of the manufacturer, and the specific operation step of the operation of the preset rule is not limited herein) to obtain the gradient of the output data of the layer. Then, the inverse operation of the neural network of the layer can be executed according to the input data, the weight or the parameter of the layer and the output data gradient to obtain the input data gradient and the gradient of the weight or the parameter of the layer, and the weight or the parameter of the layer can be correspondingly updated by using the gradient of the weight or the parameter obtained by calculation, so that the training of the neural network of the layer is completed.

In a specific implementation process, data related to a forward operation or a backward operation process may be processed data, and taking the forward operation as an example, the technical solution provided in this embodiment of the present application may determine whether to start a related mapping circuit (specifically, the first mapping circuit and/or the second mapping circuit) to process input data and/or a weight according to a backward operation instruction of the layer, and then perform the operation of the layer by using the processed input data and/or the weight. For the principle of the above data processing, reference may be made to the related explanations in the foregoing embodiments, and details are not described here. It should be understood that, by using the processed data to perform the neural network operation, the transmission overhead between the calculators can be greatly reduced, and in addition, for the calculators, the space for storing the data with fewer bits is smaller, i.e., the storage overhead is smaller, and the amount of computation is also reduced, i.e., the computation overhead is reduced, so the computation overhead and the storage overhead can be reduced.

Taking fig. 4e and fig. 4f as an example, a structural schematic diagram of the neural network training of matrix multiplication and convolution is specifically given below. The layer of operation shown in fig. 4e is matrix multiplication, and the layer of operation shown in fig. 4f is convolution operation, assuming that both the input data and the weight of the layer are matrices, for convenience of description, the input data here is exemplified by a matrix I, and the weight is exemplified by a matrix W, where the output data is the matrix I × the matrix W; if the matrix I and the matrix W are both sparse matrixes with larger dimensions, the sparse matrixes refer to matrixes which comprise more data with absolute values less than or equal to a preset threshold value or 0. The larger dimension can be understood as that the sum of the column number and the row number of the matrix I and the matrix W is larger, that is, the space occupied by the matrix I and the matrix W in a memory and/or a register is larger and the calculation amount is also larger, and at this time, if the conventional matrix multiplication processing is performed, the data calculation amount is larger; in order to improve the data processing efficiency, the matrix I and the matrix W are processed, and then the matrix multiplication operation is performed.

For example, the matrix I is a 1000 × 1000 sparse matrix, and the matrix W is also a 1000 × 1000 sparse matrix, so that the sum of the number of columns and the number of rows is 2000, which results in a large number of corresponding calculations, and the multiplication of the matrix multiplied by the inner product of the matrix is 109 times.

Fig. 4g and 4h show a specific structural diagram of the multi-layer neural network training. As shown in fig. 4g, the direction of the dashed arrow shows an inverse operation. For the inverse operation, the output of the inverse operation is the output data gradient; when the output data gradient is the last layer of the multi-layer neural network iterative computation, the output data gradient is specifically obtained by performing preset operation on the output data of the last layer of the current iterative computation (the preset operation can be set by a manufacturer according to the requirement of the manufacturer, and the specific operation step of the preset operation is not limited herein); if the output data gradient is not the last layer of the iterative computation of the multi-layer neural network, for example, the output data gradient is the nth layer of the current iterative computation, the output data gradient of the nth layer may be the input data gradient computed by the n +1 th layer of inverse computation. Similarly, it can be understood that fig. 4h, fig. 4h may be a schematic diagram (including forward operation and backward operation) of training a multi-layer convolutional neural network, and other operations in the schematic diagram may be represented as other layers or operations between layers besides convolutional layers, without limitation.

The present disclosure also provides an integrated circuit chip device for performing training of a neural network, the neural network including a plurality of layers, the integrated circuit chip device comprising: a processing circuit and an external interface;

the external interface is used for receiving a training instruction;

the processing circuit is used for determining first-layer input data and first-layer weight data according to the training instruction, and executing n layers of forward operations of the neural network through the first-layer input data and the first-layer weight data to obtain an nth output result;

the processing circuit is further configured to obtain an nth output result gradient according to the nth output result, and obtain an nth reverse operation instruction of an nth layer of reverse operation and nth layer of input data and nth layer of weight group data required by the nth reverse operation instruction according to the training instruction; executing n layers of reverse operation of the neural network according to the nth reverse operation instruction, the nth output result gradient, the nth layer of input data and the nth layer of weight group data to obtain n weight gradients of n layers of operation;

the processing circuit is further configured to update the n weights of the n-layer operation by applying the n weight gradients.

The disclosure also discloses a neural network computing device, which includes one or more chips shown in fig. 3, and is used for acquiring data to be computed and control information from other processing devices, executing a specified neural network operation, and transmitting the execution result to peripheral equipment through an I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one chip shown in fig. 3 is included, the chips shown in fig. 3 can be linked and transmit data through a specific structure, for example, a PCIE bus interconnects and transmits data to support larger-scale operation of the neural network. At this time, the same control system may be shared, or there may be separate control systems; the memory may be shared or there may be separate memories for each accelerator. In addition, the interconnection mode can be any interconnection topology. Optionally, the neural network operation device has high compatibility, and can be connected with various types of servers through a PCIE interface.

In one embodiment, the present invention discloses a chip (e.g., fig. 5) for performing all or part of the embodiments provided in the method embodiments described above.

In one embodiment, the invention discloses an electronic device comprising functional units for performing all or part of the embodiments of the method as described above.

Electronic devices include data processing devices, robots, computers, printers, scanners, tablets, smart terminals, cell phones, tachographs, navigators, sensors, cameras, servers, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, vehicles, home appliances, and/or medical devices. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

The above-described embodiments, objects, technical solutions and advantages of the present disclosure are further described in detail, it should be understood that the above-described embodiments are only illustrative of the embodiments of the present disclosure, and are not intended to limit the present disclosure, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. An integrated circuit chip apparatus, the apparatus being configured to perform training of a neural network, the neural network including n layers, the n values ranging from integers greater than or equal to 2, the integrated circuit chip apparatus comprising: a main processing circuit and a plurality of basic processing circuits; the main processing circuit comprises a first mapping circuit, at least one circuit of the plurality of basic processing circuits comprises a second mapping circuit, and the first mapping circuit and the second mapping circuit are used for executing compression processing of each data in neural network operation;

the plurality of base processing circuits are distributed in an array; each basic processing circuit is connected with other adjacent basic processing circuits, the main processing circuit is connected with n basic processing circuits in a 1 st row, n basic processing circuits in an m th row and m basic processing circuits in a 1 st column, the arrays are m x n arrays, the value ranges of m and n are integers which are more than or equal to 1, and at least one value of m and n is more than or equal to 2;

2. The integrated circuit chip apparatus of claim 1, wherein when the first data block comprises a horizontal data block and a vertical data block,

the main processing circuit is specifically configured to start the first mapping circuit to process the horizontal data block and the vertical data block to obtain a processed horizontal data block, an identification data block associated with the horizontal data block, a processed vertical data block, and an identification data block associated with the vertical data block; splitting the processed transverse data block and the identification data block associated with the transverse data block to obtain a plurality of basic data blocks and identification data blocks associated with the basic data blocks; distributing the plurality of basic data blocks and the identification data blocks respectively associated with the plurality of basic data blocks to a basic processing circuit connected with the basic processing circuit, and broadcasting the vertical data blocks and the identification data blocks associated with the vertical data blocks to the basic processing circuit connected with the vertical data blocks;

the basic processing circuit is used for starting the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the basic data block and the identification data block associated with the vertical data block; and processing the basic data blocks and the vertical data blocks according to the connection identification data blocks, performing reverse operation on the processed basic data blocks and the processed vertical data blocks to obtain operation results, and sending the operation results to the main processing circuit.

3. The integrated circuit chip apparatus of claim 1, wherein when the first data block comprises a lateral data block,

the main processing circuit is specifically configured to start the first mapping circuit to process the transverse data block to obtain a processed transverse data block and an identification data block associated with the transverse data block, or start the first mapping circuit to process the transverse data block according to a pre-stored identification data block associated with the transverse data block to obtain a processed transverse data block; splitting the processed transverse data block and the identification data block associated with the transverse data block to obtain a plurality of basic data blocks and identification data blocks associated with the basic data blocks; distributing the plurality of elementary data blocks and the identification data blocks associated with each of the plurality of elementary data blocks to a base processing circuit connected thereto; broadcasting the vertical data block to a base processing circuit connected thereto;

the basic processing circuit is used for starting the second mapping circuit to process the vertical data block according to the identification data block associated with the basic data block, performing reverse operation on the processed vertical data block and the basic data block to obtain an operation result, and sending the operation result to the main processing circuit.

4. The integrated circuit chip apparatus of claim 1, wherein when the first data block comprises a vertical data block,

the main processing circuit is specifically configured to start the first mapping circuit to process the vertical data block to obtain a processed vertical data block and an identification data block associated with the vertical data block, or start the first mapping circuit to process the vertical data block according to a prestored identification data block associated with the vertical data block to obtain a processed vertical data block; splitting the transverse data block to obtain a plurality of basic data blocks; distributing the plurality of base data to a base processing circuit connected thereto; broadcasting the processed vertical data block and the identification data block related to the vertical data block to a basic processing circuit connected with the vertical data block;

the basic processing circuit is used for starting the second mapping circuit to process the basic data block according to the identification data block associated with the vertical data block to obtain a processed basic data block; and performing reverse operation on the processed basic data blocks and the processed vertical data blocks to obtain an operation result, and sending the operation result to the main processing circuit.

5. The integrated circuit chip apparatus of any of claims 2-4,

the basic processing circuit is specifically configured to perform a reverse operation on the basic data block and the vertical data block to obtain a reverse operation result, accumulate the reverse operation result to obtain an operation result, and send the operation result to the main processing circuit;

and the main processing circuit is used for obtaining accumulation results after accumulating the operation results and arranging the accumulation results to obtain instruction results.

6. The integrated circuit chip apparatus of any of claims 2-4,

the main processing circuit is specifically configured to broadcast the vertical data block or the processed vertical data block to the plurality of basic processing circuits at a time; alternatively, the first and second electrodes may be,

the main processing circuit is specifically configured to divide the vertical data block or the processed vertical data block into a plurality of partial vertical data blocks, and broadcast the plurality of partial vertical data blocks to the plurality of basic processing circuits for multiple times.

7. The integrated circuit chip apparatus of any of claims 2-4,

the main processing circuit is specifically configured to split the processed vertical data block and the identification data block associated with the vertical data block to obtain a plurality of partial vertical data blocks and identification data blocks associated with the plurality of partial vertical data blocks; broadcasting the plurality of partial vertical data blocks and the identification data blocks respectively associated with the plurality of partial vertical data blocks to the basic processing circuit by one or more times; the plurality of partial vertical data blocks are combined to form the processed vertical data block;

the basic processing circuit is specifically configured to start the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the partial vertical data block and the identification data block associated with the basic data block; processing the partial vertical data blocks and the basic data blocks according to the connection identification data blocks to obtain processed vertical data blocks and processed basic data blocks; performing reverse operation on the processed vertical data block and the processed basic data block;

or, the basic processing circuit is specifically configured to start the second mapping circuit to process the basic data block according to the identification data block associated with the partial vertical data block to obtain a processed basic data block, and perform a reverse operation on the processed basic data and the partial vertical data block.

8. The integrated circuit chip apparatus of claim 1,

the main processing circuit is specifically configured to determine that the nth layer input data and the nth layer weight group data are both horizontal data blocks and the nth output result gradient is a vertical data block if the nth reverse operation instruction is a multiplication instruction; if the nth reverse operation instruction is a convolution instruction, determining that the nth layer input data and the nth layer weight group data are both vertical data blocks, and determining that the nth output result gradient is a horizontal data block.

9. The integrated circuit chip apparatus of any of claims 2-4,

the n-layer inversion operation further comprises: one or any combination of bias operation, full connection operation, GEMM operation, GEMV operation and activation operation.

10. The integrated circuit chip apparatus of claim 9,

the nth output result gradient is as follows: one or any combination of vector, matrix, three-dimensional data block, four-dimensional data block and n-dimensional data block;

the nth layer of input data is as follows: one or any combination of vector, matrix, three-dimensional data block, four-dimensional data block and n-dimensional data block;

the n layers of weight group data are as follows: one or any combination of vectors, matrices, three-dimensional data blocks, four-dimensional data blocks, and n-dimensional data blocks.

11. A chip incorporating a device according to any one of claims 1 to 10.

12. A method of operation of a neural network, the method being implemented within an integrated circuit chip device, the integrated circuit chip device comprising: the integrated circuit chip apparatus of any of claims 1-10, the integrated circuit chip apparatus to perform a training operation of a neural network.