CN110909870B

CN110909870B - Training device and method

Info

Publication number: CN110909870B
Application number: CN201811074120.0A
Authority: CN
Inventors: 不公告发明人
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2018-09-14
Filing date: 2018-09-14
Publication date: 2022-12-09
Anticipated expiration: 2038-09-14
Also published as: CN110909870A

Abstract

The present disclosure relates to a training device and method, the device comprising: the parameter compression unit determines parameters to be compressed of the neural network according to the received model data of the neural network, and compresses the parameters to be compressed to obtain semantic vectors corresponding to the neural network; the parameter storage unit stores semantic vectors corresponding to the neural network and sends the semantic vectors to the parameter decompression unit or the arithmetic unit when receiving a data reading instruction; when receiving the semantic vector, the parameter decompression unit decompresses the semantic vector to obtain a decompression parameter of the neural network and sends the decompression parameter to the operation unit; the arithmetic unit trains the neural network for the received semantic vector or the decompression parameter. The method and the device can compress the parameters to be compressed, thereby effectively reducing the size of the model of the neural network, reducing the requirement on the memory and effectively improving the data processing speed of the neural network.

Description

Training device and method

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a training apparatus and method.

Background

Artificial Neural Networks (ANN) are a research hotspot in the field of Artificial intelligence since the 80 s of the 20 th century. The method abstracts the human brain neuron network from the information processing angle, establishes a certain simple model, and forms different networks according to different connection modes. It is also often directly referred to in engineering and academia as neural networks or neural-like networks. The neural network is an operational model, which is formed by connecting a large number of nodes (or called neurons). The operation of the existing neural network is usually based on a Central Processing Unit (CPU) or a Graphics Processing Unit (GPU) to implement forward operation and forward or reverse training operation of the neural network, and these operations are large in computation amount and high in power consumption.

Disclosure of Invention

In view of this, the present disclosure provides a training apparatus and method to implement training of a neural network and implement real-time compression and decompression of parameters in the training.

According to an aspect of the present disclosure, a neural network training apparatus supporting compression and decompression is provided, the apparatus being configured to perform training of a neural network, the apparatus including:

the parameter compression unit is used for determining parameters to be compressed of the neural network according to the received model data of the neural network, and compressing the parameters to be compressed by using an encoder to obtain semantic vectors corresponding to the neural network;

the parameter storage unit is connected to the parameter compression unit and used for storing semantic vectors corresponding to the neural network and sending the semantic vectors to the parameter decompression unit or the arithmetic unit when receiving a data reading instruction;

the parameter decompression unit is connected to the parameter storage unit and used for decompressing the semantic vector by using a decoder when the semantic vector is received to obtain a decompression parameter of the neural network and sending the decompression parameter to the operation unit; and

and the operation unit is respectively connected to the parameter storage unit and the parameter decompression unit and is used for training the neural network on the received semantic vector or the decompression parameter.

According to another aspect of the present disclosure, a neural network chip is provided, where the neural network chip includes a machine learning operation device or a combined processing device, where the machine learning operation device includes one or more neural network training devices supporting compression and decompression, and is configured to obtain input data and control information to be operated from other processing devices, execute a specified machine learning operation, and transmit an execution result to other processing devices through an I/O interface; when the machine learning operation device comprises a plurality of training devices, the plurality of training devices can be connected through a specific structure and transmit data; the training devices are interconnected through a PCIE bus of a fast peripheral equipment interconnection bus and transmit data so as to support larger-scale machine learning operation; a plurality of training devices share the same control system or own respective control systems; the training devices share the memory or own the memory; the interconnection mode of the training devices is any interconnection topology;

the combined processing device comprises the machine learning arithmetic device, a universal interconnection interface and other processing devices;

the machine learning arithmetic device interacts with the other processing devices to jointly complete the calculation operation designated by the user;

the combination processing apparatus further includes: and a storage device connected to the machine learning arithmetic device and the other processing device, respectively, for storing data of the machine learning arithmetic device and the other processing device.

According to another aspect of the present disclosure, an electronic device is provided, and the electronic device includes the neural network chip.

According to another aspect of the present disclosure, a board card is provided, where the board card includes: the device comprises a storage device, an interface device, a control device and the neural network chip;

wherein the neural network chip is connected with the storage device, the control device and the interface device respectively;

the storage device is used for storing data;

the interface device is used for realizing data transmission between the chip and external equipment;

the control device is used for monitoring the state of the chip, wherein,

the memory device includes: a plurality of groups of memory cells, each group of memory cells is connected with the chip through a bus, and the memory cells are: DDR SDRAM;

the chip includes: the DDR controller is used for controlling data transmission and data storage of each memory unit;

the interface device is as follows: a standard PCIE interface.

According to another aspect of the present disclosure, a neural network training method supporting compression and decompression is provided, the method is applied to a neural network training device to perform training of a neural network, the neural network training device includes a parameter compression unit, a parameter storage unit, a parameter decompression unit, and an operation unit, and the method includes:

the parameter compression unit determines parameters to be compressed of the neural network according to received model data of the neural network, and compresses the parameters to be compressed by using an encoder to obtain semantic vectors corresponding to the neural network;

the parameter storage unit stores semantic vectors corresponding to the neural network, and sends the semantic vectors to the parameter decompression unit or the arithmetic unit when receiving a data reading instruction;

when receiving the semantic vector, the parameter decompressing unit decompresses the semantic vector by using a decoder to obtain a decompressing parameter of the neural network and sends the decompressing parameter to the arithmetic unit; and

and the arithmetic unit trains the neural network on the received semantic vector or the decompression parameter.

The method and the device can compress the parameters to be compressed, thereby effectively reducing the size of a model of the neural network, reducing the requirement on the memory and effectively improving the data processing speed of the neural network.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a block diagram of a neural network training device supporting compression and decompression according to an embodiment of the present disclosure.

Fig. 2 shows a schematic diagram of parameter compression and decompression according to an embodiment of the present disclosure.

Fig. 3 shows a block diagram of a neural network training device supporting compression and decompression according to an embodiment of the present disclosure.

Fig. 4 shows a block diagram of a neural network training device supporting compression and decompression according to an embodiment of the present disclosure.

FIG. 5 shows a cache module schematic according to an embodiment of the present disclosure.

FIG. 6 shows a block diagram of a main processing circuit according to an embodiment of the present disclosure.

Fig. 7 shows a block diagram of a neural network training device supporting compression and decompression according to an embodiment of the present disclosure.

Fig. 8 shows a block diagram of a neural network training device supporting compression and decompression according to an embodiment of the present disclosure.

Fig. 9 shows a block diagram of a neural network training device supporting compression and decompression according to an embodiment of the present disclosure.

Fig. 10 shows a schematic diagram of a compression and decompression process according to an embodiment of the present disclosure.

Fig. 11 shows a schematic diagram of a compression and decompression process according to an embodiment of the present disclosure.

FIG. 12 shows a schematic diagram of a combined treatment device according to an embodiment of the present disclosure.

FIG. 13 shows a schematic diagram of a combined treatment device according to an embodiment of the present disclosure.

Fig. 14 shows a schematic diagram of a board according to an embodiment of the present disclosure.

Fig. 15 shows a flowchart of a neural network training method supporting compression and decompression according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Training (training) a neural network (neural networks) requires huge costs of storage, computation, and energy consumption. During the training process, if the neural network can be compressed (compression) and decompressed (decompression) in real time, the storage amount, the calculation amount and the energy consumption can be reduced. However, conventional computing platforms, such as CPUs, GPUs, etc., and most special accelerators, cannot compress and decompress the neural network in real time.

The present disclosure provides a neural network training device supporting compression and decompression, so as to implement real-time compression and decompression on a neural network when training the neural network, thereby reducing memory, computation and energy consumption.

Referring to fig. 1, fig. 1 is a block diagram illustrating a neural network training device supporting compression and decompression according to an embodiment of the present disclosure.

As shown in fig. 1, the apparatus includes a parameter compressing unit 10, a parameter storing unit 20, a parameter decompressing unit 30 and an operation unit 40, wherein the parameter compressing unit 10 is connected to the parameter storing unit 20, the parameter storing unit 20 is connected to the parameter decompressing unit 30, and the operation unit 40 is connected to the parameter storing unit 20 and the parameter decompressing unit 30.

The parameter compression unit 10 is configured to determine a parameter to be compressed of the neural network according to received model data of the neural network, and compress the parameter to be compressed by using an encoder to obtain a semantic vector corresponding to the neural network, where the parameter to be compressed includes a weight of the neural network.

In one possible embodiment, the model data of the neural network may include input vectors, neurons (neurons), weights (weights), gradients (gradients), topology, learning rate, activation functions, and other parameters of the neural network.

The parameter compression unit 10 may compress the model data of the neural network, compress the multidimensional data therein into low-dimensional data, and reduce the vector length of the data, thereby reducing the memory pressure of the storage parameter.

For example, the weights of the neural network may be compressed, and the multidimensional weights may be compressed into a semantic vector with a fixed length, where the semantic vector includes information of the weights before compression, and it should be understood that when the weights are selected for compression, any number of weights may be selected for compression.

In a possible embodiment, the encoder may include one or more of Neural networks such as CNN (Convolutional Neural Network), RNN (cyclic Neural Network), biRNN (Bidirectional RNN), GRU (Gated cyclic Unit), LSTM (Long Short-Term Memory), and the like, and may also compress the model data of the Neural Network by using methods such as entropy coding, quantization coding, mapping coding, and the like.

For example, RNN may be selected as an encoder to encode and compress the weight values, and the following description will take the encoder as RNN as an example.

Referring to fig. 2, fig. 2 is a diagram illustrating parameter compression and decompression according to an embodiment of the disclosure.

When the RNN is adopted to encode and compress the parameters to be compressed, a layer-by-layer greedy algorithm can be adopted to train the depth network.

As shown in fig. 2, the RNN includes an input layer and a plurality of hidden layers (two layers are taken as examples), when compressing the parameters to be compressed by a layer-by-layer greedy algorithm, first, a first layer of the RNN is trained by a plurality of vectors (input vectors and weight values of a neural network), and the first layer of the RNN converts the plurality of vectors into a first intermediate vector composed of hidden unit activation values of the first layer; then, the second layer of the RNN converts the intermediate vector transmitted by the first layer into a second intermediate vector consisting of hidden unit activation values of the second layer by taking the intermediate vector as the input of the second layer of the RNN; then, the same strategy is adopted for the hidden layer behind, the output of the previous layer is used as the input of the next layer, and the RNN model is trained in sequence; finally, the last layer of the hidden layer at the current time can be used as the semantic vector of the hidden layer.

In RNN, the hidden layer state at the current time is determined by the hidden layer state at the previous time and the input at the current time, for example, the formula: ht = f (ht-1, xt), where ht is the hidden layer state at the current time (time t), ht-1 is the hidden layer state at the previous time (time t-1), and xt is the input of the hidden layer at the current time.

After the hidden layer states at each time are obtained, the hidden layer states (hT 1 to hTx) at each time (T1 to Tx, x is an integer greater than 1) are summed to generate a final semantic vector c, c = q ({ hT 1.., hTx }), q representing some non-linear function.

However, in the RNN network, the hidden layer state at the previous time cannot be seen after the current time is calculated, so the hidden layer state at the last time (Tx time) can be used as the semantic vector c, i.e. c = hTx.

In one possible implementation, the parameters of the layers may also be adjusted using a back propagation algorithm.

In a possible implementation manner, the parameter compression unit 10 is further configured to determine whether the parameter to be compressed is sparse, and send a sparse flag corresponding to the semantic vector to the parameter storage unit when the parameter to be compressed is sparse.

The sparse mode refers to that the parameter matrix to be compressed includes more data with absolute values less than or equal to a preset threshold value or 0.

In one possible embodiment, the sparsified label may be labeled, for example, with a pool variable.

And a parameter storage unit 20, configured to store a semantic vector corresponding to the neural network, and send the semantic vector to the parameter decompression unit or the operation unit when receiving a data reading instruction.

In one possible embodiment, the parameter storage unit 20 may further store the input vector and the sparsification flag of the neural network.

In one possible embodiment, the data reading instruction may be issued by a controller other than the neural network operation device supporting compression and decompression, or may be issued by an operation unit and a parameter decompression unit in the neural network operation device supporting compression and decompression.

In a possible implementation, the parameter storage unit 10, when receiving a data reading instruction, sends the semantic vector to the parameter decompression unit or the arithmetic unit, and further includes:

and when the data reading instruction is received and the parameter storage unit does not store the thinning mark corresponding to the semantic vector, sending the semantic vector to the parameter decompression unit.

And when the data reading instruction is received and the parameter storage unit stores the thinning mark corresponding to the semantic vector, sending the semantic vector to the operation unit.

And the parameter decompressing unit 30 is configured to decompress the semantic vector by using a decoder when receiving the semantic vector, obtain a decompressing parameter of the neural network, and send the decompressing parameter to the arithmetic unit.

The parameter decompressing unit 30 may decode and decompress the semantic vector, so that the same number of decompressing parameters as the number of the parameters to be compressed may be obtained, and the decompressing parameters include information of the parameters to be compressed.

For example, when the parameter to be compressed is N weights, the parameter decompressing unit 30 may decompress the semantic vector decoding into N decompression parameters, which are substantially equal to the N weights.

In a possible implementation manner, the decoder may include one or more of Neural networks such as CNN (volume Neural Network), RNN (Recurrent Neural Network), biRNN (Bidirectional RNN), GRU (Gated Recurrent Unit), LSTM (Long Short-Term Memory Network), and the like, and the decoder may also decompress the model data of the Neural Network by using methods such as entropy coding, quantization coding, mapping coding, and the like.

The selection of the decoder may correspond to the encoder, for example, when the encoder selects CNN, the decoder may be CNN. However, the selection of the decoder and the encoder may be arbitrary, and for example, when the encoder selects the CNN, the decoder may select any one or more of the CNN, RNN, and the like.

The decoding process will be described below by taking the decoder as an RNN.

Please refer to fig. 2. As shown in fig. 2, the RNN model for decompressing semantic vectors by the parameter decompression unit 30 includes a plurality of hidden layers (layer 1 in the figure as an example) and an output layer for outputting decompression parameters.

The process of decompressing the semantic vector by the parameter decompressing unit 30 can be regarded as the inverse process of the process of compressing the parameter to be compressed by the parameter compressing unit 10, and at the stage of decompressing, the next output can be predicted according to the generated output sequence, so as to decompress the semantic vector of the hidden layer into the decompressing parameter.

In RNN, the decoding process can predict the next output yt given the aforementioned semantic vector c and the already generated output sequence y1, y2, \8230; yt-1.

And the arithmetic unit 40 is respectively connected to the parameter storage unit and the parameter decompression unit, and is used for training the neural network on the received semantic vector or the decompression parameter.

The training of the neural network includes forward operation, backward operation, weight updating, and the like.

In one possible implementation, the inverse operation may include: bias execution operation, full connection operation, matrix multiplication (GEMM) operation, matrix vector product (GEMV) operation and activation operation or any combination thereof.

In one possible embodiment, the neural network may comprise n layers, n being an integer greater than or equal to 2.

The forward training of the neural network usually performs a forward operation, in which each layer uses its own input data and weight to calculate corresponding output data according to an operation rule specified by the type of the layer.

The forward operation process (also called inference) of the neural network is a process of processing input data of each layer by layer and obtaining output data through certain calculation, and has the following characteristics:

input of a certain layer:

the input of a certain layer can be input data of a neural network;

the input of a certain layer can be the output of other layers;

the input to a layer may be the output at a time on the layer (corresponding to the case of a recurrent neural network);

a layer may obtain input from a plurality of said input sources simultaneously;

output of a certain layer:

the output of a certain layer can be used as the output result of the neural network;

the output of a certain layer may be the input of other layers;

the output of a layer may be the input of the layer at the next time (in the case of a recurrent neural network);

the output of a certain layer can output results to the plurality of output directions;

specifically, the types of operations of the layers in the neural network include, but are not limited to, the following:

convolutional layers (i.e., performing convolution operations);

fully-connected layers (i.e., performing fully-connected operations);

normalization (regularization) layer: including LRN (Local Response Normalization) layer, BN (Batch Normalization) layer, etc.;

a pooling layer;

an active layer: including but not limited to the following types Sigmoid layer, reLU layer, prilu layer, leakyReLu layer, tanh layer;

the inverse operations of the layers, each of which needs to perform two parts of operations: one part is to calculate gradients of weights (for updating weights of the present layer in a "weight update" step) using gradients of output data that may be sparse representations and input data that may be sparse representations, and the other part is to calculate gradients of input data (for being used as gradients of output data of a next layer in an inverse operation for the inverse operation thereof) using gradients of output data that may be sparse representations and weights that may be sparse representations;

the backward operation reversely transfers the gradient from the last layer in the reverse order of the forward operation.

In one possible embodiment, the inverse calculated output data gradient of a layer may be from:

gradient returned by last loss function (lost function or cost function) of the neural network;

input data gradients for other layers;

the input data gradient at a time on the local layer (corresponding to the case of the recurrent neural network);

a layer may simultaneously acquire output data gradients from a plurality of said sources;

after the reverse operation of the neural network is executed, the weight gradient of each layer is calculated, and then the weight gradient is used in the operation unit to update the weight;

in the forward operation, after the execution of the artificial neural network of the previous layer is completed, the operation instruction of the next layer takes the output data calculated in the operation unit as the input data of the next layer for operation (or performs some operation on the output data and then takes the output data as the input data of the next layer), and at the same time, the weight is replaced by the weight of the next layer; in the reverse operation, after the reverse operation of the artificial neural network in the previous layer is completed, the next layer of operation instruction takes the input data gradient calculated in the operation unit as the output data gradient of the next layer for operation (or performs some operation on the input data gradient and then takes the input data gradient as the output data gradient of the next layer), and at the same time, replaces the weight with the weight of the next layer

In a possible implementation manner, when the parameter to be compressed is a weight of the neural network, if the weight is sparse, the semantic vector after compressing the parameter can be directly used by the arithmetic unit 40 to train the neural network; if the weight is not sparse, the corresponding semantic vector needs to be decompressed to generate a decompression parameter, and the decompression parameter can be directly used by the arithmetic unit 40 to train the neural network.

In a possible implementation, the training the neural network on the received semantic vector or the decompression parameter may include:

determining first-layer input data and first-layer weight group data according to the semantic vector or the decompression parameter, and executing n layers of forward operations of a neural network on the first-layer input data and the first-layer weight group data to obtain an nth output result of the forward operations;

obtaining an nth output result gradient according to the nth output result, and obtaining an nth reverse operation instruction of an nth layer of reverse operation and nth input data and nth weight group data required by the nth reverse operation instruction according to the semantic vector or the decompression parameter;

dividing the nth output result gradient, the nth layer of input data and the nth layer of weight group data into a vertical data block and a horizontal data block according to the nth reverse operation instruction;

executing operation in a neural network in a parallel mode according to a second data block to obtain an operation result, wherein the second data block is associated with the processed first data block;

processing the operation result to obtain the nth layer weight group gradient and the nth layer input data gradient, and updating the nth layer weight group data by applying the nth layer weight group gradient;

and taking the input data gradient of the nth layer as the output result gradient of the nth-1 layer, performing reverse operation of the n-1 layer to obtain a weight group gradient of the n-1 layer, and updating weight group data of a corresponding layer by applying the weight group gradient of the n-1 layer, wherein the weight group data comprises at least two weights.

The neural network operation device supporting compression and decompression according to the present disclosure may be implemented by a hardware circuit (for example, but not limited to, an application specific integrated circuit ASIC), and the parameter compression unit 10, the parameter storage unit 20, the parameter decompression unit 30, and the operation unit 40 may be integrated into a single chip (for example, a neural network chip).

The neural network operation device supporting compression and decompression according to the present disclosure may be applied in the following (including but not limited to) scenarios: the system comprises various electronic products such as a data processing device, a robot, a computer, a printer, a scanner, a telephone, a tablet computer, an intelligent terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage device and a wearable device; various vehicles such as airplanes, ships, vehicles, and the like; various household appliances such as televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas stoves, range hoods and the like; and various medical devices including nuclear magnetic resonance apparatuses, B-ultrasonic apparatuses, electrocardiographs and the like.

Through the device, the parameter to be compressed of the neural network is determined according to the received model data of the neural network, the encoder is used for compressing the parameter to be compressed, the semantic vector corresponding to the neural network is obtained, the parameter to be compressed comprises the weight of the neural network, the semantic vector corresponding to the neural network is stored, the semantic vector is sent to the parameter decompression unit or the operation unit when a data reading instruction is received, the decoder is used for decompressing the semantic vector when the semantic vector is received, the decompression parameter of the neural network is obtained, the decompression parameter is sent to the operation unit, and the operation unit is used for training the neural network by using the received semantic vector or the decompression parameter.

Referring to fig. 3, fig. 3 is a block diagram illustrating a neural network training device supporting compression and decompression according to an embodiment of the present disclosure.

The training apparatus is configured to perform training of a neural network, where the neural network includes n layers, n is an integer greater than or equal to 2, as shown in fig. 3, the training apparatus includes an operation unit 40, a controller unit 11, a parameter compression unit 10, and a parameter decompression unit 30, and the operation unit 40 includes: a master processing circuit 401 and a plurality of slave processing circuits 402.

The parameter decompressing unit 30 is configured to receive a semantic vector, decompress the semantic vector by using a decoder, and obtain a decompressing parameter of the neural network, where the semantic vector is compressed data of model data of the neural network.

The controller unit 11 is electrically connected to the parameter decompressing unit 30, and is configured to obtain a decompressing parameter, determine first layer input data and first layer weight group data according to the decompressing parameter, perform n layers of forward operations on the first layer input data and the first layer weight group data to obtain an nth output result of the forward operation, and send the nth output result to the main processing circuit.

A main processing circuit 401, electrically connected to the controller unit 11, configured to obtain an nth output result gradient according to the nth output result, and obtain an nth reverse operation instruction of an nth layer of reverse operation and nth layer of input data and nth layer of weight group data required by the nth reverse operation instruction according to the decompression parameter; dividing the nth output result gradient, the nth layer of input data and the nth layer of weight group data into a vertical data block and a horizontal data block according to the nth reverse operation instruction; and sending the first data block to at least one slave processing circuit in a plurality of slave processing circuits connected with the master processing circuit according to the nth reverse operation instruction.

In one possible embodiment, the nth output result gradient is: one or any combination of vectors, matrices, three-dimensional data blocks, four-dimensional data blocks, and n-dimensional data blocks.

The plurality of slave processing circuits 402, electrically connected to the master processing circuit 401, are configured to execute operations in a neural network in a parallel manner according to a second data block to obtain an operation result, and transmit the operation result to the master processing circuit through a slave processing circuit connected to the master processing circuit, where the second data block is a data block determined by the slave processing circuit to receive data sent by the master processing circuit, and the second data block is associated with the processed first data block.

The main processing circuit 401 is further configured to process the operation result to obtain an nth layer weight group gradient and an nth layer input data gradient, and update the nth layer weight group data by applying the nth layer weight group gradient.

The controller unit 11 is further configured to perform n-1-layer inverse operation on the nth layer input data gradient as the nth-1 output result gradient of the nth-1 layer to obtain an n-1-layer weight group gradient, and update weight group data of a corresponding layer by using the n-1-layer weight group gradient, where the weight group data includes at least two weights;

and the parameter compression unit 10 is electrically connected to the controller unit, the main processing unit and the slave processing unit, and is configured to use data generated by the controller unit, the main processing unit and the slave processing unit as data to be compressed, and compress the data to be compressed by using an encoder to obtain a corresponding semantic vector.

In one possible embodiment, parameters, such as weights, of the various layers of the neural network may also be adjusted using a back propagation algorithm.

In a possible implementation manner, the main processing circuit 401 is further configured to determine that the nth layer input data and the nth layer weight group data are both horizontal data blocks and the nth output result gradient is a vertical data block when the nth inverse operation instruction is a multiplication instruction; and when the nth reverse operation instruction is a convolution instruction, determining that the nth layer of input data and the nth layer of weight group data are both vertical data blocks, and determining that the nth output result gradient is a horizontal data block.

In one possible embodiment, the nth output result gradient is: one or any combination of vector, matrix, three-dimensional data block, four-dimensional data block and n-dimensional data block;

in a possible implementation manner, the nth layer input data is: one or any combination of vector, matrix, three-dimensional data block, four-dimensional data block and n-dimensional data block;

in a possible implementation manner, the n layers of weight group data are: one or any combination of vectors, matrices, three-dimensional data blocks, four-dimensional data blocks, and n-dimensional data blocks.

By the neural network training device, the semantic vector is decompressed by a decoder of a parameter decompression unit to obtain decompression parameters of a neural network, a controller unit determines first layer input data and first layer weight group data according to the decompression parameters, n layers of forward operation of the neural network are executed on the first layer input data and the first layer weight group data to obtain an nth output result of the forward operation, the nth output result is sent to a main processing circuit, the main processing circuit obtains an nth output result gradient according to the nth output result, and an nth reverse operation instruction of the nth reverse operation and nth layer input data and nth layer weight group data required by the nth reverse operation instruction are obtained according to the decompression parameters; dividing the nth output result gradient, the nth layer of input data and the nth layer of weight group data into a vertical data block and a horizontal data block according to the nth reverse operation instruction; the controller unit takes the n-th layer input data gradient as the n-1 layer output result gradient to execute n-1 layer reverse operation to obtain the n-1 layer weight group gradient, applies the n-1 layer weight group gradient to update the weight group data of the corresponding layer, and the parameter compression unit takes the data generated by the controller unit, the main processing unit and the slave processing unit as data to be compressed and compresses the data to be compressed by utilizing the n-1 layer weight group gradient to obtain the corresponding vector semantic meaning of the vector encoder. The method and the device can compress the parameters to be compressed, thereby effectively reducing the size of the model of the neural network, reducing the requirement on the memory and effectively improving the data processing speed of the neural network.

Referring to fig. 4, fig. 4 is a block diagram illustrating a neural network training device supporting compression and decompression according to an embodiment of the present disclosure.

As shown in fig. 4, the training apparatus may further include:

the storage unit 50 is electrically connected to the parameter compression unit 10 and the parameter decompression unit 30, and can be used for storing the semantic vector. After the parameter compression unit 10 completes compression of the parameters of the neural network, the semantic vector generated after compression may be stored by the storage unit 50.

In one possible implementation, the storage unit 50 may include a cache 502, a register 501, and a data I/O unit 503.

In this embodiment, the cache 502 may be a cache, and the cache 502 may include a neuron cache and a weight cache. The neuron cache is used for storing data related to the neurons in the neural network, and the weight cache is used for storing data related to the weights in the neural network.

Wherein the storage unit includes: the register 501 and the buffer 502 are combined arbitrarily.

Referring to fig. 5, fig. 5 is a schematic diagram illustrating a cache module according to an embodiment of the disclosure.

As shown in fig. 5, the neuron cache may include:

an input neuron cache 5021, configured to store data related to an input neuron in a neural network, where the input neuron cache 5021 further includes an input neuron index cache 5030 and an input neuron gradient cache 5031, the input neuron index cache 5030 is configured to store an input neuron index, and the input neuron gradient cache 5031 is configured to store an input neuron gradient in an inverse computation process;

an output neuron cache 5024, configured to store data related to output neurons in the neural network, where the output neuron cache 5024 may include an output neuron index cache 5040 and an output neuron gradient cache 5041, where the output neuron index cache 5040 is configured to store output neuron indexes, and the output neuron gradient cache 5041 is configured to store output neuron gradients during inverse computation.

The weight caching comprises the following steps:

an input weight cache 5027, configured to store data related to an input weight in a neural network, where the input weight cache 5027 may include an input weight index cache 5050 and an input weight gradient cache 5051, the input weight index cache 5050 is configured to store an input weight, and the input weight gradient cache 5051 is configured to store an input weight gradient in a reverse calculation process;

the output weight buffer 5029 is used for storing data related to output weights in the neural network, wherein the output weight buffer further comprises an output weight index buffer 5060 and an output weight gradient buffer 5061, the output weight index buffer 5060 is used for storing the output weight index, and the output weight gradient buffer 5061 is used for storing the output weight gradient in the reverse calculation process.

Of course, the above division of the cache 502 is exemplary and not intended to limit the present disclosure, and besides the above mentioned examples, each unit in the cache may be multiplexed, and when the cache is multiplexed, the cache units in the cache 502 may be reduced or increased. For example, the input neuron index cache 5030 in the input neuron cache 5021 may be used to store input neuron indexes and input neuron gradients, and when different data needs to be stored, the input neuron index cache 5021 may be used to store any other type of data, which is not limited in the present disclosure.

The direct memory access unit 15 is electrically connected to the storage unit 50, and is configured to obtain the semantic vector from the storage unit 50 or store the semantic vector to the storage unit.

In a possible implementation manner, the parameter compression unit 30 is further configured to determine whether the parameter to be compressed is sparse, and send a sparsification flag corresponding to the semantic vector to the storage unit 50 when the parameter to be compressed is sparse;

the storage unit 50 is further configured to store the sparsification flag,

the storage unit 50 is further configured to send the semantic vector to the parameter decompression unit or the controller unit when receiving the data reading instruction, and includes:

when the data reading instruction is received and the parameter storage unit 50 stores therein the thinning-out flag corresponding to the semantic vector, sending the semantic vector to the controller unit 11;

the controller unit 11 is further configured to obtain the semantic vector, and determine first-layer input data and first-layer weight group data according to the semantic vector.

In a possible implementation, the storage unit 50 sends the semantic vector to the parameter decompression unit 30 or the operation unit 40 when receiving a data reading instruction, and further includes:

when the data reading instruction is received and the sparsification flag corresponding to the semantic vector is not stored in the storage unit 50, the semantic vector is sent to the parameter decompression unit 30.

In a possible embodiment, the controller unit 11 may comprise: instruction storage unit 110, instruction processing unit 111, and store queue unit 113.

The instruction storage unit 110 is configured to store a calculation instruction associated with the artificial neural network operation;

the instruction processing unit 111 is configured to analyze the calculation instruction to obtain a plurality of operation instructions;

the storage queue unit 113 is configured to store an instruction queue, where the instruction queue includes: and a plurality of operation instructions or calculation instructions to be executed according to the front and back sequence of the queue.

In a possible embodiment, the controller unit 11 further comprises: a dependency processing unit;

the dependency relationship processing unit is configured to determine whether an association relationship exists between a first operation instruction and a zeroth operation instruction before the first operation instruction, if the association relationship exists between the first operation instruction and the zeroth operation instruction, cache the first operation instruction in the instruction storage unit, and after the zeroth operation instruction is executed, extract the first operation instruction from the instruction storage unit and transmit the first operation instruction to the operation unit;

the determining whether the first operation instruction and a zeroth operation instruction before the first operation instruction have an association relation or not comprises the following steps:

extracting a first storage address interval of required data in the first operation instruction according to the first operation instruction, extracting a zeroth storage address interval of the required data in the zeroth operation instruction according to the zeroth operation instruction, if the first storage address interval and the zeroth storage address interval have an overlapped area, determining that the first operation instruction and the zeroth operation instruction have an association relation, and if the first storage address interval and the zeroth storage address interval do not have an overlapped area, determining that the first operation instruction and the zeroth operation instruction do not have an association relation.

In one possible implementation, as shown in fig. 6, the main processing circuit 401 may further include: one or any combination of the conversion processing circuit 4011, the activation processing circuit 4012, and the addition processing circuit 4013.

A conversion processing circuit 4011 configured to perform interchange between the first data structure and the second data structure (e.g., conversion of continuous data and discrete data) of the data block or the intermediate result received by the main processing circuit 401; or to perform an interchange between the first data type and the second data type (e.g. conversion of a fixed point type to a floating point type) with a data block or an intermediate result received by the main processing circuit.

An activation processing circuit 4012 is configured to perform an activation operation of data in the main processing circuit 401.

And an addition processing circuit 4013 configured to perform addition or accumulation.

In one possible implementation, the slave processing circuit may include: a multiplication processing circuit.

The multiplication processing circuit may be configured to perform a product operation on the received data block to obtain a product result;

forwarding processing circuitry (optional) for forwarding the received data block or the product result.

And the accumulation processing circuit is used for performing accumulation operation on the product result to obtain the intermediate result.

Referring to fig. 7, fig. 7 is a block diagram illustrating a neural network training device supporting compression and decompression according to an embodiment of the present disclosure.

In a possible implementation, as shown in fig. 7, the operation unit 40 may further include: a tree module 60, the tree module 60 comprising: a root port 601 and a plurality of branch ports 602, wherein the root port 601 of the tree module 60 is connected with the master processing circuit 401, and the plurality of branch ports 602 of the tree module 60 are respectively connected with one slave processing circuit 402 of the plurality of slave processing circuits;

the tree module 60 is configured to forward the data blocks, the weight forward operation instructions, and the reverse operation instructions between the master processing circuit 401 and the multiple slave processing circuits 402.

In an alternative embodiment, as shown in fig. 7, the arithmetic unit comprises: a tree module 60, said tree module comprising: a root port 601 and a plurality of branch ports 602, wherein the root port of the tree module is connected with the main processing circuit, and the branch ports of the tree module are respectively connected with one of the plurality of slave processing circuits;

the tree module 60 is configured to forward data blocks, weights, and operation instructions between the master processing circuit and the plurality of slave processing circuits.

Optionally, the tree module 60 is an optional result of the training apparatus, and may include at least 1 layer of nodes, which are wire structures with forwarding function, and which may not have computing function itself. If the tree module has zero-level nodes, the tree module is not needed.

Optionally, the tree module 60 may have an e-tree structure, for example, a binary tree structure as shown in fig. 7, or may have a ternary tree structure, where e is an integer greater than or equal to 2. The specific value of e is not limited in the embodiments of the present disclosure, the number of layers may be 2, and the slave processing circuit may be connected to nodes of other layers than the node of the second last layer, for example, the node of the first last layer as shown in fig. 7.

In a possible implementation, as shown in fig. 8, the operation unit 40 may further include: a branch processing circuit 403.

The main processing circuit 401 is configured to allocate an input neuron into a plurality of data blocks, and send at least one of the data blocks, a weight, and at least one of a plurality of operation instructions to the branch processing circuit 403;

the branch processing circuit 403 is configured to forward data blocks, weights, and operation instructions between the master processing circuit 401 and the plurality of slave processing circuits 402;

the slave processing circuits 402 are configured to perform an operation on the received data block and the weight according to the operation instruction to obtain an intermediate result, and transmit the intermediate result to the branch processing circuit 402;

the main processing circuit 401 is configured to perform subsequent processing on the intermediate result sent by the branch processing circuit 403 to obtain a result of the calculation instruction, and send the result of the calculation instruction to the controller unit 11.

In one possible implementation, as shown in fig. 9, the plurality of slave processing circuits 502 may be distributed in an array; each slave processing circuit 402 is connected to other adjacent slave processing circuits 402, the master processing circuit 401 is connected to K slave processing circuits 402 in the plurality of slave processing circuits 402, and the K slave processing circuits 402 are: a slave processing circuits 402 of row 1, a slave processing circuits 402 of row b, and b slave processing circuits 402 of column 1;

the K slave processing circuits 402, configured to forward data and instructions between the master processing circuit 401 and the plurality of slave processing circuits 402;

the master processing circuit 401 is configured to allocate an input data into a plurality of data blocks, and send at least one data block of the plurality of data blocks and at least one operation instruction of a plurality of operation instructions to the K slave processing circuits 402;

the K slave processing circuits 402 for converting data between the master processing circuit 401 and the plurality of slave processing circuits 402;

the slave processing circuits 402 are configured to perform an operation on the received data block according to the operation instruction to obtain an intermediate result, and transmit the operation result to the K slave processing circuits 402;

the main processing circuit 401 is configured to perform subsequent processing on the K intermediate results sent by the slave processing circuits 402 to obtain a result of the calculation instruction, and send the result of the calculation instruction to the controller unit 11.

It should be noted that the above-described training apparatus shows a case where the parameter decompression unit 30 and the parameter compression unit 10 are provided outside the arithmetic unit 40, and in other embodiments, the parameter compression unit 10 and/or the parameter decompression unit 30 may be provided in the arithmetic unit 40, or the parameter compression unit 10 and/or the parameter decompression unit 30 may be provided in the controller unit 50.

The parameter compression unit 10 and the parameter decompression unit 30 may also compress and decompress by other methods. For example, the parameter compression unit 10 and the parameter decompression unit 30 may perform compression and decompression by using a mapping coding method. Of course, the parameter compression unit 10 and the parameter decompression unit 30 may be separate units or may be combined units.

In one possible embodiment, the parameter compression unit 10 may compress sparse parameters to be compressed by using a Multiplexer (MUX) to generate a fixed-length semantic vector, so as to implement compression of the parameters to be compressed. In contrast, the parameter decompression unit 30 may decompress the fixed-length semantic vectors using a Demultiplexer (DEMUX), thereby obtaining decompression parameters.

The following describes the compression and decompression processes by taking MUX as an example of the parameter compression unit 10 and DEMUX as an example of the parameter decompression unit 30.

Referring to fig. 10, fig. 10 is a schematic diagram illustrating a compression and decompression process according to an embodiment of the present disclosure.

As shown in fig. 10, when the parameter to be compressed is a sparse neuron, the neuron index may be used to determine data of the neuron, and the MUX is used to obtain a compressed non-zero neuron, and the non-zero neuron and the neuron index may be used as a fixed-length semantic vector, so that the parameter decompression unit 30 decompresses the data. Correspondingly, when the parameter decompressing unit 30 obtains the fixed-length semantic vector, the non-zero neuron and the neuron index are used for judgment, and the decompressing parameters (neurons) are obtained through DEMUX decompression.

In a possible implementation, the parameter compression unit 10 may implement compression of the parameters to be compressed by using quantization coding and entropy coding to obtain compression parameters (fixed-length semantic vectors). Correspondingly, the parameter decompressing unit 30 may decompress the fixed-length semantic vector by using entropy decoding and quantization decoding to obtain the decompressed parameter.

The following describes the compression and decompression processes by taking the parameter compression unit 10 to implement compression of the weight matrix by using quantization coding and entropy coding methods, and the parameter decompression unit 30 to implement decoding by using entropy decoding and quantization decoding methods.

Referring to fig. 11, fig. 11 is a schematic diagram illustrating a compression and decompression process according to an embodiment of the present disclosure.

As shown in fig. 11, the parameter compression unit 10 may quantize the weight matrix by a quantization coding method to obtain a corresponding codebook and dictionary, then process the codebook and dictionary by an entropy coding method to obtain a processed codebook and dictionary, and use the processed codebook and dictionary as compressed fixed-length semantic vectors. Correspondingly, after the parameter decompression unit 10 receives the fixed-length semantic vector, it may perform entropy decoding on the fixed-length semantic vector to obtain an entropy-decoded codebook and a dictionary, and then decode the entropy-decoded codebook and the entropy-decoded dictionary by using a quantization decoding method to obtain a decompression parameter (a weight matrix after decompression).

Of course, in other embodiments, the parameter compressing unit 10 and the parameter decompressing unit 30 may also use other methods to implement compression and decompression, and the disclosure is not limited thereto.

The present disclosure also provides a machine learning operation device, which includes one or more training devices mentioned in the present disclosure, and is configured to obtain data to be operated and control information from other processing devices, execute a specified machine learning operation, and transmit the execution result to a peripheral device through an I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one training device is included, the training devices can be linked and transmit data through a specific structure, for example, through the PCIE bus to interconnect and transmit data, so as to support larger-scale machine learning operations. At this time, the same control system may be shared, or there may be separate control systems; the memory may be shared or there may be separate memories for each accelerator. In addition, the interconnection mode can be any interconnection topology.

The machine learning arithmetic device has higher compatibility and can be connected with various types of servers through PCIE interfaces.

The present disclosure also provides a combined processing device, which includes the above machine learning arithmetic device, a universal interconnection interface, and other processing devices. The machine learning arithmetic device interacts with other processing devices to jointly complete the operation designated by the user.

Referring to fig. 12, fig. 12 is a schematic diagram of a combined processing device according to an embodiment of the disclosure.

Other processing devices, as shown in FIG. 12, include one or more processor types of general purpose/special purpose processors, such as Central Processing Units (CPUs), graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the machine learning arithmetic device and external data and control, including data transportation, and finish basic control of starting, stopping and the like of the machine learning arithmetic device; other processing devices may cooperate with the machine learning computing device to perform computing tasks.

And the universal interconnection interface is used for transmitting data and control instructions between the machine learning arithmetic device and other processing devices. The machine learning arithmetic device obtains required input data from other processing devices and writes the required input data into a storage device on the machine learning arithmetic device chip; control instructions can be obtained from other processing devices and written into a control cache on a machine learning arithmetic device chip; the data in the storage module of the machine learning arithmetic device can also be read and transmitted to other processing devices.

Referring to fig. 13, fig. 13 is a schematic diagram of a combined processing device according to an embodiment of the disclosure.

As shown in fig. 13, the apparatus may further include a storage device, and the storage device is connected to the machine learning arithmetic device and the other processing device, respectively. The storage device is used for storing data in the machine learning arithmetic device and the other processing device, and is particularly suitable for data which is required to be calculated and cannot be stored in the internal storage of the machine learning arithmetic device or the other processing device.

The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some components are such as camera, display, mouse, keyboard, network card, wifi interface.

In some embodiments, a chip including the above machine learning operation device or the combination processing device is also provided.

In some embodiments, the disclosure further provides a chip packaging structure, which includes the chip.

In some embodiments, the present disclosure further provides a board card including the chip package structure.

Referring to fig. 14, and referring to fig. 1 and 4, fig. 14 is a schematic diagram of a board card according to an embodiment of the disclosure.

As shown in fig. 14, the board card may include other kit components besides the chip 389, and the kit components include, but are not limited to: memory device 390, interface device 391 and control device 392;

the memory device 390 is connected to the chip in the chip package structure through a bus for storing data. The memory device may include a plurality of groups of memory cells 393. Each group of the storage units is connected with the chip through a bus. It is understood that each group of the memory cells may be a DDR SDRAM (Double Data Rate SDRAM).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 sets of the storage unit. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controllers are used for data transmission, and 8 bits are used for ECC checking. It can be understood that when DDR4-3200 grains are adopted in each group of memory cells, the theoretical bandwidth of data transmission can reach 25600MB/s.

In one embodiment, each group of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And arranging a controller for controlling DDR in the chip, wherein the controller is used for controlling data transmission and data storage of each storage unit.

The interface device is electrically connected with a chip in the chip packaging structure. The interface device is used for realizing data transmission between the chip and an external device (such as a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so as to implement data transfer. Preferably, when PCIE3.0X 16 interface is adopted for transmission, the theoretical bandwidth can reach 16000MB/s. In another embodiment, the interface device may also be another interface, and the disclosure does not limit the specific expression of the other interface, and the interface unit may implement the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (e.g., a server) by the interface device.

The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a single chip Microcomputer (MCU). The chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may carry a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing circuits and/or a plurality of processing circuits in the chip.

In some embodiments, the present disclosure further provides an electronic device, which includes the above board card.

Electronic devices include data processing apparatus, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, tachographs, navigators, sensors, cameras, servers, cloud servers, cameras, video cameras, projectors, watches, headsets, mobile storage, wearable devices, vehicles, household appliances, and/or medical devices.

The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

Referring to fig. 15, fig. 15 is a flowchart illustrating a neural network training method supporting compression and decompression according to an embodiment of the present disclosure.

As shown in fig. 15, the method is applied to a neural network training device to perform training of a neural network, the neural network training device includes a parameter compression unit, a parameter storage unit, a parameter decompression unit, and an operation unit, and the method includes:

step S110, a parameter compression unit determines parameters to be compressed of the neural network according to received model data of the neural network, and compresses the parameters to be compressed by using an encoder to obtain semantic vectors corresponding to the neural network;

step S120, a parameter storage unit stores semantic vectors corresponding to the neural network, and sends the semantic vectors to the parameter decompression unit or the arithmetic unit when receiving a data reading instruction;

step S130, when the parameter decompressing unit receives the semantic vector, the parameter decompressing unit decompresses the semantic vector by using a decoder to obtain a decompressing parameter of the neural network, and sends the decompressing parameter to the arithmetic unit; and

step S140, the arithmetic unit trains the neural network according to the received semantic vector or the decompression parameter.

By the method, the parameter to be compressed of the neural network is determined according to the received model data of the neural network, the encoder is used for compressing the parameter to be compressed, the semantic vector corresponding to the neural network is obtained, the semantic vector corresponding to the neural network is stored, the semantic vector is sent to the parameter decompression unit or the operation unit when a data reading instruction is received, the decoder is used for decompressing the semantic vector when the semantic vector is received, the decompression parameter of the neural network is obtained, the decompression parameter is sent to the operation unit, and the operation unit performs training of the neural network by using the received semantic vector or the decompression parameter.

In one possible embodiment, the method further comprises:

the parameter compression unit judges whether the parameter to be compressed is sparse, and sends a sparse mark corresponding to the semantic vector to the parameter storage unit when the parameter to be compressed is sparse;

the parameter storage unit stores the thinning-out flag,

when receiving the data reading instruction, the parameter storage unit sends the semantic vector to the parameter decompression unit or the arithmetic unit, including:

and when the data reading instruction is received and the parameter storage unit stores the thinning marks corresponding to the semantic vectors, sending the semantic vectors to the arithmetic unit.

In a possible implementation, the parameter storage unit, when receiving a data reading instruction, sends the semantic vector to the parameter decompression unit or the arithmetic unit, and further includes:

In a possible embodiment, the neural network includes n layers, where n is an integer greater than or equal to 2, and the operation unit trains the neural network on the received semantic vector or the decompression parameter, including:

obtaining an nth output result gradient according to the nth output result, and obtaining an nth reverse operation instruction of nth reverse operation and nth input data and nth weight group data required by the nth reverse operation instruction according to the semantic vector or the decompression parameter;

In a possible implementation, the apparatus further comprises a controller unit, wherein the arithmetic unit comprises a master processing circuit and a plurality of slave processing circuits, the method further comprising:

the controller unit acquires the decompression parameters, determines first-layer input data and first-layer weight group data according to the decompression parameters, executes n layers of forward operations of a neural network on the first-layer input data and the first-layer weight group data to obtain an nth output result of the forward operations, and sends the nth output result to the main processing circuit;

the main processing circuit obtains an nth output result gradient according to the nth output result, and obtains an nth reverse operation instruction of nth reverse operation and nth input data and nth weight group data required by the nth reverse operation instruction according to the decompression parameter; dividing the nth output result gradient, the nth layer of input data and the nth layer of weight group data into a vertical data block and a horizontal data block according to the nth reverse operation instruction; sending the first data block to at least one slave processing circuit in a plurality of slave processing circuits connected with the main processing circuit according to the nth reverse operation instruction;

the plurality of slave processing circuits execute operation in a neural network in a parallel mode according to a second data block to obtain an operation result, and the operation result is transmitted to the master processing circuit through the slave processing circuit connected with the master processing circuit, wherein the second data block is determined by the slave processing circuit to receive the data block sent by the master processing circuit, and the second data block is associated with the processed first data block;

the main processing circuit processes the operation result to obtain the nth layer weight group gradient and the nth layer input data gradient, and updates the nth layer weight group data by applying the nth layer weight group gradient;

the controller unit takes the input data gradient of the nth layer as the output result gradient of the (n-1) th layer, performs reverse operation of the (n-1) th layer to obtain the weight group gradient of the (n-1) th layer, and updates the weight group data of the corresponding layer by applying the weight group gradient of the (n-1) th layer, wherein the weight group data comprises at least two weights;

and the parameter compression unit takes the data generated by the controller unit, the main processing unit and the auxiliary processing unit as data to be compressed, and compresses the data to be compressed by using an encoder to obtain a corresponding semantic vector.

In one possible implementation, the apparatus further includes a direct memory access unit, and the method further includes:

the direct memory access unit acquires the semantic vector from the storage unit or stores the semantic vector to the storage unit;

wherein the parameter storage unit includes: any combination of registers and caches.

In one possible implementation, the cache includes a neuron cache including an input neuron cache and an output neuron cache, the input neuron cache including an input neuron index cache and an input neuron gradient cache, and the output neuron cache including an output neuron index cache and an output neuron gradient cache, wherein the neuron cache in the cache stores data related to neurons in a neural network, including:

the neuron cache stores data related to the neurons in the neural network;

the input neuron cache stores data related to input neurons in a neural network, wherein the input neuron index cache stores input neuron indexes, and the input neuron gradient cache stores input neuron gradients in the inverse calculation process;

and the output neuron cache stores data related to the output neurons in the neural network, wherein the output neuron gradient cache stores output neuron gradients in the inverse calculation process.

In a possible embodiment, the cache includes a weight cache, the weight cache includes an input weight cache and an output weight cache, the input weight cache includes an input weight index cache and an input weight gradient cache, and the output weight cache includes an output weight index cache and an output weight gradient cache, where a neuron cache in the cache stores data related to neurons in a neural network, including:

the weight cache stores data related to the weight of the neural network;

the input weight cache stores data related to the input weight in the neural network, wherein the input weight index cache stores the input weight, and the input weight gradient cache stores the input weight gradient in the reverse calculation process;

and the output weight cache stores data related to the output weight in the neural network, wherein the output weight index cache stores an output weight index, and the output weight gradient cache stores an output weight gradient in the reverse calculation process.

In one possible implementation, the decompression parameters include neurons, neuron gradients, weights, weight gradients.

In one possible embodiment, the controller unit comprises: an instruction storage unit, an instruction processing unit, a storage queue unit and a dependency processing unit, wherein,

the instruction storage unit stores a calculation instruction associated with the artificial neural network operation;

the instruction processing unit analyzes the calculation instruction to obtain a plurality of operation instructions;

the store queue unit stores an instruction queue comprising: a plurality of operation instructions or calculation instructions to be executed according to the front and back sequence of the queue;

the dependency relationship processing unit determines whether a first operation instruction and a zeroth operation instruction before the first operation instruction have an association relationship, if the first operation instruction and the zeroth operation instruction have the association relationship, the first operation instruction is cached in the instruction storage unit, and after the zeroth operation instruction is executed, the first operation instruction is extracted from the instruction storage unit and transmitted to the operation unit;

In a possible implementation, the arithmetic unit comprises: a tree module, the tree module comprising: a root port and a plurality of branch ports, the root port of the tree module is connected with the main processing circuit, the plurality of branch ports of the tree module are respectively connected with one of the plurality of slave processing circuits, wherein,

and the tree module forwards the data blocks, the weight forward operation instructions and the reverse operation instructions between the main processing circuit and the plurality of slave processing circuits.

In a possible implementation, the arithmetic unit further comprises a branch processing circuit,

the main processing circuit distributes an input neuron into a plurality of data blocks, and sends at least one data block in the data blocks, a weight value and at least one operation instruction in a plurality of operation instructions to the branch processing circuit;

the branch processing circuit forwards data blocks, weights and operation instructions between the main processing circuit and the plurality of slave processing circuits;

the plurality of slave processing circuits execute operation on the received data blocks and the weight according to the operation instruction to obtain intermediate results, and transmit the intermediate results to the branch processing circuit;

and the main processing circuit carries out subsequent processing on the intermediate result sent by the branch processing circuit to obtain a result of the calculation instruction, and sends the result of the calculation instruction to the controller unit.

In one possible embodiment, the plurality of slave processing circuits are distributed in an array; each slave processing circuit is connected with other adjacent slave processing circuits, the master processing circuit is connected with K slave processing circuits in the plurality of slave processing circuits, and the K slave processing circuits are: a slave processing circuits of row 1, a slave processing circuits of row b and b slave processing circuits of column 1;

forwarding of data and instructions by the K slave processing circuits between the master processing circuit and a plurality of slave processing circuits;

the main processing circuit distributes an input data into a plurality of data blocks, and sends at least one data block in the data blocks and at least one operation instruction in a plurality of operation instructions to the K slave processing circuits;

the K slave processing circuits converting data between the master processing circuit and the plurality of slave processing circuits;

the plurality of slave processing circuits execute operation on the received data blocks according to the operation instruction to obtain intermediate results, and transmit the operation results to the K slave processing circuits;

and the main processing circuit performs subsequent processing on the intermediate results sent by the K slave processing circuits to obtain a result of the calculation instruction, and sends the result of the calculation instruction to the controller unit.

It should be noted that the above neural network training method is a method item corresponding to the above neural network training device, and for a specific introduction, reference is made to the description of the device item before, which is not repeated herein.

It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed coupling or direct coupling or communication connection between each other may be through some interfaces, indirect coupling or communication connection between devices or units, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.

The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solutions of the present disclosure may be embodied in the form of a software product, which is stored in a memory and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods according to the embodiments of the present disclosure. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, read-Only memories (ROMs), random Access Memories (RAMs), magnetic or optical disks, and the like. The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. An apparatus for neural network training supporting compression and decompression, the apparatus being configured to perform training of a neural network, the apparatus comprising:

the parameter compression unit is used for determining parameters to be compressed of the neural network according to received model data of the neural network, compressing the parameters to be compressed by using an encoder, compressing multidimensional data in the parameters to be compressed into low-dimensional data and obtaining semantic vectors corresponding to the neural network;

the parameter decompression unit is connected to the parameter storage unit and used for decompressing the semantic vector by using a decoder when receiving the semantic vector to obtain a decompression parameter of the neural network and sending the decompression parameter to the operation unit; and

the arithmetic unit is respectively connected to the parameter storage unit and the parameter decompression unit and is used for training the neural network on the received semantic vector or the decompression parameter;

the parameter compression unit is also used for judging whether the parameter to be compressed is sparse or not, and sending a sparse mark corresponding to the semantic vector to the parameter storage unit when the parameter to be compressed is sparse;

the parameter storage unit is also used for storing the thinning mark,

2. The apparatus according to claim 1, wherein the parameter storage unit sends the semantic vector to the parameter decompression unit or the arithmetic unit when receiving a data read instruction, further comprising:

3. The apparatus of claim 1, wherein the neural network comprises n layers, n being an integer greater than or equal to 2, the training comprises forward training and backward training, and the forward training or the backward training of the neural network on the received semantic vector or the decompression parameter comprises:

performing operation in a neural network in a parallel mode according to the vertical data block and/or the horizontal data block to obtain an operation result;

4. The apparatus of claim 1, wherein the neural network comprises n layers, n being an integer greater than or equal to 2, the apparatus further comprising a controller unit, wherein the arithmetic unit comprises a master processing circuit and a plurality of slave processing circuits;

the controller unit is electrically connected to the parameter decompressing unit and is used for acquiring the decompressing parameters, determining first-layer input data and first-layer weight group data according to the decompressing parameters, executing n layers of forward operations of a neural network on the first-layer input data and the first-layer weight group data to obtain an nth output result of the forward operations, and sending the nth output result to the main processing circuit;

the main processing circuit is electrically connected to the controller unit and is used for obtaining an nth output result gradient according to the nth output result, and obtaining an nth reverse operation instruction of an nth layer of reverse operation and nth layer of input data and nth layer of weight group data required by the nth reverse operation instruction according to the decompression parameter; dividing the nth output result gradient, the nth layer of input data and the nth layer of weight group data into a vertical data block and a horizontal data block according to the nth reverse operation instruction; sending the vertical data block and/or the horizontal data block to at least one slave processing circuit in a plurality of slave processing circuits connected with the main processing circuit according to the nth reverse operation instruction; the plurality of slave processing circuits are electrically connected with the main processing circuit and used for executing the operation in the neural network in a parallel mode according to the vertical data blocks and/or the horizontal data blocks to obtain operation results and transmitting the operation results to the main processing circuit through the slave processing circuit connected with the main processing circuit,

the main processing circuit is also used for processing the operation result to obtain the nth layer weight group gradient and the nth layer input data gradient, and updating the nth layer weight group data by applying the nth layer weight group gradient;

the controller unit is further configured to perform n-1-layer inverse operation on the n-layer input data gradient as an n-1 output result gradient of the n-1 layer to obtain an n-1-layer weight group gradient, and update weight group data of a corresponding layer by applying the n-1-layer weight group gradient, where the weight group data includes at least two weights;

the parameter compression unit is electrically connected to the controller unit, the master processing circuit and the slave processing circuit, and is further configured to use data generated by the controller unit, the master processing circuit and the slave processing circuit as data to be compressed, and compress the data to be compressed by using an encoder to obtain a corresponding semantic vector.

5. The apparatus of claim 4, further comprising:

the storage unit is electrically connected with the parameter compression unit and the parameter decompression unit and is used for storing the semantic vector;

the direct memory access unit is electrically connected with the storage unit and is used for acquiring the semantic vector from the storage unit or storing the semantic vector to the storage unit;

wherein the storage unit includes: register, cache.

6. The apparatus of claim 5, wherein the cache comprises:

the neuron cache is used for storing data related to neurons in the neural network;

the neuron cache includes:

the input neuron cache is used for storing data related to input neurons in a neural network, and further comprises an input neuron index cache and an input neuron gradient cache, wherein the input neuron index cache is used for storing input neuron indexes, and the input neuron gradient cache is used for storing input neuron gradients in the inverse calculation process;

the output neuron cache is used for storing data related to output neurons in a neural network, and further comprises an output neuron index cache and an output neuron gradient cache, wherein the output neuron index cache is used for storing output neuron indexes, and the output neuron gradient cache is used for storing output neuron gradients in a reverse calculation process.

7. The apparatus of claim 5, wherein the caching comprises:

the weight cache is used for storing data related to the weight of the neural network;

the weight caching comprises:

the input weight cache is used for storing data related to input weights in a neural network, and further comprises an input weight index cache and an input weight gradient cache, wherein the input weight index cache is used for storing the input weight index, and the input weight gradient cache is used for storing the input weight gradient in the reverse calculation process;

the output weight cache is used for storing data related to the output weight in the neural network, and further comprises an output weight index cache and an output weight gradient cache, wherein the output weight index cache is used for storing the output weight index, and the output weight gradient cache is used for storing the output weight gradient in the reverse calculation process.

8. The apparatus of claim 4, wherein the decompression parameters comprise neurons, neuron gradients, weights, weight gradients.

9. The apparatus of claim 4, wherein the controller unit comprises: the device comprises an instruction storage unit, an instruction processing unit, a storage queue unit and a dependency relationship processing unit;

the instruction storage unit is used for storing a calculation instruction associated with the neural network operation;

the instruction processing unit is used for analyzing the calculation instruction to obtain a plurality of operation instructions;

the storage queue unit is used for storing an instruction queue, and the instruction queue comprises: storing a plurality of operation instructions or calculation instructions to be executed according to the sequence of the queue;

the determining whether the first operation instruction has an association relationship with a zeroth operation instruction before the first operation instruction comprises:

10. The apparatus according to claim 4, wherein the arithmetic unit comprises: a tree module, the tree module comprising: the root port of the tree module is connected with the main processing circuit, and the branch ports of the tree module are respectively connected with one of the plurality of slave processing circuits;

the tree module is used for forwarding data blocks, weights, forward operation instructions and backward operation instructions between the main processing circuit and the plurality of slave processing circuits.

11. The apparatus of claim 8, wherein the arithmetic unit further comprises branch processing circuitry,

the main processing circuit is used for distributing an input neuron into a plurality of data blocks and sending at least one data block in the data blocks, a weight value and at least one operation instruction in a plurality of operation instructions to the branch processing circuit;

the branch processing circuit is used for forwarding data blocks, weights and operation instructions between the main processing circuit and the plurality of slave processing circuits;

the plurality of slave processing circuits are used for executing operation on the received data blocks and the weight according to the operation instruction to obtain an intermediate result and transmitting the intermediate result to the branch processing circuit;

and the main processing circuit is used for carrying out subsequent processing on the intermediate result sent by the branch processing circuit to obtain a result of the operation instruction, and sending the result of the operation instruction to the controller unit.

12. The apparatus of claim 8, wherein the plurality of slave processing circuits are distributed in an array; each slave processing circuit is connected with other adjacent slave processing circuits, the master processing circuit is connected with K slave processing circuits in the plurality of slave processing circuits, and the K slave processing circuits are as follows: a slave processing circuits of row 1, a slave processing circuits of row b and b slave processing circuits of column 1;

the K slave processing circuits are used for forwarding data and instructions between the main processing circuit and the plurality of slave processing circuits;

the main processing circuit is used for distributing an input data into a plurality of data blocks and sending at least one data block in the data blocks and at least one operation instruction in a plurality of operation instructions to the K slave processing circuits;

the K slave processing circuits are used for converting data between the main processing circuit and the plurality of slave processing circuits;

the plurality of slave processing circuits are used for performing operation on the received data blocks according to the operation instruction to obtain an intermediate result and transmitting the operation result to the K slave processing circuits;

and the main processing circuit is used for carrying out subsequent processing on the intermediate results sent by the K slave processing circuits to obtain the result of the operation instruction, and sending the result of the operation instruction to the controller unit.

13. A neural network chip, comprising machine learning computing means or combinatorial processing means, wherein,

the machine learning operation device comprises one or more neural network training devices supporting compression and decompression as claimed in any one of claims 1 to 12, and is used for acquiring input data and control information to be operated from other processing devices, executing specified machine learning operation, and transmitting the execution result to other processing devices through an I/O interface; when the machine learning operation device comprises a plurality of training devices, the plurality of training devices can be connected through a specific structure and transmit data; the training devices are interconnected through a PCIE bus which is a fast peripheral equipment interconnection bus and transmit data so as to support operation of machine learning in a larger scale; a plurality of training devices share the same control system or own respective control systems; the training devices share a memory or own respective memories; the interconnection mode of the plurality of training devices is any interconnection topology;

the machine learning operation device interacts with the other processing devices to jointly complete the calculation operation designated by the user;

14. An electronic device, characterized in that it comprises a chip according to claim 13.

15. A board, the board comprising: a memory device, an interface apparatus and a control device and the neural network chip of claim 13;

the storage device is used for storing data;

the control device is used for monitoring the state of the chip, wherein,

the memory device includes: the multi-group memory cell, each group the memory cell with the chip passes through bus connection, the memory cell is: DDR SDRAM;

the interface device is as follows: a standard PCIE interface.

16. A neural network training method supporting compression and decompression is applied to a neural network training device to perform training of a neural network, the neural network training device comprises a parameter compression unit, a parameter storage unit, a parameter decompression unit and an operation unit, and the method comprises the following steps:

the parameter compression unit determines parameters to be compressed of the neural network according to received model data of the neural network, compresses the parameters to be compressed by using an encoder, compresses multidimensional data in the parameters to be compressed into low-dimensional data, and obtains semantic vectors corresponding to the neural network;

the arithmetic unit trains the neural network on the received semantic vector or the decompression parameter;

the parameter storage unit stores the thinning-out flag,

17. The method of claim 16, wherein the parameter storage unit sends the semantic vector to the parameter decompression unit or arithmetic unit when receiving a data read instruction, further comprising:

18. The method of claim 16, wherein the neural network comprises n layers, n is an integer greater than or equal to 2, and the training unit trains the neural network according to the received semantic vector or the decompression parameter, and comprises:

performing operation in a neural network in a parallel mode according to the vertical data block and/or the horizontal data block to obtain an operation result; processing the operation result to obtain the nth layer weight group gradient and the nth layer input data gradient, and updating the nth layer weight group data by applying the nth layer weight group gradient;

and taking the input data gradient of the nth layer as the output result gradient of the (n-1) th layer, performing reverse operation on the (n-1) th layer to obtain the weight group gradient of the (n-1) th layer, and updating weight group data of a corresponding layer by applying the weight group gradient of the (n-1) th layer, wherein the weight group data comprises at least two weights.

19. The method of claim 16, wherein the neural network comprises n layers, n being an integer greater than or equal to 2, the device further comprising a controller unit, wherein the arithmetic unit comprises a master processing circuit and a plurality of slave processing circuits;

the main processing circuit obtains an nth output result gradient according to the nth output result, and obtains an nth reverse operation instruction of an nth layer of reverse operation and nth input data and nth weight group data required by the nth reverse operation instruction according to the decompression parameter; dividing the nth output result gradient, the nth layer of input data and the nth layer of weight group data into a vertical data block and a horizontal data block according to the nth reverse operation instruction; sending the vertical data block and/or the horizontal data block to at least one slave processing circuit in a plurality of slave processing circuits connected with the main processing circuit according to the nth reverse operation instruction; the plurality of slave processing circuits execute operation in a neural network in a parallel mode according to the vertical data blocks and/or the horizontal data blocks to obtain operation results, and the operation results are transmitted to the main processing circuit through the slave processing circuit connected with the main processing circuit;

and the parameter compression unit takes the data generated by the controller unit, the main processing circuit and the auxiliary processing circuit as data to be compressed, and compresses the data to be compressed by using an encoder to obtain a corresponding semantic vector.

20. The method of claim 16, wherein the apparatus further comprises a storage unit and a direct memory access unit, the storage unit being electrically connected to the parameter compression unit and the parameter decompression unit for storing the semantic vector, the method further comprising: the direct memory access unit acquires the semantic vector from the storage unit or stores the semantic vector to the storage unit;

wherein the parameter storage unit includes: register, cache.

21. The method of claim 20, wherein the cache comprises a neuron cache comprising an input neuron cache and an output neuron cache, wherein the input neuron cache comprises an input neuron index cache and an input neuron gradient cache, and wherein the output neuron cache comprises an output neuron index cache and an output neuron gradient cache, and wherein the neuron cache in the cache stores data related to neurons in a neural network, comprising:

the neuron cache stores data related to the neurons in the neural network;

the output neuron cache stores data related to output neurons in the neural network, wherein the output neuron gradient cache stores output neuron gradients in the reverse calculation process.

22. The method of claim 20, wherein the cache comprises a weight cache, wherein the weight cache comprises an input weight cache and an output weight cache, wherein the input weight cache comprises an input weight index cache and an input weight gradient cache, wherein the output weight cache comprises an output weight index cache and an output weight gradient cache, and wherein the weight cache in the cache stores data related to weights in a neural network, comprising:

the weight cache stores data related to the weight of the neural network;

the input weight cache stores data related to the input weight in the neural network, wherein the input weight index cache stores an input weight index, and the input weight gradient cache stores an input weight gradient in the reverse calculation process;

23. The method of claim 16, wherein the decompression parameters comprise neurons, neuron gradients, weights, weight gradients.

24. The method of claim 19, wherein the controller unit comprises: an instruction storage unit, an instruction processing unit, a storage queue unit and a dependency processing unit, wherein,

the instruction storage unit stores a calculation instruction associated with the neural network operation;

the store queue unit stores an instruction queue comprising: storing a plurality of operation instructions or calculation instructions to be executed according to the sequence of the queue;

25. The method of claim 19, wherein the arithmetic unit comprises: a tree module, the tree module comprising: a root port and a plurality of branch ports, the root port of the tree module is connected with the main processing circuit, the plurality of branch ports of the tree module are respectively connected with one of the plurality of slave processing circuits, wherein,

and the tree module forwards data blocks, weight values, forward operation instructions and backward operation instructions between the main processing circuit and the plurality of slave processing circuits.

26. The method of claim 19, wherein the arithmetic unit further comprises branch processing circuitry,

the plurality of slave processing circuits execute operation on the received data blocks and the weight according to the operation instruction to obtain an intermediate result, and transmit the intermediate result to the branch processing circuit;

and the main processing circuit carries out subsequent processing on the intermediate result sent by the branch processing circuit to obtain the result of the operation instruction, and sends the result of the operation instruction to the controller unit.

27. The method of claim 19, wherein the plurality of slave processing circuits are distributed in an array; each slave processing circuit is connected with other adjacent slave processing circuits, the master processing circuit is connected with K slave processing circuits in the plurality of slave processing circuits, and the K slave processing circuits are: a slave processing circuits of row 1, a slave processing circuits of row b and b slave processing circuits of column 1;

and the main processing circuit carries out subsequent processing on the intermediate results sent by the K slave processing circuits to obtain the result of the operation instruction, and sends the result of the operation instruction to the controller unit.