WO2022095675A1

WO2022095675A1 - Neural network sparsification apparatus and method and related product

Info

Publication number: WO2022095675A1
Application number: PCT/CN2021/123879
Authority: WO
Inventors: 高钰峰; 朱时兵; 刘少礼; 张曦珊; 何得园
Original assignee: 安徽寒武纪信息科技有限公司
Priority date: 2020-11-04
Filing date: 2021-10-14
Publication date: 2022-05-12
Also published as: CN114444681A; WO2022095676A1; US20220230069A1; CN114444680A; US20230259780A1

Abstract

A method and apparatus for sparsification training of a neural network model, a board, and a readable storage medium. A combined processing apparatus (20) comprises a computing apparatus (201), an interface apparatus (202), a processing apparatus (203), and a storage apparatus (204). The computing apparatus (201) interacts with the processing apparatus (203) to jointly complete a computing operation specified by a user. The storage apparatus is respectively connected to the computing apparatus and the processing apparatus (203) for storing data of the computing apparatus and the processing apparatus (203).

Description

Apparatus, method and related products for sparse neural network

CROSS-REFERENCE TO RELATED APPLICATIONS

This application requires the Chinese patent application filed on November 04, 2020 with the application number 2020112169035 and titled "Apparatus, Method and Corresponding Product for Neural Network Thinning" and the application number filed on December 25, 2020 It is the priority of the Chinese Patent Application No. 2020115632599 entitled "Apparatus, Method and Related Products for Neural Network Thinning".

technical field

The present disclosure relates generally to the field of processors. More specifically, the present disclosure relates to a method, device, chip, board and readable storage medium for sparse training a neural network model by a data processing device.

Background technique

In recent years, with the rapid development of deep learning, the performance of algorithms in a series of fields such as computer vision and natural language processing has achieved leapfrog progress. However, deep learning algorithms are computationally and storage-intensive tools. With the increasingly complex information processing tasks, the real-time and accuracy requirements of the algorithms continue to increase, and neural networks are often designed to be deeper and deeper, making the The increasing demand for computing and storage space makes it difficult for existing deep learning-based artificial intelligence technologies to be directly applied to mobile phones, satellites or embedded devices with limited hardware resources.

Therefore, the compression, acceleration, and optimization of deep neural network models have become extremely important. A large number of studies have tried to reduce the computing and storage requirements of neural networks without affecting the accuracy of the model, which is of great significance for the engineering application of deep learning technology in embedded and mobile terminals. Sparsification is one of the methods of model lightweighting.

The network parameter sparsification is to reduce the redundant components in the larger network by appropriate methods, so as to reduce the network's demand for computation and storage space. Although the existing fine-grained parameter sparsification methods and models perform well, they are not friendly to hardware memory access, that is, on-chip and off-chip input/output have high overhead and low performance. On the other hand, structured sparseness based on channels and convolution kernels Although the method improves the hardware performance, the model accuracy loss is relatively large. Finally, most of the existing sparse algorithms are offline fine-tuning methods, that is, the pre-trained model is sparse and then fine-tuned. The offline fine-tuning method has many restrictions and cannot be used in model training. There are more substantial performance gains.

Therefore, there is a need for a scheme capable of sparse training of neural network models.

SUMMARY OF THE INVENTION

In order to at least partially solve one or more technical problems mentioned in the background art, the solution of the present disclosure provides an apparatus, board, method and readable storage medium for sparse training of a neural network model.

In a first aspect, the present disclosure discloses a method for sparse training of a neural network model performed by a data processing device, comprising: in forward propagation, performing sparse processing on at least a neural network parameter based on a mask tensor , to calculate the value of the loss function; in backpropagation, calculate neuron gradients and neural network parameter gradients based on the loss function; and update the neural network parameters based on the neural network parameter gradients.

In a second aspect, the present disclosure provides a computer-readable storage medium on which computer program code for sparse training a neural network model is stored, and when the computer program code is executed by a processing device, executes the aforementioned first The method of any embodiment of the aspect.

In a third aspect, the present disclosure provides a data processing apparatus comprising a control circuit, a storage circuit, and an arithmetic circuit, wherein: the control circuit is configured to control the storage circuit and the arithmetic circuit to perform execution on a neural network model sparse training; the storage circuit is configured to store information including at least neural network parameters and mask tensors; and the arithmetic circuit is configured to perform the following operations under the control of the control circuit: In forward propagation, at least the neural network parameters are sparsed based on the mask tensor to calculate the value of the loss function; in back propagation, the neuron gradient and the neural network parameter gradient are calculated based on the loss function; and The neural network parameters are updated based on the neural network parameter gradients.

In a fourth aspect, the present disclosure provides a chip including the data processing circuit of any embodiment of the foregoing third aspect.

In a fifth aspect, the present disclosure provides a board including the chip of any embodiment of the foregoing fourth aspect.

Through the data processing apparatus provided above, the method for sparse training a neural network model by using the data processing apparatus, and related products, the embodiments of the present disclosure provide a solution for sparseness in the training process of the neural network. The sparsification scheme can support sparsification in the forward propagation process of training, such as sparsification of input channel dimensions, or simultaneous sparsification of input channel dimensions and output channel dimensions. In some embodiments, when forward propagation performs simultaneous thinning of input channel dimensions and output channel dimensions, simultaneous thinning of input channel dimensions and output channel dimensions may also be supported in backpropagation, thereby further optimizing performance. The sparsification scheme of the present disclosure can be performed in multiple stages of training, and different training stages can use different structured sparse data flow structures to perform related operations to obtain optimized operation and IO performance.

Description of drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily understood by reading the following detailed description with reference to the accompanying drawings. In the accompanying drawings, several embodiments of the present disclosure are shown by way of example and not limitation, and like or corresponding reference numerals refer to like or corresponding parts, wherein:

FIG. 1 is a structural diagram illustrating a board according to an embodiment of the present disclosure;

FIG. 2 is a block diagram illustrating an integrated circuit device of an embodiment of the present disclosure;

3 is a schematic diagram showing the internal structure of a single-core computing device according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram showing the internal structure of a multi-core computing device according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram showing the internal structure of a processor core according to an embodiment of the present disclosure;

FIG. 6 shows an exemplary structural block diagram of a data processing apparatus according to an embodiment of the present disclosure;

FIG. 7 illustrates a method performed in an iterative process according to an embodiment of the present disclosure;

8A illustrates a masking process for an exemplary one-dimensional mask tensor according to an embodiment of the present disclosure;

FIG. 8B illustrates the masking process of an exemplary two-dimensional mask tensor according to an embodiment of the present disclosure;

9 is a schematic diagram illustrating an exemplary mask vector update;

10 is a schematic diagram illustrating an exemplary sum-of-product calculation process;

11 is a flowchart illustrating a sparse training method according to another embodiment of the present disclosure;

FIG. 12 is a flowchart illustrating a sparse training method entering a mask fixing stage according to another embodiment of the present disclosure; and

FIG. 13 is a schematic diagram illustrating several embodiments of the present disclosure when the neural network model is sparsely trained.

Detailed ways

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, but not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present disclosure.

It should be understood that the terms "first", "second", "third" and "fourth" that may be used in the claims, description and drawings of the present disclosure are used to distinguish different objects, rather than to describe specific order. The terms "comprising" and "comprising" as used in the specification and claims of this disclosure indicate the presence of the described features, integers, steps, operations, elements and/or components, but do not exclude one or more other features, integers , step, operation, element, component and/or the presence or addition of a collection thereof.

It should also be understood that the terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to limit the present disclosure. As used in this disclosure and the claims, the singular forms "a," "an," and "the" are intended to include the plural unless the context clearly dictates otherwise. It should further be understood that, as used in this disclosure and the claims, the term "and/or" refers to and including any and all possible combinations of one or more of the associated listed items.

As used in this specification and in the claims, the term "if" may be contextually interpreted as "when" or "once" or "in response to determining" or "in response to detecting".

The specific embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

FIG. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure. As shown in FIG. 1 , the board 10 includes a chip 101, which is a system-on-chip (SoC), or a system-on-a-chip, and integrates one or more combined processing devices. The combined processing device is an artificial The intelligent computing unit is used to support various deep learning and machine learning algorithms to meet the intelligent processing requirements in complex scenarios in the fields of computer vision, speech, natural language processing, and data mining. In particular, deep learning technology is widely used in the field of cloud intelligence. A notable feature of cloud intelligence applications is the large amount of input data, which has high requirements on the storage capacity and computing capacity of the platform. The board 10 in this embodiment is suitable for cloud intelligence applications. applications, with huge off-chip storage, on-chip storage and powerful computing power.

The chip 101 is connected to an external device 103 through an external interface device 102 . The external device 103 is, for example, a server, a computer, a camera, a monitor, a mouse, a keyboard, a network card or a wifi interface, and the like. The data to be processed can be transmitted to the chip 101 by the external device 103 through the external interface device 102 . The calculation result of the chip 101 can be transmitted back to the external device 103 via the external interface device 102 . According to different application scenarios, the external interface device 102 may have different interface forms, such as a PCIe interface and the like.

The board 10 also includes a storage device 104 for storing data, which includes one or more storage units 105 . The storage device 104 is connected to the control device 106 and the chip 101 through a bus and performs data transmission. The control device 106 in the board 10 is configured to control the state of the chip 101 . To this end, in an application scenario, the control device 106 may include a microcontroller (Micro Controller Unit, MCU).

FIG. 2 is a block diagram showing a combined processing device in the chip 101 of this embodiment. As shown in FIG. 2, the combined processing device 20 includes a computing device 201, an interface device 202, a processing device 203, and a DRAM 204.

The computing device 201 is configured to perform operations specified by the user, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor to perform deep learning or machine learning calculations, which can interact with the processing device 203 through the interface device 202 to Work together to complete a user-specified operation.

The interface device 202 is used to transmit data and control instructions between the computing device 201 and the processing device 203 . For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202 and write the input data into the storage device on-chip of the computing device 201 . Further, the computing device 201 can obtain the control instruction from the processing device 203 via the interface device 202 and write it into the control cache on the computing device 201 . Alternatively or alternatively, the interface device 202 can also read the data in the storage device of the computing device 201 and transmit it to the processing device 203 .

The processing device 203, as a general processing device, performs basic control including but not limited to data transfer, starting and/or stopping the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of central processing unit (CPU), graphics processing unit (GPU), or other general-purpose and/or special-purpose processors. Processors, these processors include but are not limited to digital signal processors (DSPs), application specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs. As mentioned above, only for the computing device 201 of the present disclosure, it can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when the computing device 201 and the processing device 203 are considered together, the two are considered to form a heterogeneous multi-core structure.

The DRAM 204 is used to store the data to be processed, and is a DDR memory with a size of 16G or more, and is used to save the data of the computing device 201 and/or the processing device 203.

FIG. 3 shows a schematic diagram of the internal structure of the computing device 201 as a single core. The single-core computing device 301 is used to process input data such as computer vision, speech, natural language, and data mining. The single-core computing device 301 includes three modules: a control module 31 , an arithmetic module 32 and a storage module 33 .

The control module 31 is used to coordinate and control the work of the operation module 32 and the storage module 33 to complete the task of deep learning, and it comprises an instruction fetch unit (instruction fetch unit, IFU) 311 and an instruction decoding unit (instruction decode unit, IDU) 312. The instruction fetching unit 311 is used to acquire the instruction from the processing device 203 , and the instruction decoding unit 312 decodes the acquired instruction, and sends the decoding result to the operation module 32 and the storage module 33 as control information.

The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322 . The vector operation unit 321 is used to perform vector operations, and can support complex operations such as vector multiplication, addition, and nonlinear transformation; the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, that is, matrix multiplication and convolution.

The storage module 33 is used to store or transport related data, including a neuron storage unit (neuron RAM, NRAM) 331, a parameter storage unit (weight RAM, WRAM) 332, and a direct memory access module (direct memory access, DMA) 333. NRAM 331 is used to store input neurons, output neurons and intermediate results after calculation; WRAM 332 is used to store the convolution kernel of the deep learning network, that is, weights; DMA 333 is connected to DRAM 204 through bus 34 and is responsible for single-core computing Data transfer between device 301 and DRAM 204.

FIG. 4 shows a schematic diagram of the internal structure of the computing device 201 with multiple cores. The multi-core computing device 41 adopts a layered structure design, and the multi-core computing device 41 is a system-on-a-chip, which includes at least one cluster, and each cluster includes multiple processor cores. In other words, the multi-core computing device 41 is a system-on-chip- Cluster - a hierarchy of processor cores.

From a system-on-chip level, as shown in FIG. 4 , the multi-core computing device 41 includes an external storage controller 401 , a peripheral communication module 402 , an on-chip interconnect module 403 , a synchronization module 404 and multiple clusters 405 .

There may be multiple external memory controllers 401, and two are exemplarily shown in the figure, which are used to respond to the access request issued by the processor core, to access the external storage device, such as the DRAM 204 in FIG. 2, so as to read from the off-chip Fetch data or write data. The peripheral communication module 402 is used for receiving a control signal from the processing device 203 through the interface device 202 to start the computing device 201 to perform tasks. The on-chip interconnection module 403 connects the external storage controller 401 , the peripheral communication module 402 and the multiple clusters 405 to transmit data and control signals among the modules. The synchronization module 404 is a global synchronization barrier controller (GBC), which is used to coordinate the work progress of each cluster and ensure the synchronization of information. The plurality of clusters 405 are the computing cores of the multi-core computing device 41, and 4 are exemplarily shown in the figure. With the development of hardware, the multi-core computing device 41 of the present disclosure may further include 8, 16, 64, or even more. Multiple clusters 405. Cluster 405 is used to efficiently execute deep learning algorithms.

In terms of the cluster level, as shown in FIG. 4 , each cluster 405 includes multiple processor cores (IPU cores) 406 and one memory core (MEM core) 407 .

The processor cores 406 are exemplarily shown as four in the figure, and the present disclosure does not limit the number of the processor cores 406 . Its internal structure is shown in Figure 5. Each processor core 406 is similar to the single-core computing device 301 in FIG. 3 , and also includes three major modules: a control module 51 , an arithmetic module 52 and a storage module 53 . The functions and structures of the control module 51 , the arithmetic module 52 and the storage module 53 are substantially the same as those of the control module 31 , the arithmetic module 32 and the storage module 33 , and will not be described again. It should be noted that the storage module 53 includes an input/output direct memory access (IODMA) 533 and a move direct memory access (MVDMA) 534. The IODMA 533 controls the memory access of the NRAM 531/WRAM 532 and the DRAM 204 through the broadcast bus 409; the MVDMA 534 is used to control the memory access of the NRAM 531/WRAM 532 and the storage unit (SRAM) 408.

Returning to FIG. 4, the storage core 407 is mainly used for storage and communication, that is, to store the shared data or intermediate results between the processor cores 406, and to execute the communication between the cluster 405 and the DRAM 204, the communication between the clusters 405, and the processor Communication among the cores 406, etc. In other embodiments, the memory core 407 has scalar operation capability for performing scalar operations.

The storage core 407 includes an SRAM 408 , a broadcast bus 409 , a cluster direct memory access (CDMA) 410 and a global direct memory access (GDMA) 411 . The SRAM 408 assumes the role of a high-performance data transfer station. The data multiplexed between different processor cores 406 in the same cluster 405 does not need to be obtained from the DRAM 204 through the processor cores 406, but is stored in the processor through the SRAM 408. For transfer between cores 406, the storage core 407 only needs to quickly distribute the multiplexed data from the SRAM 408 to the multiple processor cores 406, so as to improve the communication efficiency between the cores and greatly reduce the on-chip and off-chip I/O accesses.

The broadcast bus 409, the CDMA 410 and the GDMA 411 are used to perform the communication between the processor cores 406, the communication between the clusters 405 and the data transmission between the clusters 405 and the DRAM 204, respectively. They will be explained separately below.

The broadcast bus 409 is used to complete high-speed communication among the processor cores 406 in the cluster 405. The broadcast bus 409 in this embodiment supports inter-core communication methods including unicast, multicast and broadcast. Unicast refers to point-to-point (such as a single processor core to a single processor core) data transmission, multicast is a communication method that transmits a piece of data from SRAM 408 to specific processor cores 406, and broadcast is a communication method. The communication method in which copies of data are transmitted from SRAM 408 to all processor cores 406 is a special case of multicast.

The CDMA 410 is used to control the memory access of the SRAM 408 between different clusters 405 within the same computing device 201.

The GDMA 411 cooperates with the external memory controller 401 to control the memory access from the SRAM 408 of the cluster 405 to the DRAM 204, or to read data from the DRAM 204 to the SRAM 408. As can be seen from the foregoing, the communication between the DRAM 204 and the NRAM 431 or the WRAM 432 can be implemented through two channels. The first channel is to directly contact DRAM 204 and NRAM 431 or WRAM 432 through IODAM 433; the second channel is to transfer data between DRAM 204 and SRAM 408 through GDMA 411, and then through MVDMA 534 to transfer data between SRAM 408 and NRAM 431 or WRAM 432 transfers. Although it seems that the second channel requires more components to participate and the data flow is longer, in fact, in some embodiments, the bandwidth of the second channel is much larger than that of the first channel, so DRAM 204 and NRAM 431 or Communication between the WRAMs 432 may be more efficient through a second channel. In the embodiments of the present disclosure, a data transmission channel can be selected according to its own hardware conditions.

In other embodiments, the functionality of GDMA 411 and the functionality of IODMA 533 may be integrated in the same component. In this disclosure, for the convenience of description, GDMA 411 and IODMA 533 are regarded as different components. For those skilled in the art, as long as the functions realized and the technical effects achieved are similar to those of the present disclosure, they belong to the protection scope of the present disclosure. Further, the function of GDMA 411, the function of IODMA 533, the function of CDMA 410, and the function of MVDMA 534 can also be realized by the same component.

The training of the neural network is to adjust the parameters of each layer by inputting training samples, so that the results calculated by the neural network are as close as possible to the real results. Neural network training includes forward propagation and back propagation. Forward propagation is based on the existing model. The input training samples are calculated by each layer of the neural network, and the input feature map is gradually extracted into abstract features. After forward propagation, an output value called the predicted value is obtained. The backpropagation is a loss function calculated according to the predicted value and the real value obtained by the forward propagation. The gradient descent method is used to calculate the partial derivative of the loss function for each parameter through the chain rule to update the parameters. In the chain rule, the derivative of the error value corresponding to the weight of the last layer of the neural network is first calculated. Call these derivatives gradients, and use these gradients to calculate the gradient of the penultimate layer in the neural network. Repeat this process until you get the gradient corresponding to each weight in the neural network. Finally, the corresponding gradient is subtracted from each weight in the neural network to update the weight once to reduce the error value. Then use the updated parameters for training, and repeat this for many times, so that the calculation result of forward propagation is finally in line with expectations.

In the process of neural network training, each time the neural network goes through a forward propagation of a signal and a back propagation process corresponding to an error, the weights in the neural network are updated once by using the gradient, which is called an iteration. In order to obtain a neural network with the desired accuracy, a very large sample dataset is required during the training process. In this case, feeding the sample dataset into the computer in one go is impossible. Therefore, in order to solve this problem, the sample data set needs to be divided into multiple blocks, each block is transmitted to the computer, and the weights of the neural network are updated correspondingly after each block of data set is processed forward. When a complete sample data set passes through the neural network for one forward processing and returns a corresponding weight update, this process is called an epoch. In practice, it is not enough to pass the complete data set once in the neural network. It is necessary to pass the complete data set in the same neural network multiple times, that is, multiple cycles are required, and finally the neural network with the accuracy that meets the expectations is obtained.

Based on the aforementioned hardware environment, this embodiment provides a solution for sparse training of a neural network model. In more detail, in each iteration including forward propagation and back propagation, the neural network parameters are sparsed at least in forward propagation. The thinning process can be one-dimensional thinning (eg, input channel dimension), or can be multi-dimensional thinning, such as two-dimensional thinning (eg, input channel dimension and output channel dimension are simultaneously thinned). In some embodiments, when forward propagation performs simultaneous thinning of input channel dimensions and output channel dimensions, simultaneous thinning of input channel dimensions and output channel dimensions may also be supported in backpropagation, thereby further optimizing performance. The sparsification scheme of the present disclosure can be performed in multiple stages of training, and different training stages can use different structured sparse data flow structures to perform related operations to obtain optimized operation and IO performance.

FIG. 6 shows an exemplary structural block diagram of a data processing apparatus according to an embodiment of the present disclosure.

The data processing device 600 may be implemented, for example, in the computing device 201 of FIG. 2 . As shown, the data processing apparatus 600 may include a control circuit 610 , a storage circuit 620 and an arithmetic circuit 630 .

The function of the control circuit 610 may be similar to that of the control module 314 in FIG. 3 , and it may include, for example, an instruction fetch unit for acquiring instructions from, for example, the processing device 203 in FIG. 2 , and an instruction decoding unit for processing the acquired instructions. decode, and send the decoded result to the operation circuit 630 and the storage circuit 620 as control information.

In one embodiment, the control circuit 610 may be configured to control the storage circuit 620 and the arithmetic circuit 630 to perform sparse training on the neural network model.

Storage circuitry 620 may be configured to store information, which may include at least neural network parameters. In embodiments of the present disclosure, the storage circuit 620 may also store mask tensors. In this embodiment, the storage circuit may be, for example, the WRAM 332 and NRAM 331 of FIG. 3 .

The operation circuit 630 may be configured to perform sparse training on the neural network model under the control of the control circuit 610, so as to perform the method for sparse training as shown in FIG. 7 .

Figure 7 illustrates a method performed during one iteration according to an embodiment of the present disclosure.

In step 710, in forward propagation, at least the neural network parameters are sparsed based on the mask tensor to calculate the value of the loss function.

In the embodiments of the present disclosure, the mask tensor may exist in various situations.

In some embodiments, the mask tensor is a one-dimensional tensor that sparses a specified dimension of the data. For example, mask tensors sparse the input channel dimension of neural network parameters.

In some embodiments, the thinning process may be a structured thinning process, for example, according to a thinning rule, n data elements are selected as valid data elements from every m data elements of the dimension to be sparsed in the input data, where m>n. In one implementation, m=4 and n=2. In other implementations, when m=4, n can also take other values, such as 1 or 3.

At this time, the mask tensor can be a one-dimensional vector, which can be divided into multiple intervals of length m, each interval has n elements of 1, representing the reserved data position, m-n elements of 0, representing the mask go to the data location.

In forward propagation, neurons (e.g., training data) operate (e.g., convolve) with neural network parameters (e.g., weights), and the same sparsification can be performed on neurons using mask tensors, thereby The corresponding operation is performed based on the thinned result.

FIG. 8A illustrates the masking process of an exemplary one-dimensional mask tensor according to an embodiment of the present disclosure. FIG. 8A takes the convolutional layer operation of the convolutional neural network as an example to illustrate the sparsification-based convolution operation in forward propagation.

As shown in the figure, the dimension to be sparse is the input channel dimension. An exemplary mask tensor is a vector of length 16, divided into 4 bins of length 4, each bin has 2 elements of 1, as shown by the black squares in the figure. The input channel dimension of the weight is divided into corresponding segments, and each segment corresponds to an interval of the mask tensor. weight value. The input channel dimension of the neuron is similarly sparsed using the same mask tensor. The sparse weights and the sparse neurons are then operated, such as multiply-accumulate operations.

In other embodiments, the mask tensor is a two-dimensional tensor that simultaneously sparses two specified dimensions of the data. For example, the mask tensor sparses both the input channel dimension and the output channel dimension of the neural network parameters.

At this time, the mask tensor can be a two-dimensional matrix, which can be divided into a plurality of m×m squares, and any row in each square has n elements that are 1, m-n elements are 0, and any line in each square is 0. A column has n elements as 1, m-n elements as 0, "1" represents the reserved data location, and "0" represents the masked data location. In some embodiments, assuming that m is 4 and n is 2, there are 90 such 4×4 mask matrices in total, and these mask matrices can be pre-stored in the DRAM 204.

FIG. 8B shows an exemplary masking process. It is assumed that the input channels and output channels of the convolution layer are a 4×4 channel matrix 801 whose elements are a ₁₁ to a ₄₄ , and the channel matrix 801 is a neural network parameter. The figure also shows an exemplary mask matrix 802 among the aforementioned 90 4×4 mask matrices for performing mask sparse processing on the channel matrix 801 . Specifically, if the corresponding element in the mask matrix 802 is 1, the operation circuit 630 retains the element in the channel matrix 801, and if the corresponding element in the mask matrix 802 is 0, the operation circuit 630 masks the channel matrix 801 The element in , whose value is 0. Taking a ₁₁ in the channel matrix 801 as an example, the corresponding element in the mask matrix 802 is 0, so the corresponding element in the parameter matrix 803 after masking is masked, and its value is 0. All element values of the masked parameter matrix 803 are obtained in this way. Since half of the elements in the channel matrix 801 are masked out, it means that about half of the computation is saved.

For each training sample, the arithmetic circuit 630 masks the parameters of the neural network based on the mask tensor in the forward propagation, and then performs calculation, and finally obtains the value of the loss function, which corresponds to the output error of the neural network.

Returning to Figure 7, in step 720, in backpropagation, neuron gradients and neural network parameter gradients are calculated based on the loss function. In an embodiment of the present disclosure, based on the mask tensor used in the forward propagation, the sparsification process may or may not be selectively applied in the back propagation.

In some embodiments, regardless of the mask tensor used in forward propagation, in backpropagation, neuron gradients and neural network parameter gradients may be computed based on unsparse neural network parameters; and The network parameter gradient updates the neural network parameters.

Depending on the information stored in the storage circuit, in some implementations, the neural network parameters that are not sparsed may be the neural network parameters before the sparse processing, or the neural network parameters that have been sparsed to be de-sparsed. obtained after processing. The anti-sparse processing may include restoring the thinned neural network parameters to the corresponding positions before the thinning processing according to the indication of the mask tensor, and filling the remaining positions with predetermined information (eg, 0) to restore the thinning shape before processing.

In other embodiments, when the mask tensor used in forward propagation is a two-dimensional tensor, sparseness can also be applied in backpropagation, that is, the neural network is calculated based on the sparsely processed neural network parameters parameter gradients and neuron gradients, and then update the neural network parameters based on neuron gradients.

In the backpropagation of training, the computation of neuron gradients and weight gradients is involved, as follows:

Among them, top_diff and bottom_diff are the neuron gradients respectively, W is the weight of this iteration, △W is the weight gradient calculated by this iteration,

is the computation in backpropagation, similar to the convolution operation. Relative to the back propagation direction, the bottom_diff of the previous layer is the top_diff of the current layer, and the bottom_diff of the current layer is the top_diff of the next layer, so that the error can be transferred layer by layer in reverse.

In the calculation of formula (1), the layout of the weight W is different from that in the forward propagation process, so the direction of accumulation in its operation is also different. In forward propagation, the weights are used according to (Co, Kh, Kw, Ci) dimension order or dimension shape, where Ci represents the input channel dimension, Co represents the output channel dimension, Kh is the convolution kernel height dimension, and Kw is the volume The kernel width dimension. In the forward propagation convolution operation, the operation results are accumulated in the Ci direction. In backpropagation, the weights are used according to the (Ci, Kh, Kw, Co) dimension order or dimension shape. In the back-propagation operation, the result of the operation is accumulated in the Co direction. Therefore, in order to maintain the mathematical consistency of the operational gradients in backpropagation, both Ci and Co directions need to be sparsed simultaneously.

When performing sparse processing in backpropagation, a reverse mask tensor can be used to mask the neural network parameters to obtain sparsely processed neural network parameters.

The reverse mask tensor can be identical to the mask tensor used in forward propagation. However, due to the different weight layouts in the aforementioned backpropagation and the different accumulation directions during operations, the mask tensor in forward propagation cannot be used directly. In some implementations, the mask tensor (or called forward mask tensor) used in forward propagation can be dimensionally transformed before being used. Various existing dimension transformation methods (eg, dimension transposition, data warping), etc. can be used to transform the mask tensor into the required layout in backpropagation, and use it as a reverse mask tensor. In other implementations, the mask tensor generation process used in the forward propagation process can also be repeated during the backpropagation process to generate the reverse mask tensor. However, the mask calculation in the Ci direction is implemented in the forward propagation process, and the mask calculation in the Co direction is implemented in the back propagation process.

Continuing with FIG. 7, in step 730, the neural network parameters are updated based on the neural network parameter gradients.

The sparsification training of embodiments of the present disclosure may include several training stages, such as a maskless stage, a mask adjustment stage, and a mask fixation stage. The processing of each stage will be described in detail later with reference to the accompanying drawings.

Based on the different stages of the sparse training, the updates to the neural network parameters can also be different.

In some embodiments, updating the neural network parameters may be updating the neural network parameters that are not thinned out. For example, in the mask adjustment stage, in each iteration, the parameters of the neural network that are not thinned out are updated. Further, in the mask adjustment stage, in every K (K ≥ 1) iterations, an updated mask tensor can be generated based on the updated non-sparse-processed neural network parameters, so that during the training process, an updated mask tensor can be generated. Optimize mask tensors to improve performance.

In other embodiments, updating the neural network parameters may be updating the sparse-processed neural network parameters. For example, in the mask fixing stage, since the mask tensor has been fixed, the sparse mode of the neural network parameters is fixed, that is, the effective data elements in the neural network parameters are fixed, so the update of the neural network parameters can be Only valid data elements are updated, that is, the parameters of the neural network that have been sparsed are updated. In one implementation, updating the neural network parameters may include: using a mask tensor to sparse neuron gradients; and updating the sparsed neural network parameters based on the sparse neuron gradients.

The mask tensor fixed in the mask fixing stage may be the mask tensor finally determined in the previous training stage (eg, the mask adjustment stage). Depending on the different forms of the mask tensor, there can be different ways to generate or update the mask tensor.

When the mask tensor is a one-dimensional tensor, that is, a mask vector, the mask vector can only mask a single parameter. A mask tensor can be generated based on unsparse neural network parameters. For example, select n data elements with larger absolute values from every m data elements of the specified dimension of the neural network parameters as valid data elements, where m>n; and based on the n valid data elements in the m data elements position in to generate the mask tensor. In some implementations, the aforementioned specified dimension may be the input channel dimension (Ci). Specifically, in this embodiment, the parameters are divided into multiple intervals with the specific parameter number m as the unit, and the parameters in each interval are sorted according to their absolute values. The elements of the large first n parameters are 1, and the elements of the m-n parameters with smaller absolute values in each interval are set to 0, because the larger absolute value of the mask adjustment parameters is more obvious. The characteristics of , it is more worth keeping to continue the calculation. There are many ways to filter the mask adjustment parameter with a larger absolute value, and the present disclosure is not limited in this respect.

FIG. 9 is a schematic diagram of an exemplary mask vector update, illustrating the aforementioned update mask vector by way of example. The figure shows a parameter vector 901 with 64 parameters in total, namely b ₀₁ to b ₆₄ . In this step, each element value of the mask vector is updated to keep the mask adjustment parameter with a larger absolute value and mask out the mask adjustment parameter with a smaller absolute value. Wherein, the updated mask adjustment parameters are divided into multiple intervals in units of every 4 mask adjustment parameters (that is, m is 4). As shown in the figure, b ₀₁ to b ₀₄ are the first interval 902, b ₀₅ to b 04 b ₀₈ is the second interval 903, b ₆₁ to b ₆₄ are the sixteenth interval 917, and then the mask adjustment parameters in each interval are sorted according to their absolute value, assuming the absolute value of each parameter in the first interval 902 The sequence is b ₀₂ >b ₀₁ >b ₀₄ >b ₀₃ , the absolute values of the parameters in the second interval 903 are b ₀₇ >b ₀₅ >b ₀₆ >b ₀₈ , and the parameters in the sixteenth interval 917 The absolute value of the sequence is b ₆₄ >b ₆₃ >b ₆₁ >b ₆₂ , and the mask adjustment parameters in each interval are sorted according to the absolute value of the mask adjustment parameters. Then, in these mask vectors, the position relative to the first 2 (that is, n is 2) mask adjustment parameters whose absolute value is larger in each interval is set to 1, and the position relative to each interval is set to 1. The elements of the two mask adjustment parameters with smaller absolute values (that is, mn=2) are set to 0. Taking the first interval 902 as an example, the elements corresponding to b ₀₂ and b ₀₁ in the mask vector are set to 1, The elements corresponding to b ₀₄ and b ₀₃ are set to 0. Each interval is adjusted in this way, and finally the updated mask vector 918 is completed. The updated mask vector 918 retains the larger absolute value of the updated mask adjustment parameters, and masks the smaller absolute value of the updated mask adjustment parameters. To sum up, take every 4 mask adjustment parameters as an interval, and update the element value of the mask vector in a 4-to-2 manner for each interval.

In this embodiment, the mask adjustment parameters in each interval are completely sorted to identify n with larger absolute values and mn with smaller absolute values, but the present disclosure does not necessarily require complete sorting, as long as it can be identified The n items with a larger absolute value and the mn items with a smaller absolute value are sufficient, and the size of the n items with a larger absolute value and the size of the mn items with a smaller absolute value are not necessary information. Taking the first interval 902 as an example, the present disclosure only needs to determine that b ₀₁ and b ₀₂ are two with larger absolute values, while b ₀₃ and b ₀₄ are two with smaller absolute _values _. The absolute value size and the absolute value size of b ₀₃ and b ₀₃ are not critical, and the sorting can be omitted to save computing resources.

If the mask tensor is multi-dimensional (e.g. two-dimensional), the training data can be multiplied and computed with each post-mask parameter tensor to obtain parameter evaluations. The purpose of obtaining the parameter evaluation value is to calculate the amount of information retained after being masked by the masked tensor. If the parameter evaluation value is high, it means that the amount of information has not been lost too much due to the mask. The mask tensor reduces the amount of operation on the premise of retaining most of the information, and is a high-quality mask tensor; on the contrary, if the parameter evaluation A low value indicates that too much information is lost after masking, and the mask tensor is not a high-quality mask tensor.

Specifically, the two-dimensional mask tensor can be determined as follows: a specific number of two-dimensional mask tensors are preset, and then one of the preset two-dimensional mask tensors is selected as the mask tensor to be used. Each dimension of these two-dimensional mask tensors includes m elements, where n elements are 1, m-n elements are 0, and m>n. As mentioned earlier, under the condition of m=4, n=2, there are 90 such 4×4 mask matrices in total, so one of the 90 mask matrices should be selected as the mask tensor.

Selecting one of these specific number (eg, 90) of two-dimensional mask tensors may include masking the specified two dimensions of the neural network parameters based on each preset two-dimensional mask tensors, respectively, to obtain a post-mask parameter tensor; based on each post-mask parameter tensor, perform a product-sum operation on the training data of the neural network layer to obtain parameter evaluations; and select the two-dimensional mask tensor that produces the largest of all parameter evaluations amount as the selected mask tensor. In some implementations, the two dimensions specified above may be the input channel dimension and the output channel dimension. The above product-sum operation can also be regarded as a convolution operation, but it does not accumulate in the input channel dimension, but only in the depth direction, so it can also be called the convolution operation in the depth direction, where the depth direction is Kw ×Kh dimension.

FIG. 10 shows an exemplary sum-of-product calculation process. Assuming that the training data matrix 1001 is one of the training data in the training set, it should be calculated with the channel matrix 801 in FIG. 8, and now it is calculated by multiplying it with the masked parameter matrix 803 to identify the amount of information after the mask. . There are many ways to calculate the sum of products. For example, the training data matrix 1001 is multiplied by the corresponding elements of the masked parameter matrix 803, and then the absolute values are added to obtain the parameter evaluation value S ₁ , namely:

S ₁ =|d ₃₁ ·a ₃₁ |+|d ₄₁ ·a ₄₁ |+|d ₁₂ ·a ₁₂ |+|d ₄₂ ·a ₄₂ |+|d ₁₃ ·a ₁₃ |+|d ₂₃ ·a ₂₃ | +|d ₂₄ ·a ₂₄ |+|d ₃₄ ·a ₃₄ |

For another example, the corresponding elements of the training data matrix 1001 and the masked parameter matrix 803 are multiplied by absolute values and then added to obtain the parameter evaluation value S ₂ , that is:

S ₂ =|d ₃₁ |·|a ₃₁ |+|d ₄₁ |·|a ₄₁ |+|d ₁₂ |·|a ₁₂ |+|d ₄₂ |·|a ₄₂ |+|d ₁₃ |·|a ₁₃ |+|d ₂₃ |·|a ₂₃ |+|d ₂₄ |·|a ₂₄ |+|d ₃₄ |·|a ₃₄ |

The parameter evaluation value reflects the result of a similar absolute value calculation. The parameter evaluation value S ₁ or S ₂ shows the amount of information retained after masking. The higher the value of the parameter evaluation value, the more information is retained. In one application scenario, either calculation method of parameter evaluation value S ₁ or S ₂ can be selected, while in another application scenario, the calculation methods of parameter evaluation value S ₁ and S ₂ can be used at the same time. make restrictions.

Mask all masked tensors and get parameter evaluations. In the preceding example, this means that all 90 4x4 mask matrices are masked and 90 parameter evaluations are obtained. The mask tensor with the largest parameter evaluation value is selected as the updated mask tensor, that is, the parameter mask tensor. There are many ways to select the maximum parameter evaluation value. For example, you can sort all parameter evaluation values in numerical order to obtain the largest parameter evaluation value, or simply use a two-input comparator for comparison, leaving the larger value and the lower value. One parameter evaluation value is compared again, and the largest parameter evaluation value is left after all 90 parameter evaluation values are compared. If multiple mask tensors have the same maximum parameter evaluation value, one of them may be selected based on certain rules or hardware characteristics, such as first-order, last-order, first left, last left, or is randomly selected.

The mask tensor with the largest parameter evaluation value is the mask tensor that retains the most information, and this embodiment uses the mask tensor as the parameter mask tensor.

In this embodiment, the mask tensor may be updated in each iteration or each generation of training. If in the training process, the neural network parameters are updated after each training sample, the mask tensor is preferably updated in each iteration; if the neural network parameters are updated in each iteration, the parameter mask Tensors are preferably updated at the end of each generation of training.

Those skilled in the art can understand that although the generation of the mask tensor is described above based on the update process, when the mask tensor is generated for the first time, the mask tensor can be generated in a similar manner, except that the neural network parameters on which the mask tensor is based will be generated. There are different. Depending on the stages involved in the training process, when the mask tensor is first generated, the neural network parameters may be randomly initialized parameters or neural network parameters determined after training in the unmasked stage.

As mentioned earlier, the sparsification training of embodiments of the present disclosure may include several training stages, such as a maskless stage, a mask adjustment stage, and a mask fixation stage. The processing of each stage is described in detail below with reference to the accompanying drawings.

FIG. 11 shows an exemplary flow diagram including a no-mask stage and a mask-adjustment stage. In the unmasked stage, the processing device 203 only trains the neural network parameters, that is, does not perform mask sparseness on the neural network parameters. After the unmasked stage ends and enters the mask adjustment stage, the training parameters are updated simultaneously with the mask. Tensor.

As shown in FIG. 11 , in step 1101 , the control circuit 610 first sets to enter the no-mask stage. In the unmasked stage, this embodiment does not mask the neural network parameters, all elements of the parameters participate in the training, and the parameter values can be randomly generated at the beginning of the training. Parameters are called unmasked parameters.

In step 1102, the arithmetic circuit 630 calculates the value of the loss function based on the unmasked parameters in the forward pass. In this step, the computing circuit 630 adopts the method of calculating the loss function in the prior art, and in the forward propagation, the input training samples are calculated by each layer of the neural network, and the input feature map is gradually extracted as abstract features, and the forward propagation result is used. and the loss function calculated from the true value.

In step 1103, the arithmetic circuit 630 calculates the partial derivative of the loss function with respect to the unmasked parameter in backpropagation. The arithmetic circuit 630 uses the gradient descent method to calculate the partial derivative of the loss function for each unmasked parameter through the chain rule.

In step 1104, the arithmetic circuit 630 updates the unmasked parameter based on the partial derivative, and uses the updated unmasked parameter as the initial value of the mask adjustment parameter. First, the arithmetic circuit 630 multiplies the step size according to the influence of the unmasked parameter on the error to update the unmasked parameter of the entire neural network. In this embodiment, the arithmetic circuit 630 may also update the unmasked parameter based on the partial derivative in each training sample or each iteration.

In this embodiment, step 1102, step 1103 and step 1104 can be repeated in a certain number of times of training to update the unmasked parameter multiple times. After the last update, the updated unmasked parameter will be used as the mask in the next stage. The initial value of the code adjustment parameter.

In step 1105, the control circuit 610 is set to enter the mask adjustment stage, that is, it starts to mask some parameters by using the mask tensor. During training, the prior art only trains on all parameters (such as weights, biases, etc.), and usually does not mask the parameters. The purpose of parameter masking in this embodiment is to reduce the participation of parameters in the training phase and avoid overfitting to reduce the amount of calculation. The ideal mask tensor. At the beginning of entering the mask adjustment stage, as mentioned earlier, the initial values of the mask adjustment parameters are the unmasked parameters that are finally updated in the unmasked stage, and the mask tensor can be based on the unmasked parameters that are finally updated in the unmasked stage. There is no mask parameter to obtain the initial value of the mask adjustment parameter.

In step 1106, the mask adjustment parameters are masked based on the mask tensor in the forward pass to calculate the value of the loss function. In step 1107, the partial derivatives of the loss function to the mask adjustment parameters are calculated in backpropagation. In step 1108, the mask adjustment parameters are updated based on the partial derivatives. In step 1109, the mask tensor is updated based on the updated mask adjustment parameters. For these steps, reference may be made to the foregoing description in conjunction with FIG. 7 , which will not be repeated here.

This embodiment does not limit the number of times of first-generation training in the unmasked stage and the mask adjustment stage. Those skilled in the art can arrange it according to the specific situation, and the number of times of the first-generation training in the unmasked stage and the mask adjustment stage is not necessarily required. same.

Another embodiment of the present disclosure also provides a solution for sparse training of a neural network model based on the aforementioned hardware environment. The difference from the previous embodiment is that the training is divided into three stages: no mask stage, mask adjustment stage and mask fixation stage. In the unmasked stage, the processing device 203 only trains the parameters without masking the parameters. In the mask fixing stage, the processing device 203 uses the updated mask adjustment parameters and the updated mask tensor in the mask adjustment stage as the initial values, on the premise of not changing or updating the mask tensor. Next, continue to train the parameters.

The processes performed in the unmasking stage and the mask adjusting stage in this embodiment are shown in FIG. 11 , and thus are not repeated. After entering the mask fixing stage, the flow is shown in Figure 12.

In step 1201, the control circuit 610 is set to enter the mask fixing stage. In the mask fixing stage, the control circuit 610 uses the mask adjustment parameter updated in the mask adjustment stage as the initial value of the parameter (hereinafter referred to as the mask fixing parameter) in this stage. The mask tensor is updated, so the mask tensor will not be updated in this stage, but the mask fixed parameters will be masked based on the mask tensor finally updated in the mask adjustment stage, and the training will continue. Mask fixed parameters.

This embodiment repeats the following steps in at least one generation of training.

In step 1202, the arithmetic circuit 630 masks the mask fixed parameters in forward propagation based on the mask tensor updated in the mask adjustment stage to calculate the value of the loss function.

In step 1203, the arithmetic circuit 630 calculates the partial derivatives of the loss function with respect to the fixed parameters of the mask in backpropagation.

In step 1204, the update module 64 updates the mask fixed parameters based on the partial derivatives.

For the above steps, reference may be made to the foregoing description in conjunction with FIG. 7 , which will not be repeated here.

This embodiment is divided into three stages during training. In the unmasked stage, no mask tensor masks the parameters, and only the parameters are trained to speed up the convergence of the parameters. In the mask adjustment stage, since the initial values of the parameters are no longer randomly generated, but the unmasked parameters that have been trained, it is helpful to quickly obtain an ideal mask tensor. After the mask tensor is updated, enter the mask fixing stage, and use the updated mask tensor to continue training the parameters, and the final trained parameters will better match the mask tensor.

To sum up, those skilled in the art understand that when the present disclosure performs sparse training on a neural network model, there may be several implementations as shown in FIG. 13 .

Embodiment 1301 only has a mask adjustment stage, the initial value of the parameter W0 is randomly generated, the initial value of the mask tensor M0 is determined based on the initial value of the parameter W0, and the training parameters simultaneously update the mask matrix to obtain the trained parameter Wf and the updated mask tensor Mf.

Embodiment 1302 has only a no-mask stage and a mask-adjustment stage. In the unmasked stage, only the parameters are trained, the initial value of the parameter W0 is randomly generated, and the updated parameter W1 is obtained after training. In the mask adjustment stage, the training parameters and the mask matrix are updated at the same time. The initial value of the parameters in this stage is the updated parameter W1, and the initial value of the mask tensor M0 is obtained by using the updated parameter W1, and finally the trained parameter Wf is obtained. with the updated mask tensor Mf.

Embodiment 1303 has only a mask adjustment stage and a mask fixation stage. In the mask adjustment stage, the initial value of the parameter W0 is randomly generated, the initial value of the mask tensor M0 is determined based on the initial value of the parameter W0, and the training parameters update the mask matrix at the same time to obtain the updated parameter W1 and the updated mask tensor Mf . In the mask fixing stage, the training continues with the updated mask tensor Mf mask parameters. The initial value of the parameters in this stage is the updated parameter W1, and finally the trained parameter Wf is obtained.

Embodiment 1304 has a no-mask stage, a mask-adjustment stage, and a mask-fixing stage. In the unmasked stage, only the parameters are trained, the initial value of the parameter W0 is randomly generated, and the updated parameter W1 is obtained after training. In the mask adjustment stage, the training parameters and the mask matrix are updated at the same time. The initial value of the parameters in this stage is the updated parameter W1, and the initial value of the mask tensor M0 is obtained by using the updated parameter W1, and finally the updated parameter W2 and the updated parameter are obtained. The post mask tensor Mf. In the mask fixing stage, the training continues with the updated mask tensor Mf mask parameters. The initial value of the parameters in this stage is the updated parameter W2, and finally the trained parameter Wf is obtained.

In addition to having an unmasked stage, a mask adjustment stage, and a mask fixation stage, Embodiment 1305 also has other training stages between the unmasked stage and the mask adjustment stage, and between the mask adjustment stage and the mask fixation stage (with dotted line). In the unmasked stage, only the parameters are trained, the initial value of the parameter W0 is randomly generated, and the updated parameter W1 is obtained after training. After that, any training stage disclosed or undisclosed in the present disclosure can be continued to train parameters or update the mask matrix. Assuming that this stage is a mask fixed stage, the initial value of the parameters in this stage is the updated parameter W1, while the mask matrix is The initial value M0 of the code tensor is obtained by using the updated parameter W1 to obtain the updated parameter W2.

Then it enters the mask adjustment stage, and the training parameters update the mask matrix at the same time. The initial value of the parameters in this stage is the updated parameter W2, and the initial value of the mask tensor is still the mask tensor M0, so as to obtain the updated parameters W3 and The updated mask tensor M1. After that, continue any stage disclosed or not disclosed in this disclosure, to train parameters or update the mask matrix. It is assumed that this stage is a parameter fixed stage, that is, the parameters are fixed and not trained, and only the mask tensor is trained. This stage The initial value of the parameter is the updated parameter W3, and the initial value of the mask tensor is the updated mask tensor M1 to obtain the updated mask tensor Mf.

Finally, in the mask fixing stage, the training is continued with the updated mask tensor Mf mask parameters. The initial value of the parameters in this stage is the updated parameter W3, and finally the trained parameter Wf is obtained.

The various embodiments shown in FIG. 13 are only examples, and those skilled in the art can expand other embodiments without creative efforts after referring to the present disclosure, and these embodiments all belong to the scope of the disclosure of the present disclosure.

The present disclosure does not limit the number of first-generation training performed in various embodiments, which can be arranged by those skilled in the art according to specific circumstances, and the number of first-generation training performed in each stage is not necessarily the same.

The aforementioned embodiments do not necessarily have to perform all the pre-set specific times of one-generation training. The control circuit 610 may further determine whether the percentage of all the element values of the parameter mask tensor that have not changed in the two consecutive one-generation trainings reaches a threshold. If so, it means that the training results have basically converged, and more training will have limited improvement in accuracy, so end the mask adjustment stage and complete the training. Such a threshold is generally set above 70%, that is, if the percentage of all elements of the parameter mask tensor that does not change exceeds 70%, training will be stopped. The present disclosure does not limit the threshold, which may be 80%, 90%, 100%, or any other percentage.

In the embodiments of the present disclosure, in order to save the overhead caused by the sparsification and anti-sparse processes, different sparse data flow structures can be used to perform related operations in different stages of training, so as to obtain optimal operation and IO performance .

In some embodiments, in the mask adjustment stage, the mask tensor may be updated based on the updated neural network parameters, and the results generated by the update process may include the sparsification results of the neural network parameters (eg, sparsification weights) and mask tensor. This mask tensor can be used for sparsification of training data. Then, subsequent operations may be performed based on the sparsed neural network parameters and the sparsed training data. During backpropagation in the mask adjustment stage, neuron gradients and neural network parameter gradients can be calculated based on the current non-sparse neural network parameters, and the non-sparse neural network parameters are updated accordingly. Alternatively, in the backpropagation of the mask adjustment stage, the neural network parameters can be sparsed based on the mask tensors used in the forward propagation, and neuron gradients and neural network parameters can be computed based on the sparsed neural network parameters. network parameter gradients and update the unsparse neural network parameters accordingly. The sparsification in the backpropagation process is described in the previous description, and will not be repeated here.

In other embodiments, in the mask fixing stage, the mask tensor is fixed and does not need to be updated in real time. Therefore, the fixed mask tensor can be stored in the storage circuit for subsequent use. Fixed mask tensors can include the forward mask tensor used in forward propagation, and the reverse mask tensor used in back propagation. Neural network parameters can have different storage schemes.

In one implementation, the memory circuit may store the un-thinned neural network parameters. At this time, in forward propagation, it is necessary to use the stored mask tensor to sparse the neural network parameters. In backpropagation, the neural network parameters that are not sparsely processed are used to directly participate in the neuron gradient calculation (for example, the above formula (1)), and the neural network parameters that are not sparsely processed are updated and stored in the storage circuit again. Or in backpropagation, you can use the reverse mask tensor stored in the storage circuit to sparse the neural network parameters that are not sparsed, and then calculate the neuron gradient based on this and update the unsparsed neural network accordingly. Network parameters.

In another implementation, the memory circuit may store the thinned neural network parameters. At this time, in the forward propagation, the sparsely processed neural network parameters can directly participate in the forward operation, and no further sparse processing is required. In backpropagation, the sparsely processed neural network parameters need to be updated, so the mask tensor stored in the storage circuit can be used to sparse the neural network parameter gradient, and then the sparsely processed neural network parameters can be updated. In the neuron gradient calculation during backpropagation, you can choose to sparse or not to sparse. When the sparse processing is not adopted, it is necessary to perform anti-sparse processing on the sparse-processed neural network parameters, and then calculate the neuron gradient based on the neural network parameters after the anti-sparse processing. When the sparse processing is adopted, the reverse mask tensor stored in the storage circuit can be used to sparse the neural network parameters after the anti-sparse processing, and then the neuron gradient is calculated based on this.

Another embodiment of the present disclosure is a computer-readable storage medium on which computer program codes for sparse training of a neural network model are stored. Methods of Examples. In some implementation scenarios, the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer-readable memory. Based on this, when the aspects of the present disclosure are embodied in the form of a software product (eg, a computer-readable storage medium), the software product may be stored in a memory, which may include several instructions to cause a computer device (eg, a personal computer, a server or network equipment, etc.) to execute some or all of the steps of the methods described in the embodiments of the present disclosure. The aforementioned memory may include, but is not limited to, a U disk, a flash disk, a read-only memory (Read Only Memory, ROM), a random access memory (Random Access Memory, RAM), a mobile hard disk, a magnetic disk, or a CD, etc. that can store programs. medium of code.

After the training is completed in the aforementioned embodiments, when the computing device 201 performs reasoning, the updated parameter mask tensor is used to block the parameters after training, so as to control the processing area of the feature map input to the neural network model. To achieve the expected accuracy, on the one hand, it can reduce the amount of calculation in the process of reasoning, and achieve the purpose of sparseness.

According to different application scenarios, the electronic devices or devices of the present disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and/or medical equipment. The vehicles include airplanes, ships and/or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods; the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph. The electronic equipment or device of the present disclosure can also be applied to the Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical care and other fields. Further, the electronic device or device of the present disclosure can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge terminal, terminal, etc. In one or more embodiments, the electronic device or device with high computing power according to the solution of the present disclosure can be applied to a cloud device (eg, a cloud server), while the electronic device or device with low power consumption can be applied to a terminal device and/or Edge devices (such as smartphones or cameras). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be obtained from the hardware information of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device. Matching appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-device integration.

It should be noted that, for the purpose of simplicity, the present disclosure expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solutions of the present disclosure are not limited by the order of the described actions . Accordingly, those of ordinary skill in the art, based on the disclosure or teachings of this disclosure, will appreciate that some of the steps may be performed in other orders or concurrently. Further, those skilled in the art can understand that the embodiments described in the present disclosure may be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present disclosure. In addition, according to different solutions, the present disclosure also has different emphases in the description of some embodiments. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present disclosure, and can also refer to the related descriptions of other embodiments.

In terms of specific implementation, based on the disclosure and teaching of this disclosure, those skilled in the art can understand that several embodiments disclosed in this disclosure can also be implemented in other ways not disclosed herein. For example, as for each unit in the foregoing electronic device or apparatus embodiment, it is divided on the basis of considering the logical function, and there may also be other division methods in actual implementation. As another example, multiple units or components may be combined or integrated into another system, or some features or functions of a unit or component may be selectively disabled. As far as the connection relationship between different units or components is concerned, the connections discussed above in conjunction with the accompanying drawings may be direct or indirect couplings between units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In this disclosure, units illustrated as separate components may or may not be physically separate, and components shown as units may or may not be physical units. The aforementioned components or elements may be co-located or distributed over multiple network elements. In addition, according to actual needs, some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure. In addition, in some scenarios, multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit physically exists independently.

In other implementation scenarios, the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits, and the like. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but are not limited to, devices such as transistors or memristors. In view of this, various types of devices described herein (eg, computing devices or other processing devices) may be implemented by suitable hardware processors, such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High Bandwidth Memory (High Bandwidth Memory) , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.

The foregoing can be better understood in accordance with the following terms:

Clause 1. A method of sparse training a neural network model performed by a data processing apparatus, comprising:

In forward propagation, at least the neural network parameters are sparsed based on the mask tensor to calculate the value of the loss function;

in backpropagation, computing neuron gradients and the neural network parameter gradients based on the loss function; and

The neural network parameters are updated based on the neural network parameter gradients.

Clause 2. The method of clause 1, wherein the method further comprises:

In backpropagation, the neuron gradients and the neural network parameter gradients are calculated based on the unsparse-processed neural network parameters; and the neural network parameters are updated based on the neural network parameter gradients.

Clause 3. The method of clause 2, wherein the method further comprises:

De-sparse processing is performed on the sparse-processed neural network parameters to obtain the un-sparse-processed neural network parameters.

Clause 4. The method of clause 1, wherein the method further comprises:

In backpropagation, the neuron gradients and the neural network parameter gradients are calculated based on the thinned neural network parameters; and the neural network parameters are updated based on the neuron gradients.

Clause 5. The method of clause 4, further comprising:

In backpropagation, the neural network parameters are sparsed based on the reverse mask tensor to obtain the sparsely processed neural network parameters.

Clause 6. The method of any of clauses 1-3, wherein the mask tensor is a one-dimensional tensor.

Clause 7. The method of clause 6, wherein the one-dimensional tensor sparses the input channel dimension of the neural network parameters.

Clause 8. The method of clauses 1-5, wherein the mask tensor is a two-dimensional tensor.

Clause 9. The method of Clause 8, wherein the two-dimensional tensor sparses the input channel dimension and the output channel dimension of the neural network parameters.

Clause 10. The method of Clause 5, wherein when the mask tensor is a two-dimensional tensor, the reverse mask tensor is generated by dimensionally transforming the mask tensor.

Clause 11. The method of Clause 1, wherein updating the neural network parameters comprises updating unsparse processed neural network parameters.

Clause 12. The method of clause 11, further comprising:

The mask tensor is generated based on the updated unsparse processed neural network parameters.

Clause 13. The method of clause 12, wherein when the mask tensor is a one-dimensional tensor, the method generates the mask tensor as follows:

Select n data elements with larger absolute values as valid data elements from every m data elements of the specified dimension of the neural network parameter, where m>n; and

The mask tensor is determined based on the positions of the n valid data elements among the m data elements.

Clause 14. The method of clause 12, wherein when the mask tensor is a two-dimensional tensor, the method generates the mask tensor as follows:

Preset a specific number of two-dimensional mask tensors, each dimension of the two-dimensional mask tensor includes m elements, wherein n elements are 1, m-n elements are 0, and m>n;

Mask the specified two dimensions of the neural network parameters based on each preset two-dimensional mask tensor to obtain a masked parameter tensor;

Based on each post-mask parameter tensor, a product-sum operation is performed on the training data of the neural network to obtain parameter evaluation values; and a two-dimensional mask tensor that produces the largest of all parameter evaluation values is selected as the mask Tensor.

Clause 15. The method of any of clauses 1-14, wherein the method is performed in multiple iterations in a mask adjustment phase of the sparsification training.

Clause 16. The method of clause 15, wherein the mask adjustment stage further comprises:

Judging whether the percentage of unchanged values of all elements of the mask tensor has reached a threshold during multiple successive iterations of training; and

If so, the mask adjustment phase ends.

Clause 17. The method of clause 16, wherein the threshold is one of 80%, 90%, and 100%.

Clause 18. The method of any of clauses 1-10, wherein the method is performed in multiple iterations in a mask-fixing stage of sparse training, and the mask tensor is fixed as finalized by a previous stage mask tensor.

Clause 19. The method of clause 18, wherein updating the neural network parameters comprises updating the thinned neural network parameters.

Clause 20. The method of clause 19, wherein updating the neural network parameters further comprises:

thinning the neuron gradient using the mask tensor; and

The thinned neural network parameters are updated based on the thinned neuron gradients.

Clause 21. The method of any of clauses 18-20, wherein during the mask fixation stage, a fixed mask tensor and sparsely processed neural network parameters are stored.

Clause 22. The method of any of clauses 18-20, wherein during the mask fixation phase, a fixed mask tensor and unsparsely processed neural network parameters are stored.

Clause 23. A computer-readable storage medium having stored thereon computer program code for sparse training a neural network model, which when executed by a processing device, executes any one of clauses 1 to 22. Methods.

Clause 24. A data processing apparatus comprising a control circuit, a storage circuit and an arithmetic circuit, wherein:

the control circuit is configured to control the storage circuit and the arithmetic circuit to perform sparse training on a neural network model;

the storage circuit is configured to store information including at least neural network parameters and mask tensors; and

The arithmetic circuit is configured to perform the following operations under the control of the control circuit:

in backpropagation, computing neuron gradients and neural network parameter gradients based on the loss function; and

Clause 25. The apparatus of clause 24, wherein the arithmetic circuit is further configured to:

in backpropagation, computing the neuron gradient and the neural network parameter gradient based on the unsparse-processed neural network parameters; and

Clause 26. The apparatus of clause 25, wherein the arithmetic circuit is further configured to:

Clause 27. The apparatus of clause 24, wherein the arithmetic circuit is further configured to:

in backpropagation, computing the neuron gradient and the neural network parameter gradient based on the thinned neural network parameters; and

The neural network parameters are updated based on the neuron gradients.

Clause 28. The apparatus of clause 27, wherein the arithmetic circuit is further configured to:

Clause 29. The apparatus of any of clauses 24-26, wherein the mask tensor is a one-dimensional tensor.

Clause 30. The apparatus of clause 29, wherein the one-dimensional tensor sparses input channel dimensions of neural network parameters.

Clause 31. The apparatus of clauses 24-28, wherein the mask tensor is a two-dimensional tensor.

Clause 32. The apparatus of clause 31, wherein the two-dimensional tensor sparses input channel dimensions and output channel dimensions of neural network parameters.

Clause 33. The apparatus of Clause 28, wherein when the mask tensor is a two-dimensional tensor, the reverse mask tensor is a dimensional transformation of the mask tensor by the arithmetic circuit generated.

Clause 34. The apparatus of clause 24, wherein the arithmetic circuit is further configured to:

Update unsparse neural network parameters.

Clause 35. The apparatus of clause 34, wherein the arithmetic circuit is further configured to:

Clause 36. The apparatus of clause 35, wherein when the mask tensor is a one-dimensional tensor, the arithmetic circuit is configured to generate the mask tensor as follows:

Clause 37. The apparatus of clause 35, wherein when the mask tensor is a two-dimensional tensor, the arithmetic circuit is configured to generate the mask tensor as follows:

Clause 38. The apparatus of any of clauses 24-37, wherein the operational circuit is configured to perform the operations in a plurality of iterations in a mask adjustment phase of sparsification training.

Clause 39. The apparatus of Clause 38, wherein the arithmetic circuit is further configured to: in the mask adjustment stage, determine that the values of all elements of the mask tensor have not changed in successive iterations of training whether the percentage of the threshold is reached; and

If so, the mask adjustment phase ends.

Clause 40. The device of clause 39, wherein the threshold is one of 80%, 90%, and 100%.

Clause 41. The apparatus of any of clauses 24-33, wherein the operational circuit is configured to perform the operation in a plurality of iterations in a mask-fixed phase of sparsification training, and the mask tensor Fixed to be the mask tensor finalized by the previous stage.

Clause 42. The apparatus of clause 41, wherein the arithmetic circuit is further configured to:

Update the parameters of the thinned neural network.

Clause 43. The apparatus of clause 42, wherein the arithmetic circuit is further configured to:

thinning the neuron gradient using the mask tensor; and

Clause 44. The apparatus of any of clauses 41-43, wherein during the mask fixation phase, the memory circuit is configured to store a fixed mask tensor and thinned neural network parameters.

Clause 45. The apparatus of any of clauses 41-43, wherein during the mask fixation phase, the memory circuit is configured to store the fixed mask tensor and the unsparse processed neural network parameters.

Clause 46. A chip comprising a data processing device according to any of clauses 24-45.

Clause 47. A board comprising the chip of clause 46.

The embodiments of the present disclosure have been introduced in detail above, and specific examples are used to illustrate the principles and implementations of the present disclosure. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present disclosure; at the same time, for Persons of ordinary skill in the art, according to the idea of the present disclosure, will have changes in the specific implementation manner and application scope. In summary, the contents of this description should not be construed as a limitation on the present disclosure.

Claims

A method for sparse training of a neural network model performed by a data processing device, comprising:

In forward propagation, at least the neural network parameters are sparsed based on the mask tensor to calculate the value of the loss function;

in backpropagation, computing neuron gradients and the neural network parameter gradients based on the loss function; and

The neural network parameters are updated based on the neural network parameter gradients.
The method of claim 1, wherein the method further comprises:

in backpropagation, computing the neuron gradient and the neural network parameter gradient based on the unsparse-processed neural network parameters; and

The neural network parameters are updated based on the neural network parameter gradients.
The method of claim 2, wherein the method further comprises:

De-sparse processing is performed on the sparse-processed neural network parameters to obtain the un-sparse-processed neural network parameters.
The method of claim 1, wherein the method further comprises:

in backpropagation, computing the neuron gradient and the neural network parameter gradient based on the thinned neural network parameters; and

The neural network parameters are updated based on the neuron gradients.
The method of claim 4, further comprising:

In backpropagation, the neural network parameters are sparsed based on the reverse mask tensor to obtain the sparsely processed neural network parameters.
The method according to any one of claims 1-3, wherein the mask tensor is a one-dimensional tensor.
7. The method of claim 6, wherein the one-dimensional tensor sparses input channel dimensions of neural network parameters.
The method of claims 1-5, wherein the mask tensor is a two-dimensional tensor.
The method of claim 8, wherein the two-dimensional tensor sparses the input channel dimension and the output channel dimension of the neural network parameters.
The method of claim 5, wherein when the mask tensor is a two-dimensional tensor, the reverse mask tensor is generated by dimensionally transforming the mask tensor.
1. The method of claim 1, wherein updating the neural network parameters comprises updating unsparse processed neural network parameters.
The method of claim 11, further comprising:

The mask tensor is generated based on the updated unsparse processed neural network parameters.
The method of claim 12, wherein when the mask tensor is a one-dimensional tensor, the method generates the mask tensor as follows:

Select n data elements with larger absolute values as valid data elements from every m data elements of the specified dimension of the neural network parameter, where m>n; and

The mask tensor is determined based on the positions of the n valid data elements among the m data elements.
The method of claim 12, wherein when the mask tensor is a two-dimensional tensor, the method generates the mask tensor as follows:

Preset a specific number of two-dimensional mask tensors, each dimension of the two-dimensional mask tensor includes m elements, wherein n elements are 1, m-n elements are 0, and m>n;

Mask the specified two dimensions of the neural network parameters based on each preset two-dimensional mask tensor to obtain a masked parameter tensor;

Based on each post-mask parameter tensor, performing a product-sum operation on the training data of the neural network to obtain parameter estimates; and

The two-dimensional mask tensor that yields the largest of all parameter evaluations is selected as the mask tensor.
15. The method of any of claims 1-14, wherein the method is performed in multiple iterations in a mask adjustment phase of sparsification training.
The method of claim 15, wherein the mask adjustment stage further comprises:

Judging whether the percentage of unchanged values of all elements of the mask tensor has reached a threshold during multiple successive iterations of training; and

If so, the mask adjustment phase ends.
17. The method of claim 16, wherein the threshold is one of 80%, 90%, and 100%.
10. The method of any one of claims 1-10, wherein the method is performed in multiple iterations in a mask-fixing stage of sparse training, and the mask tensor is fixed to a mask finalized in a previous stage Tensor.
19. The method of claim 18, wherein updating the neural network parameters comprises updating thinned neural network parameters.
The method of claim 19, wherein updating the neural network parameters further comprises:

thinning the neuron gradient using the mask tensor; and

The thinned neural network parameters are updated based on the thinned neuron gradients.
20. The method of any of claims 18-20, wherein during the mask fixation phase, a fixed mask tensor and sparsely processed neural network parameters are stored.
20. The method of any of claims 18-20, wherein during the mask fixation phase, a fixed mask tensor and unsparsely processed neural network parameters are stored.
A computer-readable storage medium on which computer program codes for sparse training of neural network models are stored, and when the computer program codes are executed by a processing device, the method of any one of claims 1 to 22 is executed .
A data processing device, comprising a control circuit, a storage circuit and an arithmetic circuit, wherein:

the control circuit is configured to control the storage circuit and the arithmetic circuit to perform sparse training on a neural network model;

the storage circuit is configured to store information including at least neural network parameters and mask tensors; and

The arithmetic circuit is configured to perform the following operations under the control of the control circuit:

In forward propagation, at least the neural network parameters are sparsed based on the mask tensor to calculate the value of the loss function;

in backpropagation, computing neuron gradients and neural network parameter gradients based on the loss function; and

The neural network parameters are updated based on the neural network parameter gradients.
25. The apparatus of claim 24, wherein the arithmetic circuit is further configured to:

in backpropagation, computing the neuron gradient and the neural network parameter gradient based on the unsparse-processed neural network parameters; and

The neural network parameters are updated based on the neural network parameter gradients.
26. The apparatus of claim 25, wherein the arithmetic circuit is further configured to:

De-sparse processing is performed on the sparse-processed neural network parameters to obtain the un-sparse-processed neural network parameters.
25. The apparatus of claim 24, wherein the arithmetic circuit is further configured to:

in backpropagation, computing the neuron gradient and the neural network parameter gradient based on the thinned neural network parameters; and

The neural network parameters are updated based on the neuron gradients.
28. The apparatus of claim 27, wherein the arithmetic circuit is further configured to:

In backpropagation, the neural network parameters are sparsed based on the reverse mask tensor to obtain the sparsely processed neural network parameters.
The apparatus of any of claims 24-26, wherein the mask tensor is a one-dimensional tensor.
30. The apparatus of claim 29, wherein the one-dimensional tensor sparses input channel dimensions of neural network parameters.
The apparatus of claims 24-28, wherein the mask tensor is a two-dimensional tensor.
The apparatus of claim 31, wherein the two-dimensional tensor sparses input channel dimensions and output channel dimensions of neural network parameters.
The device according to claim 28, wherein when the mask tensor is a two-dimensional tensor, the reverse mask tensor is generated by performing dimension transformation on the mask tensor by the operation circuit of.
25. The apparatus of claim 24, wherein the arithmetic circuit is further configured to:

Update unsparse neural network parameters.
35. The apparatus of claim 34, wherein the arithmetic circuit is further configured to:

The mask tensor is generated based on the updated unsparse processed neural network parameters.
36. The apparatus of claim 35, wherein when the mask tensor is a one-dimensional tensor, the arithmetic circuit is configured to generate the mask tensor as follows:

Select n data elements with larger absolute values as valid data elements from every m data elements of the specified dimension of the neural network parameter, where m>n; and

The mask tensor is determined based on the positions of the n valid data elements among the m data elements.
36. The apparatus of claim 35, wherein when the mask tensor is a two-dimensional tensor, the arithmetic circuit is configured to generate the mask tensor as follows:

Preset a specific number of two-dimensional mask tensors, each dimension of the two-dimensional mask tensor includes m elements, wherein n elements are 1, m-n elements are 0, and m>n;

Mask the specified two dimensions of the neural network parameters based on each preset two-dimensional mask tensor to obtain a masked parameter tensor;

Based on each post-mask parameter tensor, performing a product-sum operation on the training data of the neural network to obtain parameter estimates; and

The two-dimensional mask tensor that yields the largest of all parameter evaluations is selected as the mask tensor.
37. The apparatus of any of claims 24-37, wherein the operational circuit is configured to perform the operations in a plurality of iterations in a mask adjustment phase of sparse training.
The apparatus according to claim 38, wherein the operation circuit is further configured to: in the mask adjustment stage, determine the percentage of all element values of the mask tensor that have not changed in successive multiple iterations of training whether the threshold is reached; and

If so, the mask adjustment phase ends.
39. The device of claim 39, wherein the threshold is one of 80%, 90%, and 100%.
34. The apparatus of any one of claims 24-33, wherein the operational circuitry is configured to perform the operations in a plurality of iterations in a mask-fixed phase of sparse training, and the mask tensor is fixed as The mask tensor finalized by the previous stage.
The apparatus of claim 41, wherein the arithmetic circuit is further configured to:

Update the parameters of the thinned neural network.
The apparatus of claim 42, wherein the arithmetic circuit is further configured to:

thinning the neuron gradient using the mask tensor; and

The thinned neural network parameters are updated based on the thinned neuron gradients.
43. The apparatus of any of claims 41-43, wherein during the mask fixation phase, the memory circuit is configured to store a fixed mask tensor and thinned neural network parameters.
43. The apparatus of any of claims 41-43, wherein during the mask fixation phase, the memory circuit is configured to store the fixed mask tensor and the unsparse processed neural network parameters.
A chip, comprising the data processing device according to any one of claims 24-45.
A board, comprising the chip according to claim 46 .