WO2022095675A1 - Neural network sparsification apparatus and method and related product - Google Patents

Neural network sparsification apparatus and method and related product Download PDF

Info

Publication number
WO2022095675A1
WO2022095675A1 PCT/CN2021/123879 CN2021123879W WO2022095675A1 WO 2022095675 A1 WO2022095675 A1 WO 2022095675A1 CN 2021123879 W CN2021123879 W CN 2021123879W WO 2022095675 A1 WO2022095675 A1 WO 2022095675A1
Authority
WO
WIPO (PCT)
Prior art keywords
mask
neural network
tensor
network parameters
parameter
Prior art date
Application number
PCT/CN2021/123879
Other languages
French (fr)
Chinese (zh)
Inventor
高钰峰
朱时兵
刘少礼
张曦珊
何得园
Original Assignee
安徽寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 安徽寒武纪信息科技有限公司 filed Critical 安徽寒武纪信息科技有限公司
Priority to US18/003,821 priority Critical patent/US20230259780A1/en
Publication of WO2022095675A1 publication Critical patent/WO2022095675A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/285Selection of pattern recognition techniques, e.g. of classifiers in a multi-classifier system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections

Definitions

  • the present disclosure relates generally to the field of processors. More specifically, the present disclosure relates to a method, device, chip, board and readable storage medium for sparse training a neural network model by a data processing device.
  • the network parameter sparsification is to reduce the redundant components in the larger network by appropriate methods, so as to reduce the network's demand for computation and storage space.
  • the existing fine-grained parameter sparsification methods and models perform well, they are not friendly to hardware memory access, that is, on-chip and off-chip input/output have high overhead and low performance.
  • structured sparseness based on channels and convolution kernels Although the method improves the hardware performance, the model accuracy loss is relatively large.
  • most of the existing sparse algorithms are offline fine-tuning methods, that is, the pre-trained model is sparse and then fine-tuned.
  • the offline fine-tuning method has many restrictions and cannot be used in model training. There are more substantial performance gains.
  • the solution of the present disclosure provides an apparatus, board, method and readable storage medium for sparse training of a neural network model.
  • the present disclosure discloses a method for sparse training of a neural network model performed by a data processing device, comprising: in forward propagation, performing sparse processing on at least a neural network parameter based on a mask tensor , to calculate the value of the loss function; in backpropagation, calculate neuron gradients and neural network parameter gradients based on the loss function; and update the neural network parameters based on the neural network parameter gradients.
  • the present disclosure provides a computer-readable storage medium on which computer program code for sparse training a neural network model is stored, and when the computer program code is executed by a processing device, executes the aforementioned first The method of any embodiment of the aspect.
  • the present disclosure provides a data processing apparatus comprising a control circuit, a storage circuit, and an arithmetic circuit, wherein: the control circuit is configured to control the storage circuit and the arithmetic circuit to perform execution on a neural network model sparse training; the storage circuit is configured to store information including at least neural network parameters and mask tensors; and the arithmetic circuit is configured to perform the following operations under the control of the control circuit: In forward propagation, at least the neural network parameters are sparsed based on the mask tensor to calculate the value of the loss function; in back propagation, the neuron gradient and the neural network parameter gradient are calculated based on the loss function; and The neural network parameters are updated based on the neural network parameter gradients.
  • the present disclosure provides a chip including the data processing circuit of any embodiment of the foregoing third aspect.
  • the present disclosure provides a board including the chip of any embodiment of the foregoing fourth aspect.
  • the sparsification scheme can support sparsification in the forward propagation process of training, such as sparsification of input channel dimensions, or simultaneous sparsification of input channel dimensions and output channel dimensions.
  • sparsification of input channel dimensions such as sparsification of input channel dimensions, or simultaneous sparsification of input channel dimensions and output channel dimensions.
  • forward propagation performs simultaneous thinning of input channel dimensions and output channel dimensions
  • simultaneous thinning of input channel dimensions and output channel dimensions may also be supported in backpropagation, thereby further optimizing performance.
  • the sparsification scheme of the present disclosure can be performed in multiple stages of training, and different training stages can use different structured sparse data flow structures to perform related operations to obtain optimized operation and IO performance.
  • FIG. 1 is a structural diagram illustrating a board according to an embodiment of the present disclosure
  • FIG. 2 is a block diagram illustrating an integrated circuit device of an embodiment of the present disclosure
  • FIG. 3 is a schematic diagram showing the internal structure of a single-core computing device according to an embodiment of the present disclosure
  • FIG. 4 is a schematic diagram showing the internal structure of a multi-core computing device according to an embodiment of the present disclosure
  • FIG. 5 is a schematic diagram showing the internal structure of a processor core according to an embodiment of the present disclosure.
  • FIG. 6 shows an exemplary structural block diagram of a data processing apparatus according to an embodiment of the present disclosure
  • FIG. 7 illustrates a method performed in an iterative process according to an embodiment of the present disclosure
  • FIG. 8A illustrates a masking process for an exemplary one-dimensional mask tensor according to an embodiment of the present disclosure
  • FIG. 8B illustrates the masking process of an exemplary two-dimensional mask tensor according to an embodiment of the present disclosure
  • FIG. 9 is a schematic diagram illustrating an exemplary mask vector update
  • FIG. 10 is a schematic diagram illustrating an exemplary sum-of-product calculation process
  • FIG. 11 is a flowchart illustrating a sparse training method according to another embodiment of the present disclosure.
  • FIG. 12 is a flowchart illustrating a sparse training method entering a mask fixing stage according to another embodiment of the present disclosure.
  • FIG. 13 is a schematic diagram illustrating several embodiments of the present disclosure when the neural network model is sparsely trained.
  • the term “if” may be contextually interpreted as “when” or “once” or “in response to determining” or “in response to detecting”.
  • FIG. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure.
  • the board 10 includes a chip 101, which is a system-on-chip (SoC), or a system-on-a-chip, and integrates one or more combined processing devices.
  • SoC system-on-chip
  • the combined processing device is an artificial
  • the intelligent computing unit is used to support various deep learning and machine learning algorithms to meet the intelligent processing requirements in complex scenarios in the fields of computer vision, speech, natural language processing, and data mining.
  • deep learning technology is widely used in the field of cloud intelligence.
  • a notable feature of cloud intelligence applications is the large amount of input data, which has high requirements on the storage capacity and computing capacity of the platform.
  • the board 10 in this embodiment is suitable for cloud intelligence applications. applications, with huge off-chip storage, on-chip storage and powerful computing power.
  • the chip 101 is connected to an external device 103 through an external interface device 102 .
  • the external device 103 is, for example, a server, a computer, a camera, a monitor, a mouse, a keyboard, a network card or a wifi interface, and the like.
  • the data to be processed can be transmitted to the chip 101 by the external device 103 through the external interface device 102 .
  • the calculation result of the chip 101 can be transmitted back to the external device 103 via the external interface device 102 .
  • the external interface device 102 may have different interface forms, such as a PCIe interface and the like.
  • the board 10 also includes a storage device 104 for storing data, which includes one or more storage units 105 .
  • the storage device 104 is connected to the control device 106 and the chip 101 through a bus and performs data transmission.
  • the control device 106 in the board 10 is configured to control the state of the chip 101 .
  • the control device 106 may include a microcontroller (Micro Controller Unit, MCU).
  • FIG. 2 is a block diagram showing a combined processing device in the chip 101 of this embodiment.
  • the combined processing device 20 includes a computing device 201, an interface device 202, a processing device 203, and a DRAM 204.
  • the computing device 201 is configured to perform operations specified by the user, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor to perform deep learning or machine learning calculations, which can interact with the processing device 203 through the interface device 202 to Work together to complete a user-specified operation.
  • the interface device 202 is used to transmit data and control instructions between the computing device 201 and the processing device 203 .
  • the computing device 201 may obtain input data from the processing device 203 via the interface device 202 and write the input data into the storage device on-chip of the computing device 201 .
  • the computing device 201 can obtain the control instruction from the processing device 203 via the interface device 202 and write it into the control cache on the computing device 201 .
  • the interface device 202 can also read the data in the storage device of the computing device 201 and transmit it to the processing device 203 .
  • the processing device 203 performs basic control including but not limited to data transfer, starting and/or stopping the computing device 201, and the like.
  • the processing device 203 may be one or more types of central processing unit (CPU), graphics processing unit (GPU), or other general-purpose and/or special-purpose processors.
  • processors include but are not limited to digital signal processors (DSPs), application specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • DSPs digital signal processors
  • ASICs application specific integrated circuits
  • FPGAs field-programmable gate arrays
  • Programmable logic devices discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • the computing device 201 of the present disclosure can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when the computing device 201 and the processing device 203 are considered together, the two are considered to form a heterogeneous multi-core structure.
  • the DRAM 204 is used to store the data to be processed, and is a DDR memory with a size of 16G or more, and is used to save the data of the computing device 201 and/or the processing device 203.
  • FIG. 3 shows a schematic diagram of the internal structure of the computing device 201 as a single core.
  • the single-core computing device 301 is used to process input data such as computer vision, speech, natural language, and data mining.
  • the single-core computing device 301 includes three modules: a control module 31 , an arithmetic module 32 and a storage module 33 .
  • the control module 31 is used to coordinate and control the work of the operation module 32 and the storage module 33 to complete the task of deep learning, and it comprises an instruction fetch unit (instruction fetch unit, IFU) 311 and an instruction decoding unit (instruction decode unit, IDU) 312.
  • the instruction fetching unit 311 is used to acquire the instruction from the processing device 203 , and the instruction decoding unit 312 decodes the acquired instruction, and sends the decoding result to the operation module 32 and the storage module 33 as control information.
  • the operation module 32 includes a vector operation unit 321 and a matrix operation unit 322 .
  • the vector operation unit 321 is used to perform vector operations, and can support complex operations such as vector multiplication, addition, and nonlinear transformation;
  • the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, that is, matrix multiplication and convolution.
  • the storage module 33 is used to store or transport related data, including a neuron storage unit (neuron RAM, NRAM) 331, a parameter storage unit (weight RAM, WRAM) 332, and a direct memory access module (direct memory access, DMA) 333.
  • NRAM 331 is used to store input neurons, output neurons and intermediate results after calculation;
  • WRAM 332 is used to store the convolution kernel of the deep learning network, that is, weights;
  • DMA 333 is connected to DRAM 204 through bus 34 and is responsible for single-core computing Data transfer between device 301 and DRAM 204.
  • FIG. 4 shows a schematic diagram of the internal structure of the computing device 201 with multiple cores.
  • the multi-core computing device 41 adopts a layered structure design, and the multi-core computing device 41 is a system-on-a-chip, which includes at least one cluster, and each cluster includes multiple processor cores.
  • the multi-core computing device 41 is a system-on-chip- Cluster - a hierarchy of processor cores.
  • the multi-core computing device 41 includes an external storage controller 401 , a peripheral communication module 402 , an on-chip interconnect module 403 , a synchronization module 404 and multiple clusters 405 .
  • the peripheral communication module 402 is used for receiving a control signal from the processing device 203 through the interface device 202 to start the computing device 201 to perform tasks.
  • the on-chip interconnection module 403 connects the external storage controller 401 , the peripheral communication module 402 and the multiple clusters 405 to transmit data and control signals among the modules.
  • the synchronization module 404 is a global synchronization barrier controller (GBC), which is used to coordinate the work progress of each cluster and ensure the synchronization of information.
  • GBC global synchronization barrier controller
  • the plurality of clusters 405 are the computing cores of the multi-core computing device 41, and 4 are exemplarily shown in the figure. With the development of hardware, the multi-core computing device 41 of the present disclosure may further include 8, 16, 64, or even more. Multiple clusters 405. Cluster 405 is used to efficiently execute deep learning algorithms.
  • each cluster 405 includes multiple processor cores (IPU cores) 406 and one memory core (MEM core) 407 .
  • IPU cores processor cores
  • MEM core memory core
  • the processor cores 406 are exemplarily shown as four in the figure, and the present disclosure does not limit the number of the processor cores 406 . Its internal structure is shown in Figure 5.
  • Each processor core 406 is similar to the single-core computing device 301 in FIG. 3 , and also includes three major modules: a control module 51 , an arithmetic module 52 and a storage module 53 .
  • the functions and structures of the control module 51 , the arithmetic module 52 and the storage module 53 are substantially the same as those of the control module 31 , the arithmetic module 32 and the storage module 33 , and will not be described again.
  • the storage module 53 includes an input/output direct memory access (IODMA) 533 and a move direct memory access (MVDMA) 534.
  • IODMA input/output direct memory access
  • MVDMA move direct memory access
  • the IODMA 533 controls the memory access of the NRAM 531/WRAM 532 and the DRAM 204 through the broadcast bus 409; the MVDMA 534 is used to control the memory access of the NRAM 531/WRAM 532 and the storage unit (SRAM) 408.
  • the storage core 407 is mainly used for storage and communication, that is, to store the shared data or intermediate results between the processor cores 406, and to execute the communication between the cluster 405 and the DRAM 204, the communication between the clusters 405, and the processor Communication among the cores 406, etc.
  • the memory core 407 has scalar operation capability for performing scalar operations.
  • the storage core 407 includes an SRAM 408 , a broadcast bus 409 , a cluster direct memory access (CDMA) 410 and a global direct memory access (GDMA) 411 .
  • the SRAM 408 assumes the role of a high-performance data transfer station.
  • the data multiplexed between different processor cores 406 in the same cluster 405 does not need to be obtained from the DRAM 204 through the processor cores 406, but is stored in the processor through the SRAM 408.
  • the storage core 407 only needs to quickly distribute the multiplexed data from the SRAM 408 to the multiple processor cores 406, so as to improve the communication efficiency between the cores and greatly reduce the on-chip and off-chip I/O accesses.
  • the broadcast bus 409, the CDMA 410 and the GDMA 411 are used to perform the communication between the processor cores 406, the communication between the clusters 405 and the data transmission between the clusters 405 and the DRAM 204, respectively. They will be explained separately below.
  • the broadcast bus 409 is used to complete high-speed communication among the processor cores 406 in the cluster 405.
  • the broadcast bus 409 in this embodiment supports inter-core communication methods including unicast, multicast and broadcast.
  • Unicast refers to point-to-point (such as a single processor core to a single processor core) data transmission
  • multicast is a communication method that transmits a piece of data from SRAM 408 to specific processor cores 406, and broadcast is a communication method.
  • the communication method in which copies of data are transmitted from SRAM 408 to all processor cores 406 is a special case of multicast.
  • the CDMA 410 is used to control the memory access of the SRAM 408 between different clusters 405 within the same computing device 201.
  • the GDMA 411 cooperates with the external memory controller 401 to control the memory access from the SRAM 408 of the cluster 405 to the DRAM 204, or to read data from the DRAM 204 to the SRAM 408.
  • the communication between the DRAM 204 and the NRAM 431 or the WRAM 432 can be implemented through two channels.
  • the first channel is to directly contact DRAM 204 and NRAM 431 or WRAM 432 through IODAM 433; the second channel is to transfer data between DRAM 204 and SRAM 408 through GDMA 411, and then through MVDMA 534 to transfer data between SRAM 408 and NRAM 431 or WRAM 432 transfers.
  • a data transmission channel can be selected according to its own hardware conditions.
  • GDMA 411 and the functionality of IODMA 533 may be integrated in the same component.
  • GDMA 411 and IODMA 533 are regarded as different components.
  • the function of GDMA 411, the function of IODMA 533, the function of CDMA 410, and the function of MVDMA 534 can also be realized by the same component.
  • the training of the neural network is to adjust the parameters of each layer by inputting training samples, so that the results calculated by the neural network are as close as possible to the real results.
  • Neural network training includes forward propagation and back propagation. Forward propagation is based on the existing model. The input training samples are calculated by each layer of the neural network, and the input feature map is gradually extracted into abstract features. After forward propagation, an output value called the predicted value is obtained. The backpropagation is a loss function calculated according to the predicted value and the real value obtained by the forward propagation. The gradient descent method is used to calculate the partial derivative of the loss function for each parameter through the chain rule to update the parameters. In the chain rule, the derivative of the error value corresponding to the weight of the last layer of the neural network is first calculated.
  • the sample data set needs to be divided into multiple blocks, each block is transmitted to the computer, and the weights of the neural network are updated correspondingly after each block of data set is processed forward.
  • this process is called an epoch.
  • this embodiment provides a solution for sparse training of a neural network model.
  • the neural network parameters are sparsed at least in forward propagation.
  • the thinning process can be one-dimensional thinning (eg, input channel dimension), or can be multi-dimensional thinning, such as two-dimensional thinning (eg, input channel dimension and output channel dimension are simultaneously thinned).
  • simultaneous thinning of input channel dimensions and output channel dimensions may also be supported in backpropagation, thereby further optimizing performance.
  • the sparsification scheme of the present disclosure can be performed in multiple stages of training, and different training stages can use different structured sparse data flow structures to perform related operations to obtain optimized operation and IO performance.
  • FIG. 6 shows an exemplary structural block diagram of a data processing apparatus according to an embodiment of the present disclosure.
  • the data processing device 600 may be implemented, for example, in the computing device 201 of FIG. 2 . As shown, the data processing apparatus 600 may include a control circuit 610 , a storage circuit 620 and an arithmetic circuit 630 .
  • control circuit 610 may be similar to that of the control module 314 in FIG. 3 , and it may include, for example, an instruction fetch unit for acquiring instructions from, for example, the processing device 203 in FIG. 2 , and an instruction decoding unit for processing the acquired instructions. decode, and send the decoded result to the operation circuit 630 and the storage circuit 620 as control information.
  • control circuit 610 may be configured to control the storage circuit 620 and the arithmetic circuit 630 to perform sparse training on the neural network model.
  • Storage circuitry 620 may be configured to store information, which may include at least neural network parameters.
  • the storage circuit 620 may also store mask tensors.
  • the storage circuit may be, for example, the WRAM 332 and NRAM 331 of FIG. 3 .
  • the operation circuit 630 may be configured to perform sparse training on the neural network model under the control of the control circuit 610, so as to perform the method for sparse training as shown in FIG. 7 .
  • Figure 7 illustrates a method performed during one iteration according to an embodiment of the present disclosure.
  • step 710 in forward propagation, at least the neural network parameters are sparsed based on the mask tensor to calculate the value of the loss function.
  • the mask tensor may exist in various situations.
  • the mask tensor is a one-dimensional tensor that sparses a specified dimension of the data. For example, mask tensors sparse the input channel dimension of neural network parameters.
  • the thinning process may be a structured thinning process, for example, according to a thinning rule, n data elements are selected as valid data elements from every m data elements of the dimension to be sparsed in the input data, where m>n.
  • n can also take other values, such as 1 or 3.
  • the mask tensor can be a one-dimensional vector, which can be divided into multiple intervals of length m, each interval has n elements of 1, representing the reserved data position, m-n elements of 0, representing the mask go to the data location.
  • neurons e.g., training data
  • neural network parameters e.g., weights
  • FIG. 8A illustrates the masking process of an exemplary one-dimensional mask tensor according to an embodiment of the present disclosure.
  • FIG. 8A takes the convolutional layer operation of the convolutional neural network as an example to illustrate the sparsification-based convolution operation in forward propagation.
  • the dimension to be sparse is the input channel dimension.
  • An exemplary mask tensor is a vector of length 16, divided into 4 bins of length 4, each bin has 2 elements of 1, as shown by the black squares in the figure.
  • the input channel dimension of the weight is divided into corresponding segments, and each segment corresponds to an interval of the mask tensor. weight value.
  • the input channel dimension of the neuron is similarly sparsed using the same mask tensor.
  • the sparse weights and the sparse neurons are then operated, such as multiply-accumulate operations.
  • the mask tensor is a two-dimensional tensor that simultaneously sparses two specified dimensions of the data. For example, the mask tensor sparses both the input channel dimension and the output channel dimension of the neural network parameters.
  • the thinning process may be a structured thinning process, for example, according to a thinning rule, n data elements are selected as valid data elements from every m data elements of the dimension to be sparsed in the input data, where m>n.
  • n can also take other values, such as 1 or 3.
  • the mask tensor can be a two-dimensional matrix, which can be divided into a plurality of m ⁇ m squares, and any row in each square has n elements that are 1, m-n elements are 0, and any line in each square is 0.
  • a column has n elements as 1, m-n elements as 0, "1" represents the reserved data location, and "0" represents the masked data location.
  • m is 4 and n is 2
  • FIG. 8B shows an exemplary masking process. It is assumed that the input channels and output channels of the convolution layer are a 4 ⁇ 4 channel matrix 801 whose elements are a 11 to a 44 , and the channel matrix 801 is a neural network parameter.
  • the figure also shows an exemplary mask matrix 802 among the aforementioned 90 4 ⁇ 4 mask matrices for performing mask sparse processing on the channel matrix 801 . Specifically, if the corresponding element in the mask matrix 802 is 1, the operation circuit 630 retains the element in the channel matrix 801, and if the corresponding element in the mask matrix 802 is 0, the operation circuit 630 masks the channel matrix 801 The element in , whose value is 0.
  • the corresponding element in the mask matrix 802 is 0, so the corresponding element in the parameter matrix 803 after masking is masked, and its value is 0. All element values of the masked parameter matrix 803 are obtained in this way. Since half of the elements in the channel matrix 801 are masked out, it means that about half of the computation is saved.
  • the arithmetic circuit 630 For each training sample, the arithmetic circuit 630 masks the parameters of the neural network based on the mask tensor in the forward propagation, and then performs calculation, and finally obtains the value of the loss function, which corresponds to the output error of the neural network.
  • step 720 in backpropagation, neuron gradients and neural network parameter gradients are calculated based on the loss function.
  • the sparsification process may or may not be selectively applied in the back propagation.
  • neuron gradients and neural network parameter gradients may be computed based on unsparse neural network parameters; and The network parameter gradient updates the neural network parameters.
  • the neural network parameters that are not sparsed may be the neural network parameters before the sparse processing, or the neural network parameters that have been sparsed to be de-sparsed. obtained after processing.
  • the anti-sparse processing may include restoring the thinned neural network parameters to the corresponding positions before the thinning processing according to the indication of the mask tensor, and filling the remaining positions with predetermined information (eg, 0) to restore the thinning shape before processing.
  • sparseness can also be applied in backpropagation, that is, the neural network is calculated based on the sparsely processed neural network parameters parameter gradients and neuron gradients, and then update the neural network parameters based on neuron gradients.
  • top_diff and bottom_diff are the neuron gradients respectively
  • W is the weight of this iteration
  • ⁇ W is the weight gradient calculated by this iteration
  • the computation in backpropagation is the computation in backpropagation, similar to the convolution operation.
  • the bottom_diff of the previous layer is the top_diff of the current layer
  • the bottom_diff of the current layer is the top_diff of the next layer, so that the error can be transferred layer by layer in reverse.
  • the layout of the weight W is different from that in the forward propagation process, so the direction of accumulation in its operation is also different.
  • the weights are used according to (Co, Kh, Kw, Ci) dimension order or dimension shape, where Ci represents the input channel dimension, Co represents the output channel dimension, Kh is the convolution kernel height dimension, and Kw is the volume The kernel width dimension.
  • the operation results are accumulated in the Ci direction.
  • the weights are used according to the (Ci, Kh, Kw, Co) dimension order or dimension shape.
  • the result of the operation is accumulated in the Co direction. Therefore, in order to maintain the mathematical consistency of the operational gradients in backpropagation, both Ci and Co directions need to be sparsed simultaneously.
  • a reverse mask tensor can be used to mask the neural network parameters to obtain sparsely processed neural network parameters.
  • the reverse mask tensor can be identical to the mask tensor used in forward propagation. However, due to the different weight layouts in the aforementioned backpropagation and the different accumulation directions during operations, the mask tensor in forward propagation cannot be used directly.
  • the mask tensor (or called forward mask tensor) used in forward propagation can be dimensionally transformed before being used.
  • Various existing dimension transformation methods eg, dimension transposition, data warping), etc. can be used to transform the mask tensor into the required layout in backpropagation, and use it as a reverse mask tensor.
  • the mask tensor generation process used in the forward propagation process can also be repeated during the backpropagation process to generate the reverse mask tensor.
  • the mask calculation in the Ci direction is implemented in the forward propagation process
  • the mask calculation in the Co direction is implemented in the back propagation process.
  • step 730 the neural network parameters are updated based on the neural network parameter gradients.
  • the sparsification training of embodiments of the present disclosure may include several training stages, such as a maskless stage, a mask adjustment stage, and a mask fixation stage.
  • a maskless stage such as a maskless stage
  • a mask adjustment stage such as a mask adjustment stage
  • a mask fixation stage such as a mask fixation stage.
  • the updates to the neural network parameters can also be different.
  • updating the neural network parameters may be updating the neural network parameters that are not thinned out. For example, in the mask adjustment stage, in each iteration, the parameters of the neural network that are not thinned out are updated. Further, in the mask adjustment stage, in every K (K ⁇ 1) iterations, an updated mask tensor can be generated based on the updated non-sparse-processed neural network parameters, so that during the training process, an updated mask tensor can be generated. Optimize mask tensors to improve performance.
  • updating the neural network parameters may be updating the sparse-processed neural network parameters.
  • the sparse mode of the neural network parameters is fixed, that is, the effective data elements in the neural network parameters are fixed, so the update of the neural network parameters can be Only valid data elements are updated, that is, the parameters of the neural network that have been sparsed are updated.
  • updating the neural network parameters may include: using a mask tensor to sparse neuron gradients; and updating the sparsed neural network parameters based on the sparse neuron gradients.
  • the mask tensor fixed in the mask fixing stage may be the mask tensor finally determined in the previous training stage (eg, the mask adjustment stage).
  • the mask adjustment stage e.g., the mask adjustment stage.
  • the mask vector can only mask a single parameter.
  • a mask tensor can be generated based on unsparse neural network parameters. For example, select n data elements with larger absolute values from every m data elements of the specified dimension of the neural network parameters as valid data elements, where m>n; and based on the n valid data elements in the m data elements position in to generate the mask tensor.
  • the aforementioned specified dimension may be the input channel dimension (Ci).
  • the parameters are divided into multiple intervals with the specific parameter number m as the unit, and the parameters in each interval are sorted according to their absolute values.
  • the elements of the large first n parameters are 1, and the elements of the m-n parameters with smaller absolute values in each interval are set to 0, because the larger absolute value of the mask adjustment parameters is more obvious.
  • the characteristics of it is more worth keeping to continue the calculation.
  • FIG. 9 is a schematic diagram of an exemplary mask vector update, illustrating the aforementioned update mask vector by way of example.
  • the figure shows a parameter vector 901 with 64 parameters in total, namely b 01 to b 64 .
  • each element value of the mask vector is updated to keep the mask adjustment parameter with a larger absolute value and mask out the mask adjustment parameter with a smaller absolute value.
  • the updated mask adjustment parameters are divided into multiple intervals in units of every 4 mask adjustment parameters (that is, m is 4).
  • b 01 to b 04 are the first interval 902
  • b 05 to b 04 b 08 is the second interval 903
  • b 61 to b 64 are the sixteenth interval 917
  • the mask adjustment parameters in each interval are sorted according to their absolute value, assuming the absolute value of each parameter in the first interval 902
  • the sequence is b 02 >b 01 >b 04 >b 03
  • the absolute values of the parameters in the second interval 903 are b 07 >b 05 >b 06 >b 08
  • the absolute value of the sequence is b 64 >b 63 >b 61 >b 62
  • the mask adjustment parameters in each interval are sorted according to the absolute value of the mask adjustment parameters.
  • the position relative to the first 2 that is, n is 2) mask adjustment parameters whose absolute value is larger in each interval is set to 1, and the position relative to each interval is set to 1.
  • the elements corresponding to b 02 and b 01 in the mask vector are set to 1
  • the elements corresponding to b 04 and b 03 are set to 0.
  • Each interval is adjusted in this way, and finally the updated mask vector 918 is completed.
  • the updated mask vector 918 retains the larger absolute value of the updated mask adjustment parameters, and masks the smaller absolute value of the updated mask adjustment parameters. To sum up, take every 4 mask adjustment parameters as an interval, and update the element value of the mask vector in a 4-to-2 manner for each interval.
  • the mask adjustment parameters in each interval are completely sorted to identify n with larger absolute values and mn with smaller absolute values, but the present disclosure does not necessarily require complete sorting, as long as it can be identified
  • the n items with a larger absolute value and the mn items with a smaller absolute value are sufficient, and the size of the n items with a larger absolute value and the size of the mn items with a smaller absolute value are not necessary information.
  • the present disclosure only needs to determine that b 01 and b 02 are two with larger absolute values, while b 03 and b 04 are two with smaller absolute values .
  • the absolute value size and the absolute value size of b 03 and b 03 are not critical, and the sorting can be omitted to save computing resources.
  • the training data can be multiplied and computed with each post-mask parameter tensor to obtain parameter evaluations.
  • the purpose of obtaining the parameter evaluation value is to calculate the amount of information retained after being masked by the masked tensor. If the parameter evaluation value is high, it means that the amount of information has not been lost too much due to the mask.
  • the mask tensor reduces the amount of operation on the premise of retaining most of the information, and is a high-quality mask tensor; on the contrary, if the parameter evaluation A low value indicates that too much information is lost after masking, and the mask tensor is not a high-quality mask tensor.
  • the two-dimensional mask tensor can be determined as follows: a specific number of two-dimensional mask tensors are preset, and then one of the preset two-dimensional mask tensors is selected as the mask tensor to be used.
  • Each dimension of these two-dimensional mask tensors includes m elements, where n elements are 1, m-n elements are 0, and m>n.
  • Selecting one of these specific number (eg, 90) of two-dimensional mask tensors may include masking the specified two dimensions of the neural network parameters based on each preset two-dimensional mask tensors, respectively, to obtain a post-mask parameter tensor; based on each post-mask parameter tensor, perform a product-sum operation on the training data of the neural network layer to obtain parameter evaluations; and select the two-dimensional mask tensor that produces the largest of all parameter evaluations amount as the selected mask tensor.
  • the two dimensions specified above may be the input channel dimension and the output channel dimension.
  • the above product-sum operation can also be regarded as a convolution operation, but it does not accumulate in the input channel dimension, but only in the depth direction, so it can also be called the convolution operation in the depth direction, where the depth direction is Kw ⁇ Kh dimension.
  • FIG. 10 shows an exemplary sum-of-product calculation process.
  • the training data matrix 1001 is one of the training data in the training set, it should be calculated with the channel matrix 801 in FIG. 8, and now it is calculated by multiplying it with the masked parameter matrix 803 to identify the amount of information after the mask. .
  • the training data matrix 1001 is multiplied by the corresponding elements of the masked parameter matrix 803, and then the absolute values are added to obtain the parameter evaluation value S 1 , namely:
  • the corresponding elements of the training data matrix 1001 and the masked parameter matrix 803 are multiplied by absolute values and then added to obtain the parameter evaluation value S 2 , that is:
  • the parameter evaluation value reflects the result of a similar absolute value calculation.
  • the parameter evaluation value S 1 or S 2 shows the amount of information retained after masking. The higher the value of the parameter evaluation value, the more information is retained. In one application scenario, either calculation method of parameter evaluation value S 1 or S 2 can be selected, while in another application scenario, the calculation methods of parameter evaluation value S 1 and S 2 can be used at the same time. make restrictions.
  • Mask all masked tensors and get parameter evaluations In the preceding example, this means that all 90 4x4 mask matrices are masked and 90 parameter evaluations are obtained.
  • the mask tensor with the largest parameter evaluation value is selected as the updated mask tensor, that is, the parameter mask tensor.
  • There are many ways to select the maximum parameter evaluation value For example, you can sort all parameter evaluation values in numerical order to obtain the largest parameter evaluation value, or simply use a two-input comparator for comparison, leaving the larger value and the lower value.
  • One parameter evaluation value is compared again, and the largest parameter evaluation value is left after all 90 parameter evaluation values are compared. If multiple mask tensors have the same maximum parameter evaluation value, one of them may be selected based on certain rules or hardware characteristics, such as first-order, last-order, first left, last left, or is randomly selected.
  • the mask tensor with the largest parameter evaluation value is the mask tensor that retains the most information, and this embodiment uses the mask tensor as the parameter mask tensor.
  • the mask tensor may be updated in each iteration or each generation of training. If in the training process, the neural network parameters are updated after each training sample, the mask tensor is preferably updated in each iteration; if the neural network parameters are updated in each iteration, the parameter mask Tensors are preferably updated at the end of each generation of training.
  • the mask tensor when the mask tensor is generated for the first time, the mask tensor can be generated in a similar manner, except that the neural network parameters on which the mask tensor is based will be generated. There are different. Depending on the stages involved in the training process, when the mask tensor is first generated, the neural network parameters may be randomly initialized parameters or neural network parameters determined after training in the unmasked stage.
  • the sparsification training of embodiments of the present disclosure may include several training stages, such as a maskless stage, a mask adjustment stage, and a mask fixation stage.
  • a maskless stage such as a maskless stage
  • a mask adjustment stage such as a mask adjustment stage
  • a mask fixation stage such as a mask fixation stage.
  • FIG. 11 shows an exemplary flow diagram including a no-mask stage and a mask-adjustment stage.
  • the processing device 203 In the unmasked stage, the processing device 203 only trains the neural network parameters, that is, does not perform mask sparseness on the neural network parameters. After the unmasked stage ends and enters the mask adjustment stage, the training parameters are updated simultaneously with the mask. Tensor.
  • the control circuit 610 first sets to enter the no-mask stage.
  • this embodiment does not mask the neural network parameters, all elements of the parameters participate in the training, and the parameter values can be randomly generated at the beginning of the training. Parameters are called unmasked parameters.
  • step 1102 the arithmetic circuit 630 calculates the value of the loss function based on the unmasked parameters in the forward pass.
  • the computing circuit 630 adopts the method of calculating the loss function in the prior art, and in the forward propagation, the input training samples are calculated by each layer of the neural network, and the input feature map is gradually extracted as abstract features, and the forward propagation result is used. and the loss function calculated from the true value.
  • the arithmetic circuit 630 calculates the partial derivative of the loss function with respect to the unmasked parameter in backpropagation.
  • the arithmetic circuit 630 uses the gradient descent method to calculate the partial derivative of the loss function for each unmasked parameter through the chain rule.
  • the arithmetic circuit 630 updates the unmasked parameter based on the partial derivative, and uses the updated unmasked parameter as the initial value of the mask adjustment parameter.
  • the arithmetic circuit 630 multiplies the step size according to the influence of the unmasked parameter on the error to update the unmasked parameter of the entire neural network.
  • the arithmetic circuit 630 may also update the unmasked parameter based on the partial derivative in each training sample or each iteration.
  • step 1102, step 1103 and step 1104 can be repeated in a certain number of times of training to update the unmasked parameter multiple times. After the last update, the updated unmasked parameter will be used as the mask in the next stage. The initial value of the code adjustment parameter.
  • the control circuit 610 is set to enter the mask adjustment stage, that is, it starts to mask some parameters by using the mask tensor.
  • the prior art only trains on all parameters (such as weights, biases, etc.), and usually does not mask the parameters.
  • the purpose of parameter masking in this embodiment is to reduce the participation of parameters in the training phase and avoid overfitting to reduce the amount of calculation.
  • the ideal mask tensor At the beginning of entering the mask adjustment stage, as mentioned earlier, the initial values of the mask adjustment parameters are the unmasked parameters that are finally updated in the unmasked stage, and the mask tensor can be based on the unmasked parameters that are finally updated in the unmasked stage. There is no mask parameter to obtain the initial value of the mask adjustment parameter.
  • step 1106 the mask adjustment parameters are masked based on the mask tensor in the forward pass to calculate the value of the loss function.
  • step 1107 the partial derivatives of the loss function to the mask adjustment parameters are calculated in backpropagation.
  • step 1108 the mask adjustment parameters are updated based on the partial derivatives.
  • step 1109 the mask tensor is updated based on the updated mask adjustment parameters.
  • This embodiment does not limit the number of times of first-generation training in the unmasked stage and the mask adjustment stage. Those skilled in the art can arrange it according to the specific situation, and the number of times of the first-generation training in the unmasked stage and the mask adjustment stage is not necessarily required. same.
  • Another embodiment of the present disclosure also provides a solution for sparse training of a neural network model based on the aforementioned hardware environment.
  • the difference from the previous embodiment is that the training is divided into three stages: no mask stage, mask adjustment stage and mask fixation stage.
  • the processing device 203 In the unmasked stage, the processing device 203 only trains the parameters without masking the parameters.
  • the processing device 203 uses the updated mask adjustment parameters and the updated mask tensor in the mask adjustment stage as the initial values, on the premise of not changing or updating the mask tensor. Next, continue to train the parameters.
  • step 1201 the control circuit 610 is set to enter the mask fixing stage.
  • the control circuit 610 uses the mask adjustment parameter updated in the mask adjustment stage as the initial value of the parameter (hereinafter referred to as the mask fixing parameter) in this stage.
  • the mask tensor is updated, so the mask tensor will not be updated in this stage, but the mask fixed parameters will be masked based on the mask tensor finally updated in the mask adjustment stage, and the training will continue.
  • Mask fixed parameters are updated.
  • This embodiment repeats the following steps in at least one generation of training.
  • step 1202 the arithmetic circuit 630 masks the mask fixed parameters in forward propagation based on the mask tensor updated in the mask adjustment stage to calculate the value of the loss function.
  • step 1203 the arithmetic circuit 630 calculates the partial derivatives of the loss function with respect to the fixed parameters of the mask in backpropagation.
  • step 1204 the update module 64 updates the mask fixed parameters based on the partial derivatives.
  • This embodiment is divided into three stages during training.
  • the unmasked stage no mask tensor masks the parameters, and only the parameters are trained to speed up the convergence of the parameters.
  • the mask adjustment stage since the initial values of the parameters are no longer randomly generated, but the unmasked parameters that have been trained, it is helpful to quickly obtain an ideal mask tensor.
  • the mask tensor is updated, enter the mask fixing stage, and use the updated mask tensor to continue training the parameters, and the final trained parameters will better match the mask tensor.
  • Embodiment 1301 only has a mask adjustment stage, the initial value of the parameter W0 is randomly generated, the initial value of the mask tensor M0 is determined based on the initial value of the parameter W0, and the training parameters simultaneously update the mask matrix to obtain the trained parameter Wf and the updated mask tensor Mf.
  • Embodiment 1302 has only a no-mask stage and a mask-adjustment stage.
  • the unmasked stage only the parameters are trained, the initial value of the parameter W0 is randomly generated, and the updated parameter W1 is obtained after training.
  • the mask adjustment stage the training parameters and the mask matrix are updated at the same time.
  • the initial value of the parameters in this stage is the updated parameter W1, and the initial value of the mask tensor M0 is obtained by using the updated parameter W1, and finally the trained parameter Wf is obtained. with the updated mask tensor Mf.
  • Embodiment 1303 has only a mask adjustment stage and a mask fixation stage.
  • the initial value of the parameter W0 is randomly generated
  • the initial value of the mask tensor M0 is determined based on the initial value of the parameter W0
  • the training parameters update the mask matrix at the same time to obtain the updated parameter W1 and the updated mask tensor Mf .
  • the mask fixing stage the training continues with the updated mask tensor Mf mask parameters.
  • the initial value of the parameters in this stage is the updated parameter W1, and finally the trained parameter Wf is obtained.
  • Embodiment 1304 has a no-mask stage, a mask-adjustment stage, and a mask-fixing stage.
  • the unmasked stage only the parameters are trained, the initial value of the parameter W0 is randomly generated, and the updated parameter W1 is obtained after training.
  • the mask adjustment stage the training parameters and the mask matrix are updated at the same time.
  • the initial value of the parameters in this stage is the updated parameter W1
  • the initial value of the mask tensor M0 is obtained by using the updated parameter W1, and finally the updated parameter W2 and the updated parameter are obtained.
  • the post mask tensor Mf In the mask fixing stage, the training continues with the updated mask tensor Mf mask parameters.
  • the initial value of the parameters in this stage is the updated parameter W2, and finally the trained parameter Wf is obtained.
  • Embodiment 1305 In addition to having an unmasked stage, a mask adjustment stage, and a mask fixation stage, Embodiment 1305 also has other training stages between the unmasked stage and the mask adjustment stage, and between the mask adjustment stage and the mask fixation stage (with dotted line).
  • the unmasked stage only the parameters are trained, the initial value of the parameter W0 is randomly generated, and the updated parameter W1 is obtained after training.
  • any training stage disclosed or undisclosed in the present disclosure can be continued to train parameters or update the mask matrix. Assuming that this stage is a mask fixed stage, the initial value of the parameters in this stage is the updated parameter W1, while the mask matrix is The initial value M0 of the code tensor is obtained by using the updated parameter W1 to obtain the updated parameter W2.
  • the initial value of the parameters in this stage is the updated parameter W2, and the initial value of the mask tensor is still the mask tensor M0, so as to obtain the updated parameters W3 and The updated mask tensor M1.
  • this stage is a parameter fixed stage, that is, the parameters are fixed and not trained, and only the mask tensor is trained.
  • This stage The initial value of the parameter is the updated parameter W3, and the initial value of the mask tensor is the updated mask tensor M1 to obtain the updated mask tensor Mf.
  • the training is continued with the updated mask tensor Mf mask parameters.
  • the initial value of the parameters in this stage is the updated parameter W3, and finally the trained parameter Wf is obtained.
  • FIG. 13 The various embodiments shown in FIG. 13 are only examples, and those skilled in the art can expand other embodiments without creative efforts after referring to the present disclosure, and these embodiments all belong to the scope of the disclosure of the present disclosure.
  • the present disclosure does not limit the number of first-generation training performed in various embodiments, which can be arranged by those skilled in the art according to specific circumstances, and the number of first-generation training performed in each stage is not necessarily the same.
  • the aforementioned embodiments do not necessarily have to perform all the pre-set specific times of one-generation training.
  • the control circuit 610 may further determine whether the percentage of all the element values of the parameter mask tensor that have not changed in the two consecutive one-generation trainings reaches a threshold. If so, it means that the training results have basically converged, and more training will have limited improvement in accuracy, so end the mask adjustment stage and complete the training.
  • a threshold is generally set above 70%, that is, if the percentage of all elements of the parameter mask tensor that does not change exceeds 70%, training will be stopped.
  • the present disclosure does not limit the threshold, which may be 80%, 90%, 100%, or any other percentage.
  • different sparse data flow structures can be used to perform related operations in different stages of training, so as to obtain optimal operation and IO performance .
  • the mask tensor in the mask adjustment stage, may be updated based on the updated neural network parameters, and the results generated by the update process may include the sparsification results of the neural network parameters (eg, sparsification weights) and mask tensor.
  • This mask tensor can be used for sparsification of training data.
  • subsequent operations may be performed based on the sparsed neural network parameters and the sparsed training data.
  • neuron gradients and neural network parameter gradients can be calculated based on the current non-sparse neural network parameters, and the non-sparse neural network parameters are updated accordingly.
  • the neural network parameters can be sparsed based on the mask tensors used in the forward propagation, and neuron gradients and neural network parameters can be computed based on the sparsed neural network parameters. network parameter gradients and update the unsparse neural network parameters accordingly.
  • the sparsification in the backpropagation process is described in the previous description, and will not be repeated here.
  • the mask tensor in the mask fixing stage, is fixed and does not need to be updated in real time. Therefore, the fixed mask tensor can be stored in the storage circuit for subsequent use.
  • Fixed mask tensors can include the forward mask tensor used in forward propagation, and the reverse mask tensor used in back propagation.
  • Neural network parameters can have different storage schemes.
  • the memory circuit may store the un-thinned neural network parameters.
  • forward propagation it is necessary to use the stored mask tensor to sparse the neural network parameters.
  • backpropagation the neural network parameters that are not sparsely processed are used to directly participate in the neuron gradient calculation (for example, the above formula (1)), and the neural network parameters that are not sparsely processed are updated and stored in the storage circuit again.
  • Network parameters may be used to directly participate in the neuron gradient calculation (for example, the above formula (1)), and the neural network parameters that are not sparsely processed are updated and stored in the storage circuit again.
  • the memory circuit may store the thinned neural network parameters.
  • the sparsely processed neural network parameters can directly participate in the forward operation, and no further sparse processing is required.
  • the sparsely processed neural network parameters need to be updated, so the mask tensor stored in the storage circuit can be used to sparse the neural network parameter gradient, and then the sparsely processed neural network parameters can be updated.
  • the sparse processing When the sparse processing is not adopted, it is necessary to perform anti-sparse processing on the sparse-processed neural network parameters, and then calculate the neuron gradient based on the neural network parameters after the anti-sparse processing.
  • the reverse mask tensor stored in the storage circuit can be used to sparse the neural network parameters after the anti-sparse processing, and then the neuron gradient is calculated based on this.
  • Another embodiment of the present disclosure is a computer-readable storage medium on which computer program codes for sparse training of a neural network model are stored.
  • Methods of Examples the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer-readable memory.
  • the software product may be stored in a memory, which may include several instructions to cause a computer device (eg, a personal computer, a server or network equipment, etc.) to execute some or all of the steps of the methods described in the embodiments of the present disclosure.
  • the aforementioned memory may include, but is not limited to, a U disk, a flash disk, a read-only memory (Read Only Memory, ROM), a random access memory (Random Access Memory, RAM), a mobile hard disk, a magnetic disk, or a CD, etc. that can store programs. medium of code.
  • the updated parameter mask tensor is used to block the parameters after training, so as to control the processing area of the feature map input to the neural network model.
  • the computing device 201 performs reasoning
  • the updated parameter mask tensor is used to block the parameters after training, so as to control the processing area of the feature map input to the neural network model.
  • it can reduce the amount of calculation in the process of reasoning, and achieve the purpose of sparseness.
  • the electronic devices or devices of the present disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and/or medical equipment.
  • the vehicles include airplanes, ships and/or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.
  • the electronic equipment or device of the present disclosure can also be applied to the Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical care and other fields. Further, the electronic device or device of the present disclosure can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge terminal, terminal, etc.
  • the electronic device or device with high computing power according to the solution of the present disclosure can be applied to a cloud device (eg, a cloud server), while the electronic device or device with low power consumption can be applied to a terminal device and/or Edge devices (such as smartphones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be obtained from the hardware information of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device. Matching appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-device integration.
  • the present disclosure expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solutions of the present disclosure are not limited by the order of the described actions . Accordingly, those of ordinary skill in the art, based on the disclosure or teachings of this disclosure, will appreciate that some of the steps may be performed in other orders or concurrently. Further, those skilled in the art can understand that the embodiments described in the present disclosure may be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present disclosure. In addition, according to different solutions, the present disclosure also has different emphases in the description of some embodiments. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present disclosure, and can also refer to the related descriptions of other embodiments.
  • units illustrated as separate components may or may not be physically separate, and components shown as units may or may not be physical units.
  • the aforementioned components or elements may be co-located or distributed over multiple network elements.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure.
  • multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit physically exists independently.
  • the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits, and the like.
  • the physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but are not limited to, devices such as transistors or memristors.
  • various types of devices described herein eg, computing devices or other processing devices
  • suitable hardware processors such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like.
  • the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High Bandwidth Memory (High Bandwidth Memory) , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
  • a variable resistance memory Resistive Random Access Memory, RRAM
  • Dynamic Random Access Memory Dynamic Random Access Memory
  • SRAM Static Random Access Memory
  • EDRAM Enhanced Dynamic Random Access Memory
  • HBM High Bandwidth Memory
  • HBM Hybrid Memory Cube
  • ROM and RAM etc.
  • a method of sparse training a neural network model performed by a data processing apparatus comprising:
  • At least the neural network parameters are sparsed based on the mask tensor to calculate the value of the loss function
  • the neural network parameters are updated based on the neural network parameter gradients.
  • the neuron gradients and the neural network parameter gradients are calculated based on the unsparse-processed neural network parameters; and the neural network parameters are updated based on the neural network parameter gradients.
  • De-sparse processing is performed on the sparse-processed neural network parameters to obtain the un-sparse-processed neural network parameters.
  • the neuron gradients and the neural network parameter gradients are calculated based on the thinned neural network parameters; and the neural network parameters are updated based on the neuron gradients.
  • the neural network parameters are sparsed based on the reverse mask tensor to obtain the sparsely processed neural network parameters.
  • Clause 7 The method of clause 6, wherein the one-dimensional tensor sparses the input channel dimension of the neural network parameters.
  • Clause 8 The method of clauses 1-5, wherein the mask tensor is a two-dimensional tensor.
  • Clause 9 The method of Clause 8, wherein the two-dimensional tensor sparses the input channel dimension and the output channel dimension of the neural network parameters.
  • Clause 10 The method of Clause 5, wherein when the mask tensor is a two-dimensional tensor, the reverse mask tensor is generated by dimensionally transforming the mask tensor.
  • Clause 11 The method of Clause 1, wherein updating the neural network parameters comprises updating unsparse processed neural network parameters.
  • the mask tensor is generated based on the updated unsparse processed neural network parameters.
  • the mask tensor is determined based on the positions of the n valid data elements among the m data elements.
  • each dimension of the two-dimensional mask tensor includes m elements, wherein n elements are 1, m-n elements are 0, and m>n;
  • a product-sum operation is performed on the training data of the neural network to obtain parameter evaluation values; and a two-dimensional mask tensor that produces the largest of all parameter evaluation values is selected as the mask Tensor.
  • Clause 15 The method of any of clauses 1-14, wherein the method is performed in multiple iterations in a mask adjustment phase of the sparsification training.
  • Clause 17 The method of clause 16, wherein the threshold is one of 80%, 90%, and 100%.
  • Clause 18 The method of any of clauses 1-10, wherein the method is performed in multiple iterations in a mask-fixing stage of sparse training, and the mask tensor is fixed as finalized by a previous stage mask tensor.
  • the thinned neural network parameters are updated based on the thinned neuron gradients.
  • Clause 21 The method of any of clauses 18-20, wherein during the mask fixation stage, a fixed mask tensor and sparsely processed neural network parameters are stored.
  • Clause 22 The method of any of clauses 18-20, wherein during the mask fixation phase, a fixed mask tensor and unsparsely processed neural network parameters are stored.
  • Clause 23 A computer-readable storage medium having stored thereon computer program code for sparse training a neural network model, which when executed by a processing device, executes any one of clauses 1 to 22. Methods.
  • a data processing apparatus comprising a control circuit, a storage circuit and an arithmetic circuit, wherein:
  • control circuit is configured to control the storage circuit and the arithmetic circuit to perform sparse training on a neural network model
  • the storage circuit is configured to store information including at least neural network parameters and mask tensors
  • the arithmetic circuit is configured to perform the following operations under the control of the control circuit:
  • At least the neural network parameters are sparsed based on the mask tensor to calculate the value of the loss function
  • the neural network parameters are updated based on the neural network parameter gradients.
  • the neural network parameters are updated based on the neural network parameter gradients.
  • De-sparse processing is performed on the sparse-processed neural network parameters to obtain the un-sparse-processed neural network parameters.
  • the neural network parameters are updated based on the neuron gradients.
  • the neural network parameters are sparsed based on the reverse mask tensor to obtain the sparsely processed neural network parameters.
  • Clause 30 The apparatus of clause 29, wherein the one-dimensional tensor sparses input channel dimensions of neural network parameters.
  • Clause 32 The apparatus of clause 31, wherein the two-dimensional tensor sparses input channel dimensions and output channel dimensions of neural network parameters.
  • Clause 33 The apparatus of Clause 28, wherein when the mask tensor is a two-dimensional tensor, the reverse mask tensor is a dimensional transformation of the mask tensor by the arithmetic circuit generated.
  • the mask tensor is generated based on the updated unsparse processed neural network parameters.
  • the mask tensor is determined based on the positions of the n valid data elements among the m data elements.
  • each dimension of the two-dimensional mask tensor includes m elements, wherein n elements are 1, m-n elements are 0, and m>n;
  • a product-sum operation is performed on the training data of the neural network to obtain parameter evaluation values; and a two-dimensional mask tensor that produces the largest of all parameter evaluation values is selected as the mask Tensor.
  • Clause 38 The apparatus of any of clauses 24-37, wherein the operational circuit is configured to perform the operations in a plurality of iterations in a mask adjustment phase of sparsification training.
  • Clause 39 The apparatus of Clause 38, wherein the arithmetic circuit is further configured to: in the mask adjustment stage, determine that the values of all elements of the mask tensor have not changed in successive iterations of training whether the percentage of the threshold is reached; and
  • Clause 40 The device of clause 39, wherein the threshold is one of 80%, 90%, and 100%.
  • Clause 41 The apparatus of any of clauses 24-33, wherein the operational circuit is configured to perform the operation in a plurality of iterations in a mask-fixed phase of sparsification training, and the mask tensor Fixed to be the mask tensor finalized by the previous stage.
  • the thinned neural network parameters are updated based on the thinned neuron gradients.
  • Clause 44 The apparatus of any of clauses 41-43, wherein during the mask fixation phase, the memory circuit is configured to store a fixed mask tensor and thinned neural network parameters.
  • Clause 45 The apparatus of any of clauses 41-43, wherein during the mask fixation phase, the memory circuit is configured to store the fixed mask tensor and the unsparse processed neural network parameters.
  • Clause 46 A chip comprising a data processing device according to any of clauses 24-45.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Neurology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

A method and apparatus for sparsification training of a neural network model, a board, and a readable storage medium. A combined processing apparatus (20) comprises a computing apparatus (201), an interface apparatus (202), a processing apparatus (203), and a storage apparatus (204). The computing apparatus (201) interacts with the processing apparatus (203) to jointly complete a computing operation specified by a user. The storage apparatus is respectively connected to the computing apparatus and the processing apparatus (203) for storing data of the computing apparatus and the processing apparatus (203).

Description

神经网络稀疏化的装置、方法及相关产品Apparatus, method and related products for sparse neural network
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求于2020年11月04日申请的、申请号为2020112169035、名称为“神经网络稀疏化的设备、方法及相应产品”的中国专利申请以及于2020年12月25日申请的、申请号为2020115632599、名称为“神经网络稀疏化的装置、方法及相关产品”的中国专利申请的优先权。This application requires the Chinese patent application filed on November 04, 2020 with the application number 2020112169035 and titled "Apparatus, Method and Corresponding Product for Neural Network Thinning" and the application number filed on December 25, 2020 It is the priority of the Chinese Patent Application No. 2020115632599 entitled "Apparatus, Method and Related Products for Neural Network Thinning".
技术领域technical field
本披露一般地涉及处理器领域。更具体地,本披露涉及由数据处理装置对神经网络模型进行稀疏化训练的方法、装置、芯片、板卡和可读存储介质。The present disclosure relates generally to the field of processors. More specifically, the present disclosure relates to a method, device, chip, board and readable storage medium for sparse training a neural network model by a data processing device.
背景技术Background technique
近年来,随着深度学习的迅猛发展,使得计算机视觉、自然语言处理等一系列领域的算法性能都有了跨越式的进展。然而深度学习算法是一种计算密集型和存储密集型的工具,随着信息处理任务的日趋复杂,对算法实时性和准确性要求不断增高,神经网络往往会被设计得越来越深,使得其计算量和存储空间需求越来越大,导致现存的基于深度学习的人工智能技术难以直接应用在硬件资源受限的手机、卫星或嵌入式设备上。In recent years, with the rapid development of deep learning, the performance of algorithms in a series of fields such as computer vision and natural language processing has achieved leapfrog progress. However, deep learning algorithms are computationally and storage-intensive tools. With the increasingly complex information processing tasks, the real-time and accuracy requirements of the algorithms continue to increase, and neural networks are often designed to be deeper and deeper, making the The increasing demand for computing and storage space makes it difficult for existing deep learning-based artificial intelligence technologies to be directly applied to mobile phones, satellites or embedded devices with limited hardware resources.
因此,深度神经网络模型的压缩、加速、优化变得格外重要。大量的研究试着在不影响模型精度的前提下,减少神经网络的计算和存储需求,对深度学习技术在嵌入端、移动端的工程化应用具有十分重要的意义。稀疏化正是模型轻量化方法之一。Therefore, the compression, acceleration, and optimization of deep neural network models have become extremely important. A large number of studies have tried to reduce the computing and storage requirements of neural networks without affecting the accuracy of the model, which is of great significance for the engineering application of deep learning technology in embedded and mobile terminals. Sparsification is one of the methods of model lightweighting.
网络参数稀疏化是通过适当的方法减少较大网络中的冗余成分,以降低网络对计算量和存储空间的需求。现有细粒度参数稀疏化的方法模型表现虽好,但是对硬件访存不友善,也就是片上片外输入/输出开销大,性能低;另一方面,基于通道、卷积核的结构化稀疏方法虽提升了硬件性能,但是模型精度损失较大;最后,现有稀疏算法大都是离线微调的方式,即对预训练模型稀疏后再微调,离线微调的方式限制多,且无法在模型训练上有更可观的性能收益。The network parameter sparsification is to reduce the redundant components in the larger network by appropriate methods, so as to reduce the network's demand for computation and storage space. Although the existing fine-grained parameter sparsification methods and models perform well, they are not friendly to hardware memory access, that is, on-chip and off-chip input/output have high overhead and low performance. On the other hand, structured sparseness based on channels and convolution kernels Although the method improves the hardware performance, the model accuracy loss is relatively large. Finally, most of the existing sparse algorithms are offline fine-tuning methods, that is, the pre-trained model is sparse and then fine-tuned. The offline fine-tuning method has many restrictions and cannot be used in model training. There are more substantial performance gains.
因此,需要一种能够对神经网络模型进行稀疏化训练的方案。Therefore, there is a need for a scheme capable of sparse training of neural network models.
发明内容SUMMARY OF THE INVENTION
为了至少部分地解决背景技术中提到的一个或多个技术问题,本披露的方案提供了一种对神经网络模型进行稀疏化训练的装置、板卡、方法及可读存储介质。In order to at least partially solve one or more technical problems mentioned in the background art, the solution of the present disclosure provides an apparatus, board, method and readable storage medium for sparse training of a neural network model.
在第一方面中,本披露公开一种由数据处理装置执行的对神经网络模型进行稀疏化训练的方法,包括:在正向传播中,基于掩码张量至少对神经网络参数进行稀疏化处理,以计算损失函数的值;在反向传播中,基于所述损失函数计算神经元梯度和神经网络参数梯度;以及基于所述神经网络参数梯度更新所述神经网络参数。In a first aspect, the present disclosure discloses a method for sparse training of a neural network model performed by a data processing device, comprising: in forward propagation, performing sparse processing on at least a neural network parameter based on a mask tensor , to calculate the value of the loss function; in backpropagation, calculate neuron gradients and neural network parameter gradients based on the loss function; and update the neural network parameters based on the neural network parameter gradients.
在第二方面中,本披露提供一种计算机可读存储介质,其上存储有对神经网络模型进行稀疏化训练的计算机程序代码,当所述计算机程序代码由处理装置运行时,执行前述第一方面任一实施例的方法。In a second aspect, the present disclosure provides a computer-readable storage medium on which computer program code for sparse training a neural network model is stored, and when the computer program code is executed by a processing device, executes the aforementioned first The method of any embodiment of the aspect.
在第三方面中,本披露提供一种数据处理装置,包括控制电路、存储电路和运算电路,其中:所述控制电路配置用于控制所述存储电路和所述运算电路以对神经网络模型执行稀疏化训练;所述存储电路配置用于存储信息,所述信息至少包括神经网络参数和掩码张量;以及所述运算电路配置用于在所述控制电路的控制下执行以下操作:在正向传播中,基于掩码张量至少对神经网络参数进行稀疏化处理,以计算损失函数的值;在反向传播中,基于所述损失函数计算神经元梯度和所述神经网络参数梯度;以及基于所述神经网络参数梯度更新所述神经网络参数。In a third aspect, the present disclosure provides a data processing apparatus comprising a control circuit, a storage circuit, and an arithmetic circuit, wherein: the control circuit is configured to control the storage circuit and the arithmetic circuit to perform execution on a neural network model sparse training; the storage circuit is configured to store information including at least neural network parameters and mask tensors; and the arithmetic circuit is configured to perform the following operations under the control of the control circuit: In forward propagation, at least the neural network parameters are sparsed based on the mask tensor to calculate the value of the loss function; in back propagation, the neuron gradient and the neural network parameter gradient are calculated based on the loss function; and The neural network parameters are updated based on the neural network parameter gradients.
在第四方面中,本披露提供一种芯片,包括前述第三方面任一实施例的数据处理电路。In a fourth aspect, the present disclosure provides a chip including the data processing circuit of any embodiment of the foregoing third aspect.
在第五方面中,本披露提供一种板卡,包括前述第四方面任一实施例的芯片。In a fifth aspect, the present disclosure provides a board including the chip of any embodiment of the foregoing fourth aspect.
通过如上所提供的数据处理装置、使用数据处理装置来对神经网络模型进行稀疏化训练的方法及 相关产品,本披露实施例提供了一种在神经网络的训练过程中进行稀疏化的方案。该稀疏化方案可以支持训练的正向传播过程中的稀疏化,例如输入通道维度稀疏化,或者输入通道维度与输出通道维度同时稀疏化。在一些实施例中,当正向传播执行输入通道维度与输出通道维度同时稀疏化时,反向传播中也可以支持输入通道维度与输出通道维度同时稀疏化,从而进一步优化性能。本披露的稀疏化方案可以在训练的多个阶段中执行,并且不同训练阶段可以采用不同的结构化稀疏数据流结构进行相关的运算,以获得优化的运算和IO性能。Through the data processing apparatus provided above, the method for sparse training a neural network model by using the data processing apparatus, and related products, the embodiments of the present disclosure provide a solution for sparseness in the training process of the neural network. The sparsification scheme can support sparsification in the forward propagation process of training, such as sparsification of input channel dimensions, or simultaneous sparsification of input channel dimensions and output channel dimensions. In some embodiments, when forward propagation performs simultaneous thinning of input channel dimensions and output channel dimensions, simultaneous thinning of input channel dimensions and output channel dimensions may also be supported in backpropagation, thereby further optimizing performance. The sparsification scheme of the present disclosure can be performed in multiple stages of training, and different training stages can use different structured sparse data flow structures to perform related operations to obtain optimized operation and IO performance.
附图说明Description of drawings
通过参考附图阅读下文的详细描述,本披露示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本披露的若干实施方式,并且相同或对应的标号表示相同或对应的部分,其中:The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily understood by reading the following detailed description with reference to the accompanying drawings. In the accompanying drawings, several embodiments of the present disclosure are shown by way of example and not limitation, and like or corresponding reference numerals refer to like or corresponding parts, wherein:
图1是示出本披露实施例的板卡的结构图;FIG. 1 is a structural diagram illustrating a board according to an embodiment of the present disclosure;
图2是示出本披露实施例的集成电路装置的结构图;FIG. 2 is a block diagram illustrating an integrated circuit device of an embodiment of the present disclosure;
图3是示出本披露实施例的单核计算装置的内部结构示意图;3 is a schematic diagram showing the internal structure of a single-core computing device according to an embodiment of the present disclosure;
图4是示出本披露实施例的多核计算装置的内部结构示意图;FIG. 4 is a schematic diagram showing the internal structure of a multi-core computing device according to an embodiment of the present disclosure;
图5是示出本披露实施例的处理器核的内部结构示意图;FIG. 5 is a schematic diagram showing the internal structure of a processor core according to an embodiment of the present disclosure;
图6示出了根据本披露实施例的数据处理装置的示例性结构框图;FIG. 6 shows an exemplary structural block diagram of a data processing apparatus according to an embodiment of the present disclosure;
图7示出了根据本披露实施例在一次迭代过程中执行的方法;FIG. 7 illustrates a method performed in an iterative process according to an embodiment of the present disclosure;
图8A示出了根据本披露实施例的一个示例性一维掩码张量的掩码过程;8A illustrates a masking process for an exemplary one-dimensional mask tensor according to an embodiment of the present disclosure;
图8B示出了根据本披露实施例的一个示例性二维掩码张量的掩码过程;FIG. 8B illustrates the masking process of an exemplary two-dimensional mask tensor according to an embodiment of the present disclosure;
图9是示出示例性的掩码向量更新的示意图;9 is a schematic diagram illustrating an exemplary mask vector update;
图10是示出示例性的乘积和计算过程的示意图;10 is a schematic diagram illustrating an exemplary sum-of-product calculation process;
图11是示出本公开另一实施例稀疏化训练方法的流程图;11 is a flowchart illustrating a sparse training method according to another embodiment of the present disclosure;
图12是示出本公开另一实施例进入掩码固定阶段的稀疏化训练方法的流程图;以及FIG. 12 is a flowchart illustrating a sparse training method entering a mask fixing stage according to another embodiment of the present disclosure; and
图13是示出本公开对神经网络模型进行稀疏化训练时的几种实施方式的示意图。FIG. 13 is a schematic diagram illustrating several embodiments of the present disclosure when the neural network model is sparsely trained.
具体实施方式Detailed ways
下面将结合本披露实施例中的附图,对本披露实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本披露一部分实施例,而不是全部的实施例。基于本披露中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本披露保护的范围。The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, but not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present disclosure.
应当理解,本披露的权利要求、说明书及附图中可能使用的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本披露的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that the terms "first", "second", "third" and "fourth" that may be used in the claims, description and drawings of the present disclosure are used to distinguish different objects, rather than to describe specific order. The terms "comprising" and "comprising" as used in the specification and claims of this disclosure indicate the presence of the described features, integers, steps, operations, elements and/or components, but do not exclude one or more other features, integers , step, operation, element, component and/or the presence or addition of a collection thereof.
还应当理解,在本披露说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本披露。如在本披露说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本披露说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should also be understood that the terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to limit the present disclosure. As used in this disclosure and the claims, the singular forms "a," "an," and "the" are intended to include the plural unless the context clearly dictates otherwise. It should further be understood that, as used in this disclosure and the claims, the term "and/or" refers to and including any and all possible combinations of one or more of the associated listed items.
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。As used in this specification and in the claims, the term "if" may be contextually interpreted as "when" or "once" or "in response to determining" or "in response to detecting".
下面结合附图来详细描述本披露的具体实施方式。The specific embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.
图1示出本公开实施例的一种板卡10的结构示意图。如图1所示,板卡10包括芯片101,其是一种系统级芯片(System on Chip,SoC),或称片上系统,集成有一个或多个组合处理装置,组合处理装置是一种人工智能运算单元,用以支持各类深度学习和机器学习算法,满足计算机视觉、语音、自然语言处理、数据挖掘等领域复杂场景下的智能处理需求。特别是深度学习技术大量应用在云端智能领 域,云端智能应用的一个显著特点是输入数据量大,对平台的存储能力和计算能力有很高的要求,此实施例的板卡10适用在云端智能应用,具有庞大的片外存储、片上存储和强大的计算能力。FIG. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure. As shown in FIG. 1 , the board 10 includes a chip 101, which is a system-on-chip (SoC), or a system-on-a-chip, and integrates one or more combined processing devices. The combined processing device is an artificial The intelligent computing unit is used to support various deep learning and machine learning algorithms to meet the intelligent processing requirements in complex scenarios in the fields of computer vision, speech, natural language processing, and data mining. In particular, deep learning technology is widely used in the field of cloud intelligence. A notable feature of cloud intelligence applications is the large amount of input data, which has high requirements on the storage capacity and computing capacity of the platform. The board 10 in this embodiment is suitable for cloud intelligence applications. applications, with huge off-chip storage, on-chip storage and powerful computing power.
芯片101通过对外接口装置102与外部设备103相连接。外部设备103例如是服务器、计算机、摄像头、显示器、鼠标、键盘、网卡或wifi接口等。待处理的数据可以由外部设备103通过对外接口装置102传递至芯片101。芯片101的计算结果可以经由对外接口装置102传送回外部设备103。根据不同的应用场景,对外接口装置102可以具有不同的接口形式,例如PCIe接口等。The chip 101 is connected to an external device 103 through an external interface device 102 . The external device 103 is, for example, a server, a computer, a camera, a monitor, a mouse, a keyboard, a network card or a wifi interface, and the like. The data to be processed can be transmitted to the chip 101 by the external device 103 through the external interface device 102 . The calculation result of the chip 101 can be transmitted back to the external device 103 via the external interface device 102 . According to different application scenarios, the external interface device 102 may have different interface forms, such as a PCIe interface and the like.
板卡10还包括用于存储数据的存储器件104,其包括一个或多个存储单元105。存储器件104通过总线与控制器件106和芯片101进行连接和数据传输。板卡10中的控制器件106配置用于对芯片101的状态进行调控。为此,在一个应用场景中,控制器件106可以包括单片机(Micro Controller Unit,MCU)。The board 10 also includes a storage device 104 for storing data, which includes one or more storage units 105 . The storage device 104 is connected to the control device 106 and the chip 101 through a bus and performs data transmission. The control device 106 in the board 10 is configured to control the state of the chip 101 . To this end, in an application scenario, the control device 106 may include a microcontroller (Micro Controller Unit, MCU).
图2是示出此实施例的芯片101中的组合处理装置的结构图。如图2中所示,组合处理装置20包括计算装置201、接口装置202、处理装置203和DRAM 204。FIG. 2 is a block diagram showing a combined processing device in the chip 101 of this embodiment. As shown in FIG. 2, the combined processing device 20 includes a computing device 201, an interface device 202, a processing device 203, and a DRAM 204.
计算装置201配置成执行用户指定的操作,主要实现为单核智能处理器或者多核智能处理器,用以执行深度学习或机器学习的计算,其可以通过接口装置202与处理装置203进行交互,以共同完成用户指定的操作。The computing device 201 is configured to perform operations specified by the user, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor to perform deep learning or machine learning calculations, which can interact with the processing device 203 through the interface device 202 to Work together to complete a user-specified operation.
接口装置202用于在计算装置201与处理装置203间传输数据和控制指令。例如,计算装置201可以经由接口装置202从处理装置203中获取输入数据,写入计算装置201片上的存储装置。进一步,计算装置201可以经由接口装置202从处理装置203中获取控制指令,写入计算装置201片上的控制缓存中。替代地或可选地,接口装置202也可以读取计算装置201的存储装置中的数据并传输给处理装置203。The interface device 202 is used to transmit data and control instructions between the computing device 201 and the processing device 203 . For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202 and write the input data into the storage device on-chip of the computing device 201 . Further, the computing device 201 can obtain the control instruction from the processing device 203 via the interface device 202 and write it into the control cache on the computing device 201 . Alternatively or alternatively, the interface device 202 can also read the data in the storage device of the computing device 201 and transmit it to the processing device 203 .
处理装置203作为通用的处理装置,执行包括但不限于数据搬运、对计算装置201的开启和/或停止等基本控制。根据实现方式的不同,处理装置203可以是中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)或其他通用和/或专用处理器中的一种或多种类型的处理器,这些处理器包括但不限于数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以根据实际需要来确定。如前所述,仅就本公开的计算装置201而言,其可以视为具有单核结构或者同构多核结构。然而,当将计算装置201和处理装置203整合共同考虑时,二者视为形成异构多核结构。The processing device 203, as a general processing device, performs basic control including but not limited to data transfer, starting and/or stopping the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of central processing unit (CPU), graphics processing unit (GPU), or other general-purpose and/or special-purpose processors. Processors, these processors include but are not limited to digital signal processors (DSPs), application specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs. As mentioned above, only for the computing device 201 of the present disclosure, it can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when the computing device 201 and the processing device 203 are considered together, the two are considered to form a heterogeneous multi-core structure.
DRAM 204用以存储待处理的数据,为DDR内存,大小通常为16G或更大,用于保存计算装置201和/或处理装置203的数据。The DRAM 204 is used to store the data to be processed, and is a DDR memory with a size of 16G or more, and is used to save the data of the computing device 201 and/or the processing device 203.
图3示出了计算装置201为单核的内部结构示意图。单核计算装置301用以处理计算机视觉、语音、自然语言、数据挖掘等输入数据,单核计算装置301包括三大模块:控制模块31、运算模块32及存储模块33。FIG. 3 shows a schematic diagram of the internal structure of the computing device 201 as a single core. The single-core computing device 301 is used to process input data such as computer vision, speech, natural language, and data mining. The single-core computing device 301 includes three modules: a control module 31 , an arithmetic module 32 and a storage module 33 .
控制模块31用以协调并控制运算模块32和存储模块33的工作,以完成深度学习的任务,其包括取指单元(instruction fetch unit,IFU)311及指令译码单元(instruction decode unit,IDU)312。取指单元311用以获取来自处理装置203的指令,指令译码单元312则将获取的指令进行译码,并将译码结果作为控制信息发送给运算模块32和存储模块33。The control module 31 is used to coordinate and control the work of the operation module 32 and the storage module 33 to complete the task of deep learning, and it comprises an instruction fetch unit (instruction fetch unit, IFU) 311 and an instruction decoding unit (instruction decode unit, IDU) 312. The instruction fetching unit 311 is used to acquire the instruction from the processing device 203 , and the instruction decoding unit 312 decodes the acquired instruction, and sends the decoding result to the operation module 32 and the storage module 33 as control information.
运算模块32包括向量运算单元321及矩阵运算单元322。向量运算单元321用以执行向量运算,可支持向量乘、加、非线性变换等复杂运算;矩阵运算单元322负责深度学习算法的核心计算,即矩阵乘及卷积。The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322 . The vector operation unit 321 is used to perform vector operations, and can support complex operations such as vector multiplication, addition, and nonlinear transformation; the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, that is, matrix multiplication and convolution.
存储模块33用来存储或搬运相关数据,包括神经元存储单元(neuron RAM,NRAM)331、参数存储单元(weight RAM,WRAM)332、直接内存访问模块(direct memory access,DMA)333。NRAM 331用以存储输入神经元、输出神经元和计算后的中间结果;WRAM 332则用以存储深度学习网络的卷积核,即权值;DMA 333通过总线34连接DRAM 204,负责单核计算装置301与DRAM 204间的数据搬运。The storage module 33 is used to store or transport related data, including a neuron storage unit (neuron RAM, NRAM) 331, a parameter storage unit (weight RAM, WRAM) 332, and a direct memory access module (direct memory access, DMA) 333. NRAM 331 is used to store input neurons, output neurons and intermediate results after calculation; WRAM 332 is used to store the convolution kernel of the deep learning network, that is, weights; DMA 333 is connected to DRAM 204 through bus 34 and is responsible for single-core computing Data transfer between device 301 and DRAM 204.
图4示出了计算装置201为多核的内部结构示意图。多核计算装置41采用分层结构设计,多核计 算装置41作为一个片上系统,其包括至少一个集群(cluster),每个集群又包括多个处理器核,换言之,多核计算装置41是以片上系统-集群-处理器核的层次所构成的。FIG. 4 shows a schematic diagram of the internal structure of the computing device 201 with multiple cores. The multi-core computing device 41 adopts a layered structure design, and the multi-core computing device 41 is a system-on-a-chip, which includes at least one cluster, and each cluster includes multiple processor cores. In other words, the multi-core computing device 41 is a system-on-chip- Cluster - a hierarchy of processor cores.
以片上系统的层级来看,如图4所示,多核计算装置41包括外部存储控制器401、外设通信模块402、片上互联模块403、同步模块404以及多个集群405。From a system-on-chip level, as shown in FIG. 4 , the multi-core computing device 41 includes an external storage controller 401 , a peripheral communication module 402 , an on-chip interconnect module 403 , a synchronization module 404 and multiple clusters 405 .
外部存储控制器401可以有多个,在图中示例性地展示2个,其用以响应处理器核发出的访问请求,访问外部存储设备,例如图2中的DRAM 204,从而自片外读取数据或是将数据写入。外设通信模块402用以通过接口装置202接收来自处理装置203的控制信号,启动计算装置201执行任务。片上互联模块403将外部存储控制器401、外设通信模块402及多个集群405连接起来,用以在各个模块间传输数据和控制信号。同步模块404是一种全局同步屏障控制器(global barrier controller,GBC),用以协调各集群的工作进度,确保信息的同步。多个集群405是多核计算装置41的计算核心,在图中示例性地展示4个,随着硬件的发展,本公开的多核计算装置41还可以包括8个、16个、64个、甚至更多的集群405。集群405用以高效地执行深度学习算法。There may be multiple external memory controllers 401, and two are exemplarily shown in the figure, which are used to respond to the access request issued by the processor core, to access the external storage device, such as the DRAM 204 in FIG. 2, so as to read from the off-chip Fetch data or write data. The peripheral communication module 402 is used for receiving a control signal from the processing device 203 through the interface device 202 to start the computing device 201 to perform tasks. The on-chip interconnection module 403 connects the external storage controller 401 , the peripheral communication module 402 and the multiple clusters 405 to transmit data and control signals among the modules. The synchronization module 404 is a global synchronization barrier controller (GBC), which is used to coordinate the work progress of each cluster and ensure the synchronization of information. The plurality of clusters 405 are the computing cores of the multi-core computing device 41, and 4 are exemplarily shown in the figure. With the development of hardware, the multi-core computing device 41 of the present disclosure may further include 8, 16, 64, or even more. Multiple clusters 405. Cluster 405 is used to efficiently execute deep learning algorithms.
以集群的层级来看,如图4所示,每个集群405包括多个处理器核(IPU core)406及一个存储核(MEM core)407。In terms of the cluster level, as shown in FIG. 4 , each cluster 405 includes multiple processor cores (IPU cores) 406 and one memory core (MEM core) 407 .
处理器核406在图中示例性地展示4个,本公开不限制处理器核406的数量。其内部架构如图5所示。每个处理器核406类似于图3的单核计算装置301,同样包括三大模块:控制模块51、运算模块52及存储模块53。控制模块51、运算模块52及存储模块53的功用及结构大致与控制模块31、运算模块32及存储模块33相同,不再赘述。需特别说明的是,存储模块53包括输入/输出直接内存访问模块(input/output direct memory access,IODMA)533、搬运直接内存访问模块(move direct memory access,MVDMA)534。IODMA 533通过广播总线409控制NRAM 531/WRAM 532与DRAM 204的访存;MVDMA 534则用以控制NRAM 531/WRAM 532与存储单元(SRAM)408的访存。The processor cores 406 are exemplarily shown as four in the figure, and the present disclosure does not limit the number of the processor cores 406 . Its internal structure is shown in Figure 5. Each processor core 406 is similar to the single-core computing device 301 in FIG. 3 , and also includes three major modules: a control module 51 , an arithmetic module 52 and a storage module 53 . The functions and structures of the control module 51 , the arithmetic module 52 and the storage module 53 are substantially the same as those of the control module 31 , the arithmetic module 32 and the storage module 33 , and will not be described again. It should be noted that the storage module 53 includes an input/output direct memory access (IODMA) 533 and a move direct memory access (MVDMA) 534. The IODMA 533 controls the memory access of the NRAM 531/WRAM 532 and the DRAM 204 through the broadcast bus 409; the MVDMA 534 is used to control the memory access of the NRAM 531/WRAM 532 and the storage unit (SRAM) 408.
回到图4,存储核407主要用以存储和通信,即存储处理器核406间的共享数据或中间结果、以及执行集群405与DRAM 204之间的通信、集群405间彼此的通信、处理器核406间彼此的通信等。在其他实施例中,存储核407具有标量运算的能力,用以执行标量运算。Returning to FIG. 4, the storage core 407 is mainly used for storage and communication, that is, to store the shared data or intermediate results between the processor cores 406, and to execute the communication between the cluster 405 and the DRAM 204, the communication between the clusters 405, and the processor Communication among the cores 406, etc. In other embodiments, the memory core 407 has scalar operation capability for performing scalar operations.
存储核407包括SRAM 408、广播总线409、集群直接内存访问模块(cluster direct memory access,CDMA)410及全局直接内存访问模块(global direct memory access,GDMA)411。SRAM 408承担高性能数据中转站的角色,在同一个集群405内不同处理器核406之间所复用的数据不需要通过处理器核406各自向DRAM 204获得,而是经SRAM 408在处理器核406间中转,存储核407只需要将复用的数据从SRAM 408迅速分发给多个处理器核406即可,以提高核间通讯效率,亦大大减少片上片外的输入/输出访问。The storage core 407 includes an SRAM 408 , a broadcast bus 409 , a cluster direct memory access (CDMA) 410 and a global direct memory access (GDMA) 411 . The SRAM 408 assumes the role of a high-performance data transfer station. The data multiplexed between different processor cores 406 in the same cluster 405 does not need to be obtained from the DRAM 204 through the processor cores 406, but is stored in the processor through the SRAM 408. For transfer between cores 406, the storage core 407 only needs to quickly distribute the multiplexed data from the SRAM 408 to the multiple processor cores 406, so as to improve the communication efficiency between the cores and greatly reduce the on-chip and off-chip I/O accesses.
广播总线409、CDMA 410及GDMA 411则分别用来执行处理器核406间的通信、集群405间的通信和集群405与DRAM 204的数据传输。以下将分别说明。The broadcast bus 409, the CDMA 410 and the GDMA 411 are used to perform the communication between the processor cores 406, the communication between the clusters 405 and the data transmission between the clusters 405 and the DRAM 204, respectively. They will be explained separately below.
广播总线409用以完成集群405内各处理器核406间的高速通信,此实施例的广播总线409支持核间通信方式包括单播、多播与广播。单播是指点对点(例如单一处理器核至单一处理器核)的数据传输,多播是将一份数据从SRAM 408传输到特定几个处理器核406的通信方式,而广播则是将一份数据从SRAM 408传输到所有处理器核406的通信方式,属于多播的一种特例。The broadcast bus 409 is used to complete high-speed communication among the processor cores 406 in the cluster 405. The broadcast bus 409 in this embodiment supports inter-core communication methods including unicast, multicast and broadcast. Unicast refers to point-to-point (such as a single processor core to a single processor core) data transmission, multicast is a communication method that transmits a piece of data from SRAM 408 to specific processor cores 406, and broadcast is a communication method. The communication method in which copies of data are transmitted from SRAM 408 to all processor cores 406 is a special case of multicast.
CDMA 410用以控制在同一个计算装置201内不同集群405间的SRAM 408的访存。The CDMA 410 is used to control the memory access of the SRAM 408 between different clusters 405 within the same computing device 201.
GDMA 411与外部存储控制器401协同,用以控制集群405的SRAM 408到DRAM 204的访存,或是将数据自DRAM 204读取至SRAM 408中。从前述可知,DRAM 204与NRAM 431或WRAM 432间的通信可以经由2个渠道来实现。第一个渠道是通过IODAM 433直接联系DRAM 204与NRAM 431或WRAM 432;第二个渠道是先经由GDMA 411使得数据在DRAM 204与SRAM 408间传输,再经过MVDMA 534使得数据在SRAM 408与NRAM 431或WRAM 432间传输。虽然表面上看来第二个渠道需要更多的元件参与,数据流较长,但实际上在部分实施例中,第二个渠道的带宽远大于第一个渠道,因此DRAM 204与NRAM 431或WRAM 432间的通信通过第二个渠道可能更有效率。本公开的实施例可根据本身硬件条件选择数据传输渠道。The GDMA 411 cooperates with the external memory controller 401 to control the memory access from the SRAM 408 of the cluster 405 to the DRAM 204, or to read data from the DRAM 204 to the SRAM 408. As can be seen from the foregoing, the communication between the DRAM 204 and the NRAM 431 or the WRAM 432 can be implemented through two channels. The first channel is to directly contact DRAM 204 and NRAM 431 or WRAM 432 through IODAM 433; the second channel is to transfer data between DRAM 204 and SRAM 408 through GDMA 411, and then through MVDMA 534 to transfer data between SRAM 408 and NRAM 431 or WRAM 432 transfers. Although it seems that the second channel requires more components to participate and the data flow is longer, in fact, in some embodiments, the bandwidth of the second channel is much larger than that of the first channel, so DRAM 204 and NRAM 431 or Communication between the WRAMs 432 may be more efficient through a second channel. In the embodiments of the present disclosure, a data transmission channel can be selected according to its own hardware conditions.
在其他实施例中,GDMA 411的功能和IODMA 533的功能可以整合在同一部件中。本公开为了方 便描述,将GDMA 411和IODMA 533视为不同部件,对于本领域技术人员来说,只要其实现的功能以及达到的技术效果与本公开类似,即属于本公开的保护范围。进一步地,GDMA 411的功能、IODMA533的功能、CDMA 410的功能、MVDMA 534的功能亦可以由同一部件来实现。In other embodiments, the functionality of GDMA 411 and the functionality of IODMA 533 may be integrated in the same component. In this disclosure, for the convenience of description, GDMA 411 and IODMA 533 are regarded as different components. For those skilled in the art, as long as the functions realized and the technical effects achieved are similar to those of the present disclosure, they belong to the protection scope of the present disclosure. Further, the function of GDMA 411, the function of IODMA 533, the function of CDMA 410, and the function of MVDMA 534 can also be realized by the same component.
神经网络的训练是通过输入训练样本来调整各层的参数,使得神经网络计算出来的结果与真实结果尽量接近。神经网络训练包括正向传播及反向传播,正向传播是基于现有的模型,输入训练样本通过神经网络的各层计算,将输入的特征图逐步提取为抽象特征。正向传播后,得到一个被称为预测值的输出值。而反向传播是根据正向传播得到的预测值和真实值计算而得的损失函数,采用梯度下降法,通过链式法则计算出损失函数对每个参数的偏导,来更新参数。在链条法则中,首先计算对应神经网络的最后一层权值的误差值的导数。称这些导数为梯度,然后使用这些梯度来计算神经网络中的倒数第二层的梯度。重复此过程,直到得到神经网络中每个权值对应的梯度。最后,将神经网络中每个权值减去对应的梯度,从而对权值进行一次更新,以达到减少误差值的目的。再利用更新后的参数进行训练,如此重复多次,最终使得正向传播的计算结果符合预期。The training of the neural network is to adjust the parameters of each layer by inputting training samples, so that the results calculated by the neural network are as close as possible to the real results. Neural network training includes forward propagation and back propagation. Forward propagation is based on the existing model. The input training samples are calculated by each layer of the neural network, and the input feature map is gradually extracted into abstract features. After forward propagation, an output value called the predicted value is obtained. The backpropagation is a loss function calculated according to the predicted value and the real value obtained by the forward propagation. The gradient descent method is used to calculate the partial derivative of the loss function for each parameter through the chain rule to update the parameters. In the chain rule, the derivative of the error value corresponding to the weight of the last layer of the neural network is first calculated. Call these derivatives gradients, and use these gradients to calculate the gradient of the penultimate layer in the neural network. Repeat this process until you get the gradient corresponding to each weight in the neural network. Finally, the corresponding gradient is subtracted from each weight in the neural network to update the weight once to reduce the error value. Then use the updated parameters for training, and repeat this for many times, so that the calculation result of forward propagation is finally in line with expectations.
在神经网络训练过程中,神经网络每经过一次信号的正向传播以及对应一次误差的反向传播过程,神经网络中的权值利用梯度进行一次更新,此时称为一次迭代(iteration)。为了获得精度符合预期的神经网络,在训练过程中需要很庞大的样本数据集。在这种情况下,一次性将样本数据集输入计算机是不可能的。因此,为了解决这个问题,需要把样本数据集分成多个块,每块传递给计算机,每块数据集正向处理后,对应更新一次神经网络的权值。当一个完整的样本数据集通过了神经网络一次正向处理并且对应返回了一次权值更新,这个过程称为一个周期(epoch)。实际中,在神经网络中传递一次完整的数据集是不够的,需要将完整的数据集在同一神经网络中传递多次,即需要多个周期,最终获得精度符合预期的神经网络。In the process of neural network training, each time the neural network goes through a forward propagation of a signal and a back propagation process corresponding to an error, the weights in the neural network are updated once by using the gradient, which is called an iteration. In order to obtain a neural network with the desired accuracy, a very large sample dataset is required during the training process. In this case, feeding the sample dataset into the computer in one go is impossible. Therefore, in order to solve this problem, the sample data set needs to be divided into multiple blocks, each block is transmitted to the computer, and the weights of the neural network are updated correspondingly after each block of data set is processed forward. When a complete sample data set passes through the neural network for one forward processing and returns a corresponding weight update, this process is called an epoch. In practice, it is not enough to pass the complete data set once in the neural network. It is necessary to pass the complete data set in the same neural network multiple times, that is, multiple cycles are required, and finally the neural network with the accuracy that meets the expectations is obtained.
此实施例基于前述的硬件环境,提供一种对神经网络模型进行稀疏化训练的方案。更详细来说,在包括正向传播和反向传播过程的每一次迭代中,至少在正向传播中对神经网络参数进行稀疏化处理。稀疏化处理可以是一维稀疏化(例如,输入通道维度),也可以是多维稀疏化,例如二维稀疏化(例如,输入通道维度与输出通道维度同时稀疏化)。在一些实施例中,当正向传播执行输入通道维度与输出通道维度同时稀疏化时,反向传播中也可以支持输入通道维度与输出通道维度同时稀疏化,从而进一步优化性能。本披露的稀疏化方案可以在训练的多个阶段中执行,并且不同训练阶段可以采用不同的结构化稀疏数据流结构进行相关的运算,以获得优化的运算和IO性能。Based on the aforementioned hardware environment, this embodiment provides a solution for sparse training of a neural network model. In more detail, in each iteration including forward propagation and back propagation, the neural network parameters are sparsed at least in forward propagation. The thinning process can be one-dimensional thinning (eg, input channel dimension), or can be multi-dimensional thinning, such as two-dimensional thinning (eg, input channel dimension and output channel dimension are simultaneously thinned). In some embodiments, when forward propagation performs simultaneous thinning of input channel dimensions and output channel dimensions, simultaneous thinning of input channel dimensions and output channel dimensions may also be supported in backpropagation, thereby further optimizing performance. The sparsification scheme of the present disclosure can be performed in multiple stages of training, and different training stages can use different structured sparse data flow structures to perform related operations to obtain optimized operation and IO performance.
图6示出了根据本披露实施例的数据处理装置的示例性结构框图。FIG. 6 shows an exemplary structural block diagram of a data processing apparatus according to an embodiment of the present disclosure.
数据处理装置600例如可以实现在图2的计算装置201中。如图所示,数据处理装置600可以包括控制电路610、存储电路620和运算电路630。The data processing device 600 may be implemented, for example, in the computing device 201 of FIG. 2 . As shown, the data processing apparatus 600 may include a control circuit 610 , a storage circuit 620 and an arithmetic circuit 630 .
控制电路610的功能可以类似于图3的控制模块314,其例如可以包括取指单元,用以获取来自例如图2的处理装置203的指令,以及指令译码单元,用于将获取的指令进行译码,并将译码结果作为控制信息发送给运算电路630和存储电路620。The function of the control circuit 610 may be similar to that of the control module 314 in FIG. 3 , and it may include, for example, an instruction fetch unit for acquiring instructions from, for example, the processing device 203 in FIG. 2 , and an instruction decoding unit for processing the acquired instructions. decode, and send the decoded result to the operation circuit 630 and the storage circuit 620 as control information.
在一个实施例中,控制电路610可以配置用于控制存储电路620和运算电路630以对神经网络模型执行稀疏化训练。In one embodiment, the control circuit 610 may be configured to control the storage circuit 620 and the arithmetic circuit 630 to perform sparse training on the neural network model.
存储电路620可以配置用于存储信息,这些信息至少可以包括神经网络参数。在本披露的实施例中,存储电路620还可以存储掩码张量。在此实施例中,存储电路例如可以是图3的WRAM 332、NRAM 331。 Storage circuitry 620 may be configured to store information, which may include at least neural network parameters. In embodiments of the present disclosure, the storage circuit 620 may also store mask tensors. In this embodiment, the storage circuit may be, for example, the WRAM 332 and NRAM 331 of FIG. 3 .
运算电路630可以配置用于在控制电路610的控制下,对神经网络模型执行稀疏化训练,以执行如图7所示的稀疏化训练的方法。The operation circuit 630 may be configured to perform sparse training on the neural network model under the control of the control circuit 610, so as to perform the method for sparse training as shown in FIG. 7 .
图7示出了根据本披露实施例在一次迭代过程中执行的方法。Figure 7 illustrates a method performed during one iteration according to an embodiment of the present disclosure.
在步骤710中,在正向传播中,基于掩码张量至少对神经网络参数进行稀疏化处理,以计算损失函数的值。In step 710, in forward propagation, at least the neural network parameters are sparsed based on the mask tensor to calculate the value of the loss function.
在本披露的实施例中,掩码张量可能存在多种情况。In the embodiments of the present disclosure, the mask tensor may exist in various situations.
在一些实施例中,掩码张量是一维张量,其对数据的一个指定维度进行稀疏化处理。例如,掩码张量对神经网络参数的输入通道维度进行稀疏化处理。In some embodiments, the mask tensor is a one-dimensional tensor that sparses a specified dimension of the data. For example, mask tensors sparse the input channel dimension of neural network parameters.
在一些实施例中,稀疏化处理可以是结构化稀疏处理,例如根据稀疏规则,从输入数据的待稀疏维度的每m个数据元素中选择n个数据元素作为有效数据元素,其中m>n。在一个实现中,m=4,n=2。在另一些实现中,m=4时,n也可以取其他值,例如1或3。In some embodiments, the thinning process may be a structured thinning process, for example, according to a thinning rule, n data elements are selected as valid data elements from every m data elements of the dimension to be sparsed in the input data, where m>n. In one implementation, m=4 and n=2. In other implementations, when m=4, n can also take other values, such as 1 or 3.
此时,掩码张量可以是一维向量,其可以划分为多个长度为m的区间,每个区间内有n个元素为1,代表保留的数据位置,m-n个元素为0,代表掩去的数据位置。At this time, the mask tensor can be a one-dimensional vector, which can be divided into multiple intervals of length m, each interval has n elements of 1, representing the reserved data position, m-n elements of 0, representing the mask go to the data location.
在正向传播中,神经元(例如,训练数据)与神经网络参数(例如,权值)进行运算(例如,卷积),可以利用掩码张量对神经元进行同样的稀疏化处理,从而基于稀疏化后的结果执行相应的运算。In forward propagation, neurons (e.g., training data) operate (e.g., convolve) with neural network parameters (e.g., weights), and the same sparsification can be performed on neurons using mask tensors, thereby The corresponding operation is performed based on the thinned result.
图8A示出了根据本披露实施例的一个示例性一维掩码张量的掩码过程。图8A以卷积神经网络的卷积层运算为例,示出了正向传播中的基于稀疏化的卷积运算。FIG. 8A illustrates the masking process of an exemplary one-dimensional mask tensor according to an embodiment of the present disclosure. FIG. 8A takes the convolutional layer operation of the convolutional neural network as an example to illustrate the sparsification-based convolution operation in forward propagation.
如图所示,待稀疏维度是输入通道维度。示例性的掩码张量是长度为16的向量,分割成4个长度为4的区间,每个区间有2个元素为1,如图中黑色方块所示。权值的输入通道维度进行对应的分段,每个分段对应掩码张量的一个区间,二者相互作用(例如,通过运算电路630中的乘法器进行对应位相乘)得到掩码后的权值。利用同样的掩码张量类似地对神经元的输入通道维度的数据进行稀疏化处理。稀疏化后的权值和稀疏化后的神经元再进行运算,例如乘累加运算。As shown in the figure, the dimension to be sparse is the input channel dimension. An exemplary mask tensor is a vector of length 16, divided into 4 bins of length 4, each bin has 2 elements of 1, as shown by the black squares in the figure. The input channel dimension of the weight is divided into corresponding segments, and each segment corresponds to an interval of the mask tensor. weight value. The input channel dimension of the neuron is similarly sparsed using the same mask tensor. The sparse weights and the sparse neurons are then operated, such as multiply-accumulate operations.
在另一些实施例中,掩码张量是二维张量,其对数据的两个指定维度同时进行稀疏化处理。例如,掩码张量对神经网络参数的输入通道维度和输出通道维度同时进行稀疏化处理。In other embodiments, the mask tensor is a two-dimensional tensor that simultaneously sparses two specified dimensions of the data. For example, the mask tensor sparses both the input channel dimension and the output channel dimension of the neural network parameters.
在一些实施例中,稀疏化处理可以是结构化稀疏处理,例如根据稀疏规则,从输入数据的待稀疏维度的每m个数据元素中选择n个数据元素作为有效数据元素,其中m>n。在一个实现中,m=4,n=2。在另一些实现中,m=4时,n也可以取其他值,例如1或3。In some embodiments, the thinning process may be a structured thinning process, for example, according to a thinning rule, n data elements are selected as valid data elements from every m data elements of the dimension to be sparsed in the input data, where m>n. In one implementation, m=4 and n=2. In other implementations, when m=4, n can also take other values, such as 1 or 3.
此时,掩码张量可以是二维矩阵,其可以划分为多个m×m的方块,每个方块内任一行有n个元素为1,m-n个元素为0,并且每个方块内任一列有n个元素为1,m-n个元素为0,“1”代表保留的数据位置,“0”代表掩去的数据位置。在一些实施例中,假设m为4且n为2,则这样的4×4掩码矩阵一共有90个,这些掩码矩阵可以预存于DRAM 204中。At this time, the mask tensor can be a two-dimensional matrix, which can be divided into a plurality of m×m squares, and any row in each square has n elements that are 1, m-n elements are 0, and any line in each square is 0. A column has n elements as 1, m-n elements as 0, "1" represents the reserved data location, and "0" represents the masked data location. In some embodiments, assuming that m is 4 and n is 2, there are 90 such 4×4 mask matrices in total, and these mask matrices can be pre-stored in the DRAM 204.
图8B示出一个示例性的掩码过程,假设卷积层的输入通道与输出通道为4×4的通道矩阵801,其元素为a 11至a 44,通道矩阵801即为神经网络参数。图中还示出了前述90个4×4掩码矩阵中的一个示例性掩码矩阵802,用于对通道矩阵801进行掩码稀疏化处理。具体地,如果掩码矩阵802中相对应的元素为1,则运算电路630保留通道矩阵801中的元素,如果掩码矩阵802中相对应的元素为0,则运算电路630掩去通道矩阵801中的元素,其值为0。以通道矩阵801中的a 11为例,其掩码矩阵802中相对应的元素为0,故掩码后参数矩阵803的相对应元素被掩去,其值为0。以此方式得到掩码后参数矩阵803的所有元素值。由于通道矩阵801中有一半的元素被掩去,表示省去了约一半的计算量。 FIG. 8B shows an exemplary masking process. It is assumed that the input channels and output channels of the convolution layer are a 4×4 channel matrix 801 whose elements are a 11 to a 44 , and the channel matrix 801 is a neural network parameter. The figure also shows an exemplary mask matrix 802 among the aforementioned 90 4×4 mask matrices for performing mask sparse processing on the channel matrix 801 . Specifically, if the corresponding element in the mask matrix 802 is 1, the operation circuit 630 retains the element in the channel matrix 801, and if the corresponding element in the mask matrix 802 is 0, the operation circuit 630 masks the channel matrix 801 The element in , whose value is 0. Taking a 11 in the channel matrix 801 as an example, the corresponding element in the mask matrix 802 is 0, so the corresponding element in the parameter matrix 803 after masking is masked, and its value is 0. All element values of the masked parameter matrix 803 are obtained in this way. Since half of the elements in the channel matrix 801 are masked out, it means that about half of the computation is saved.
针对每个训练样本,运算电路630在正向传播中基于掩码张量对神经网络参数进行掩码后进行计算,最终获得损失函数的值,损失函数对应了神经网络的输出误差。For each training sample, the arithmetic circuit 630 masks the parameters of the neural network based on the mask tensor in the forward propagation, and then performs calculation, and finally obtains the value of the loss function, which corresponds to the output error of the neural network.
回到图7,在步骤720中,在反向传播中,基于损失函数计算神经元梯度和神经网络参数梯度。在本披露的实施例中,基于正向传播中所使用的掩码张量,反向传播中可以选择性地应用或不应用稀疏化处理。Returning to Figure 7, in step 720, in backpropagation, neuron gradients and neural network parameter gradients are calculated based on the loss function. In an embodiment of the present disclosure, based on the mask tensor used in the forward propagation, the sparsification process may or may not be selectively applied in the back propagation.
在一些实施例中,无论正向传播所使用的掩码张量如何,在反向传播中,均可以基于未稀疏化处理的神经网络参数来计算神经元梯度和神经网络参数梯度;以及基于神经网络参数梯度更新神经网络参数。In some embodiments, regardless of the mask tensor used in forward propagation, in backpropagation, neuron gradients and neural network parameter gradients may be computed based on unsparse neural network parameters; and The network parameter gradient updates the neural network parameters.
取决于存储电路中所存储的信息,在一些实现中,未稀疏化处理的神经网络参数可以是稀疏化处理前的神经网络参数,也可以是对已稀疏化处理的神经网络参数进行反稀疏化处理后得到的。反稀疏化处理可以包括根据掩码张量的指示,将已稀疏化处理的神经网络参数恢复到稀疏化处理前的对应位置,而在其余位置处填充预定信息(例如0)以恢复到稀疏化处理前的形状。Depending on the information stored in the storage circuit, in some implementations, the neural network parameters that are not sparsed may be the neural network parameters before the sparse processing, or the neural network parameters that have been sparsed to be de-sparsed. obtained after processing. The anti-sparse processing may include restoring the thinned neural network parameters to the corresponding positions before the thinning processing according to the indication of the mask tensor, and filling the remaining positions with predetermined information (eg, 0) to restore the thinning shape before processing.
在另一些实施例中,当正向传播所使用的掩码张量为二维张量时,在反向传播中也可以应用稀疏化,也即基于稀疏化处理的神经网络参数来计算神经网络参数梯度和神经元梯度,继而基于神经元梯度更新神经网络参数。In other embodiments, when the mask tensor used in forward propagation is a two-dimensional tensor, sparseness can also be applied in backpropagation, that is, the neural network is calculated based on the sparsely processed neural network parameters parameter gradients and neuron gradients, and then update the neural network parameters based on neuron gradients.
在训练的反向传播中,涉及到神经元梯度和权值梯度的计算,如下所示:In the backpropagation of training, the computation of neuron gradients and weight gradients is involved, as follows:
Figure PCTCN2021123879-appb-000001
Figure PCTCN2021123879-appb-000001
Figure PCTCN2021123879-appb-000002
Figure PCTCN2021123879-appb-000002
其中,top_diff、bottom_diff分别为神经元梯度,W是本次迭代的权值,△W是本次迭代计算的权值梯度,
Figure PCTCN2021123879-appb-000003
是反向传播中的计算,类似于卷积运算。相对于反向传播方向上,前一层的bottom_diff是当前层的top_diff,当前层的bottom_diff是下一层的top_diff,由此可以反向逐层传递误差。
Among them, top_diff and bottom_diff are the neuron gradients respectively, W is the weight of this iteration, △W is the weight gradient calculated by this iteration,
Figure PCTCN2021123879-appb-000003
is the computation in backpropagation, similar to the convolution operation. Relative to the back propagation direction, the bottom_diff of the previous layer is the top_diff of the current layer, and the bottom_diff of the current layer is the top_diff of the next layer, so that the error can be transferred layer by layer in reverse.
在公式(1)的计算中,权值W的布局与正向传播过程中的不同,因而其运算中累加的方向也不相同。在正向传播中,权值按照(Co,Kh,Kw,Ci)维度顺序或维度形状进行使用,其中Ci表示输入通道维度、Co表示输出通道维度、Kh为卷积核高度维度、Kw为卷积核宽度维度。正向传播的卷积运算中,运算结果是在Ci方向上累加。而在反向传播中,权值按照(Ci,Kh,Kw,Co)维度顺序或维度形状进行使用。反向传播的运算中,运算结果是在Co方向上累加。因此,为了保持反向传播中运算梯度的数学一致性,需要对Ci和Co方向同时进行稀疏化。In the calculation of formula (1), the layout of the weight W is different from that in the forward propagation process, so the direction of accumulation in its operation is also different. In forward propagation, the weights are used according to (Co, Kh, Kw, Ci) dimension order or dimension shape, where Ci represents the input channel dimension, Co represents the output channel dimension, Kh is the convolution kernel height dimension, and Kw is the volume The kernel width dimension. In the forward propagation convolution operation, the operation results are accumulated in the Ci direction. In backpropagation, the weights are used according to the (Ci, Kh, Kw, Co) dimension order or dimension shape. In the back-propagation operation, the result of the operation is accumulated in the Co direction. Therefore, in order to maintain the mathematical consistency of the operational gradients in backpropagation, both Ci and Co directions need to be sparsed simultaneously.
在反向传播中进行稀疏化处理时,可以使用反向掩码张量对神经网络参数进行掩码,以得到稀疏化处理的神经网络参数。When performing sparse processing in backpropagation, a reverse mask tensor can be used to mask the neural network parameters to obtain sparsely processed neural network parameters.
反向掩码张量可以与正向传播中所使用的掩码张量一致。但是,由于前面提到的反向传播中权值布局不同,运算时的累加方向也不同,因此不能直接使用正向传播中的掩码张量。在一些实现中,可以对正向传播中使用的掩码张量(或称为正向掩码张量)进行维度转换后再使用。可以使用现有的各种维度转换方式(例如,维度转置、数据变形)等,以将掩码张量转换为反向传播中所需的布局方式,作为反向掩码张量使用。在另一些实现中,也可以在反向传播过程中重复一次正向传播过程中采用的掩码张量生成过程,来生成反向掩码张量。不过在正向传播过程中实现Ci方向的掩码计算,而在反向传播过程中实现Co方向的掩码计算。The reverse mask tensor can be identical to the mask tensor used in forward propagation. However, due to the different weight layouts in the aforementioned backpropagation and the different accumulation directions during operations, the mask tensor in forward propagation cannot be used directly. In some implementations, the mask tensor (or called forward mask tensor) used in forward propagation can be dimensionally transformed before being used. Various existing dimension transformation methods (eg, dimension transposition, data warping), etc. can be used to transform the mask tensor into the required layout in backpropagation, and use it as a reverse mask tensor. In other implementations, the mask tensor generation process used in the forward propagation process can also be repeated during the backpropagation process to generate the reverse mask tensor. However, the mask calculation in the Ci direction is implemented in the forward propagation process, and the mask calculation in the Co direction is implemented in the back propagation process.
继续图7,在步骤730中,基于神经网络参数梯度更新神经网络参数。Continuing with FIG. 7, in step 730, the neural network parameters are updated based on the neural network parameter gradients.
本披露实施例的稀疏化训练可以包括若干训练阶段,例如无掩码阶段、掩码调整阶段和掩码固定阶段。后面将结合附图详细描述各个阶段的处理。The sparsification training of embodiments of the present disclosure may include several training stages, such as a maskless stage, a mask adjustment stage, and a mask fixation stage. The processing of each stage will be described in detail later with reference to the accompanying drawings.
基于稀疏化训练所处的不同阶段,对神经网络参数的更新也可以不同。Based on the different stages of the sparse training, the updates to the neural network parameters can also be different.
在一些实施例中,更新神经网络参数可以是更新未稀疏化处理的神经网络参数。例如,在掩码调整阶段中,在每次迭代中,对未稀疏化处理的神经网络参数进行更新。进一步地,在掩码调整阶段中,在每K(K≥1)次迭代中,可以基于更新的未稀疏化处理的神经网络参数来生成更新的掩码张量,由此可以在训练过程中优化掩码张量,提升性能。In some embodiments, updating the neural network parameters may be updating the neural network parameters that are not thinned out. For example, in the mask adjustment stage, in each iteration, the parameters of the neural network that are not thinned out are updated. Further, in the mask adjustment stage, in every K (K ≥ 1) iterations, an updated mask tensor can be generated based on the updated non-sparse-processed neural network parameters, so that during the training process, an updated mask tensor can be generated. Optimize mask tensors to improve performance.
在另一些实施例中,更新神经网络参数可以是更新已稀疏化处理的神经网络参数。例如,在掩码固定阶段中,由于掩码张量已经固定,因此神经网络参数的稀疏化模式是固定的,也即神经网络参数中的有效数据元素是固定,因而对神经网络参数的更新可以只更新有效数据元素,也即更新已稀疏化处理的神经网络参数。在一种实现中,更新神经网络参数可以包括:利用掩码张量对神经元梯度进行稀疏化处理;以及基于稀疏化的神经元梯度更新已稀疏化处理的神经网络参数。In other embodiments, updating the neural network parameters may be updating the sparse-processed neural network parameters. For example, in the mask fixing stage, since the mask tensor has been fixed, the sparse mode of the neural network parameters is fixed, that is, the effective data elements in the neural network parameters are fixed, so the update of the neural network parameters can be Only valid data elements are updated, that is, the parameters of the neural network that have been sparsed are updated. In one implementation, updating the neural network parameters may include: using a mask tensor to sparse neuron gradients; and updating the sparsed neural network parameters based on the sparse neuron gradients.
掩码固定阶段中固定的掩码张量可以是上一训练阶段(例如,掩码调整阶段)最终确定的掩码张量。取决于掩码张量的不同形式,可以有不同的方式来生成或更新掩码张量。The mask tensor fixed in the mask fixing stage may be the mask tensor finally determined in the previous training stage (eg, the mask adjustment stage). Depending on the different forms of the mask tensor, there can be different ways to generate or update the mask tensor.
当掩码张量为一维张量时,也就是掩码向量,掩码向量仅能针对单一参数进行掩码。可以基于未稀疏化的神经网络参数来生成掩码张量。例如,从神经网络参数的指定维度的每m个数据元素中选择n个绝对值较大的数据元素作为有效数据元素,其中m>n;以及基于该n个有效数据元素在该m个数据元素中的位置来生成掩码张量。在一些实现中,上述指定维度可以是输入通道维度(Ci)。具体地,此实施例以特定参数数量m为单位将参数分成多个区间,每个区间内的参数根据其绝对值大小排序,接着设定掩码张量中,位置相对于每个区间内绝对值较大的前n个参数的元素为1,并设定位置相对于每个区间内绝对值较小的m-n个参数的元素为0,原因在于掩码调整参数中绝对值较大的载有更明显的特征,更值得保留下来继续计算。筛选掩码调整参数中绝对值较大者的方式有很多,本披露在此方面没有限制。When the mask tensor is a one-dimensional tensor, that is, a mask vector, the mask vector can only mask a single parameter. A mask tensor can be generated based on unsparse neural network parameters. For example, select n data elements with larger absolute values from every m data elements of the specified dimension of the neural network parameters as valid data elements, where m>n; and based on the n valid data elements in the m data elements position in to generate the mask tensor. In some implementations, the aforementioned specified dimension may be the input channel dimension (Ci). Specifically, in this embodiment, the parameters are divided into multiple intervals with the specific parameter number m as the unit, and the parameters in each interval are sorted according to their absolute values. The elements of the large first n parameters are 1, and the elements of the m-n parameters with smaller absolute values in each interval are set to 0, because the larger absolute value of the mask adjustment parameters is more obvious. The characteristics of , it is more worth keeping to continue the calculation. There are many ways to filter the mask adjustment parameter with a larger absolute value, and the present disclosure is not limited in this respect.
图9为一种示例性的掩码向量更新的示意图,以实例说明前述更新掩码向量。图中示出一个参数向量901,共有64个参数,分别为b 01至b 64。在此步骤中,更新掩码向量的各元素值,以保留掩码调 整参数的绝对值较大的,并掩去掩码调整参数的绝对值较小的。其中,以每4个掩码调整参数为单位(即m为4)将更新后的掩码调整参数分成多个区间,如图所示,b 01至b 04为第一区间902、b 05至b 08为第二区间903、b 61至b 64为第十六区间917,然后将每个区间内的掩码调整参数根据其绝对值大小排序,假设第一区间902中各参数的绝对值大小依序为b 02>b 01>b 04>b 03,第二区间903中各参数的绝对值大小依序为b 07>b 05>b 06>b 08,而第十六区间917中各参数的绝对值大小依序为b 64>b 63>b 61>b 62,将每个区间内的掩码调整参数根据掩码调整参数绝对值大小排序。然后,将这些掩码向量中,位置相对于每个区间内绝对值较大的前2个(即n为2)掩码调整参数的元素设定为1,并将位置相对于每个区间内绝对值较小的2个(即m-n=2)掩码调整参数的元素设定为0,以第一区间902为例,将掩码向量中对应b 02与b 01的元素设定为1,对应b 04与b 03的元素设定为0。每个区间皆以此方式进行调整,最后完成更新后的掩码向量918。更新后的掩码向量918保留了更新后掩码调整参数中绝对值较大的,掩去了更新后掩码调整参数中绝对值较小的。综上所述,以每4个掩码调整参数为一个区间,每个区间以4选2的方式来更新掩码向量的元素值。 FIG. 9 is a schematic diagram of an exemplary mask vector update, illustrating the aforementioned update mask vector by way of example. The figure shows a parameter vector 901 with 64 parameters in total, namely b 01 to b 64 . In this step, each element value of the mask vector is updated to keep the mask adjustment parameter with a larger absolute value and mask out the mask adjustment parameter with a smaller absolute value. Wherein, the updated mask adjustment parameters are divided into multiple intervals in units of every 4 mask adjustment parameters (that is, m is 4). As shown in the figure, b 01 to b 04 are the first interval 902, b 05 to b 04 b 08 is the second interval 903, b 61 to b 64 are the sixteenth interval 917, and then the mask adjustment parameters in each interval are sorted according to their absolute value, assuming the absolute value of each parameter in the first interval 902 The sequence is b 02 >b 01 >b 04 >b 03 , the absolute values of the parameters in the second interval 903 are b 07 >b 05 >b 06 >b 08 , and the parameters in the sixteenth interval 917 The absolute value of the sequence is b 64 >b 63 >b 61 >b 62 , and the mask adjustment parameters in each interval are sorted according to the absolute value of the mask adjustment parameters. Then, in these mask vectors, the position relative to the first 2 (that is, n is 2) mask adjustment parameters whose absolute value is larger in each interval is set to 1, and the position relative to each interval is set to 1. The elements of the two mask adjustment parameters with smaller absolute values (that is, mn=2) are set to 0. Taking the first interval 902 as an example, the elements corresponding to b 02 and b 01 in the mask vector are set to 1, The elements corresponding to b 04 and b 03 are set to 0. Each interval is adjusted in this way, and finally the updated mask vector 918 is completed. The updated mask vector 918 retains the larger absolute value of the updated mask adjustment parameters, and masks the smaller absolute value of the updated mask adjustment parameters. To sum up, take every 4 mask adjustment parameters as an interval, and update the element value of the mask vector in a 4-to-2 manner for each interval.
此实施例是将每个区间内的掩码调整参数完整排序,以识别出绝对值较大的n个与绝对值较小的m-n个,但本公开不必然需要进行完整排序,只要能识别出绝对值较大的n个与绝对值较小的m-n个即可,至于绝对值较大的n个中的大小及绝对值较小的m-n个中的大小并非必要信息。以第一区间902为例,本公开只需判断b 01与b 02是绝对值较大的2个,而b 03与b 04是绝对值较小的2个即可,b 01与b 02的绝对值大小和b 03与b 03的绝对值大小并不关键,可以不排序,以节省运算资源。 In this embodiment, the mask adjustment parameters in each interval are completely sorted to identify n with larger absolute values and mn with smaller absolute values, but the present disclosure does not necessarily require complete sorting, as long as it can be identified The n items with a larger absolute value and the mn items with a smaller absolute value are sufficient, and the size of the n items with a larger absolute value and the size of the mn items with a smaller absolute value are not necessary information. Taking the first interval 902 as an example, the present disclosure only needs to determine that b 01 and b 02 are two with larger absolute values, while b 03 and b 04 are two with smaller absolute values . The absolute value size and the absolute value size of b 03 and b 03 are not critical, and the sorting can be omitted to save computing resources.
如果掩码张量为多维(例如二维),则可以让训练数据与每个掩码后参数张量进行乘积和计算,以获得参数评估值。获得参数评估值的目的在于计算经掩码张量掩码后保留下来的信息量大小。如果参数评估值高,表示信息量并未因为掩码而丢失太多,该掩码张量在保留大部分信息的前提下降低了运算量,是优质的掩码张量;反之,如果参数评估值低,表示信息量在掩码后丢失太多,该掩码张量并非优质的掩码张量。If the mask tensor is multi-dimensional (e.g. two-dimensional), the training data can be multiplied and computed with each post-mask parameter tensor to obtain parameter evaluations. The purpose of obtaining the parameter evaluation value is to calculate the amount of information retained after being masked by the masked tensor. If the parameter evaluation value is high, it means that the amount of information has not been lost too much due to the mask. The mask tensor reduces the amount of operation on the premise of retaining most of the information, and is a high-quality mask tensor; on the contrary, if the parameter evaluation A low value indicates that too much information is lost after masking, and the mask tensor is not a high-quality mask tensor.
具体地,对于二维掩码张量可以按如下确定:预设特定数量的二维掩码张量,然后从这些预设的二维掩码张量中选择一个作为要使用的掩码张量。这些二维掩码张量的每个维度包括m个元素,其中n个元素为1,m-n个元素为0,m>n。如前面所提到的,在m=4,n=2的条件下,这样的4×4掩码矩阵一共有90个,因此要从这90个掩码矩阵中选择一个作为掩码张量。Specifically, the two-dimensional mask tensor can be determined as follows: a specific number of two-dimensional mask tensors are preset, and then one of the preset two-dimensional mask tensors is selected as the mask tensor to be used. Each dimension of these two-dimensional mask tensors includes m elements, where n elements are 1, m-n elements are 0, and m>n. As mentioned earlier, under the condition of m=4, n=2, there are 90 such 4×4 mask matrices in total, so one of the 90 mask matrices should be selected as the mask tensor.
在从这些特定数量(例如,90个)的二维掩码张量中选择一个可以包括:基于每个预设的二维掩码张量分别对神经网络参数的指定两个维度进行掩码,以获得掩码后参数张量;基于每个掩码后参数张量,对神经网络层的训练数据进行乘积和运算,以获得参数评估值;以及选择产生所有参数评估值中最大的二维掩码张量作为选择的掩码张量。在一些实现中,上述指定两个维度可以是输入通道维度和输出通道维度。上述乘积和运算也可以视为一种卷积运算,但是其不在输入通道维度上进行累加,只在深度方向上进行累加,因此也可以称为深度方向的卷积运算,此处深度方向即Kw×Kh维度。Selecting one of these specific number (eg, 90) of two-dimensional mask tensors may include masking the specified two dimensions of the neural network parameters based on each preset two-dimensional mask tensors, respectively, to obtain a post-mask parameter tensor; based on each post-mask parameter tensor, perform a product-sum operation on the training data of the neural network layer to obtain parameter evaluations; and select the two-dimensional mask tensor that produces the largest of all parameter evaluations amount as the selected mask tensor. In some implementations, the two dimensions specified above may be the input channel dimension and the output channel dimension. The above product-sum operation can also be regarded as a convolution operation, but it does not accumulate in the input channel dimension, but only in the depth direction, so it can also be called the convolution operation in the depth direction, where the depth direction is Kw ×Kh dimension.
图10示出一个示例性的乘积和计算过程。假设训练数据矩阵1001为训练集中的训练数据之一,原本应与图8的通道矩阵801进行计算,现改与掩码后参数矩阵803进行乘积和计算,用以识别掩码后信息量的多寡。这样的乘积和计算有多种方式,例如训练数据矩阵1001与掩码后参数矩阵803相应元素相乘后取绝对值相加,以获得参数评估值S 1,即: FIG. 10 shows an exemplary sum-of-product calculation process. Assuming that the training data matrix 1001 is one of the training data in the training set, it should be calculated with the channel matrix 801 in FIG. 8, and now it is calculated by multiplying it with the masked parameter matrix 803 to identify the amount of information after the mask. . There are many ways to calculate the sum of products. For example, the training data matrix 1001 is multiplied by the corresponding elements of the masked parameter matrix 803, and then the absolute values are added to obtain the parameter evaluation value S 1 , namely:
S 1=|d 31·a 31|+|d 41·a 41|+|d 12·a 12|+|d 42·a 42|+|d 13·a 13|+|d 23·a 23|+|d 24·a 24|+|d 34·a 34| S 1 =|d 31 ·a 31 |+|d 41 ·a 41 |+|d 12 ·a 12 |+|d 42 ·a 42 |+|d 13 ·a 13 |+|d 23 ·a 23 | +|d 24 ·a 24 |+|d 34 ·a 34 |
又例如训练数据矩阵1001与掩码后参数矩阵803相应元素取绝对值相乘后相加,以获得参数评估值S 2,即: For another example, the corresponding elements of the training data matrix 1001 and the masked parameter matrix 803 are multiplied by absolute values and then added to obtain the parameter evaluation value S 2 , that is:
S 2=|d 31|·|a 31|+|d 41|·|a 41|+|d 12|·|a 12|+|d 42|·|a 42|+|d 13|·|a 13|+|d 23|·|a 23|+|d 24|·|a 24|+|d 34|·|a 34| S 2 =|d 31 |·|a 31 |+|d 41 |·|a 41 |+|d 12 |·|a 12 |+|d 42 |·|a 42 |+|d 13 |·|a 13 |+|d 23 |·|a 23 |+|d 24 |·|a 24 |+|d 34 |·|a 34 |
参数评估值反映了类似绝对值计算的结果,参数评估值S 1或S 2显示了掩码后保留下来的信息量多寡,参数评估值的数值越高表示保留的信息量越多。在一种应用场景中,可以选择参数评估值S 1或S 2任一种计算方式,而在另一种应用场景中,可以同时利用参数评估值S 1及S 2的计算方式,本公开不做限制。 The parameter evaluation value reflects the result of a similar absolute value calculation. The parameter evaluation value S 1 or S 2 shows the amount of information retained after masking. The higher the value of the parameter evaluation value, the more information is retained. In one application scenario, either calculation method of parameter evaluation value S 1 or S 2 can be selected, while in another application scenario, the calculation methods of parameter evaluation value S 1 and S 2 can be used at the same time. make restrictions.
对所有掩码张量都进行掩码并获得参数评估值。在前述的例子中,意味着所有90个4×4掩码矩阵都进行掩码并获得90个参数评估值。选择最大参数评估值的掩码张量为更新后的掩码张量,即参数掩码张量。选择最大参数评估值的方式很多,例如可以将所有的参数评估值依数值大小排序,以获得最大的参数评估值,或是简单以一个两输入的比较器进行比较,留下较大的与下一个参数评估值再比较,90个参数评估值都比较完后留下的便是最大的参数评估值。如果多个掩码张量具有相同的最大参数评估值,则可以基于特定规则或硬件特性选择其中之一,例如排序在前的、排序在后的、最先留下的、最后留下的或是随机从中择一。Mask all masked tensors and get parameter evaluations. In the preceding example, this means that all 90 4x4 mask matrices are masked and 90 parameter evaluations are obtained. The mask tensor with the largest parameter evaluation value is selected as the updated mask tensor, that is, the parameter mask tensor. There are many ways to select the maximum parameter evaluation value. For example, you can sort all parameter evaluation values in numerical order to obtain the largest parameter evaluation value, or simply use a two-input comparator for comparison, leaving the larger value and the lower value. One parameter evaluation value is compared again, and the largest parameter evaluation value is left after all 90 parameter evaluation values are compared. If multiple mask tensors have the same maximum parameter evaluation value, one of them may be selected based on certain rules or hardware characteristics, such as first-order, last-order, first left, last left, or is randomly selected.
拥有最大参数评估值的掩码张量即是保留最多信息量的掩码张量,此实施例以该掩码张量作为参数掩码张量。The mask tensor with the largest parameter evaluation value is the mask tensor that retains the most information, and this embodiment uses the mask tensor as the parameter mask tensor.
在此实施例中,可以在每次迭代或每个一代训练中来更新掩码张量。如果在训练过程中,神经网络参数是在每个训练样本训练后都更新,则掩码张量较佳的在每次迭代都更新;如果神经网络参数是在每次迭代更新,则参数掩码张量较佳的在每次一代训练结束才更新。In this embodiment, the mask tensor may be updated in each iteration or each generation of training. If in the training process, the neural network parameters are updated after each training sample, the mask tensor is preferably updated in each iteration; if the neural network parameters are updated in each iteration, the parameter mask Tensors are preferably updated at the end of each generation of training.
本领域技术人员可以理解,虽然上面基于更新过程来描述掩码张量的生成,但是在初次产生掩码张量时,可以以类似地方式生成掩码张量,只是所基于的神经网络参数会有不同。取决于训练过程所包含的阶段,初次生成掩码张量时,所基于的神经网络参数可能是随机初始化的参数,也可能经过无掩码阶段训练后确定的神经网络参数。Those skilled in the art can understand that although the generation of the mask tensor is described above based on the update process, when the mask tensor is generated for the first time, the mask tensor can be generated in a similar manner, except that the neural network parameters on which the mask tensor is based will be generated. There are different. Depending on the stages involved in the training process, when the mask tensor is first generated, the neural network parameters may be randomly initialized parameters or neural network parameters determined after training in the unmasked stage.
如前面所提到的,本披露实施例的稀疏化训练可以包括若干训练阶段,例如无掩码阶段、掩码调整阶段和掩码固定阶段。下面结合附图详细描述各个阶段的处理。As mentioned earlier, the sparsification training of embodiments of the present disclosure may include several training stages, such as a maskless stage, a mask adjustment stage, and a mask fixation stage. The processing of each stage is described in detail below with reference to the accompanying drawings.
图11示出了包含无掩码阶段和掩码调整阶段的示例性流程图。在无掩码阶段中,处理装置203仅针对神经网络参数进行训练,也就是不对神经网络参数进行掩码稀疏化,待无掩码阶段结束进入掩码调整阶段后,才训练参数同时更新掩码张量。FIG. 11 shows an exemplary flow diagram including a no-mask stage and a mask-adjustment stage. In the unmasked stage, the processing device 203 only trains the neural network parameters, that is, does not perform mask sparseness on the neural network parameters. After the unmasked stage ends and enters the mask adjustment stage, the training parameters are updated simultaneously with the mask. Tensor.
如图11所示,在步骤1101中,控制电路610首先设定进入无掩码阶段。在无掩码阶段中,此实施例不对神经网络参数进行掩码,参数的所有元素都参与训练,在训练的一开始可以随机产生参数值,为方便辨别,在无掩码阶段中参与训练的参数称为无掩码参数。As shown in FIG. 11 , in step 1101 , the control circuit 610 first sets to enter the no-mask stage. In the unmasked stage, this embodiment does not mask the neural network parameters, all elements of the parameters participate in the training, and the parameter values can be randomly generated at the beginning of the training. Parameters are called unmasked parameters.
在步骤1102中,运算电路630在正向传播中基于无掩码参数以计算损失函数的值。在此步骤中,运算电路630采用现有技术计算损失函数的方式,在正向传播中输入训练样本通过神经网络的各层计算,将输入的特征图逐步提取为抽象特征,利用正向传播结果和真实值计算而得的损失函数。In step 1102, the arithmetic circuit 630 calculates the value of the loss function based on the unmasked parameters in the forward pass. In this step, the computing circuit 630 adopts the method of calculating the loss function in the prior art, and in the forward propagation, the input training samples are calculated by each layer of the neural network, and the input feature map is gradually extracted as abstract features, and the forward propagation result is used. and the loss function calculated from the true value.
在步骤1103中,运算电路630在反向传播中计算损失函数对无掩码参数的偏导。运算电路630采用梯度下降法,通过链式法则计算出损失函数对每个无掩码参数的偏导。In step 1103, the arithmetic circuit 630 calculates the partial derivative of the loss function with respect to the unmasked parameter in backpropagation. The arithmetic circuit 630 uses the gradient descent method to calculate the partial derivative of the loss function for each unmasked parameter through the chain rule.
在步骤1104中,运算电路630基于偏导更新无掩码参数,并将更新后的无掩码参数作为掩码调整参数的初始值。首先,运算电路630根据无掩码参数对误差的影响,再乘以步长,以更新整个神经网络的无掩码参数。在此实施例中,运算电路630同样可以在每个训练样本或每次迭代中基于偏导来更新无掩码参数。In step 1104, the arithmetic circuit 630 updates the unmasked parameter based on the partial derivative, and uses the updated unmasked parameter as the initial value of the mask adjustment parameter. First, the arithmetic circuit 630 multiplies the step size according to the influence of the unmasked parameter on the error to update the unmasked parameter of the entire neural network. In this embodiment, the arithmetic circuit 630 may also update the unmasked parameter based on the partial derivative in each training sample or each iteration.
此实施例可以在特定次数的一代训练中重复步骤1102、步骤1103及步骤1104,以多次更新无掩码参数,在最后一次更新后,更新后的无掩码参数将作为下一个阶段中掩码调整参数的初始值。In this embodiment, step 1102, step 1103 and step 1104 can be repeated in a certain number of times of training to update the unmasked parameter multiple times. After the last update, the updated unmasked parameter will be used as the mask in the next stage. The initial value of the code adjustment parameter.
在步骤1105中,控制电路610设定进入掩码调整阶段,也就是开始利用掩码张量对部分参数进行掩码。在进行训练时,现有技术仅针对所有参数(像是权值、偏置等)做训练,通常不会对参数进行掩码。此实施例对参数掩码,其目的在于在训练阶段便减少参数参与,避免过拟合以减少计算量,同时让掩码张量也在训练过程中随着参数的更新而更新,以获得更理想的掩码张量。在进入掩码调整阶段之初,如前所述,掩码调整参数的初始值是在无掩码阶段最终更新的无掩码参数,而掩码张量可以基于在无掩码阶段最终更新的无掩码参数来获得掩码调整参数的初始值,其获得方式与前面描述的掩码张量的生成方式相同,不再赘述。In step 1105, the control circuit 610 is set to enter the mask adjustment stage, that is, it starts to mask some parameters by using the mask tensor. During training, the prior art only trains on all parameters (such as weights, biases, etc.), and usually does not mask the parameters. The purpose of parameter masking in this embodiment is to reduce the participation of parameters in the training phase and avoid overfitting to reduce the amount of calculation. The ideal mask tensor. At the beginning of entering the mask adjustment stage, as mentioned earlier, the initial values of the mask adjustment parameters are the unmasked parameters that are finally updated in the unmasked stage, and the mask tensor can be based on the unmasked parameters that are finally updated in the unmasked stage. There is no mask parameter to obtain the initial value of the mask adjustment parameter.
在步骤1106中,在正向传播中基于掩码张量对掩码调整参数进行掩码以计算损失函数的值。在步骤1107中,在反向传播中计算损失函数对掩码调整参数的偏导。在步骤1108中,基于偏导更新掩码调整参数。在步骤1109中,基于更新后的掩码调整参数更新掩码张量。这些步骤可以参考前面结合图7的描述,此处不再赘述。In step 1106, the mask adjustment parameters are masked based on the mask tensor in the forward pass to calculate the value of the loss function. In step 1107, the partial derivatives of the loss function to the mask adjustment parameters are calculated in backpropagation. In step 1108, the mask adjustment parameters are updated based on the partial derivatives. In step 1109, the mask tensor is updated based on the updated mask adjustment parameters. For these steps, reference may be made to the foregoing description in conjunction with FIG. 7 , which will not be repeated here.
此实施例不限制在无掩码阶段与掩码调整阶段进行一代训练的次数,本领域技术人员可以依具体情况安排,且无掩码阶段与掩码调整阶段进行一代训练的次数也不一定要相同。This embodiment does not limit the number of times of first-generation training in the unmasked stage and the mask adjustment stage. Those skilled in the art can arrange it according to the specific situation, and the number of times of the first-generation training in the unmasked stage and the mask adjustment stage is not necessarily required. same.
本公开的另一个实施例同样基于前述的硬件环境,提供一种对神经网络模型进行稀疏化训练的方案。与前述实施例不同处在于在训练时分为三个阶段:无掩码阶段、掩码调整阶段及掩码固定阶段。在无掩码阶段中,处理装置203仅针对参数进行训练,不对参数掩码,在掩码调整阶段中,处理装置203以更新后的无掩码参数作为初始值,同时对参数与掩码张量进行训练,在掩码固定阶段,处理装置203以在掩码调整阶段中更新后的掩码调整参数及更新后的掩码张量作为初始值,在不改变或更新掩码张量的前提下,继续训练参数。Another embodiment of the present disclosure also provides a solution for sparse training of a neural network model based on the aforementioned hardware environment. The difference from the previous embodiment is that the training is divided into three stages: no mask stage, mask adjustment stage and mask fixation stage. In the unmasked stage, the processing device 203 only trains the parameters without masking the parameters. In the mask fixing stage, the processing device 203 uses the updated mask adjustment parameters and the updated mask tensor in the mask adjustment stage as the initial values, on the premise of not changing or updating the mask tensor. Next, continue to train the parameters.
此实施例在无掩码阶段及掩码调整阶段所执行的流程如图11所示,故不赘述。在进入掩码固定阶段后,其流程如图12所示。The processes performed in the unmasking stage and the mask adjusting stage in this embodiment are shown in FIG. 11 , and thus are not repeated. After entering the mask fixing stage, the flow is shown in Figure 12.
在步骤1201中,控制电路610设定进入掩码固定阶段。在掩码固定阶段中,控制电路610以在掩码调整阶段更新后的掩码调整参数作为这阶段的参数(以下称为掩码固定参数)的初始值,此实施例在掩码调整阶段便将掩码张量更新完毕,因此在这阶段里将不对掩码张量再进行更新,而是基于在掩码调整阶段中最终更新的掩码张量对掩码固定参数进行掩码,继续训练掩码固定参数。In step 1201, the control circuit 610 is set to enter the mask fixing stage. In the mask fixing stage, the control circuit 610 uses the mask adjustment parameter updated in the mask adjustment stage as the initial value of the parameter (hereinafter referred to as the mask fixing parameter) in this stage. The mask tensor is updated, so the mask tensor will not be updated in this stage, but the mask fixed parameters will be masked based on the mask tensor finally updated in the mask adjustment stage, and the training will continue. Mask fixed parameters.
此实施例在至少一次的一代训练中重复以下步骤。This embodiment repeats the following steps in at least one generation of training.
在步骤1202中,运算电路630在正向传播中基于在掩码调整阶段更新后的掩码张量对掩码固定参数进行掩码,以计算损失函数的值。In step 1202, the arithmetic circuit 630 masks the mask fixed parameters in forward propagation based on the mask tensor updated in the mask adjustment stage to calculate the value of the loss function.
在步骤1203中,运算电路630在反向传播中计算损失函数对掩码固定参数的偏导。In step 1203, the arithmetic circuit 630 calculates the partial derivatives of the loss function with respect to the fixed parameters of the mask in backpropagation.
在步骤1204中,更新模块64基于偏导更新掩码固定参数。In step 1204, the update module 64 updates the mask fixed parameters based on the partial derivatives.
上述步骤可以参考前面结合图7的描述,此处不再赘述。For the above steps, reference may be made to the foregoing description in conjunction with FIG. 7 , which will not be repeated here.
此实施例在训练时分为三阶段。在无掩码阶段中,没有掩码张量对参数掩码,仅训练参数,以加速参数的收敛。在掩码调整阶段中,由于参数的初始值已不再是随机产生的了,而是已经训练过的无掩码参数,有助于快速获得理想的掩码张量。在掩码张量更新完毕后,进入掩码固定阶段,利用更新好的掩码张量继续训练参数,最终训练好的参数将更佳的匹配掩码张量。This embodiment is divided into three stages during training. In the unmasked stage, no mask tensor masks the parameters, and only the parameters are trained to speed up the convergence of the parameters. In the mask adjustment stage, since the initial values of the parameters are no longer randomly generated, but the unmasked parameters that have been trained, it is helpful to quickly obtain an ideal mask tensor. After the mask tensor is updated, enter the mask fixing stage, and use the updated mask tensor to continue training the parameters, and the final trained parameters will better match the mask tensor.
综上所述,本领域技术人员理解本公开在对神经网络模型进行稀疏化训练时,可能存在如图13所示的几种实施方式。To sum up, those skilled in the art understand that when the present disclosure performs sparse training on a neural network model, there may be several implementations as shown in FIG. 13 .
实施方式1301仅具有掩码调整阶段,参数初始值W0随机产生,掩码张量初始值M0基于参数初始值W0确定,训练参数同时更新掩码矩阵,以获得训练后的参数Wf与更新后的掩码张量Mf。 Embodiment 1301 only has a mask adjustment stage, the initial value of the parameter W0 is randomly generated, the initial value of the mask tensor M0 is determined based on the initial value of the parameter W0, and the training parameters simultaneously update the mask matrix to obtain the trained parameter Wf and the updated mask tensor Mf.
实施方式1302仅具有无掩码阶段与掩码调整阶段。在无掩码阶段仅训练参数,参数初始值W0随机产生,训练后获得更新后参数W1。在掩码调整阶段训练参数同时更新掩码矩阵,这阶段的参数初始值是更新后参数W1,而掩码张量初始值M0则是利用更新后参数W1来获得,最终获得训练后的参数Wf与更新后的掩码张量Mf。 Embodiment 1302 has only a no-mask stage and a mask-adjustment stage. In the unmasked stage, only the parameters are trained, the initial value of the parameter W0 is randomly generated, and the updated parameter W1 is obtained after training. In the mask adjustment stage, the training parameters and the mask matrix are updated at the same time. The initial value of the parameters in this stage is the updated parameter W1, and the initial value of the mask tensor M0 is obtained by using the updated parameter W1, and finally the trained parameter Wf is obtained. with the updated mask tensor Mf.
实施方式1303仅具有掩码调整阶段与掩码固定阶段。在掩码调整阶段参数初始值W0随机产生,掩码张量初始值M0基于参数初始值W0确定,训练参数同时更新掩码矩阵,以获得更新后的参数W1与更新后的掩码张量Mf。在掩码固定阶段以更新后的掩码张量Mf掩码参数继续训练,这阶段的参数初始值是更新后参数W1,最终获得训练后的参数Wf。 Embodiment 1303 has only a mask adjustment stage and a mask fixation stage. In the mask adjustment stage, the initial value of the parameter W0 is randomly generated, the initial value of the mask tensor M0 is determined based on the initial value of the parameter W0, and the training parameters update the mask matrix at the same time to obtain the updated parameter W1 and the updated mask tensor Mf . In the mask fixing stage, the training continues with the updated mask tensor Mf mask parameters. The initial value of the parameters in this stage is the updated parameter W1, and finally the trained parameter Wf is obtained.
实施方式1304具有无掩码阶段、掩码调整阶段与掩码固定阶段。在无掩码阶段仅训练参数,参数初始值W0随机产生,训练后获得更新后参数W1。在掩码调整阶段训练参数同时更新掩码矩阵,这阶段的参数初始值是更新后参数W1,而掩码张量初始值M0利用更新后参数W1来获得,最终获得更新后的参数W2与更新后的掩码张量Mf。在掩码固定阶段以更新后的掩码张量Mf掩码参数继续训练,这阶段的参数初始值是更新后参数W2,最终获得训练后的参数Wf。 Embodiment 1304 has a no-mask stage, a mask-adjustment stage, and a mask-fixing stage. In the unmasked stage, only the parameters are trained, the initial value of the parameter W0 is randomly generated, and the updated parameter W1 is obtained after training. In the mask adjustment stage, the training parameters and the mask matrix are updated at the same time. The initial value of the parameters in this stage is the updated parameter W1, and the initial value of the mask tensor M0 is obtained by using the updated parameter W1, and finally the updated parameter W2 and the updated parameter are obtained. The post mask tensor Mf. In the mask fixing stage, the training continues with the updated mask tensor Mf mask parameters. The initial value of the parameters in this stage is the updated parameter W2, and finally the trained parameter Wf is obtained.
实施方式1305除了具有无掩码阶段、掩码调整阶段与掩码固定阶段,在无掩码阶段与掩码调整 阶段间,以及掩码调整阶段与掩码固定阶段间还存在其他训练阶段(以虚线表示)。在无掩码阶段仅训练参数,参数初始值W0随机产生,训练后获得更新后参数W1。之后可以接续任何在本公开披露或未披露的训练阶段,对参数进行训练或是更新掩码矩阵,假设该阶段为掩码固定阶段,则这阶段的参数初始值是更新后参数W1,而掩码张量初始值M0利用更新后参数W1来获得,以获得更新后的参数W2。In addition to having an unmasked stage, a mask adjustment stage, and a mask fixation stage, Embodiment 1305 also has other training stages between the unmasked stage and the mask adjustment stage, and between the mask adjustment stage and the mask fixation stage (with dotted line). In the unmasked stage, only the parameters are trained, the initial value of the parameter W0 is randomly generated, and the updated parameter W1 is obtained after training. After that, any training stage disclosed or undisclosed in the present disclosure can be continued to train parameters or update the mask matrix. Assuming that this stage is a mask fixed stage, the initial value of the parameters in this stage is the updated parameter W1, while the mask matrix is The initial value M0 of the code tensor is obtained by using the updated parameter W1 to obtain the updated parameter W2.
接着进入掩码调整阶段,训练参数同时更新掩码矩阵,这阶段的参数初始值是更新后参数W2,而掩码张量初始值依旧是掩码张量M0,以获得更新后的参数W3与更新后的掩码张量M1。之后再接续任何在本公开披露或未披露的阶段,对参数进行训练或是更新掩码矩阵,假设该阶段为参数固定阶段,即参数固定不训练,仅针对掩码张量进行训练,这阶段的参数初始值是更新后参数W3,而掩码张量初始值为更新后的掩码张量M1,以获得更新后的掩码张量Mf。Then it enters the mask adjustment stage, and the training parameters update the mask matrix at the same time. The initial value of the parameters in this stage is the updated parameter W2, and the initial value of the mask tensor is still the mask tensor M0, so as to obtain the updated parameters W3 and The updated mask tensor M1. After that, continue any stage disclosed or not disclosed in this disclosure, to train parameters or update the mask matrix. It is assumed that this stage is a parameter fixed stage, that is, the parameters are fixed and not trained, and only the mask tensor is trained. This stage The initial value of the parameter is the updated parameter W3, and the initial value of the mask tensor is the updated mask tensor M1 to obtain the updated mask tensor Mf.
最后在掩码固定阶段以更新后的掩码张量Mf掩码参数继续训练,这阶段的参数初始值是更新后参数W3,最终获得训练后的参数Wf。Finally, in the mask fixing stage, the training is continued with the updated mask tensor Mf mask parameters. The initial value of the parameters in this stage is the updated parameter W3, and finally the trained parameter Wf is obtained.
图13所示的各种实施方式仅为示例,本领域技术人员在参考本公开后,无需创造性的努力便可扩展出其他实施方式,这些实施方式均属于本公开披露的范畴。The various embodiments shown in FIG. 13 are only examples, and those skilled in the art can expand other embodiments without creative efforts after referring to the present disclosure, and these embodiments all belong to the scope of the disclosure of the present disclosure.
本公开不限制各种实施方式在各阶段进行一代训练的次数,本领域技术人员可以是具体情况安排,且每个阶段进行一代训练的次数也不一定要相同。The present disclosure does not limit the number of first-generation training performed in various embodiments, which can be arranged by those skilled in the art according to specific circumstances, and the number of first-generation training performed in each stage is not necessarily the same.
前述这些实施例不必然要将所有预先设定的特定次数的一代训练都执行完毕。控制电路610可以进一步判断在连续2次的一代训练中,参数掩码张量的所有元素值未变动的百分比是否达到阈值。如是,表示训练结果基本已经收敛了,进行再多的训练对于精度的提升有限,因此结束掩码调整阶段,完成训练。这样的阈值一般设定在70%以上,也就是参数掩码张量的所有元素值未变动的百分比超过70%便停止训练。本公开不限制阈值,可以为80%、90%、100%或其他任意百分比。The aforementioned embodiments do not necessarily have to perform all the pre-set specific times of one-generation training. The control circuit 610 may further determine whether the percentage of all the element values of the parameter mask tensor that have not changed in the two consecutive one-generation trainings reaches a threshold. If so, it means that the training results have basically converged, and more training will have limited improvement in accuracy, so end the mask adjustment stage and complete the training. Such a threshold is generally set above 70%, that is, if the percentage of all elements of the parameter mask tensor that does not change exceeds 70%, training will be stopped. The present disclosure does not limit the threshold, which may be 80%, 90%, 100%, or any other percentage.
在本披露的实施例中,为了节省稀疏化和反稀疏化过程带来的开销,在训练的不同阶段可以采用不同的稀疏化数据流结构进行相关的运算,以获得最优的运算和IO性能。In the embodiments of the present disclosure, in order to save the overhead caused by the sparsification and anti-sparse processes, different sparse data flow structures can be used to perform related operations in different stages of training, so as to obtain optimal operation and IO performance .
在一些实施例中,在掩码调整阶段,可以基于更新的神经网络参数进行掩码张量的更新,更新过程产生的结果可以包括神经网络参数的稀疏化结果(例如,稀疏化权值)以及掩码张量。该掩码张量可以用于训练数据的稀疏化处理。继而,可以基于稀疏化的神经网络参数和稀疏化的训练数据执行后续运算。在掩码调整阶段的反向传播过程中,可以基于当前的未稀疏化的神经网络参数来计算神经元梯度和神经网络参数梯度,并相应更新未稀疏化的神经网络参数。或者,在掩码调整阶段的反向传播中,可以基于正向传播中所使用的掩码张量来对神经网络参数进行稀疏化,并基于稀疏化的神经网络参数来计算神经元梯度和神经网络参数梯度,并相应地更新未稀疏化的神经网络参数。反向传播过程中的稀疏化参见前面的描述,此处不再重复。In some embodiments, in the mask adjustment stage, the mask tensor may be updated based on the updated neural network parameters, and the results generated by the update process may include the sparsification results of the neural network parameters (eg, sparsification weights) and mask tensor. This mask tensor can be used for sparsification of training data. Then, subsequent operations may be performed based on the sparsed neural network parameters and the sparsed training data. During backpropagation in the mask adjustment stage, neuron gradients and neural network parameter gradients can be calculated based on the current non-sparse neural network parameters, and the non-sparse neural network parameters are updated accordingly. Alternatively, in the backpropagation of the mask adjustment stage, the neural network parameters can be sparsed based on the mask tensors used in the forward propagation, and neuron gradients and neural network parameters can be computed based on the sparsed neural network parameters. network parameter gradients and update the unsparse neural network parameters accordingly. The sparsification in the backpropagation process is described in the previous description, and will not be repeated here.
在另一些实施例中,在掩码固定阶段,此时掩码张量是固定的,不需要实时更新。因此,存储电路中可以存储已固定的掩码张量,以备后续使用。固定的掩码张量可以包括正向传播中使用的正向掩码张量,以及反向传播中使用的反向掩码张量。神经网络参数可以有不同的存储方案。In other embodiments, in the mask fixing stage, the mask tensor is fixed and does not need to be updated in real time. Therefore, the fixed mask tensor can be stored in the storage circuit for subsequent use. Fixed mask tensors can include the forward mask tensor used in forward propagation, and the reverse mask tensor used in back propagation. Neural network parameters can have different storage schemes.
在一种实现中,存储电路可以存储未稀疏化处理的神经网络参数。此时,在正向传播中,需要利用所存储的掩码张量对神经网络参数进行稀疏化操作。在反向传播中,利用未稀疏化处理的神经网络参数直接参与神经元梯度计算(例如上面公式(1)),以及对未稀疏化处理的神经网络参数进行更新并再次存储在存储电路中。或者在反向传播中,可以利用存储电路中存储的反向掩码张量对未稀疏化处理的神经网络参数进行稀疏化处理,然后基于此计算神经元梯度并相应更新未稀疏化处理的神经网络参数。In one implementation, the memory circuit may store the un-thinned neural network parameters. At this time, in forward propagation, it is necessary to use the stored mask tensor to sparse the neural network parameters. In backpropagation, the neural network parameters that are not sparsely processed are used to directly participate in the neuron gradient calculation (for example, the above formula (1)), and the neural network parameters that are not sparsely processed are updated and stored in the storage circuit again. Or in backpropagation, you can use the reverse mask tensor stored in the storage circuit to sparse the neural network parameters that are not sparsed, and then calculate the neuron gradient based on this and update the unsparsed neural network accordingly. Network parameters.
在另一种实现中,存储电路可以存储已稀疏化处理的神经网络参数。此时,在正向传播中,已稀疏化处理的神经网络参数可以直接参与正向运算,无需再做稀疏化处理。在反向传播中,需要更新已稀疏化处理的神经网络参数,因此可以利用存储电路中存储的掩码张量对神经网络参数梯度进行稀疏化处理,然后更新已稀疏化处理的神经网络参数。在反向传播过程中的神经元梯度计算中,可以选择进行稀疏化或不稀疏化处理。当不采用稀疏化处理时,需要对已稀疏化处理的神经网络参数进行反稀疏化处理,然后基于反稀疏化处理后的神经网络参数计算神经元梯度。当采用稀疏化处理时,可以利 用存储电路中存储的反向掩码张量对经反稀疏化处理的神经网络参数再进行稀疏化处理,然后基于此计算神经元梯度。In another implementation, the memory circuit may store the thinned neural network parameters. At this time, in the forward propagation, the sparsely processed neural network parameters can directly participate in the forward operation, and no further sparse processing is required. In backpropagation, the sparsely processed neural network parameters need to be updated, so the mask tensor stored in the storage circuit can be used to sparse the neural network parameter gradient, and then the sparsely processed neural network parameters can be updated. In the neuron gradient calculation during backpropagation, you can choose to sparse or not to sparse. When the sparse processing is not adopted, it is necessary to perform anti-sparse processing on the sparse-processed neural network parameters, and then calculate the neuron gradient based on the neural network parameters after the anti-sparse processing. When the sparse processing is adopted, the reverse mask tensor stored in the storage circuit can be used to sparse the neural network parameters after the anti-sparse processing, and then the neuron gradient is calculated based on this.
本公开另一个实施例为一种计算机可读存储介质,其上存储有对神经网络模型进行稀疏化训练的计算机程序代码,当所述计算机程序代码由处理器运行时,执行如前所述各实施例的方法。在一些实现场景中,上述集成的单元可以采用软件程序模块的形式来实现。如果以软件程序模块的形式实现并作为独立的产品销售或使用时,所述集成的单元可以存储在计算机可读取存储器中。基于此,当本公开的方案以软件产品(例如计算机可读存储介质)的形式体现时,该软件产品可以存储在存储器中,其可以包括若干指令用以使得计算机设备(例如个人计算机、服务器或者网络设备等)执行本公开实施例所述方法的部分或全部步骤。前述的存储器可以包括但不限于U盘、闪存盘、只读存储器(Read Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。Another embodiment of the present disclosure is a computer-readable storage medium on which computer program codes for sparse training of a neural network model are stored. Methods of Examples. In some implementation scenarios, the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer-readable memory. Based on this, when the aspects of the present disclosure are embodied in the form of a software product (eg, a computer-readable storage medium), the software product may be stored in a memory, which may include several instructions to cause a computer device (eg, a personal computer, a server or network equipment, etc.) to execute some or all of the steps of the methods described in the embodiments of the present disclosure. The aforementioned memory may include, but is not limited to, a U disk, a flash disk, a read-only memory (Read Only Memory, ROM), a random access memory (Random Access Memory, RAM), a mobile hard disk, a magnetic disk, or a CD, etc. that can store programs. medium of code.
前述这些实施例在训练完成后,计算装置201进行推理时,利用更新后的参数掩码张量对训练后的参数进行遮挡,以控制输入至神经网络模型的特征图的处理区域,一方面可以达到预期的精度,一方面又能在推理的过程中降低计算量,完成稀疏化的目的。After the training is completed in the aforementioned embodiments, when the computing device 201 performs reasoning, the updated parameter mask tensor is used to block the parameters after training, so as to control the processing area of the feature map input to the neural network model. To achieve the expected accuracy, on the one hand, it can reduce the amount of calculation in the process of reasoning, and achieve the purpose of sparseness.
根据不同的应用场景,本披露的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本披露的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步,本披露的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中,根据本披露方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器),而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备和/或边缘端设备的硬件信息,从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。According to different application scenarios, the electronic devices or devices of the present disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and/or medical equipment. The vehicles include airplanes, ships and/or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods; the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph. The electronic equipment or device of the present disclosure can also be applied to the Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical care and other fields. Further, the electronic device or device of the present disclosure can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge terminal, terminal, etc. In one or more embodiments, the electronic device or device with high computing power according to the solution of the present disclosure can be applied to a cloud device (eg, a cloud server), while the electronic device or device with low power consumption can be applied to a terminal device and/or Edge devices (such as smartphones or cameras). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be obtained from the hardware information of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device. Matching appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-device integration.
需要说明的是,为了简明的目的,本披露将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本披露的方案并不受所描述的动作的顺序限制。因此,依据本披露的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步,本领域技术人员可以理解本披露所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本披露某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本披露对一些实施例的描述也各有侧重。鉴于此,本领域技术人员可以理解本披露某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。It should be noted that, for the purpose of simplicity, the present disclosure expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solutions of the present disclosure are not limited by the order of the described actions . Accordingly, those of ordinary skill in the art, based on the disclosure or teachings of this disclosure, will appreciate that some of the steps may be performed in other orders or concurrently. Further, those skilled in the art can understand that the embodiments described in the present disclosure may be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present disclosure. In addition, according to different solutions, the present disclosure also has different emphases in the description of some embodiments. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present disclosure, and can also refer to the related descriptions of other embodiments.
在具体实现方面,基于本披露的公开和教导,本领域技术人员可以理解本披露所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的电子设备或装置实施例中的各个单元来说,本文在考虑了逻辑功能的基础上对其进行拆分,而实际实现时也可以有另外的拆分方式。又例如,可以将多个单元或组件结合或者集成到另一个系统,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。In terms of specific implementation, based on the disclosure and teaching of this disclosure, those skilled in the art can understand that several embodiments disclosed in this disclosure can also be implemented in other ways not disclosed herein. For example, as for each unit in the foregoing electronic device or apparatus embodiment, it is divided on the basis of considering the logical function, and there may also be other division methods in actual implementation. As another example, multiple units or components may be combined or integrated into another system, or some features or functions of a unit or component may be selectively disabled. As far as the connection relationship between different units or components is concerned, the connections discussed above in conjunction with the accompanying drawings may be direct or indirect couplings between units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.
在本披露中,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外,根据实际的需要,可以选择其中的部分或者全部单元来实现本披露实施例所述方案的目的。另外,在一些场景中,本披露实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。In this disclosure, units illustrated as separate components may or may not be physically separate, and components shown as units may or may not be physical units. The aforementioned components or elements may be co-located or distributed over multiple network elements. In addition, according to actual needs, some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure. In addition, in some scenarios, multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit physically exists independently.
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此,本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现,例如中央处理器、GPU、FPGA、DSP和ASIC等。进一步,前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),其例如可以是可变电阻式存储器(Resistive Random Access Memory,RRAM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,EDRAM)、高带宽存储器(High Bandwidth Memory,HBM)、混合存储器立方体(Hybrid Memory Cube,HMC)、ROM和RAM等。In other implementation scenarios, the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits, and the like. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but are not limited to, devices such as transistors or memristors. In view of this, various types of devices described herein (eg, computing devices or other processing devices) may be implemented by suitable hardware processors, such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High Bandwidth Memory (High Bandwidth Memory) , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
依据以下条款可更好地理解前述内容:The foregoing can be better understood in accordance with the following terms:
条款1、一种由数据处理装置执行的对神经网络模型进行稀疏化训练的方法,包括: Clause 1. A method of sparse training a neural network model performed by a data processing apparatus, comprising:
在正向传播中,基于掩码张量至少对神经网络参数进行稀疏化处理,以计算损失函数的值;In forward propagation, at least the neural network parameters are sparsed based on the mask tensor to calculate the value of the loss function;
在反向传播中,基于所述损失函数计算神经元梯度和所述神经网络参数梯度;以及in backpropagation, computing neuron gradients and the neural network parameter gradients based on the loss function; and
基于所述神经网络参数梯度更新所述神经网络参数。The neural network parameters are updated based on the neural network parameter gradients.
条款2、根据条款1所述的方法,其中所述方法进一步包括:Clause 2. The method of clause 1, wherein the method further comprises:
在反向传播中,基于未稀疏化处理的神经网络参数来计算所述神经元梯度和所述神经网络参数梯度;以及基于所述神经网络参数梯度更新所述神经网络参数。In backpropagation, the neuron gradients and the neural network parameter gradients are calculated based on the unsparse-processed neural network parameters; and the neural network parameters are updated based on the neural network parameter gradients.
条款3、根据条款2所述的方法,其中所述方法进一步包括:Clause 3. The method of clause 2, wherein the method further comprises:
对已稀疏化处理的神经网络参数进行反稀疏化处理以得到所述未稀疏化处理的神经网络参数。De-sparse processing is performed on the sparse-processed neural network parameters to obtain the un-sparse-processed neural network parameters.
条款4、根据条款1所述的方法,其中所述方法进一步包括:Clause 4. The method of clause 1, wherein the method further comprises:
在反向传播中,基于稀疏化处理的神经网络参数来计算所述神经元梯度和所述神经网络参数梯度;以及基于所述神经元梯度更新所述神经网络参数。In backpropagation, the neuron gradients and the neural network parameter gradients are calculated based on the thinned neural network parameters; and the neural network parameters are updated based on the neuron gradients.
条款5、根据条款4所述的方法,所述方法进一步包括:Clause 5. The method of clause 4, further comprising:
在反向传播中,基于反向掩码张量对所述神经网络参数进行稀疏化处理以得到所述稀疏化处理的神经网络参数。In backpropagation, the neural network parameters are sparsed based on the reverse mask tensor to obtain the sparsely processed neural network parameters.
条款6、根据条款1-3任一所述的方法,其中所述掩码张量为一维张量。Clause 6. The method of any of clauses 1-3, wherein the mask tensor is a one-dimensional tensor.
条款7、根据条款6所述的方法,其中所述一维张量对神经网络参数的输入通道维度进行稀疏化处理。Clause 7. The method of clause 6, wherein the one-dimensional tensor sparses the input channel dimension of the neural network parameters.
条款8、根据条款1-5所述的方法,其中所述掩码张量为二维张量。Clause 8. The method of clauses 1-5, wherein the mask tensor is a two-dimensional tensor.
条款9、根据条款8所述的方法,其中所述二维张量对神经网络参数的输入通道维度和输出通道维度进行稀疏化处理。Clause 9. The method of Clause 8, wherein the two-dimensional tensor sparses the input channel dimension and the output channel dimension of the neural network parameters.
条款10、根据条款5所述的方法,其中当所述掩码张量为二维张量时,所述反向掩码张量是通过对所述掩码张量进行维度转换而生成的。Clause 10. The method of Clause 5, wherein when the mask tensor is a two-dimensional tensor, the reverse mask tensor is generated by dimensionally transforming the mask tensor.
条款11、根据条款1所述的方法,其中更新所述神经网络参数包括更新未稀疏化处理的神经网络参数。Clause 11. The method of Clause 1, wherein updating the neural network parameters comprises updating unsparse processed neural network parameters.
条款12、根据条款11所述的方法,还包括:Clause 12. The method of clause 11, further comprising:
基于更新的未稀疏化处理的神经网络参数生成所述掩码张量。The mask tensor is generated based on the updated unsparse processed neural network parameters.
条款13、根据条款12所述的方法,其中当所述掩码张量为一维张量时,所述方法按如下生成所述掩码张量:Clause 13. The method of clause 12, wherein when the mask tensor is a one-dimensional tensor, the method generates the mask tensor as follows:
从所述神经网络参数的指定维度的每m个数据元素中选择n个绝对值较大的数据元素作为有效数据元素,其中m>n;以及Select n data elements with larger absolute values as valid data elements from every m data elements of the specified dimension of the neural network parameter, where m>n; and
基于所述n个有效数据元素在所述m个数据元素中的位置来确定所述掩码张量。The mask tensor is determined based on the positions of the n valid data elements among the m data elements.
条款14、根据条款12所述的方法,其中当所述掩码张量为二维张量时,所述方法按如下生成所述掩码张量:Clause 14. The method of clause 12, wherein when the mask tensor is a two-dimensional tensor, the method generates the mask tensor as follows:
预设特定数量的二维掩码张量,所述二维掩码张量的每个维度包括m个元素,其中n个元素为1,m- n个元素为0,m>n;Preset a specific number of two-dimensional mask tensors, each dimension of the two-dimensional mask tensor includes m elements, wherein n elements are 1, m-n elements are 0, and m>n;
基于每个预设的二维掩码张量分别对所述神经网络参数的指定两个维度进行掩码,以获得掩码后参数张量;Mask the specified two dimensions of the neural network parameters based on each preset two-dimensional mask tensor to obtain a masked parameter tensor;
基于每个掩码后参数张量,对所述神经网络的训练数据进行乘积和运算,以获得参数评估值;以及选择产生所有参数评估值中最大的二维掩码张量作为所述掩码张量。Based on each post-mask parameter tensor, a product-sum operation is performed on the training data of the neural network to obtain parameter evaluation values; and a two-dimensional mask tensor that produces the largest of all parameter evaluation values is selected as the mask Tensor.
条款15、根据条款1-14任一所述的方法,其中所述方法在稀疏化训练的掩码调整阶段中的多次迭代中执行。Clause 15. The method of any of clauses 1-14, wherein the method is performed in multiple iterations in a mask adjustment phase of the sparsification training.
条款16、根据条款15所述的方法,其中所述掩码调整阶段还包括:Clause 16. The method of clause 15, wherein the mask adjustment stage further comprises:
判断在连续多次迭代训练中,所述掩码张量的所有元素值未变动的百分比是否达到阈值;以及Judging whether the percentage of unchanged values of all elements of the mask tensor has reached a threshold during multiple successive iterations of training; and
如是,结束所述掩码调整阶段。If so, the mask adjustment phase ends.
条款17、根据条款16所述的方法,其中所述阈值为80%、90%及100%其中之一。Clause 17. The method of clause 16, wherein the threshold is one of 80%, 90%, and 100%.
条款18、根据条款1-10任一所述的方法,其中所述方法在稀疏化训练的掩码固定阶段中的多次迭代中执行,并且所述掩码张量固定为之前阶段最终确定的掩码张量。Clause 18. The method of any of clauses 1-10, wherein the method is performed in multiple iterations in a mask-fixing stage of sparse training, and the mask tensor is fixed as finalized by a previous stage mask tensor.
条款19、根据条款18所述的方法,其中更新所述神经网络参数包括更新已稀疏化处理的神经网络参数。Clause 19. The method of clause 18, wherein updating the neural network parameters comprises updating the thinned neural network parameters.
条款20、根据条款19所述的方法,其中更新所述神经网络参数进一步包括: Clause 20. The method of clause 19, wherein updating the neural network parameters further comprises:
利用所述掩码张量对所述神经元梯度进行稀疏化处理;以及thinning the neuron gradient using the mask tensor; and
基于所述稀疏化的神经元梯度更新所述已稀疏化处理的神经网络参数。The thinned neural network parameters are updated based on the thinned neuron gradients.
条款21、根据条款18-20任一所述的方法,其中在所述掩码固定阶段期间,存储已固定的掩码张量以及经稀疏化处理的神经网络参数。Clause 21. The method of any of clauses 18-20, wherein during the mask fixation stage, a fixed mask tensor and sparsely processed neural network parameters are stored.
条款22、根据条款18-20任一所述的方法,其中在所述掩码固定阶段期间,存储已固定的掩码张量以及未稀疏化处理的神经网络参数。Clause 22. The method of any of clauses 18-20, wherein during the mask fixation phase, a fixed mask tensor and unsparsely processed neural network parameters are stored.
条款23、一种计算机可读存储介质,其上存储有对神经网络模型进行稀疏化训练的计算机程序代码,当所述计算机程序代码由处理装置运行时,执行条款1至22任一项所述的方法。Clause 23. A computer-readable storage medium having stored thereon computer program code for sparse training a neural network model, which when executed by a processing device, executes any one of clauses 1 to 22. Methods.
条款24、一种数据处理装置,包括控制电路、存储电路和运算电路,其中:Clause 24. A data processing apparatus comprising a control circuit, a storage circuit and an arithmetic circuit, wherein:
所述控制电路配置用于控制所述存储电路和所述运算电路以对神经网络模型执行稀疏化训练;the control circuit is configured to control the storage circuit and the arithmetic circuit to perform sparse training on a neural network model;
所述存储电路配置用于存储信息,所述信息至少包括神经网络参数和掩码张量;以及the storage circuit is configured to store information including at least neural network parameters and mask tensors; and
所述运算电路配置用于在所述控制电路的控制下执行以下操作:The arithmetic circuit is configured to perform the following operations under the control of the control circuit:
在正向传播中,基于掩码张量至少对神经网络参数进行稀疏化处理,以计算损失函数的值;In forward propagation, at least the neural network parameters are sparsed based on the mask tensor to calculate the value of the loss function;
在反向传播中,基于所述损失函数计算神经元梯度和神经网络参数梯度;以及in backpropagation, computing neuron gradients and neural network parameter gradients based on the loss function; and
基于所述神经网络参数梯度更新所述神经网络参数。The neural network parameters are updated based on the neural network parameter gradients.
条款25、根据条款24所述的装置,其中运算电路进一步配置用于:Clause 25. The apparatus of clause 24, wherein the arithmetic circuit is further configured to:
在反向传播中,基于未稀疏化处理的神经网络参数来计算所述神经元梯度和所述神经网络参数梯度;以及in backpropagation, computing the neuron gradient and the neural network parameter gradient based on the unsparse-processed neural network parameters; and
基于所述神经网络参数梯度更新所述神经网络参数。The neural network parameters are updated based on the neural network parameter gradients.
条款26、根据条款25所述的装置,其中运算电路进一步配置用于:Clause 26. The apparatus of clause 25, wherein the arithmetic circuit is further configured to:
对已稀疏化处理的神经网络参数进行反稀疏化处理以得到所述未稀疏化处理的神经网络参数。De-sparse processing is performed on the sparse-processed neural network parameters to obtain the un-sparse-processed neural network parameters.
条款27、根据条款24所述的装置,其中运算电路进一步配置用于:Clause 27. The apparatus of clause 24, wherein the arithmetic circuit is further configured to:
在反向传播中,基于稀疏化处理的神经网络参数来计算所述神经元梯度和所述神经网络参数梯度;以及in backpropagation, computing the neuron gradient and the neural network parameter gradient based on the thinned neural network parameters; and
基于所述神经元梯度更新所述神经网络参数。The neural network parameters are updated based on the neuron gradients.
条款28、根据条款27所述的装置,其中运算电路进一步配置用于:Clause 28. The apparatus of clause 27, wherein the arithmetic circuit is further configured to:
在反向传播中,基于反向掩码张量对所述神经网络参数进行稀疏化处理以得到所述稀疏化处理的神经网络参数。In backpropagation, the neural network parameters are sparsed based on the reverse mask tensor to obtain the sparsely processed neural network parameters.
条款29、根据条款24-26任一所述的装置,其中所述掩码张量为一维张量。Clause 29. The apparatus of any of clauses 24-26, wherein the mask tensor is a one-dimensional tensor.
条款30、根据条款29所述的装置,其中所述一维张量对神经网络参数的输入通道维度进行稀疏 化处理。Clause 30. The apparatus of clause 29, wherein the one-dimensional tensor sparses input channel dimensions of neural network parameters.
条款31、根据条款24-28所述的装置,其中所述掩码张量为二维张量。 Clause 31. The apparatus of clauses 24-28, wherein the mask tensor is a two-dimensional tensor.
条款32、根据条款31所述的装置,其中所述二维张量对神经网络参数的输入通道维度和输出通道维度进行稀疏化处理。 Clause 32. The apparatus of clause 31, wherein the two-dimensional tensor sparses input channel dimensions and output channel dimensions of neural network parameters.
条款33、根据条款28所述的装置,其中当所述掩码张量为二维张量时,所述反向掩码张量是由所述运算电路对所述掩码张量进行维度转换而生成的。 Clause 33. The apparatus of Clause 28, wherein when the mask tensor is a two-dimensional tensor, the reverse mask tensor is a dimensional transformation of the mask tensor by the arithmetic circuit generated.
条款34、根据条款24所述的装置,其中运算电路进一步配置用于: Clause 34. The apparatus of clause 24, wherein the arithmetic circuit is further configured to:
更新未稀疏化处理的神经网络参数。Update unsparse neural network parameters.
条款35、根据条款34所述的装置,其中运算电路进一步配置用于:Clause 35. The apparatus of clause 34, wherein the arithmetic circuit is further configured to:
基于更新的未稀疏化处理的神经网络参数生成所述掩码张量。The mask tensor is generated based on the updated unsparse processed neural network parameters.
条款36、根据条款35所述的装置,其中当所述掩码张量为一维张量时,所述运算电路配置用于按如下生成所述掩码张量:Clause 36. The apparatus of clause 35, wherein when the mask tensor is a one-dimensional tensor, the arithmetic circuit is configured to generate the mask tensor as follows:
从所述神经网络参数的指定维度的每m个数据元素中选择n个绝对值较大的数据元素作为有效数据元素,其中m>n;以及Select n data elements with larger absolute values as valid data elements from every m data elements of the specified dimension of the neural network parameter, where m>n; and
基于所述n个有效数据元素在所述m个数据元素中的位置来确定所述掩码张量。The mask tensor is determined based on the positions of the n valid data elements among the m data elements.
条款37、根据条款35所述的装置,其中当所述掩码张量为二维张量时,所述运算电路配置用于按如下生成所述掩码张量:Clause 37. The apparatus of clause 35, wherein when the mask tensor is a two-dimensional tensor, the arithmetic circuit is configured to generate the mask tensor as follows:
预设特定数量的二维掩码张量,所述二维掩码张量的每个维度包括m个元素,其中n个元素为1,m-n个元素为0,m>n;Preset a specific number of two-dimensional mask tensors, each dimension of the two-dimensional mask tensor includes m elements, wherein n elements are 1, m-n elements are 0, and m>n;
基于每个预设的二维掩码张量分别对所述神经网络参数的指定两个维度进行掩码,以获得掩码后参数张量;Mask the specified two dimensions of the neural network parameters based on each preset two-dimensional mask tensor to obtain a masked parameter tensor;
基于每个掩码后参数张量,对所述神经网络的训练数据进行乘积和运算,以获得参数评估值;以及选择产生所有参数评估值中最大的二维掩码张量作为所述掩码张量。Based on each post-mask parameter tensor, a product-sum operation is performed on the training data of the neural network to obtain parameter evaluation values; and a two-dimensional mask tensor that produces the largest of all parameter evaluation values is selected as the mask Tensor.
条款38、根据条款24-37任一所述的装置,其中所述运算电路配置用于在稀疏化训练的掩码调整阶段中的多次迭代中执行所述操作。Clause 38. The apparatus of any of clauses 24-37, wherein the operational circuit is configured to perform the operations in a plurality of iterations in a mask adjustment phase of sparsification training.
条款39、根据条款38所述的装置,其中所述运算电路进一步配置用于:在所述掩码调整阶段,判断在连续多次迭代训练中,所述掩码张量的所有元素值未变动的百分比是否达到阈值;以及Clause 39. The apparatus of Clause 38, wherein the arithmetic circuit is further configured to: in the mask adjustment stage, determine that the values of all elements of the mask tensor have not changed in successive iterations of training whether the percentage of the threshold is reached; and
如是,结束所述掩码调整阶段。If so, the mask adjustment phase ends.
条款40、根据条款39所述的装置,其中所述阈值为80%、90%及100%其中之一。Clause 40. The device of clause 39, wherein the threshold is one of 80%, 90%, and 100%.
条款41、根据条款24-33任一所述的装置,其中所述运算电路配置用于在稀疏化训练的掩码固定阶段中的多次迭代中执行所述操作,并且所述掩码张量固定为之前阶段最终确定的掩码张量。 Clause 41. The apparatus of any of clauses 24-33, wherein the operational circuit is configured to perform the operation in a plurality of iterations in a mask-fixed phase of sparsification training, and the mask tensor Fixed to be the mask tensor finalized by the previous stage.
条款42、根据条款41所述的装置,其中运算电路进一步配置用于:Clause 42. The apparatus of clause 41, wherein the arithmetic circuit is further configured to:
更新已稀疏化处理的神经网络参数。Update the parameters of the thinned neural network.
条款43、根据条款42所述的装置,其中运算电路进一步配置用于:Clause 43. The apparatus of clause 42, wherein the arithmetic circuit is further configured to:
利用所述掩码张量对所述神经元梯度进行稀疏化处理;以及thinning the neuron gradient using the mask tensor; and
基于所述稀疏化的神经元梯度更新所述已稀疏化处理的神经网络参数。The thinned neural network parameters are updated based on the thinned neuron gradients.
条款44、根据条款41-43任一所述的装置,其中在所述掩码固定阶段期间,所述存储电路配置用于存储已固定的掩码张量以及经稀疏化处理的神经网络参数。Clause 44. The apparatus of any of clauses 41-43, wherein during the mask fixation phase, the memory circuit is configured to store a fixed mask tensor and thinned neural network parameters.
条款45、根据条款41-43任一所述的装置,其中在所述掩码固定阶段期间,所述存储电路配置用于存储已固定的掩码张量以及未稀疏化处理的神经网络参数。Clause 45. The apparatus of any of clauses 41-43, wherein during the mask fixation phase, the memory circuit is configured to store the fixed mask tensor and the unsparse processed neural network parameters.
条款46、一种芯片,包括根据条款24-45任一所述的数据处理装置。Clause 46. A chip comprising a data processing device according to any of clauses 24-45.
条款47、一种板卡,包括根据条款46所述的芯片。Clause 47. A board comprising the chip of clause 46.
以上对本披露实施例进行了详细介绍,本文中应用了具体个例对本披露的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本披露的方法及其核心思想;同时,对于本领域的一般技术人员,依据本披露的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本披露的限制。The embodiments of the present disclosure have been introduced in detail above, and specific examples are used to illustrate the principles and implementations of the present disclosure. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present disclosure; at the same time, for Persons of ordinary skill in the art, according to the idea of the present disclosure, will have changes in the specific implementation manner and application scope. In summary, the contents of this description should not be construed as a limitation on the present disclosure.

Claims (47)

  1. 一种由数据处理装置执行的对神经网络模型进行稀疏化训练的方法,包括:A method for sparse training of a neural network model performed by a data processing device, comprising:
    在正向传播中,基于掩码张量至少对神经网络参数进行稀疏化处理,以计算损失函数的值;In forward propagation, at least the neural network parameters are sparsed based on the mask tensor to calculate the value of the loss function;
    在反向传播中,基于所述损失函数计算神经元梯度和所述神经网络参数梯度;以及in backpropagation, computing neuron gradients and the neural network parameter gradients based on the loss function; and
    基于所述神经网络参数梯度更新所述神经网络参数。The neural network parameters are updated based on the neural network parameter gradients.
  2. 根据权利要求1所述的方法,其中所述方法进一步包括:The method of claim 1, wherein the method further comprises:
    在反向传播中,基于未稀疏化处理的神经网络参数来计算所述神经元梯度和所述神经网络参数梯度;以及in backpropagation, computing the neuron gradient and the neural network parameter gradient based on the unsparse-processed neural network parameters; and
    基于所述神经网络参数梯度更新所述神经网络参数。The neural network parameters are updated based on the neural network parameter gradients.
  3. 根据权利要求2所述的方法,其中所述方法进一步包括:The method of claim 2, wherein the method further comprises:
    对已稀疏化处理的神经网络参数进行反稀疏化处理以得到所述未稀疏化处理的神经网络参数。De-sparse processing is performed on the sparse-processed neural network parameters to obtain the un-sparse-processed neural network parameters.
  4. 根据权利要求1所述的方法,其中所述方法进一步包括:The method of claim 1, wherein the method further comprises:
    在反向传播中,基于稀疏化处理的神经网络参数来计算所述神经元梯度和所述神经网络参数梯度;以及in backpropagation, computing the neuron gradient and the neural network parameter gradient based on the thinned neural network parameters; and
    基于所述神经元梯度更新所述神经网络参数。The neural network parameters are updated based on the neuron gradients.
  5. 根据权利要求4所述的方法,所述方法进一步包括:The method of claim 4, further comprising:
    在反向传播中,基于反向掩码张量对所述神经网络参数进行稀疏化处理以得到所述稀疏化处理的神经网络参数。In backpropagation, the neural network parameters are sparsed based on the reverse mask tensor to obtain the sparsely processed neural network parameters.
  6. 根据权利要求1-3任一所述的方法,其中所述掩码张量为一维张量。The method according to any one of claims 1-3, wherein the mask tensor is a one-dimensional tensor.
  7. 根据权利要求6所述的方法,其中所述一维张量对神经网络参数的输入通道维度进行稀疏化处理。7. The method of claim 6, wherein the one-dimensional tensor sparses input channel dimensions of neural network parameters.
  8. 根据权利要求1-5所述的方法,其中所述掩码张量为二维张量。The method of claims 1-5, wherein the mask tensor is a two-dimensional tensor.
  9. 根据权利要求8所述的方法,其中所述二维张量对神经网络参数的输入通道维度和输出通道维度进行稀疏化处理。The method of claim 8, wherein the two-dimensional tensor sparses the input channel dimension and the output channel dimension of the neural network parameters.
  10. 根据权利要求5所述的方法,其中当所述掩码张量为二维张量时,所述反向掩码张量是通过对所述掩码张量进行维度转换而生成的。The method of claim 5, wherein when the mask tensor is a two-dimensional tensor, the reverse mask tensor is generated by dimensionally transforming the mask tensor.
  11. 根据权利要求1所述的方法,其中更新所述神经网络参数包括更新未稀疏化处理的神经网络参数。1. The method of claim 1, wherein updating the neural network parameters comprises updating unsparse processed neural network parameters.
  12. 根据权利要求11所述的方法,还包括:The method of claim 11, further comprising:
    基于更新的未稀疏化处理的神经网络参数生成所述掩码张量。The mask tensor is generated based on the updated unsparse processed neural network parameters.
  13. 根据权利要求12所述的方法,其中当所述掩码张量为一维张量时,所述方法按如下生成所述掩码张量:The method of claim 12, wherein when the mask tensor is a one-dimensional tensor, the method generates the mask tensor as follows:
    从所述神经网络参数的指定维度的每m个数据元素中选择n个绝对值较大的数据元素作为有效数据元素,其中m>n;以及Select n data elements with larger absolute values as valid data elements from every m data elements of the specified dimension of the neural network parameter, where m>n; and
    基于所述n个有效数据元素在所述m个数据元素中的位置来确定所述掩码张量。The mask tensor is determined based on the positions of the n valid data elements among the m data elements.
  14. 根据权利要求12所述的方法,其中当所述掩码张量为二维张量时,所述方法按如下生成所述掩码张量:The method of claim 12, wherein when the mask tensor is a two-dimensional tensor, the method generates the mask tensor as follows:
    预设特定数量的二维掩码张量,所述二维掩码张量的每个维度包括m个元素,其中n个元素为1,m-n个元素为0,m>n;Preset a specific number of two-dimensional mask tensors, each dimension of the two-dimensional mask tensor includes m elements, wherein n elements are 1, m-n elements are 0, and m>n;
    基于每个预设的二维掩码张量分别对所述神经网络参数的指定两个维度进行掩码,以获得掩码后参数张量;Mask the specified two dimensions of the neural network parameters based on each preset two-dimensional mask tensor to obtain a masked parameter tensor;
    基于每个掩码后参数张量,对所述神经网络的训练数据进行乘积和运算,以获得参数评估值;以及Based on each post-mask parameter tensor, performing a product-sum operation on the training data of the neural network to obtain parameter estimates; and
    选择产生所有参数评估值中最大的二维掩码张量作为所述掩码张量。The two-dimensional mask tensor that yields the largest of all parameter evaluations is selected as the mask tensor.
  15. 根据权利要求1-14任一所述的方法,其中所述方法在稀疏化训练的掩码调整阶段中的多次迭代中执行。15. The method of any of claims 1-14, wherein the method is performed in multiple iterations in a mask adjustment phase of sparsification training.
  16. 根据权利要求15所述的方法,其中所述掩码调整阶段还包括:The method of claim 15, wherein the mask adjustment stage further comprises:
    判断在连续多次迭代训练中,所述掩码张量的所有元素值未变动的百分比是否达到阈值;以及Judging whether the percentage of unchanged values of all elements of the mask tensor has reached a threshold during multiple successive iterations of training; and
    如是,结束所述掩码调整阶段。If so, the mask adjustment phase ends.
  17. 根据权利要求16所述的方法,其中所述阈值为80%、90%及100%其中之一。17. The method of claim 16, wherein the threshold is one of 80%, 90%, and 100%.
  18. 根据权利要求1-10任一所述的方法,其中所述方法在稀疏化训练的掩码固定阶段中的多次迭代中执行,并且所述掩码张量固定为之前阶段最终确定的掩码张量。10. The method of any one of claims 1-10, wherein the method is performed in multiple iterations in a mask-fixing stage of sparse training, and the mask tensor is fixed to a mask finalized in a previous stage Tensor.
  19. 根据权利要求18所述的方法,其中更新所述神经网络参数包括更新已稀疏化处理的神经网络参数。19. The method of claim 18, wherein updating the neural network parameters comprises updating thinned neural network parameters.
  20. 根据权利要求19所述的方法,其中更新所述神经网络参数进一步包括:The method of claim 19, wherein updating the neural network parameters further comprises:
    利用所述掩码张量对所述神经元梯度进行稀疏化处理;以及thinning the neuron gradient using the mask tensor; and
    基于所述稀疏化的神经元梯度更新所述已稀疏化处理的神经网络参数。The thinned neural network parameters are updated based on the thinned neuron gradients.
  21. 根据权利要求18-20任一所述的方法,其中在所述掩码固定阶段期间,存储已固定的掩码张量以及经稀疏化处理的神经网络参数。20. The method of any of claims 18-20, wherein during the mask fixation phase, a fixed mask tensor and sparsely processed neural network parameters are stored.
  22. 根据权利要求18-20任一所述的方法,其中在所述掩码固定阶段期间,存储已固定的掩码张量以及未稀疏化处理的神经网络参数。20. The method of any of claims 18-20, wherein during the mask fixation phase, a fixed mask tensor and unsparsely processed neural network parameters are stored.
  23. 一种计算机可读存储介质,其上存储有对神经网络模型进行稀疏化训练的计算机程序代码,当所述计算机程序代码由处理装置运行时,执行权利要求1至22任一项所述的方法。A computer-readable storage medium on which computer program codes for sparse training of neural network models are stored, and when the computer program codes are executed by a processing device, the method of any one of claims 1 to 22 is executed .
  24. 一种数据处理装置,包括控制电路、存储电路和运算电路,其中:A data processing device, comprising a control circuit, a storage circuit and an arithmetic circuit, wherein:
    所述控制电路配置用于控制所述存储电路和所述运算电路以对神经网络模型执行稀疏化训练;the control circuit is configured to control the storage circuit and the arithmetic circuit to perform sparse training on a neural network model;
    所述存储电路配置用于存储信息,所述信息至少包括神经网络参数和掩码张量;以及the storage circuit is configured to store information including at least neural network parameters and mask tensors; and
    所述运算电路配置用于在所述控制电路的控制下执行以下操作:The arithmetic circuit is configured to perform the following operations under the control of the control circuit:
    在正向传播中,基于掩码张量至少对神经网络参数进行稀疏化处理,以计算损失函数的值;In forward propagation, at least the neural network parameters are sparsed based on the mask tensor to calculate the value of the loss function;
    在反向传播中,基于所述损失函数计算神经元梯度和神经网络参数梯度;以及in backpropagation, computing neuron gradients and neural network parameter gradients based on the loss function; and
    基于所述神经网络参数梯度更新所述神经网络参数。The neural network parameters are updated based on the neural network parameter gradients.
  25. 根据权利要求24所述的装置,其中运算电路进一步配置用于:25. The apparatus of claim 24, wherein the arithmetic circuit is further configured to:
    在反向传播中,基于未稀疏化处理的神经网络参数来计算所述神经元梯度和所述神经网络参数梯度;以及in backpropagation, computing the neuron gradient and the neural network parameter gradient based on the unsparse-processed neural network parameters; and
    基于所述神经网络参数梯度更新所述神经网络参数。The neural network parameters are updated based on the neural network parameter gradients.
  26. 根据权利要求25所述的装置,其中运算电路进一步配置用于:26. The apparatus of claim 25, wherein the arithmetic circuit is further configured to:
    对已稀疏化处理的神经网络参数进行反稀疏化处理以得到所述未稀疏化处理的神经网络参数。De-sparse processing is performed on the sparse-processed neural network parameters to obtain the un-sparse-processed neural network parameters.
  27. 根据权利要求24所述的装置,其中运算电路进一步配置用于:25. The apparatus of claim 24, wherein the arithmetic circuit is further configured to:
    在反向传播中,基于稀疏化处理的神经网络参数来计算所述神经元梯度和所述神经网络参数梯度;以及in backpropagation, computing the neuron gradient and the neural network parameter gradient based on the thinned neural network parameters; and
    基于所述神经元梯度更新所述神经网络参数。The neural network parameters are updated based on the neuron gradients.
  28. 根据权利要求27所述的装置,其中运算电路进一步配置用于:28. The apparatus of claim 27, wherein the arithmetic circuit is further configured to:
    在反向传播中,基于反向掩码张量对所述神经网络参数进行稀疏化处理以得到所述稀疏化处理的神经网络参数。In backpropagation, the neural network parameters are sparsed based on the reverse mask tensor to obtain the sparsely processed neural network parameters.
  29. 根据权利要求24-26任一所述的装置,其中所述掩码张量为一维张量。The apparatus of any of claims 24-26, wherein the mask tensor is a one-dimensional tensor.
  30. 根据权利要求29所述的装置,其中所述一维张量对神经网络参数的输入通道维度进行稀疏化处理。30. The apparatus of claim 29, wherein the one-dimensional tensor sparses input channel dimensions of neural network parameters.
  31. 根据权利要求24-28所述的装置,其中所述掩码张量为二维张量。The apparatus of claims 24-28, wherein the mask tensor is a two-dimensional tensor.
  32. 根据权利要求31所述的装置,其中所述二维张量对神经网络参数的输入通道维度和输出通道维度进行稀疏化处理。The apparatus of claim 31, wherein the two-dimensional tensor sparses input channel dimensions and output channel dimensions of neural network parameters.
  33. 根据权利要求28所述的装置,其中当所述掩码张量为二维张量时,所述反向掩码张量是由所述运算电路对所述掩码张量进行维度转换而生成的。The device according to claim 28, wherein when the mask tensor is a two-dimensional tensor, the reverse mask tensor is generated by performing dimension transformation on the mask tensor by the operation circuit of.
  34. 根据权利要求24所述的装置,其中运算电路进一步配置用于:25. The apparatus of claim 24, wherein the arithmetic circuit is further configured to:
    更新未稀疏化处理的神经网络参数。Update unsparse neural network parameters.
  35. 根据权利要求34所述的装置,其中运算电路进一步配置用于:35. The apparatus of claim 34, wherein the arithmetic circuit is further configured to:
    基于更新的未稀疏化处理的神经网络参数生成所述掩码张量。The mask tensor is generated based on the updated unsparse processed neural network parameters.
  36. 根据权利要求35所述的装置,其中当所述掩码张量为一维张量时,所述运算电路配置用于按如下生成所述掩码张量:36. The apparatus of claim 35, wherein when the mask tensor is a one-dimensional tensor, the arithmetic circuit is configured to generate the mask tensor as follows:
    从所述神经网络参数的指定维度的每m个数据元素中选择n个绝对值较大的数据元素作为有效数据元素,其中m>n;以及Select n data elements with larger absolute values as valid data elements from every m data elements of the specified dimension of the neural network parameter, where m>n; and
    基于所述n个有效数据元素在所述m个数据元素中的位置来确定所述掩码张量。The mask tensor is determined based on the positions of the n valid data elements among the m data elements.
  37. 根据权利要求35所述的装置,其中当所述掩码张量为二维张量时,所述运算电路配置用于按如下生成所述掩码张量:36. The apparatus of claim 35, wherein when the mask tensor is a two-dimensional tensor, the arithmetic circuit is configured to generate the mask tensor as follows:
    预设特定数量的二维掩码张量,所述二维掩码张量的每个维度包括m个元素,其中n个元素为1,m-n个元素为0,m>n;Preset a specific number of two-dimensional mask tensors, each dimension of the two-dimensional mask tensor includes m elements, wherein n elements are 1, m-n elements are 0, and m>n;
    基于每个预设的二维掩码张量分别对所述神经网络参数的指定两个维度进行掩码,以获得掩码后参数张量;Mask the specified two dimensions of the neural network parameters based on each preset two-dimensional mask tensor to obtain a masked parameter tensor;
    基于每个掩码后参数张量,对所述神经网络的训练数据进行乘积和运算,以获得参数评估值;以及Based on each post-mask parameter tensor, performing a product-sum operation on the training data of the neural network to obtain parameter estimates; and
    选择产生所有参数评估值中最大的二维掩码张量作为所述掩码张量。The two-dimensional mask tensor that yields the largest of all parameter evaluations is selected as the mask tensor.
  38. 根据权利要求24-37任一所述的装置,其中所述运算电路配置用于在稀疏化训练的掩码调整阶段中的多次迭代中执行所述操作。37. The apparatus of any of claims 24-37, wherein the operational circuit is configured to perform the operations in a plurality of iterations in a mask adjustment phase of sparse training.
  39. 根据权利要求38所述的装置,其中所述运算电路进一步配置用于:在所述掩码调整阶段,判断在连续多次迭代训练中,所述掩码张量的所有元素值未变动的百分比是否达到阈值;以及The apparatus according to claim 38, wherein the operation circuit is further configured to: in the mask adjustment stage, determine the percentage of all element values of the mask tensor that have not changed in successive multiple iterations of training whether the threshold is reached; and
    如是,结束所述掩码调整阶段。If so, the mask adjustment phase ends.
  40. 根据权利要求39所述的装置,其中所述阈值为80%、90%及100%其中之一。39. The device of claim 39, wherein the threshold is one of 80%, 90%, and 100%.
  41. 根据权利要求24-33任一所述的装置,其中所述运算电路配置用于在稀疏化训练的掩码固定阶段中的多次迭代中执行所述操作,并且所述掩码张量固定为之前阶段最终确定的掩码张量。34. The apparatus of any one of claims 24-33, wherein the operational circuitry is configured to perform the operations in a plurality of iterations in a mask-fixed phase of sparse training, and the mask tensor is fixed as The mask tensor finalized by the previous stage.
  42. 根据权利要求41所述的装置,其中运算电路进一步配置用于:The apparatus of claim 41, wherein the arithmetic circuit is further configured to:
    更新已稀疏化处理的神经网络参数。Update the parameters of the thinned neural network.
  43. 根据权利要求42所述的装置,其中运算电路进一步配置用于:The apparatus of claim 42, wherein the arithmetic circuit is further configured to:
    利用所述掩码张量对所述神经元梯度进行稀疏化处理;以及thinning the neuron gradient using the mask tensor; and
    基于所述稀疏化的神经元梯度更新所述已稀疏化处理的神经网络参数。The thinned neural network parameters are updated based on the thinned neuron gradients.
  44. 根据权利要求41-43任一所述的装置,其中在所述掩码固定阶段期间,所述存储电路配置用于存储已固定的掩码张量以及经稀疏化处理的神经网络参数。43. The apparatus of any of claims 41-43, wherein during the mask fixation phase, the memory circuit is configured to store a fixed mask tensor and thinned neural network parameters.
  45. 根据权利要求41-43任一所述的装置,其中在所述掩码固定阶段期间,所述存储电路配置用于存储已固定的掩码张量以及未稀疏化处理的神经网络参数。43. The apparatus of any of claims 41-43, wherein during the mask fixation phase, the memory circuit is configured to store the fixed mask tensor and the unsparse processed neural network parameters.
  46. 一种芯片,包括根据权利要求24-45任一所述的数据处理装置。A chip, comprising the data processing device according to any one of claims 24-45.
  47. 一种板卡,包括根据权利要求46所述的芯片。A board, comprising the chip according to claim 46 .
PCT/CN2021/123879 2020-11-04 2021-10-14 Neural network sparsification apparatus and method and related product WO2022095675A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/003,821 US20230259780A1 (en) 2020-11-04 2021-10-14 Neural network sparsification apparatus and method and related product

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202011216903 2020-11-04
CN202011216903.5 2020-11-04
CN202011563259.9A CN114444680A (en) 2020-11-04 2020-12-25 Neural network sparsing device and method and related product
CN202011563259.9 2020-12-25

Publications (1)

Publication Number Publication Date
WO2022095675A1 true WO2022095675A1 (en) 2022-05-12

Family

ID=81362120

Family Applications (2)

Application Number Title Priority Date Filing Date
PCT/CN2021/123881 WO2022095676A1 (en) 2020-11-04 2021-10-14 Neural network sparsification device and method, and corresponding product
PCT/CN2021/123879 WO2022095675A1 (en) 2020-11-04 2021-10-14 Neural network sparsification apparatus and method and related product

Family Applications Before (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/123881 WO2022095676A1 (en) 2020-11-04 2021-10-14 Neural network sparsification device and method, and corresponding product

Country Status (3)

Country Link
US (2) US20230259780A1 (en)
CN (2) CN114444680A (en)
WO (2) WO2022095676A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115170917B (en) * 2022-06-20 2023-11-07 美的集团(上海)有限公司 Image processing method, electronic device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107886164A (en) * 2017-12-20 2018-04-06 东软集团股份有限公司 A kind of convolutional neural networks training, method of testing and training, test device
WO2020097217A1 (en) * 2018-11-06 2020-05-14 Emory University Systems and Methods for Training an Autoencoder Neural Network Using Sparse Data
CN111652366A (en) * 2020-05-09 2020-09-11 哈尔滨工业大学 Combined neural network model compression method based on channel pruning and quantitative training

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9779786B1 (en) * 2016-10-26 2017-10-03 Xilinx, Inc. Tensor operations and acceleration

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107886164A (en) * 2017-12-20 2018-04-06 东软集团股份有限公司 A kind of convolutional neural networks training, method of testing and training, test device
WO2020097217A1 (en) * 2018-11-06 2020-05-14 Emory University Systems and Methods for Training an Autoencoder Neural Network Using Sparse Data
CN111652366A (en) * 2020-05-09 2020-09-11 哈尔滨工业大学 Combined neural network model compression method based on channel pruning and quantitative training

Also Published As

Publication number Publication date
CN114444681A (en) 2022-05-06
WO2022095676A1 (en) 2022-05-12
US20220230069A1 (en) 2022-07-21
CN114444680A (en) 2022-05-06
US20230259780A1 (en) 2023-08-17

Similar Documents

Publication Publication Date Title
JP6880160B2 (en) Arithmetic logic unit and calculation method
CN111047022B (en) Computing device and related product
WO2023045445A1 (en) Data processing device, data processing method, and related product
CN112633490A (en) Data processing device and method for executing neural network model and related products
WO2023045446A1 (en) Computing apparatus, data processing method, and related product
WO2022134873A1 (en) Data processing device, data processing method, and related product
WO2022095675A1 (en) Neural network sparsification apparatus and method and related product
CN113469336A (en) Compiling method and execution method for optimizing neural network model and related products
Belabed et al. Low cost and low power stacked sparse autoencoder hardware acceleration for deep learning edge computing applications
CN113469337A (en) Compiling method for optimizing neural network model and related product
CN114692844A (en) Data processing device, data processing method and related product
WO2022134688A1 (en) Data processing circuit, data processing method, and related products
WO2022134872A1 (en) Data processing apparatus, data processing method and related product
WO2022135599A1 (en) Device, board and method for merging branch structures, and readable storage medium
CN111047024A (en) Computing device and related product
WO2023236929A1 (en) Method and device for reading target data in data based on instruction
WO2022063217A1 (en) Device for forward fusion of neural network, board, method, and readable storage medium
WO2022111013A1 (en) Device supporting multiple access modes, method and readable storage medium
CN115599738A (en) Method for optimizing neural network model and related product
WO2022135600A1 (en) Computational neural network apparatus, card, method, and readable storage medium
CN114444678A (en) Apparatus, method, and storage medium for thinning neural network layer
CN115600657A (en) Processing device, equipment and method and related products thereof
CN114429194A (en) Device, board card, method and readable storage medium for processing neural network calculation
CN116484926A (en) Self-adaptive splitting optimization equipment and method
CN114692846A (en) Data processing device, data processing method and related product

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21888372

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21888372

Country of ref document: EP

Kind code of ref document: A1