US20220230069A1 - Neural network sparsification device and method, and related product - Google Patents

Neural network sparsification device and method, and related product Download PDF

Info

Publication number
US20220230069A1
US20220230069A1 US17/557,802 US202217557802A US2022230069A1 US 20220230069 A1 US20220230069 A1 US 20220230069A1 US 202217557802 A US202217557802 A US 202217557802A US 2022230069 A1 US2022230069 A1 US 2022230069A1
Authority
US
United States
Prior art keywords
mask
tensor
parameters
adjustment parameters
updated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US17/557,802
Inventor
Yufeng Gao
Shibing ZHU
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Cambricon Information Technology Co Ltd
Original Assignee
Anhui Cambricon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Cambricon Information Technology Co Ltd filed Critical Anhui Cambricon Information Technology Co Ltd
Assigned to Anhui Cambricon Information Technology Co., Ltd. reassignment Anhui Cambricon Information Technology Co., Ltd. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: GAO, Yufeng, ZHU, Shibing
Publication of US20220230069A1 publication Critical patent/US20220230069A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/285Selection of pattern recognition techniques, e.g. of classifiers in a multi-classifier system
    • G06K9/6227
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections

Definitions

  • the present disclosure relates generally to the field of neural network. More particularly, the present disclosure relates to a device, a board card, a method and a readable storage medium for performing sparse training on a neural network model.
  • Network parameter sparsification is to reduce redundant components in a larger network through a proper method to reduce the requirements of the network for the amount of computation and the storage space.
  • an existing fine-grained parameter sparsification method has excellent model performance, it is unfriendly to hardware access, in other words, on-chip and off-chip input/output overhead is high and performance is low; in addition, although a structured sparsity method based on channel and convolution kernel improves hardware performance, the model accuracy has a greater loss; finally, most of existing sparse algorithms are in an off-line fine-tuning mode, in other words, a pre-trained model is subjected to fine-tuning after sparsification, so that the off-line fine-tuning mode is greatly limited, and considerable performance benefits cannot be obtained from model training.
  • solutions of the present disclosure provide a device, a board card, a method and a readable storage medium for performing sparse training on a neural network model.
  • the present disclosure provides a method of performing sparse training on a neural network model, which comprises a mask adjustment stage and a mask fixing stage.
  • the following steps are repeated in a plurality of epochs: masking, in forward propagation, mask adjustment parameters based on a mask tensor to compute a value of a loss function; computing, in back propagation, partial derivatives of the loss function with respect to the mask adjustment parameters; updating the mask adjustment parameters based on the partial derivatives; and updating the mask tensor based on the updated mask adjustment parameters.
  • the updated mask adjustment parameters in the mask adjustment stage may be taken as initial values of mask fixing parameters, and the following steps are repeated in a plurality of epochs: masking, in forward propagation, the mask fixing parameters based on an updated mask tensor to compute the value of the loss function; computing, in back propagation, partial derivatives of the loss function with respect to the mask fixing parameters; and updating the mask fixing parameters based on the partial derivatives.
  • the updated mask fixing parameters are shielded by using the updated mask tensor to control a processing area of a feature map input to the neural network model.
  • the present disclosure provides a method of performing sparse training on a neural network model, comprising: in a mask adjustment stage, repeating the following steps in a plurality of epochs: masking, in forward propagation, mask adjustment parameters based on a mask tensor to compute a value of a loss function; computing, in back propagation, partial derivatives of the loss function with respect to the mask adjustment parameters; updating the mask adjustment parameters based on the partial derivatives; and updating the mask tensor based on updated mask adjustment parameters.
  • the updated mask adjustment parameters are shielded by using the updated mask tensor to control a processing area of a feature map input to the neural network model.
  • the present disclosure provides a computer-readable storage medium, having stored thereon computer program code for performing sparse training on a neural network model.
  • the computer program code when executed by a processing device, performs the aforementioned method.
  • the present disclosure provides an integrated circuit device for performing sparse training on a neural network model, comprising: a processing device and a computation device.
  • the processing device comprises a control module, a computation module, and an updating module.
  • the control module sets that a mask adjustment stage is entered, the computation module repeats the following operations in a plurality of epochs: masking, in forward propagation, mask adjustment parameters based on a mask tensor to compute a value of a loss function; and computing, in back propagation, partial derivatives of the loss function with respect to the mask adjustment parameters; and the updating module is configured to update the mask adjustment parameters based on the partial derivatives, and update the mask tensor based on updated mask adjustment parameters.
  • the updating module takes the updated mask adjustment parameters as initial values of mask fixing parameters, and the computation module repeats the following operations in a plurality of epochs: masking, in forward propagation, the mask fixing parameters based on the updated mask tensor in the mask adjustment stage to compute the value of the loss function; and computing, in back propagation, partial derivatives of the loss function with respect to the mask fixing parameters.
  • the updating module updates the mask fixing parameters based on the partial derivatives.
  • the computation device is configured to shield the updated mask fixing parameters by using the updated mask tensor to control a processing area of a feature map input to the neural network model.
  • the present disclosure provides an integrated circuit device for performing sparse training on a neural network model, comprising: a processing device and a computation device.
  • the processing device comprises a control module, a computation module, and an updating module; when the control module sets that a mask adjustment stage is entered, the computation module repeats the following operations in a plurality of epochs: masking, in forward propagation, mask adjustment parameters based on a mask tensor to compute a value of a loss function; and computing, in back propagation, partial derivatives of the loss function with respect to the mask adjustment parameters; and the updating module is configured to update the mask adjustment parameters based on the partial derivatives, and update the mask tensor based on the updated mask adjustment parameters.
  • the computation device is configured to shield the updated mask adjustment parameters by using the updated mask tensor to control a processing area of a feature map input to the neural network model.
  • the present disclosure provides a board card, comprising the aforementioned integrated circuit device.
  • the parameters are trained and the mask tensor is updated at the same time, which achieves the technical effects of reducing input/output overhead and improving accuracy.
  • FIG. 1 is a structural diagram illustrating a board card according to an embodiment of the present disclosure
  • FIG. 2 is a structural diagram illustrating an integrated circuit device according to an embodiment of the present disclosure
  • FIG. 3 is a schematic diagram illustrating an internal structure of a single-core computation device according to an embodiment of the present disclosure
  • FIG. 4 is a schematic diagram illustrating an internal structure of a multi-core computation device according to an embodiment of the disclosure
  • FIG. 5 is a schematic diagram illustrating an internal structure of a processor core according to an embodiment of the disclosure.
  • FIG. 6A is a schematic diagram illustrating an internal structure of a processing device according to an embodiment of the present disclosure
  • FIG. 6B is a schematic diagram illustrating an internal structure of a processing device according to another embodiment of the present disclosure.
  • FIG. 7 is a flowchart illustrating a sparse training method according to an embodiment of the disclosure.
  • FIG. 8 is a schematic diagram illustrating an exemplary masking process
  • FIG. 9 is a schematic diagram illustrating exemplary mask vector updating
  • FIG. 10 is a schematic diagram illustrating an exemplary product sum computation process
  • FIG. 11 is a flowchart illustrating a sparse training method according to another embodiment of the present disclosure.
  • FIG. 12 is a flowchart illustrating a sparse training method in a mask fixing stage according to another embodiment of the disclosure.
  • FIG. 13 is a schematic diagram illustrating several implementations when performing sparse training on a neural network model according to the present disclosure.
  • a term “if” may be interpreted contextually as “when”, or “once”, or “in response to determining”, or “in response to detecting.”
  • a neural network is composed of an input layer, a convolutional layer, an activation function, a pooling layer and a fully connected layer, and has at least several layers and at most hundreds of layers.
  • One operator is executed for each layer, for example, a convolutional operator is executed for a convolutional layer, and the number of operators that need to be executed is as many as the number of the layers.
  • a convolutional operator is executed for a convolutional layer, and the number of operators that need to be executed is as many as the number of the layers.
  • FIG. 1 illustrates a schematic structural diagram of a board card 10 according to an embodiment of the present disclosure.
  • the board card 10 comprises a chip 101 , which is a system-level chip, or a system-on-chip, and has integrated thereon one or more combined processing devices.
  • the combined processing device is an artificial intelligence computation unit for supporting various deep learning and machine learning algorithms, to meet intelligent processing requirements in complex scenes in fields such as computer vision, voice, natural language processing, data mining.
  • a deep learning technique has been widely applied to a cloud intelligence field, and one remarkable feature of a cloud intelligence application is a huge amount of input data, which results in a high requirement for storage capability and computational capability of a platform.
  • the board card 10 of this embodiment is suitable for the cloud intelligent application and has huge off-chip and on-chip storage, and powerful computational capability.
  • the chip 101 is connected with an external device 103 through an external interface device 102 .
  • the external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a WIFI interface, or the like.
  • Data to be processed may, via the external interface device 102 , be transferred to the chip 101 by the external device 103 .
  • a computation result of the chip 101 may, via the external interface device 102 , be transmitted back to the external device 103 .
  • the external interface device 102 may have interfaces in different forms, for example, a PCIe interface, and the like, according to different application scenes.
  • the board card 10 further comprises a storage device 104 for data storage, which comprises one or more storage units 105 .
  • the storage device 104 is, through a bus, in connection and data transmission with a control device 106 and the chip 101 .
  • the control device 106 in the board card 10 is configured to regulate a state of the chip 101 .
  • the control device 106 may include a micro controller unit (MCU).
  • MCU micro controller unit
  • FIG. 2 is a structural diagram illustrating a combined processing device in the chip 101 of this embodiment.
  • the combined processing device 20 comprises a computation device 201 , an interface device 202 , a processing device 203 , and a DRAM 204 .
  • the computation device 201 is configured to perform a user-specified operation, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor to perform computation of deep learning or machine learning, and it may interact with the processing device 203 through the interface device 202 , to jointly complete the user-specified operation.
  • the interface device 202 is configured to transmit data and control instructions between the computation device 201 and the processing device 203 .
  • the computation device 201 may, via the interface device 202 , acquire input data from the processing device 203 , and write the input data into a storage device on the computation device 201 .
  • the computation device 201 may, via the interface device 202 , acquire a control instruction from the processing device 203 , and write the control instruction into a control cache on the computation device 201 .
  • the interface device 202 may also read data in the storage device of the computation device 201 and transmit the data to the processing device 203 .
  • the processing device 203 performs basic controls that include, but are not limited to, data move, start and/or stop of the computation device 201 , and the like.
  • the processing device 203 may be a processor of one or more of central processing unit (CPU), graphics processing unit (GPU) or other general-purpose and/or special-purpose processor.
  • CPU central processing unit
  • GPU graphics processing unit
  • special-purpose processor include, but are not limited to, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic, a discrete hardware component, and the like, and the number of the processors may be determined according to actual needs.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field-programmable gate array
  • the computation device 201 of the present disclosure may be regarded as having a single-core structure or an isomorphic multi-core structure.
  • the computation device 201 and the processing device 203 are integrated to be considered together, the two are regarded as forming a heterogeneous multi-core structure.
  • the DRAM 204 is configured to store data to be processed, and is a DDR memory with a size of typically 16G or larger, to store data of the computation device 201 and/or the processing device 203 .
  • FIG. 3 illustrates a schematic diagram of an internal structure of the computation device 201 as a single core.
  • the single-core computation device 301 is configured to process input data such as computer vision, voice, natural language, data mining and the like, and it comprises three modules: a control module 31 , a computation module 32 , and a storage module 33 .
  • the control module 31 is configured to coordinate and control work of the computation module 32 and the storage module 33 to complete a deep learning task, and comprises an instruction fetch unit (IFU) 311 and an instruction decode unit (IDU) 312 .
  • the instruction fetch unit 311 is configured to acquire an instruction from the processing device 203
  • the instruction decode unit 312 is configured to decode the acquired instruction and send a decoded result as control information to the computation module 32 and the storage module 33 .
  • the computation module 32 comprises a vector computation unit 321 and a matrix computation unit 322 .
  • the vector computation unit 321 is configured to perform vector computation, and may support complex computation such as vector multiplication, addition, nonlinear transformation; and the matrix computation unit 322 is responsible for core computation of a deep learning algorithm, in other words, matrix multiplication and convolution.
  • the storage module 33 is configured to store or move related data, and comprises a neuron RAM (NRAM) 331 , a weight RAM (WRAM) 332 , and a direct memory access (DMA) 333 .
  • NRAM neuron RAM
  • WRAM weight RAM
  • DMA direct memory access
  • the NRAM 331 is configured to store an input neuron, output neuron and computed intermediate result
  • the WRAM 332 is configured to store a convolution kernel, in other words, a weight, of a deep learning network
  • the DMA 333 is connected to the DRAM 204 through a bus 34 and is responsible for data move between the single-core computation device 301 and the DRAM 204 .
  • FIG. 4 illustrates a schematic diagram of an internal structure of the computation device 201 as a multi-core.
  • the multi-core computation device 41 employs a hierarchy design.
  • the multi-core computation device 41 as a system on chip, comprises at least one cluster, and each cluster comprises a plurality of processor cores, in other words, the multi-core computation device 41 is composed of the hierarchy of system on chip-cluster-processor core.
  • the multi-core computation device 41 comprises, as shown in FIG. 4 , an external storage controller 401 , a peripheral communication module 402 , an on-chip interconnection module 403 , a synchronization module 404 , and a plurality of clusters 405 .
  • the external storage controller 401 is configured to access, in response to an access request issued by a processor core, external storage device, for example, the DRAM 204 in FIG. 2 , to read data from or write data to off-chip.
  • the peripheral communication module 402 is configured to receive a control signal from the processing device 203 through the interface device 202 to start the computation device 201 to execute a task.
  • the on-chip interconnection module 403 connect the external storage controller 401 , the peripheral communication module 402 and the plurality of clusters 405 , to transmit data and control signals between the modules.
  • the synchronization module 404 is a global barrier controller (GBC), and is configured to coordinate work progress of the clusters to ensure information synchronization.
  • the plurality of clusters 405 are computation cores of the multi-core computation device 41 , four exemplarily shown in the figure. With the development of hardware, the multi-core computation device 41 of the present disclosure may further include 8, 16, 64, or even more clusters 405 .
  • the cluster 405 is configured to efficiently execute a deep learning algorithm.
  • each cluster 405 comprises, as shown in FIG. 4 , a plurality of processor cores (IPU cores) 406 and a MEM core 407 .
  • processor cores IPU cores
  • MEM core MEM core
  • each processor core 406 also comprises three modules: a control module 51 , a computation module 52 , and a storage module 53 . Functions and structures of the control module 51 , the computation module 52 and the storage module 53 are substantially the same as those of the control module 31 , the computation module 32 , and the storage module 33 , and thus are not repeated.
  • the storage module 53 comprises an input/output direct memory access (IODMA) module 533 and a move direct memory access (MVDMA) module 534 .
  • IODMA input/output direct memory access
  • MVDMA move direct memory access
  • the IODMA 533 controls, through a broadcast bus 409 , access to NRAM 531 /WRAM 532 and DRAM 204 ; and the MVDMA 534 is configured to control access to the NRAM 531 /WRAM 532 and a storage unit (SRAM) 408 .
  • SRAM storage unit
  • the MEM core 407 is primarily used for storage and communication, in other words, storing shared data or intermediate results among the processor cores 406 , and performing communication between the cluster 405 and the DRAM 204 , communication among the clusters 405 , communication among the processor cores 406 , and the like.
  • the MEM core 407 has the capability of scalar computation and used for performing scalar computation.
  • the MEM core 407 comprises the SRAM 408 , the broadcast bus 409 , a cluster direct memory access (CDMA) module 410 , and a global direct memory access (GDMA) module 411 .
  • the SRAM 408 plays a role of a high-performance data transfer station, data multiplexed between different processor cores 406 in a same cluster 405 does not need to be individually obtained from the DRAM 204 through the processor cores 406 , but is transferred among the processor cores 406 through the SRAM 408 , and the MEM core 407 only needs to rapidly distribute the multiplexed data from the SRAM 408 to the plurality of processor cores 406 , so that inter-core communication efficiency is improved, and on-chip and off-chip input/output access is greatly reduced.
  • the broadcast bus 409 , CDMA 410 , and GDMA 411 are used for performing communication among the processor cores 406 , communication among the clusters 405 , and data transmission between the cluster 405 and the DRAM 204 , respectively, which will be described separately below.
  • the broadcast bus 409 is used for completing high-speed communication among the processor cores 406 in the cluster 405 , and an inter-core communication mode supported by the broadcast bus 409 of this embodiment comprises unicast, multicast and broadcast.
  • the unicast refers to point-to-point (for example, from a single processor core to a single processor core) data transmission
  • the multicast is a communication mode that one data is transmitted from the SRAM 408 to several specific processor cores 406
  • the broadcast is a communication mode that one data is transmitted from the SRAM 408 to all the processor cores 406 , which is a special case of the multicast.
  • the CDMA 410 is used for controlling access to SRAMs 408 between different clusters 405 within a same computation device 201 .
  • the GDMA 411 cooperates with the external storage controller 401 to control access from the SRAM 408 of the cluster 405 to the DRAM 204 or to read data from the DRAM 204 into the SRAM 408 .
  • the communication between the DRAM 204 and NRAM 531 or WRAM 32 may be accomplished via two channels.
  • a first channel is to directly connect the DRAM 204 with the NRAM 531 or WRAM 532 through IODAM 533 ; and a second channel is to transmit data between the DRAM 204 and the SRAM 408 via the GDMA 411 , and then between the SRAM 408 and the NRAM 531 or WRAM 532 via MVDMA 534 .
  • a data transmission channel can be selected according to its own hardware condition.
  • functions of the GDMA 411 and IODMA 533 may be integrated in a same component.
  • the GDMA 411 and the IODMA 533 are regarded as different components in the present disclosure, and for those skilled in the art, the component will fall within the scope of protection of the present disclosure as long as it realizes functions and technical effects similar to the present disclosure.
  • the functions of the GDMA 411 , IODMA 533 , CDMA 410 , and MVDMA 534 can be also implemented by a same component.
  • the training of the neural network is to adjust parameters of the layers by inputting training samples, so that a result computed by the neural network is close to a real result as far as possible.
  • the neural network training comprises forward propagation and backward propagation.
  • the forward propagation is to, based on an existing model, input a training sample which is computed by the layers of the neural network, and to gradually extract an input feature map into abstract features.
  • the backward propagation is to compute a loss function according to a result of the forward propagation and a real value, compute partial derivatives of the loss function with respect to the parameters through a chain rule by adopting a gradient descent method, to update the parameters, and then perform training by using updated parameters, and repeat the above many times, such that a final computation result of the forward propagation is as anticipated.
  • an epoch refers to a process of performing training once by using all training samples, a set of these training samples is a training set, and the training of a batchsize of the training samples is an iteration.
  • a training set has 1000 training samples, and a batchsize is set to 10, then 10 training samples are required to participate in training for each iteration, and there are 100 iterations in one epoch.
  • the training of a neural network model may go through a plurality of epochs.
  • the embodiment provides a solution of performing sparse training on a neural network model. More specifically, the processing device 203 trains parameters and a mask tensor at the same time in a neural network training stage. As shown in FIG. 6A , the processing device 203 comprises a random generation module 61 , a control module 62 , a computation module 63 , and an updating module 64 , to execute a sparse training method shown in FIG. 7 . In other embodiments, as shown in FIG.
  • the processing device 203 comprises a random generation module 61 , a control module 62 , a computation module 63 , an updating module 64 , and a mask tensor determination module 65 , to perform the sparse training method as shown in FIG. 7 .
  • step 701 it is set that a mask adjustment stage is entered. While performing training, the prior art only trains all the parameters (such as weights, offsets), and usually does not mask the parameters.
  • the purpose of masking the parameters in this embodiment is to reduce participation of the parameters in the training stage, avoid over-fitting to reduce the amount of computation, and meanwhile, make the mask tensor updated with the updating of the parameters in the training process to obtain a more ideal mask tensor.
  • the control module 62 starts to enter the mask adjustment stage, in other words, begins to mask a part of the parameters by using the mask tensor.
  • the parameters and the mask tensor are randomly generated at the beginning of the training, and the random generation module 61 randomly generates initial values of the mask tensor and the parameters.
  • the mask tensor is generated according to the randomly generated parameters at the beginning of the training, in other words, the random generation module 61 randomly generates initial values of the parameters, and the mask tensor determination module 65 determines an initial value of the mask tensor according to the initial values of the parameters.
  • the mask tensor determination module 65 may determine the initial value of the mask tensor based on: selecting, from every m data elements of a specified dimension of the initial values of the above parameters, n data elements with larger absolute values as valid data elements, where m>n; and generating the initial value of the mask tensor based on positions of the n valid data elements in the m data elements.
  • the specified dimension may be an input channel dimension (C in ).
  • the parameters are divided into a plurality of intervals in units of a specific parameter count m, parameters in each interval are sorted according to their absolute values from large to small, then in the mask tensor, elements at positions that correspond to first n parameters with larger absolute values in each interval are set to 1, and elements at positions that correspond to m-n parameters with smaller absolute values in each interval are set to 0.
  • the control module 62 when the mask tensor are a two-dimensional tensor, the control module 62 will preset a specific count of two-dimensional mask tensors and then select one of the preset two-dimensional mask tensors as the initial value of the mask tensor.
  • Each dimension of these two-dimensional mask tensors comprises m elements, of which n elements are 1, and m-n elements are 0, where m>n.
  • the mask tensor of this embodiment is exemplarily set as a two-dimensional mask matrix for masking an input channel (cm) and an output channel (c out ) of a convolution kernel of the convolution layer, and assuming that m is 4 and n is 2, the mask matrix c in ⁇ c out is set to 4(m) ⁇ 4(m), where there are 2(n) elements of 1 and 2(m-n) elements of 0 in any row and any column. Since there are 90 such 4 ⁇ 4 mask matrices, the control module 62 presets, in this step, 90 4 ⁇ 4 mask matrices with 2 elements of 1 and 2 elements of 0 in any row and any column, which are pre-stored in the DRAM 204 .
  • this embodiment is illustrated by taking the input channel (c in ) and the output channel (c out ) as an example, the present disclosure is not limited thereto, and any parameter can be masked according to the teaching of this embodiment.
  • Selecting one from among these specific count (for example, 90) of two-dimensional mask tensors as the initial value may comprise: respectively masking two specified dimensions of the initial values of the parameters of the neural network layer based on each preset two-dimensional mask tensor to obtain masked parameter tensors; performing product sum computation on training data of the neural network layer based on each masked parameter tensor to obtain parameter evaluation values; and selecting a two-dimensional mask tensor that yields the largest of all parameter evaluation values as the initial value of the mask tensor.
  • the above two specified dimensions can be an input channel dimension and an output channel dimension.
  • a masking process of the two-dimensional mask tensor can refer to the description made thereinafter in conjunction with FIG. 8 .
  • the processing device 203 After entering the mask adjustment stage, the processing device 203 repeats the following steps in a plurality of epochs.
  • the mask adjustment parameters are masked based on the mask tensor in forward propagation to compute a value of a loss function.
  • the parameters in the mask adjustment stage is defined as mask adjustment parameters.
  • the computation module 63 masks the input channel and the output channel respectively according to one mask matrix selected from the 90 mask matrices in the initialization step.
  • FIG. 8 illustrates an exemplary masking process, and assuming that the input channel and the output channel of the convolutional layer are a 4 ⁇ 4 channel matrix 801 , whose elements are a 11 to a 44 , the channel matrix 801 is the mask adjustment parameters.
  • the computation module 63 performs masking based on an exemplarily selected mask matrix 802 of the aforementioned 90 4 ⁇ 4 mask matrices. An element in the channel matrix 801 is retained by the computation module 63 if the corresponding element in the mask matrix 802 is 1, and an element in the channel matrix 801 is masked by the computation module 63 if the corresponding element in the mask matrix 802 is 0, so that its value is 0.
  • the computation module 63 masks, in forward propagation, the mask adjustment parameters based on the mask tensor, then performs computation, and finally obtains the value of the loss function that corresponds to an output error of the neural network.
  • step 703 partial derivatives of the loss function with respect to the mask adjustment parameters are computed in backward propagation.
  • the computation module 63 When the computation module 63 is in back propagation, it propagates the output error of the neural network from an output end of the neural network model to an input direction step by step, and in the process, an effect of each mask adjustment parameter on the loss function is computed by using a chain rule, in other words, the partial derivative of the loss function with respect to each mask adjustment parameter is computed.
  • step 704 the mask adjustment parameters are updated based on the partial derivatives.
  • the updating module 64 according to effects of the mask adjustment parameters on the error, multiplies the mask adjustment parameters by a stride, to update the mask adjustment parameters of the whole neural network.
  • the updating module 64 may update the mask adjustment parameters based on the partial derivatives in each training sample or each iteration. Taking the aforementioned example that the epoch comprises the training set of 1000 training samples and the batchesize is 10, if the mask adjustment parameters are updated after each training sample is trained, the updating will be performed 1000 times in the epoch; and if the mask adjustment parameters are updated every iteration, the updating will be performed 100 times in the epoch.
  • the mask tensor is updated based on the updated mask adjustment parameters.
  • the updating module 64 of this embodiment updates the mask tensor in a variety of ways.
  • the mask vector can only mask a single parameter.
  • the updating module 64 comprises a division unit 641 , a sorting unit 642 and an adjustment unit 643 , which are used for updating the mask vector.
  • the updating module 64 updates the mask vector it will set, to 1, the element(s) that corresponds to mask adjustment parameters with larger absolute values, and set, to 0, the element(s) that corresponds to mask adjustment parameters with smaller absolute values, which is because those mask adjustment parameters with larger absolute values carry more obvious features and are more worthy of being retained for further computation.
  • There are many ways of screening those mask adjustment parameters with larger absolute values one of which is exemplarily given below.
  • the division unit 641 divides the updated mask adjustment parameters into a plurality of intervals in units of a specific parameter count m, the sorting unit 642 sorts the mask adjustment parameters in each interval according to their absolute values from large to small, and the adjustment unit 643 sets, in the mask vector, to 1, elements that correspond to first n mask adjustment parameters sorted, and sets, to 0, remaining elements that correspond to m-n mask adjustment parameters with smaller absolute values, in other words, first n mask adjustment parameters with larger absolute values are retained, and m-n mask adjustment parameters with smaller absolute values are masked.
  • FIG. 9 is a schematic diagram of exemplary mask vector updating, which illustrates the above updating of the mask vector by way of example.
  • the figure shows a parameter vector 901 with 64 parameters, b 01 to b 64 respectively.
  • the updating module 64 updates element values of the mask vector to retain those mask adjustment parameters with larger absolute values and mask those mask adjustment parameters with smaller absolute values.
  • the division unit 641 divides the updated mask adjustment parameters into a plurality of intervals in units of every 4 mask adjustment parameters (in other words, m is 4), and as shown in the figure, b 01 to boa are in a first interval 902 , b 05 to b 08 are in a second interval 903 , and b 61 to b 64 are in a sixteenth interval 917 .
  • the sorting unit 642 sorts the mask adjustment parameters in each interval according to their absolute values from large to small. It is assumed that absolute values of the parameters in the first interval 902 are sorted as b 02 >b 01 >b 04 >b 03 , absolute values of the parameters in the second interval 903 are sorted as b 07 >b 08 >b 06 >b 08 , and absolute values of the parameters in the sixteenth interval 917 are sorted as b 64 >b 63 >b 61 >b 62 , and the sorting unit 642 sorts the mask adjustment parameters in each interval according to the absolute values of the mask adjustment parameters from large to small.
  • n is 2
  • the first interval 902 as an example, in the mask vectors, elements that correspond to b 02 and b 01 are set to 1, and elements that correspond to boo and boa are set to 0.
  • Each interval is adjusted in this way, and finally an updated mask vector 918 is completed.
  • the updated mask vector 918 retains the updated mask adjustment parameters with larger absolute values and masks the updated mask adjustment parameters with smaller absolute values.
  • the updating module 64 takes every 4 mask adjustment parameters as one interval, and updates element values of the mask vector in the way of 2 out of 4 in each interval.
  • the mask adjustment parameters in each interval are completely sorted to identify n parameters with larger absolute values and m-n parameters with smaller absolute values, but the present disclosure does not necessarily need to perform complete sorting as long as the n parameters with larger absolute values and the m-n parameters with smaller absolute values can be identified, and the sorting of the n parameters with larger absolute values from large to small and the sorting of the m-n parameters with smaller absolute values from large to small are not necessary information.
  • the present disclosure only needs to judge that b 01 and b 02 are 2 parameters with larger absolute values, and boa and boo are 2 parameters with smaller absolute values, and the sorting of absolute values of b 01 and b 02 from large to small and the sorting of absolute values of b 03 and b 04 from large to small are not critical, and they may not be sorted, to save computation resources.
  • the updating module 64 may perform product sum computation on the training data and each masked parameter tensor to obtain a parameter evaluation value.
  • the purpose of obtaining the parameter evaluation value is to compute the amount of information retained after the masking by the mask tensor. If the evaluation value parameter is high, it is indicated that the amount of information is not lost too much due to the masking, the mask tensor reduces the amount of computation on the premise of retaining most information and is a high-quality mask tensor; and on the contrary, if the parameter evaluation value is low, it is indicated that the amount of information is lost too much after the masking, and the mask tensor is not a high-quality mask tensor.
  • An updating process of the multi-dimensional mask tensor is similar to the initialization process described above for the two-dimensional mask tensor, in other words, the mask tensor determination module 65 may be implemented as part of the updating module 64 .
  • FIG. 10 illustrates an exemplary product sum computation process. Assuming that a training data matrix 1001 is one of the training data in the training set, the computation that should be previously performed with the channel matrix 801 in FIG. 8 is now changed into a product-sum computation with the masked parameter matrix 803 , to identify the amount of information after the masking. There are various ways for such product sum computation, for example, the training data matrix 1001 is multiplied with corresponding elements of the masked parameter matrix 803 , and then absolute values of the products are summed to obtain a parameter evaluation value S 1 :
  • the parameter evaluation value reflects a result of similar absolute value computations, and the parameter evaluation value S 1 or S 2 represents how much the amount of information is retained after the masking, and the higher the parameter evaluation value, the more amount of information retained.
  • either of the parameter evaluation value S 1 or S 2 may be selected, while in another application scene, both the parameter evaluation value S 1 and S 2 may be employed, which is not limited in this disclosure.
  • the updating module 64 masks all mask tensors and obtains the parameter evaluation value.
  • it means that all of the 90 4 ⁇ 4 mask matrices are masked and 90 parameter evaluation values are obtained.
  • a mask tensor with a largest parameter evaluation value is selected as the updated mask tensor, in other words, the parameter mask tensor.
  • the sorting unit 642 may sort all the parameter evaluation values according to the values from large to small to obtain the largest parameter evaluation value, or simply compare two values with a two-input comparator, then leave the larger for comparison with a next parameter evaluation value, and then the largest parameter evaluation value is left after all the 90 parameter evaluation values have been compared.
  • the updating module 64 may select one of them based on a specific rule or hardware characteristic, for example, a top-ranked one, bottom-ranked one, first-left one, last-left one, or one randomly chosen.
  • the mask tensor having the largest parameter evaluation value is a mask tensor that retains the most amount of information, and in this embodiment, this mask tensor is taken as the parameter mask tensor.
  • the updating module 64 will update the parameter mask tensor in each iteration or each epoch. If in the step 704 , the mask adjustment parameters are updated after each training sample is trained, it is advantageous that the parameter mask tensor is updated every iteration; and if in the step 704 , the mask adjustment parameters are updated every iteration, it is advantageous that the parameter mask tensor is updated at the end of each epoch.
  • the parameters are trained and the mask matrix is updated at the same time.
  • the neural network training will perform a specific count of epochs, and the specific count may be 1, 5, 10 or others, which may be adjusted by those skilled in the art according to the specific training situation and is not limited in the present disclosure.
  • Another embodiment of the present disclosure provides, similarly based on the aforementioned hardware environment, a solution of performing sparse training on a neural network model. It is different from the previous embodiment in that a mask-free stage is entered before the mask adjustment stage. In the mask-free stage, the processing device 203 only trains the parameters, in other words, does not mask the parameters, and after the mask-free stage is finished and the mask adjustment stage is entered, the parameters are trained and the mask matrix is updated at the same time. Training flow of this embodiment is shown in FIG. 11 .
  • step 1101 the control module 62 first sets that a mask-free stage is entered.
  • the mask-free stage in this embodiment, the parameters are not masked, all of them participate in the training, and at the beginning of the training, the random generation module 61 randomly generates parameter values.
  • the parameters participating in the training in the mask-free stage are called mask-free parameters.
  • the computation module 63 computes a value of a loss function in forward propagation based on the mask-free parameters.
  • the computation module 63 employs the way of computing the loss function in the prior art as follows: inputting a training sample in forward propagation, through the computation by the layers of the neural network, gradually extracting an input feature map as abstract features and computing the loss function by using forward propagation results and a real value.
  • step 1103 the computation module 63 computes partial derivatives of the loss function with respect to the mask-free parameters in back propagation.
  • the computation module 63 computes the partial derivative of the loss function with respect to each mask-free parameter through a chain rule by employing a gradient descent method.
  • the updating module 64 updates the mask-free parameters based on the partial derivatives, and takes the updated mask-free parameters as initial values of the mask adjustment parameters. According to effects of the mask-free parameters on error, the updating module 64 multiplies the parameters by a stride to update the mask-free parameters of the whole neural network. In this embodiment, the updating module 64 may also update the mask-free parameters based on the partial derivatives in each training sample or each iteration.
  • the steps 1102 , 1103 and 1104 may be repeated in a specific count of epochs, to update the mask-free parameters many times, and after a last updating, the updated mask-free parameters will be taken as initial values of the mask adjustment parameters in a next stage.
  • step 1105 it is set that a mask adjustment stage is entered.
  • the control module 62 sets that the mask adjustment stage is entered, in other words, it begins to use the mask tensor to mask part of the parameters.
  • initial values of the mask adjustment parameters are mask-free parameters finally updated in the mask-free stage, and the mask tensor can be generated in two ways, one of which is to randomly generate the mask tensor by the random generation module 61 , the other of which is to obtain the initial values of the mask adjustment parameters based on the mask-free parameters finally updated in the mask-free stage, in a same manner as the step 705 , which is not repeated.
  • step 1106 the mask adjustment parameters are masked in forward propagation based on the mask tensor to compute the value of the loss function.
  • step 1107 the partial derivatives of the loss function with respect to the mask adjustment parameters are computed in back propagation.
  • step 1108 the mask adjustment parameters are updated based on the partial derivatives.
  • step 1109 the mask tensor is updated based on updated mask adjustment parameters.
  • the counts of epochs in the mask-free stage and in the mask adjustment stage are not limited in this embodiment, and can be arranged by those skilled in the art according to a specific situation, and the counts of epochs in the mask-free stage and in the mask adjustment stage are not necessarily the same.
  • Another embodiment of the present disclosure provides, similarly based on the aforementioned hardware environment, a solution of performing sparse training on a neural network model. It is different from the above embodiment in that the training is divided into three stages: a mask-free stage, a mask adjustment stage and a mask fixing stage.
  • the processing device 203 In the mask-free stage, the processing device 203 only trains the parameters and does not mask the parameters; in the mask adjustment stage, the processing device 203 takes the updated mask-free parameters as initial values, trains the parameters and the mask tensor at the same time; and in the mask fixing stage, the processing device 203 takes the updated mask adjustment parameters and the updated mask tensor in the mask adjustment stage as initial values, and continues to train the parameters without changing or updating the mask tensor.
  • FIG. 11 The flow executed in the mask-free stage and the mask adjustment stage in this embodiment is shown in FIG. 11 , and therefore are not repeated. After the mask fixing stage is entered, the flow is shown in FIG. 12 .
  • the control module 62 sets that a mask fixing stage is entered.
  • the control module 62 takes the updated mask adjustment parameters in the mask adjustment stage as initial values of the parameters (hereinafter referred to as the mask fixing parameters) in this stage, and in this embodiment, since the mask tensor has been updated in the mask adjustment stage, the mask tensor will be no longer updated in this stage, and the mask fixing parameters are masked based on the mask tensor finally updated in the mask adjustment stage, and the mask fixing parameters are continually trained.
  • the following steps are repeated in at least one epoch.
  • step 1202 the computation module 63 masks the mask fixing parameters in forward propagation based on the updated mask tensor in the mask adjustment stage to compute the value of the loss function. This step is similar to the step 702 and is not repeated.
  • step 1203 the computation module 63 computes the partial derivatives of the loss function with respect to the mask fixing parameters in backward propagation. This step is similar to the step 703 and is not repeated.
  • step 1204 the updating module 64 updates the mask fixing parameters based on the partial derivatives. This step is similar to the step 704 and is not repeated.
  • the training is divided into three stages.
  • the mask-free stage no mask tensor masks the parameters, and only the parameters are trained to speed up the convergence of the parameters.
  • the mask adjustment stage the initial values of the parameters are no longer randomly generated, but are trained mask-free parameters, which is helpful to quickly obtain an ideal mask tensor.
  • the mask fixing stage is entered, parameters are continually trained by using the updated mask tensor, and finally, the trained parameters will better match the mask tensor.
  • the implementation 1301 only has a mask adjustment stage, in which both an initial values W 0 of the parameter and an initial value MO of the mask tensor are randomly generated by the random generation module 61 , or the initial value MO of the mask tensor is determined based on the initial value W 0 of the parameter.
  • the parameters are trained and the mask matrix is updated at the same time, to obtain a trained parameter Wf and an updated mask tensor Mf.
  • the implementation 1302 only has a mask-free stage and a mask adjustment stage.
  • the mask-free stage only parameters are trained, an initial value W 0 of the parameter is randomly generated by the random generation module 61 , and an updated parameter W 1 is obtained after the training.
  • the mask adjustment stage the parameters are trained and a mask matrix is updated at the same time, an initial value of the parameter in this stage is the updated parameter W 1 , and an initial value MO of the mask tensor is randomly generated by the random generation module 61 , or the initial value MO of the mask tensor is obtained by using the updated parameter W 1 , and finally a trained parameter Wf and an updated mask tensor Mf are obtained.
  • the implementation 1303 only has a mask adjustment stage and a mask fixing stage.
  • both an initial value W 0 of the parameter and an initial value MO of the mask tensor are randomly generated by the random generation module 61 , or the initial value MO of the mask tensor is determined based on the initial values W 0 of the parameter.
  • the parameters are trained and the mask matrix is updated at the same time, to obtain an updated parameter W 1 and an updated mask tensor Mf.
  • the mask fixing stage the parameters are masked by the updated mask tensor Mf and are continually trained, an initial value of the parameter in the stage is the updated parameter W 1 , and finally a trained parameter Wf is obtained.
  • the implementation 1304 has a mask-free stage, a mask adjustment stage, and a mask fixing stage.
  • the mask-free stage only parameters are trained, and an initial value W 0 of the parameter is randomly generated by the random generation module 61 , and an updated parameter W 1 is obtained after the training.
  • the mask adjustment stage the parameters are trained and a mask matrix is updated at the same time, an initial value of the parameter in this stage is the updated parameter W 1 , and an initial value MO of the mask tensor is randomly generated by the random generation module 61 , or the initial value MO of the mask tensor is obtained by using the updated parameter W 1 , and finally an updated parameter W 2 and an updated mask tensor Mf are obtained.
  • the mask fixing stage the parameters are masked by the updated mask tensor Mf and are continually trained, an initial value of the parameter in the stage is the updated parameter W 2 , and finally a trained parameters Wf is obtained.
  • the implementation 1305 has, in addition to a mask-free stage, a mask adjustment stage, and a mask fixing stage, other training stages (indicated by dashed lines) between the mask-free stage and the mask adjustment stage, and between the mask adjustment stage and the mask fixing stage.
  • the mask-free stage only parameters are trained, and an initial value W 0 of the parameter is randomly generated by the random generation module 61 , and an updated parameter W 1 is obtained after the training.
  • the mask-free stage may be followed by any training stage disclosed or not disclosed in this disclosure, in which the parameters are trained or a mask matrix is be updated.
  • an initial value of the parameter in this stage is the updated parameter W 1
  • an initial value MO of the mask tensor is randomly generated by the random generation module 61 , or the initial value MO of the mask tensor is obtained by using the updated parameter W 1 , to obtain an updated parameter W 2 .
  • the mask adjustment stage is entered, in which the parameters are trained and the mask matrix is be updated at the same time, an initial value of the parameter in this stage is the updated parameter W 2 , and an initial value of the mask tensor is still the mask tensor MO, to obtain an updated parameter W 3 and an updated mask tensor Ml. And then, this stage is followed by any stage disclosed or not disclosed in the present disclosure, in which the parameters are trained or the mask matrix is updated.
  • this stage is a parameter fixing stage, in other words, the parameters are fixed and not trained, and only the mask tensor is trained, an initial value of the parameter in this stage is the updated parameter W 3 , and an initial value of the mask tensor is the updated mask tensor Ml, to obtain an updated mask tensor Mf.
  • the parameters are masked by the updated tensor Mf and are continually trained, an initial value of the parameter in this stage is the updated parameter W 3 , and finally a trained parameter Wf is obtained.
  • the count of epochs in each stage of various implementations is not limited in the present disclosure, and can be arranged by those skilled in the art according to a specific situation, and the count of epochs in each stage is not necessarily the same.
  • the aforementioned embodiments do not necessarily require that all preset specific count of epochs are completed.
  • the control unit 62 may further judge whether a percentage of all unchanged element values of the parameter mask tensor reaches a threshold in 2 consecutive epochs. If the threshold is reached, it is indicated that the training result is basically converged and more training brings limited improvement on the accuracy, and therefore, the mask adjustment stage is ended to complete the training.
  • a threshold is typically set above 70%, in other words, if the percentage of all unchanged element values of the parameter mask tensor is above 70%, the training will be stopped.
  • the threshold is not limited in the present disclosure, and may be 80%, 90%, 100%, or any other percentage.
  • Another embodiment of the present disclosure is a computer-readable storage medium that has stored thereon computer program code for performing sparse training on a neural network model, which when executed by a processor, performs the methods of the embodiments as described above.
  • the above integrated units may be implemented in a form of a software program module. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated units may be stored in a computer-readable memory.
  • the software product may be stored in a memory, and it may include several instructions to cause a computer device (for example, a personal computer, a server, a network device, or the like) to perform some or all of the steps of the methods described in the embodiments of the present disclosure.
  • a computer device for example, a personal computer, a server, a network device, or the like
  • the above memory may include, but is not limited to, various media that may have stored thereon program code, for example, a U disk, a flash disk, a read only memory (ROM), a random access memory (RAM), a mobile hard disk, a magnetic disk, an optical disk, or the like.
  • the trained parameters are shielded by using the updated parameter mask tensor, to control the processing area of the feature map input to the neural network model, so that on one hand, the expected accuracy can be reached, and on the other hand, the amount of computation can be reduced in the process of the inference to achieve sparsification.
  • an electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a PC device, an Internet-of-Things terminal, a mobile terminal, a mobile phone, a dashboard camera, a navigator, a sensor, a webcam, a camera, a video camera, a projector, a watch, an earphone, a mobile memory, a wearable device, a visual terminal, an automatic driving terminal, transportation, a household appliance, and/or medical device.
  • the transportation comprises an airplane, a ship and/or a vehicle;
  • the household appliance comprises a television set, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood;
  • the medical device comprises a nuclear magnetic resonance instrument, a B-ultrasonic scanner and/or an electrocardiograph.
  • the electronic device or apparatus of the present disclosure may also be applied to fields such as Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunication, finance, retail, construction site, medical care, and the like.
  • the electronic device or apparatus of the present disclosure may also be used in application scenes related to artificial intelligence, big data, and/or cloud computing, such as a cloud, an edge, and a terminal.
  • an electronic device or apparatus with high computational capability according to the present disclosure may be applied to a cloud device (for example, cloud server), and an electronic device or apparatus with low power consumption may be applied to a terminal device and/or an edge device (for example, a smartphone or a camera).
  • appropriate hardware resources can be matched from hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate hardware resources of the terminal device and/or the edge device, so that unified management, scheduling and cooperative work of terminal-cloud integration or cloud-edge-terminal integration are achieved.
  • the units are split based on logic functions considered herein, and they may be split in other ways in a practical implementation.
  • a plurality of units or components may be combined or integrated into another system, or some features or functions in a unit or component may be selectively disabled.
  • the connection discussed above in conjunction with the accompanying drawings may be direct or indirect coupling between the units or components.
  • the foregoing direct or indirect coupling involves a communication connection that uses an interface, wherein the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.
  • an unit described as a separate part may or may not be physically separate, and a component shown as an unit may or may not be a physical unit.
  • the aforementioned components or units may be in a same position or distributed over a plurality of network units.
  • some or all of the units can be selected to achieve the objectives of the solutions described in the embodiments of the present disclosure.
  • a plurality of units in embodiments of the present disclosure may be integrated into one unit or each unit exists physically separately.
  • the above integrated unit may also be implemented in a form of hardware, in other words, a specific hardware circuit, which may include a digital circuit, and/or an analog circuit, and the like.
  • a physical implementation of a hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but are not limited to, devices such as transistors or memristors.
  • various devices described herein may be implemented by a suitable hardware processor, such as a central processing unit, GPU, FPGA, DSP, ASIC, and the like.
  • the aforementioned storage unit or storage device may be any suitable storage medium (including a magnetic storage medium, a magneto-optical storage medium, or the like), and it may be, for example, a resistive random access memory (RRAM), a dynamic random access memory (DRAM), a static random access memory (SRAM), an enhanced dynamic random access memory (EDRAM), a high bandwidth memory (HBM), a hybrid memory cube (HMC), a ROM, a RAM, and the like.
  • RRAM resistive random access memory
  • DRAM dynamic random access memory
  • SRAM static random access memory
  • EDRAM enhanced dynamic random access memory
  • HBM high bandwidth memory
  • HMC hybrid memory cube
  • ROM read only memory
  • RAM random access memory
  • RAM random access memory
  • a method of performing sparse training on a neural network model comprising:
  • Clause 2 The method of Clause 1, further comprising:
  • Clause 3 The method of Clause 2, further comprising:
  • Clause 4 The method of Clause 1, further comprising:
  • Clause 6 The method of Clause 5, wherein the specified dimension is an input channel dimension.
  • Clause 7 The method of Clause 4, wherein when the mask tensor is a two-dimensional tensor, the determining the initial value of the mask tensor includes:
  • Clause 8 The method of Clause 7, wherein the two specified dimensions are an input channel dimension and an output channel dimension.
  • Clause 9 The method of Clause 1, wherein in the mask adjustment stage, the mask adjustment parameters are updated based on the partial derivatives in each iteration.
  • Clause 10 The method of Clause 1, wherein in the mask adjustment stage, when the mask tensor is a one-dimensional tensor, the updating the mask tensor includes:
  • Clause 12 The method of Clause 11, wherein the threshold is one of 80%, 90%, and 100%.
  • Clause 13 The method of Clauses 5 to 8 or 10, wherein m is 4 and n is 2.
  • Clause 14 The method of Clause 10, wherein the specific count is 1.
  • Clause 15 A computer-readable storage medium, having stored thereon computer program code for performing sparse training on a neural network model, which, when executed by a processing device, performs the method of any of Clauses 1 to 12.
  • An integrated circuit device for performing sparse training on a neural network model comprising:
  • Clause 17 The integrated circuit device of Clause 16, wherein when the control module sets that a mask-free stage is entered, the computation module repeats the following operations in a plurality of epochs: computing, in forward propagation, the value of the loss function based on mask-free parameters; and computing, in back propagation, partial derivatives of the loss function with respect to the mask-free parameters; and the updating module updates the mask-free parameters based on the partial derivatives, and takes the updated mask-free parameters as initial values of the mask adjustment parameters.
  • Clause 18 The integrated circuit device of Clause 17, wherein the processing device further includes a random generation module configured to randomly generate initial values of the mask tensor and the mask-free parameters.
  • Clause 19 The integrated circuit device of Clause 16, wherein the processing device further includes a mask tensor determination module configured to determine the initial value of the mask tensor based on the initial values of the mask adjustment parameters.
  • Clause 21 The integrated circuit device of Clause 20, wherein the specified dimension is an input channel dimension.
  • Clause 23 The integrated circuit device of Clause 22, wherein the two specified dimensions are an input channel dimension and an output channel dimension.
  • Clause 24 The integrated circuit device of Clause 16, wherein in the mask adjustment stage, the updating module updates the mask adjustment parameters based on the partial derivatives in each iteration.
  • the updating module includes a division unit, a sorting unit, and an adjustment unit, and in the mask adjustment stage, after a specific count of epochs have been performed, the division unit divides the updated mask adjustment parameters into a plurality of intervals in units of a specific parameter count m; the sorting unit sorts the mask adjustment parameters in each interval according to absolute values of the mask adjustment parameters from large to small; the adjustment unit sets, in the mask tensor, elements at positions that correspond to first n mask adjustment parameters with larger absolute values in each interval, to 1 , and sets, in the mask tensor, elements at positions that correspond to m-n mask adjustment parameters with smaller absolute values in each interval, to 0.
  • Clause 26 The integrated circuit device of Clause 25, wherein in the mask adjustment stage, the control module judges whether a percentage of all unchanged element values of the mask tensor reaches a threshold in 2 consecutive epochs; and if the threshold is reached, the mask adjustment stage is ended.
  • Clause 27 The integrated circuit device of Clause 26, wherein the threshold is one of 80%, 90%, and 100%.
  • Clause 28 The integrated circuit device of Clauses 20 to 23 or 25, wherein m is 4 and n is 2.
  • Clause 29 The integrated circuit device of Clause 25, wherein the specific count is 1.
  • Clause 30 A board card, comprising the integrated circuit device of any of Clauses 16 to 29.
  • a method of performing sparse training on a neural network model comprising:
  • Clause 32 The method of Clause 31, further comprising:
  • Clause 33 The method of Clause 32, further comprising:
  • Clause 34 The method of Clause 31, further comprising:
  • Clause 35 The method of Clause 34, wherein when the mask tensor is a one-dimensional tensor, the determining the initial value of the mask tensor includes:
  • Clause 36 The method of Clause 35, wherein the specified dimension is an input channel dimension.
  • Clause 37 The method of Clause 34, wherein when the mask tensor is a two-dimensional tensor, the determining the initial value of the mask tensor comprises:
  • Clause 38 The method of Clause 37, wherein the two specified dimensions are an input channel dimension and an output channel dimension.
  • Clause 39 The method of Clause 31, wherein in the mask adjustment stage, the mask adjustment parameters are updated based on the partial derivatives in each iteration.
  • Clause 40 The method of Clause 31, wherein in the mask adjustment stage, when the mask tensor is a one-dimensional tensor, the updating the mask tensor includes:
  • Clause 42 The method of Clause 41, wherein the threshold is one of 80%, 90%, and 100%.
  • Clause 43 The method of Clauses 35 to 38 or 40, wherein m is 4 and n is 2.
  • Clause 44 The method of Clause 40, wherein the specific count is 1.
  • Clause 45 A computer-readable storage medium, having stored thereon computer program code for performing sparse training on a neural network model, which, when executed by a processing device, performs the method of any of Clauses 31 to 42.
  • An integrated circuit device for performing sparse training on a neural network model comprising:
  • Clause 47 The integrated circuit device of Clause 46, wherein when the control module sets that a mask-free stage is entered, the computation module repeats the following operations in a plurality of epochs: computing, in forward propagation, the value of the loss function based on mask-free parameters; and computing, in back propagation, partial derivatives of the loss function with respect to the mask-free parameters; and the updating module updates the mask-free parameters based on the partial derivatives, and takes the updated mask-free parameters as initial values of the mask adjustment parameters.
  • Clause 48 The integrated circuit device of Clause 47, wherein the processing device further includes a random generation module configured to randomly generate initial values of the mask tensor and the mask-free parameters.
  • Clause 49 The integrated circuit device of Clause 46, wherein the processing device further includes a mask tensor determination module configured to determine the initial value of the mask tensor based on the initial values of the mask adjustment parameters.
  • Clause 50 The integrated circuit device of Clause 49, wherein when the mask tensor is a one-dimensional tensor, the mask tensor determination module is configured to:
  • Clause 51 The integrated circuit device of Clause 50, wherein the specified dimension is an input channel dimension.
  • Clause 53 The integrated circuit device of Clause 52, wherein the two specified dimensions are an input channel dimension and an output channel dimension.
  • Clause 54 The integrated circuit device of Clause 46, wherein in the mask adjustment stage, the updating module updates the mask adjustment parameters based on the partial derivatives in each iteration.
  • the updating module includes a division unit, a sorting unit, and an adjustment unit, and in the mask adjustment stage, after a specific count of epochs have been performed, the division unit divides the updated mask adjustment parameters into a plurality of intervals in units of a specific parameter count m; the sorting unit sorts the mask adjustment parameters in each interval according to absolute values of the mask adjustment parameters from large to small; the adjustment unit sets, in the mask tensor, elements at positions that correspond to first n mask adjustment parameters with larger absolute values in each interval, to 1 , and sets, in the mask tensor, elements at positions that correspond to m-n mask adjustment parameters with smaller absolute values in each interval, to 0.
  • Clause 56 The integrated circuit device of Clause 55, wherein in the mask adjustment stage, the control module judges whether a percentage of all unchanged element values of the mask tensor reaches a threshold in 2 consecutive epochs; and if the threshold is reached, the mask adjustment stage is ended.
  • Clause 57 The integrated circuit device of Clause 56, wherein the threshold is one of 80%, 90%, and 100%.
  • Clause 58 The integrated circuit device of Clauses 50 to 53 or 55, wherein m is 4 and n is 2.
  • a board card comprising the integrated circuit device of any of Clauses 46 to 59.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Neurology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

This disclosure relates to a device, a board card, a method and a readable storage medium for performing sparse training on a neural network model, wherein a processing device of the present disclosure is included in an integrated circuit device, and the integrated circuit device comprises a universal interconnection interface and a computation device. The computation device interacts with the processing device to jointly complete computing operations specified by the user. The integrated circuit device further comprises a storage device, and the storage device is connected to the computation device and the processing device, respectively, for data storage of the computation device and the processing device.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application claims the priority to the Chinese patent application No. 2020112169035 filed on Nov. 4, 2020 and entitled “NEURAL NETWORK SPARSIFICATION DEVICE, METHOD AND CORRESPONDING PRODUCT” and the Chinese patent application No. 2020115661411 filed on Dec. 25, 2020 and entitled “NEURAL NETWORK SPARSIFICATION DEVICE, METHOD AND CORRESPONDING PRODUCT.”
  • TECHNICAL FIELD
  • The present disclosure relates generally to the field of neural network. More particularly, the present disclosure relates to a device, a board card, a method and a readable storage medium for performing sparse training on a neural network model.
  • BACKGROUND
  • In recent years, with the rapid development of deep learning, the performance of algorithms in a range of fields such as computer vision and natural language processing has progressed by leaps and bounds. However, a deep learning algorithm is a computation-intensive and storage-intensive tool. As information processing tasks have become increasingly complex, demands for real-time and accuracy of the algorithm are increasing, and a neural network is often designed to be deeper and deeper, therefore, its requirements for the amount of computation and storage space are increasing. As a result, it is difficult to directly apply an existing artificial intelligence technique based on the deep learning to mobile phones, satellites or embedded devices with limited hardware resources.
  • Therefore, compression, acceleration, optimization of a deep neural network model becomes particularly important. A large number of researches are attempting to reduce computation and storage requirements of a neural network without affecting the accuracy of the model, which has great significance on engineering applications of a deep learning technique in an embedded end and a mobile end, where sparsification is just one of model lightweight methods.
  • Network parameter sparsification is to reduce redundant components in a larger network through a proper method to reduce the requirements of the network for the amount of computation and the storage space. Although an existing fine-grained parameter sparsification method has excellent model performance, it is unfriendly to hardware access, in other words, on-chip and off-chip input/output overhead is high and performance is low; in addition, although a structured sparsity method based on channel and convolution kernel improves hardware performance, the model accuracy has a greater loss; finally, most of existing sparse algorithms are in an off-line fine-tuning mode, in other words, a pre-trained model is subjected to fine-tuning after sparsification, so that the off-line fine-tuning mode is greatly limited, and considerable performance benefits cannot be obtained from model training.
  • Therefore, a solution of performing inference by using a sparse online-trained parameter tensor is urgently needed.
  • SUMMARY
  • In order to at least partially solve technical problems mentioned in the background, solutions of the present disclosure provide a device, a board card, a method and a readable storage medium for performing sparse training on a neural network model.
  • In one aspect of the present disclosure, the present disclosure provides a method of performing sparse training on a neural network model, which comprises a mask adjustment stage and a mask fixing stage. In the mask adjustment stage, the following steps are repeated in a plurality of epochs: masking, in forward propagation, mask adjustment parameters based on a mask tensor to compute a value of a loss function; computing, in back propagation, partial derivatives of the loss function with respect to the mask adjustment parameters; updating the mask adjustment parameters based on the partial derivatives; and updating the mask tensor based on the updated mask adjustment parameters. In the mask fixing stage, the updated mask adjustment parameters in the mask adjustment stage may be taken as initial values of mask fixing parameters, and the following steps are repeated in a plurality of epochs: masking, in forward propagation, the mask fixing parameters based on an updated mask tensor to compute the value of the loss function; computing, in back propagation, partial derivatives of the loss function with respect to the mask fixing parameters; and updating the mask fixing parameters based on the partial derivatives. The updated mask fixing parameters are shielded by using the updated mask tensor to control a processing area of a feature map input to the neural network model.
  • In another aspect of the present disclosure, the present disclosure provides a method of performing sparse training on a neural network model, comprising: in a mask adjustment stage, repeating the following steps in a plurality of epochs: masking, in forward propagation, mask adjustment parameters based on a mask tensor to compute a value of a loss function; computing, in back propagation, partial derivatives of the loss function with respect to the mask adjustment parameters; updating the mask adjustment parameters based on the partial derivatives; and updating the mask tensor based on updated mask adjustment parameters. The updated mask adjustment parameters are shielded by using the updated mask tensor to control a processing area of a feature map input to the neural network model.
  • In another aspect of the present disclosure, the present disclosure provides a computer-readable storage medium, having stored thereon computer program code for performing sparse training on a neural network model. The computer program code, when executed by a processing device, performs the aforementioned method.
  • In another aspect of the present disclosure, the present disclosure provides an integrated circuit device for performing sparse training on a neural network model, comprising: a processing device and a computation device. The processing device comprises a control module, a computation module, and an updating module. When the control module sets that a mask adjustment stage is entered, the computation module repeats the following operations in a plurality of epochs: masking, in forward propagation, mask adjustment parameters based on a mask tensor to compute a value of a loss function; and computing, in back propagation, partial derivatives of the loss function with respect to the mask adjustment parameters; and the updating module is configured to update the mask adjustment parameters based on the partial derivatives, and update the mask tensor based on updated mask adjustment parameters. When the control module sets that a mask fixing stage is entered, the updating module takes the updated mask adjustment parameters as initial values of mask fixing parameters, and the computation module repeats the following operations in a plurality of epochs: masking, in forward propagation, the mask fixing parameters based on the updated mask tensor in the mask adjustment stage to compute the value of the loss function; and computing, in back propagation, partial derivatives of the loss function with respect to the mask fixing parameters. The updating module updates the mask fixing parameters based on the partial derivatives. The computation device is configured to shield the updated mask fixing parameters by using the updated mask tensor to control a processing area of a feature map input to the neural network model.
  • In another aspect of the present disclosure, the present disclosure provides an integrated circuit device for performing sparse training on a neural network model, comprising: a processing device and a computation device. The processing device comprises a control module, a computation module, and an updating module; when the control module sets that a mask adjustment stage is entered, the computation module repeats the following operations in a plurality of epochs: masking, in forward propagation, mask adjustment parameters based on a mask tensor to compute a value of a loss function; and computing, in back propagation, partial derivatives of the loss function with respect to the mask adjustment parameters; and the updating module is configured to update the mask adjustment parameters based on the partial derivatives, and update the mask tensor based on the updated mask adjustment parameters. The computation device is configured to shield the updated mask adjustment parameters by using the updated mask tensor to control a processing area of a feature map input to the neural network model.
  • In another aspect of the present disclosure, the present disclosure provides a board card, comprising the aforementioned integrated circuit device.
  • According to the present disclosure, in the model training stage, the parameters are trained and the mask tensor is updated at the same time, which achieves the technical effects of reducing input/output overhead and improving accuracy.
  • BRIEF DESCRIPTION OF DRAWINGS
  • The above and other objectives, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description, which proceeds with reference to the accompanying drawings. In the accompanying drawings, several embodiments of the present disclosure are illustrated exemplarily rather than restrictively, and identical or corresponding reference numerals refer to identical or corresponding parts, in which:
  • FIG. 1 is a structural diagram illustrating a board card according to an embodiment of the present disclosure;
  • FIG. 2 is a structural diagram illustrating an integrated circuit device according to an embodiment of the present disclosure;
  • FIG. 3 is a schematic diagram illustrating an internal structure of a single-core computation device according to an embodiment of the present disclosure;
  • FIG. 4 is a schematic diagram illustrating an internal structure of a multi-core computation device according to an embodiment of the disclosure;
  • FIG. 5 is a schematic diagram illustrating an internal structure of a processor core according to an embodiment of the disclosure;
  • FIG. 6A is a schematic diagram illustrating an internal structure of a processing device according to an embodiment of the present disclosure;
  • FIG. 6B is a schematic diagram illustrating an internal structure of a processing device according to another embodiment of the present disclosure;
  • FIG. 7 is a flowchart illustrating a sparse training method according to an embodiment of the disclosure;
  • FIG. 8 is a schematic diagram illustrating an exemplary masking process;
  • FIG. 9 is a schematic diagram illustrating exemplary mask vector updating;
  • FIG. 10 is a schematic diagram illustrating an exemplary product sum computation process;
  • FIG. 11 is a flowchart illustrating a sparse training method according to another embodiment of the present disclosure;
  • FIG. 12 is a flowchart illustrating a sparse training method in a mask fixing stage according to another embodiment of the disclosure; and
  • FIG. 13 is a schematic diagram illustrating several implementations when performing sparse training on a neural network model according to the present disclosure.
  • DETAILED DESCRIPTION
  • Technical solutions in the embodiments of the present disclosure will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only some of the embodiments of the present disclosure, but not all of them. All other embodiments, which can be derived by those skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
  • It should be understood that terms “first”, “second”, “third”, “fourth”, and the like, in the claims, description, and drawings of the present disclosure are used for distinguishing different objects, rather than for describing a specific order. Terms “including” and “comprising”, when used in the description and claims of this disclosure, indicate the presence of stated features, unity, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, unity, steps, operations, elements, components, and/or combinations thereof.
  • It should also be understood that terms used in the description of the present disclosure is for the purpose of describing specific embodiments only, and is not intended to limit the present disclosure. As used in the description and claims of this disclosure, “a”, “an” and “this” in the singular are intended to include the plural, unless other cases are clearly indicated in the context. It should be further understood that a term “and/or” used in the description and claims of this disclosure refers to any and all possible combinations of one or more of associated listed items and includes these combinations.
  • As used in the description and claims, a term “if” may be interpreted contextually as “when”, or “once”, or “in response to determining”, or “in response to detecting.”
  • Specific implementations of the present disclosure will be described in detail below in conjunction with the accompanying drawings.
  • A neural network is composed of an input layer, a convolutional layer, an activation function, a pooling layer and a fully connected layer, and has at least several layers and at most hundreds of layers. One operator is executed for each layer, for example, a convolutional operator is executed for a convolutional layer, and the number of operators that need to be executed is as many as the number of the layers. In this disclosure, when a specific layer is mentioned, an operator to which the layer corresponds is indicated.
  • FIG. 1 illustrates a schematic structural diagram of a board card 10 according to an embodiment of the present disclosure. As shown in FIG. 1, the board card 10 comprises a chip 101, which is a system-level chip, or a system-on-chip, and has integrated thereon one or more combined processing devices. The combined processing device is an artificial intelligence computation unit for supporting various deep learning and machine learning algorithms, to meet intelligent processing requirements in complex scenes in fields such as computer vision, voice, natural language processing, data mining. Especially, a deep learning technique has been widely applied to a cloud intelligence field, and one remarkable feature of a cloud intelligence application is a huge amount of input data, which results in a high requirement for storage capability and computational capability of a platform. The board card 10 of this embodiment is suitable for the cloud intelligent application and has huge off-chip and on-chip storage, and powerful computational capability.
  • The chip 101 is connected with an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a WIFI interface, or the like. Data to be processed may, via the external interface device 102, be transferred to the chip 101 by the external device 103. A computation result of the chip 101 may, via the external interface device 102, be transmitted back to the external device 103. The external interface device 102 may have interfaces in different forms, for example, a PCIe interface, and the like, according to different application scenes.
  • The board card 10 further comprises a storage device 104 for data storage, which comprises one or more storage units 105. The storage device 104 is, through a bus, in connection and data transmission with a control device 106 and the chip 101. The control device 106 in the board card 10 is configured to regulate a state of the chip 101. For this reason, in an application scene, the control device 106 may include a micro controller unit (MCU).
  • FIG. 2 is a structural diagram illustrating a combined processing device in the chip 101 of this embodiment. As shown in FIG. 2, the combined processing device 20 comprises a computation device 201, an interface device 202, a processing device 203, and a DRAM 204.
  • The computation device 201 is configured to perform a user-specified operation, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor to perform computation of deep learning or machine learning, and it may interact with the processing device 203 through the interface device 202, to jointly complete the user-specified operation.
  • The interface device 202 is configured to transmit data and control instructions between the computation device 201 and the processing device 203. For example, the computation device 201 may, via the interface device 202, acquire input data from the processing device 203, and write the input data into a storage device on the computation device 201. Further, the computation device 201 may, via the interface device 202, acquire a control instruction from the processing device 203, and write the control instruction into a control cache on the computation device 201. Alternatively or optionally, the interface device 202 may also read data in the storage device of the computation device 201 and transmit the data to the processing device 203.
  • The processing device 203, as a general-purpose processing device, performs basic controls that include, but are not limited to, data move, start and/or stop of the computation device 201, and the like. Depending on different implementations, the processing device 203 may be a processor of one or more of central processing unit (CPU), graphics processing unit (GPU) or other general-purpose and/or special-purpose processor. These processors include, but are not limited to, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic, a discrete hardware component, and the like, and the number of the processors may be determined according to actual needs. As mentioned above, for the computation device 201 of the present disclosure alone, it may be regarded as having a single-core structure or an isomorphic multi-core structure. However, when the computation device 201 and the processing device 203 are integrated to be considered together, the two are regarded as forming a heterogeneous multi-core structure.
  • The DRAM 204 is configured to store data to be processed, and is a DDR memory with a size of typically 16G or larger, to store data of the computation device 201 and/or the processing device 203.
  • FIG. 3 illustrates a schematic diagram of an internal structure of the computation device 201 as a single core. The single-core computation device 301 is configured to process input data such as computer vision, voice, natural language, data mining and the like, and it comprises three modules: a control module 31, a computation module 32, and a storage module 33.
  • The control module 31 is configured to coordinate and control work of the computation module 32 and the storage module 33 to complete a deep learning task, and comprises an instruction fetch unit (IFU) 311 and an instruction decode unit (IDU) 312. The instruction fetch unit 311 is configured to acquire an instruction from the processing device 203, and the instruction decode unit 312 is configured to decode the acquired instruction and send a decoded result as control information to the computation module 32 and the storage module 33.
  • The computation module 32 comprises a vector computation unit 321 and a matrix computation unit 322. The vector computation unit 321 is configured to perform vector computation, and may support complex computation such as vector multiplication, addition, nonlinear transformation; and the matrix computation unit 322 is responsible for core computation of a deep learning algorithm, in other words, matrix multiplication and convolution.
  • The storage module 33 is configured to store or move related data, and comprises a neuron RAM (NRAM) 331, a weight RAM (WRAM) 332, and a direct memory access (DMA) 333. The NRAM 331 is configured to store an input neuron, output neuron and computed intermediate result; the WRAM 332 is configured to store a convolution kernel, in other words, a weight, of a deep learning network; and the DMA 333 is connected to the DRAM 204 through a bus 34 and is responsible for data move between the single-core computation device 301 and the DRAM 204.
  • FIG. 4 illustrates a schematic diagram of an internal structure of the computation device 201 as a multi-core. The multi-core computation device 41 employs a hierarchy design. The multi-core computation device 41, as a system on chip, comprises at least one cluster, and each cluster comprises a plurality of processor cores, in other words, the multi-core computation device 41 is composed of the hierarchy of system on chip-cluster-processor core.
  • From the perspective of the system-on-chip level, the multi-core computation device 41 comprises, as shown in FIG. 4, an external storage controller 401, a peripheral communication module 402, an on-chip interconnection module 403, a synchronization module 404, and a plurality of clusters 405.
  • There may be a plurality of external storage controllers 401, two exemplarily shown in the figure. The external storage controller 401 is configured to access, in response to an access request issued by a processor core, external storage device, for example, the DRAM 204 in FIG. 2, to read data from or write data to off-chip. The peripheral communication module 402 is configured to receive a control signal from the processing device 203 through the interface device 202 to start the computation device 201 to execute a task. The on-chip interconnection module 403 connect the external storage controller 401, the peripheral communication module 402 and the plurality of clusters 405, to transmit data and control signals between the modules. The synchronization module 404 is a global barrier controller (GBC), and is configured to coordinate work progress of the clusters to ensure information synchronization. The plurality of clusters 405 are computation cores of the multi-core computation device 41, four exemplarily shown in the figure. With the development of hardware, the multi-core computation device 41 of the present disclosure may further include 8, 16, 64, or even more clusters 405. The cluster 405 is configured to efficiently execute a deep learning algorithm.
  • From the perspective of the cluster level, each cluster 405 comprises, as shown in FIG. 4, a plurality of processor cores (IPU cores) 406 and a MEM core 407.
  • There are 4 processor cores 406 exemplarily shown in the figure, and the number of the processor cores 406 is not limited in the present disclosure. An internal structure of the processor core is shown in FIG. 5. Similar to the single-core computation device 301 of FIG. 3, each processor core 406 also comprises three modules: a control module 51, a computation module 52, and a storage module 53. Functions and structures of the control module 51, the computation module 52 and the storage module 53 are substantially the same as those of the control module 31, the computation module 32, and the storage module 33, and thus are not repeated. It should be specially noted that the storage module 53 comprises an input/output direct memory access (IODMA) module 533 and a move direct memory access (MVDMA) module 534. The IODMA 533 controls, through a broadcast bus 409, access to NRAM 531/WRAM 532 and DRAM 204; and the MVDMA 534 is configured to control access to the NRAM 531/WRAM 532 and a storage unit (SRAM) 408.
  • Returning to FIG. 4, the MEM core 407 is primarily used for storage and communication, in other words, storing shared data or intermediate results among the processor cores 406, and performing communication between the cluster 405 and the DRAM 204, communication among the clusters 405, communication among the processor cores 406, and the like. In other embodiments, the MEM core 407 has the capability of scalar computation and used for performing scalar computation.
  • The MEM core 407 comprises the SRAM 408, the broadcast bus 409, a cluster direct memory access (CDMA) module 410, and a global direct memory access (GDMA) module 411. The SRAM 408 plays a role of a high-performance data transfer station, data multiplexed between different processor cores 406 in a same cluster 405 does not need to be individually obtained from the DRAM 204 through the processor cores 406, but is transferred among the processor cores 406 through the SRAM 408, and the MEM core 407 only needs to rapidly distribute the multiplexed data from the SRAM 408 to the plurality of processor cores 406, so that inter-core communication efficiency is improved, and on-chip and off-chip input/output access is greatly reduced.
  • The broadcast bus 409, CDMA 410, and GDMA 411 are used for performing communication among the processor cores 406, communication among the clusters 405, and data transmission between the cluster 405 and the DRAM 204, respectively, which will be described separately below.
  • The broadcast bus 409 is used for completing high-speed communication among the processor cores 406 in the cluster 405, and an inter-core communication mode supported by the broadcast bus 409 of this embodiment comprises unicast, multicast and broadcast. The unicast refers to point-to-point (for example, from a single processor core to a single processor core) data transmission, the multicast is a communication mode that one data is transmitted from the SRAM 408 to several specific processor cores 406, and the broadcast is a communication mode that one data is transmitted from the SRAM 408 to all the processor cores 406, which is a special case of the multicast.
  • The CDMA 410 is used for controlling access to SRAMs 408 between different clusters 405 within a same computation device 201.
  • The GDMA 411 cooperates with the external storage controller 401 to control access from the SRAM 408 of the cluster 405 to the DRAM 204 or to read data from the DRAM 204 into the SRAM 408. As can be learned from the foregoing, the communication between the DRAM 204 and NRAM 531 or WRAM 32 may be accomplished via two channels. A first channel is to directly connect the DRAM 204 with the NRAM 531 or WRAM 532 through IODAM 533; and a second channel is to transmit data between the DRAM 204 and the SRAM 408 via the GDMA 411, and then between the SRAM 408 and the NRAM 531 or WRAM 532 via MVDMA 534. Although outwardly, the second channel requires more elements to participate and the data flow is longer, in fact, the bandwidth of the second channel is much greater than that of the first channel in some embodiments, and therefore, the communication between the DRAM 204 and the NRAM 531 or WRAM 532 may be more efficient through the second channel. In an embodiment of the present disclosure, a data transmission channel can be selected according to its own hardware condition.
  • In other embodiments, functions of the GDMA 411 and IODMA 533 may be integrated in a same component. For ease of description, the GDMA 411 and the IODMA 533 are regarded as different components in the present disclosure, and for those skilled in the art, the component will fall within the scope of protection of the present disclosure as long as it realizes functions and technical effects similar to the present disclosure. Further, the functions of the GDMA 411, IODMA 533, CDMA 410, and MVDMA 534 can be also implemented by a same component.
  • The training of the neural network is to adjust parameters of the layers by inputting training samples, so that a result computed by the neural network is close to a real result as far as possible. The neural network training comprises forward propagation and backward propagation. The forward propagation is to, based on an existing model, input a training sample which is computed by the layers of the neural network, and to gradually extract an input feature map into abstract features. The backward propagation is to compute a loss function according to a result of the forward propagation and a real value, compute partial derivatives of the loss function with respect to the parameters through a chain rule by adopting a gradient descent method, to update the parameters, and then perform training by using updated parameters, and repeat the above many times, such that a final computation result of the forward propagation is as anticipated.
  • In this embodiment, an epoch refers to a process of performing training once by using all training samples, a set of these training samples is a training set, and the training of a batchsize of the training samples is an iteration. For example, a training set has 1000 training samples, and a batchsize is set to 10, then 10 training samples are required to participate in training for each iteration, and there are 100 iterations in one epoch. In practice, the training of a neural network model may go through a plurality of epochs.
  • Based on the aforementioned hardware environment, the embodiment provides a solution of performing sparse training on a neural network model. More specifically, the processing device 203 trains parameters and a mask tensor at the same time in a neural network training stage. As shown in FIG. 6A, the processing device 203 comprises a random generation module 61, a control module 62, a computation module 63, and an updating module 64, to execute a sparse training method shown in FIG. 7. In other embodiments, as shown in FIG. 6B, the processing device 203 comprises a random generation module 61, a control module 62, a computation module 63, an updating module 64, and a mask tensor determination module 65, to perform the sparse training method as shown in FIG. 7.
  • In step 701, it is set that a mask adjustment stage is entered. While performing training, the prior art only trains all the parameters (such as weights, offsets), and usually does not mask the parameters. The purpose of masking the parameters in this embodiment is to reduce participation of the parameters in the training stage, avoid over-fitting to reduce the amount of computation, and meanwhile, make the mask tensor updated with the updating of the parameters in the training process to obtain a more ideal mask tensor. The control module 62 starts to enter the mask adjustment stage, in other words, begins to mask a part of the parameters by using the mask tensor. In an application scene, the parameters and the mask tensor are randomly generated at the beginning of the training, and the random generation module 61 randomly generates initial values of the mask tensor and the parameters. In another application scene, the mask tensor is generated according to the randomly generated parameters at the beginning of the training, in other words, the random generation module 61 randomly generates initial values of the parameters, and the mask tensor determination module 65 determines an initial value of the mask tensor according to the initial values of the parameters.
  • In some embodiments, when the mask tensor is a one-dimensional tensor (in other words, a vector), the mask tensor determination module 65 may determine the initial value of the mask tensor based on: selecting, from every m data elements of a specified dimension of the initial values of the above parameters, n data elements with larger absolute values as valid data elements, where m>n; and generating the initial value of the mask tensor based on positions of the n valid data elements in the m data elements. In some implementations, the specified dimension may be an input channel dimension (Cin). Specifically, in this embodiment, the parameters are divided into a plurality of intervals in units of a specific parameter count m, parameters in each interval are sorted according to their absolute values from large to small, then in the mask tensor, elements at positions that correspond to first n parameters with larger absolute values in each interval are set to 1, and elements at positions that correspond to m-n parameters with smaller absolute values in each interval are set to 0.
  • In some other embodiments, when the mask tensor are a two-dimensional tensor, the control module 62 will preset a specific count of two-dimensional mask tensors and then select one of the preset two-dimensional mask tensors as the initial value of the mask tensor. Each dimension of these two-dimensional mask tensors comprises m elements, of which n elements are 1, and m-n elements are 0, where m>n.
  • The mask tensor of this embodiment is exemplarily set as a two-dimensional mask matrix for masking an input channel (cm) and an output channel (cout) of a convolution kernel of the convolution layer, and assuming that m is 4 and n is 2, the mask matrix cin×cout is set to 4(m)×4(m), where there are 2(n) elements of 1 and 2(m-n) elements of 0 in any row and any column. Since there are 90 such 4×4 mask matrices, the control module 62 presets, in this step, 90 4×4 mask matrices with 2 elements of 1 and 2 elements of 0 in any row and any column, which are pre-stored in the DRAM 204. Although this embodiment is illustrated by taking the input channel (cin) and the output channel (cout) as an example, the present disclosure is not limited thereto, and any parameter can be masked according to the teaching of this embodiment.
  • Selecting one from among these specific count (for example, 90) of two-dimensional mask tensors as the initial value may comprise: respectively masking two specified dimensions of the initial values of the parameters of the neural network layer based on each preset two-dimensional mask tensor to obtain masked parameter tensors; performing product sum computation on training data of the neural network layer based on each masked parameter tensor to obtain parameter evaluation values; and selecting a two-dimensional mask tensor that yields the largest of all parameter evaluation values as the initial value of the mask tensor. In some implementations, the above two specified dimensions can be an input channel dimension and an output channel dimension. A masking process of the two-dimensional mask tensor can refer to the description made thereinafter in conjunction with FIG. 8.
  • After entering the mask adjustment stage, the processing device 203 repeats the following steps in a plurality of epochs.
  • In step 702, the mask adjustment parameters are masked based on the mask tensor in forward propagation to compute a value of a loss function. Here, for ease of identification, the parameters in the mask adjustment stage is defined as mask adjustment parameters. Taking the aforementioned 4×4 mask matrix as an example, in this step, the computation module 63 masks the input channel and the output channel respectively according to one mask matrix selected from the 90 mask matrices in the initialization step.
  • FIG. 8 illustrates an exemplary masking process, and assuming that the input channel and the output channel of the convolutional layer are a 4×4 channel matrix 801, whose elements are a11 to a44, the channel matrix 801 is the mask adjustment parameters. In this step, the computation module 63 performs masking based on an exemplarily selected mask matrix 802 of the aforementioned 90 4×4 mask matrices. An element in the channel matrix 801 is retained by the computation module 63 if the corresponding element in the mask matrix 802 is 1, and an element in the channel matrix 801 is masked by the computation module 63 if the corresponding element in the mask matrix 802 is 0, so that its value is 0. Taking a11 in the channel matrix 801 as an example, its corresponding element in the mask matrix 802 is 0, so a corresponding element in the masked parameter matrix 803 is masked, and its value is 0. In this way, all element values of the masked parameter matrix 803 are obtained. Since half of elements in the channel matrix 801 are masked, it is indicated that about half of the amount of computation is saved. For each training sample, the computation module 63 masks, in forward propagation, the mask adjustment parameters based on the mask tensor, then performs computation, and finally obtains the value of the loss function that corresponds to an output error of the neural network.
  • In step 703, partial derivatives of the loss function with respect to the mask adjustment parameters are computed in backward propagation. When the computation module 63 is in back propagation, it propagates the output error of the neural network from an output end of the neural network model to an input direction step by step, and in the process, an effect of each mask adjustment parameter on the loss function is computed by using a chain rule, in other words, the partial derivative of the loss function with respect to each mask adjustment parameter is computed.
  • In step 704, the mask adjustment parameters are updated based on the partial derivatives. The updating module 64, according to effects of the mask adjustment parameters on the error, multiplies the mask adjustment parameters by a stride, to update the mask adjustment parameters of the whole neural network.
  • In this embodiment, the updating module 64 may update the mask adjustment parameters based on the partial derivatives in each training sample or each iteration. Taking the aforementioned example that the epoch comprises the training set of 1000 training samples and the batchesize is 10, if the mask adjustment parameters are updated after each training sample is trained, the updating will be performed 1000 times in the epoch; and if the mask adjustment parameters are updated every iteration, the updating will be performed 100 times in the epoch.
  • In step 705, the mask tensor is updated based on the updated mask adjustment parameters. The updating module 64 of this embodiment updates the mask tensor in a variety of ways.
  • If the mask tensor is one-dimensional, in other words, a mask vector, the mask vector can only mask a single parameter. As shown in FIG. 6, the updating module 64 comprises a division unit 641, a sorting unit 642 and an adjustment unit 643, which are used for updating the mask vector. When the updating module 64 updates the mask vector, it will set, to 1, the element(s) that corresponds to mask adjustment parameters with larger absolute values, and set, to 0, the element(s) that corresponds to mask adjustment parameters with smaller absolute values, which is because those mask adjustment parameters with larger absolute values carry more obvious features and are more worthy of being retained for further computation. There are many ways of screening those mask adjustment parameters with larger absolute values, one of which is exemplarily given below.
  • The division unit 641 divides the updated mask adjustment parameters into a plurality of intervals in units of a specific parameter count m, the sorting unit 642 sorts the mask adjustment parameters in each interval according to their absolute values from large to small, and the adjustment unit 643 sets, in the mask vector, to 1, elements that correspond to first n mask adjustment parameters sorted, and sets, to 0, remaining elements that correspond to m-n mask adjustment parameters with smaller absolute values, in other words, first n mask adjustment parameters with larger absolute values are retained, and m-n mask adjustment parameters with smaller absolute values are masked.
  • FIG. 9 is a schematic diagram of exemplary mask vector updating, which illustrates the above updating of the mask vector by way of example. The figure shows a parameter vector 901 with 64 parameters, b01 to b64 respectively. In this step, the updating module 64 updates element values of the mask vector to retain those mask adjustment parameters with larger absolute values and mask those mask adjustment parameters with smaller absolute values. The division unit 641 divides the updated mask adjustment parameters into a plurality of intervals in units of every 4 mask adjustment parameters (in other words, m is 4), and as shown in the figure, b01 to boa are in a first interval 902, b05 to b08 are in a second interval 903, and b61 to b64 are in a sixteenth interval 917. The sorting unit 642 sorts the mask adjustment parameters in each interval according to their absolute values from large to small. It is assumed that absolute values of the parameters in the first interval 902 are sorted as b02>b01>b04>b03, absolute values of the parameters in the second interval 903 are sorted as b07>b08>b06>b08, and absolute values of the parameters in the sixteenth interval 917 are sorted as b64>b63>b61>b62, and the sorting unit 642 sorts the mask adjustment parameters in each interval according to the absolute values of the mask adjustment parameters from large to small. The adjustment unit 643 sets, in these mask vectors, to 1, elements at positions that correspond to mask adjustment parameters with first 2 (in other words, n is 2) larger absolute values in each interval, and sets, to 0, elements at positions that correspond to mask adjustment parameters with 2 (in other words, m-n=2) smaller absolute values in each interval. Taking the first interval 902 as an example, in the mask vectors, elements that correspond to b02 and b01 are set to 1, and elements that correspond to boo and boa are set to 0. Each interval is adjusted in this way, and finally an updated mask vector 918 is completed. The updated mask vector 918 retains the updated mask adjustment parameters with larger absolute values and masks the updated mask adjustment parameters with smaller absolute values. To sum up, the updating module 64 takes every 4 mask adjustment parameters as one interval, and updates element values of the mask vector in the way of 2 out of 4 in each interval.
  • In this embodiment, the mask adjustment parameters in each interval are completely sorted to identify n parameters with larger absolute values and m-n parameters with smaller absolute values, but the present disclosure does not necessarily need to perform complete sorting as long as the n parameters with larger absolute values and the m-n parameters with smaller absolute values can be identified, and the sorting of the n parameters with larger absolute values from large to small and the sorting of the m-n parameters with smaller absolute values from large to small are not necessary information. Taking the first interval 902 as an example, the present disclosure only needs to judge that b01 and b02 are 2 parameters with larger absolute values, and boa and boo are 2 parameters with smaller absolute values, and the sorting of absolute values of b01 and b02 from large to small and the sorting of absolute values of b03 and b04 from large to small are not critical, and they may not be sorted, to save computation resources.
  • If the masked tensor is multi-dimensional, the updating module 64 may perform product sum computation on the training data and each masked parameter tensor to obtain a parameter evaluation value. The purpose of obtaining the parameter evaluation value is to compute the amount of information retained after the masking by the mask tensor. If the evaluation value parameter is high, it is indicated that the amount of information is not lost too much due to the masking, the mask tensor reduces the amount of computation on the premise of retaining most information and is a high-quality mask tensor; and on the contrary, if the parameter evaluation value is low, it is indicated that the amount of information is lost too much after the masking, and the mask tensor is not a high-quality mask tensor. An updating process of the multi-dimensional mask tensor is similar to the initialization process described above for the two-dimensional mask tensor, in other words, the mask tensor determination module 65 may be implemented as part of the updating module 64.
  • FIG. 10 illustrates an exemplary product sum computation process. Assuming that a training data matrix 1001 is one of the training data in the training set, the computation that should be previously performed with the channel matrix 801 in FIG. 8 is now changed into a product-sum computation with the masked parameter matrix 803, to identify the amount of information after the masking. There are various ways for such product sum computation, for example, the training data matrix 1001 is multiplied with corresponding elements of the masked parameter matrix 803, and then absolute values of the products are summed to obtain a parameter evaluation value S1:
  • ? ? indicates text missing or illegible when filed
  • For another example, absolute values of the training data matrix 1001 and corresponding elements of the masked parameter matrix 803 are multiplied, and then the products are summed to obtain a parameter evaluation value S2:
  • ? ? indicates text missing or illegible when filed
  • The parameter evaluation value reflects a result of similar absolute value computations, and the parameter evaluation value S1 or S2 represents how much the amount of information is retained after the masking, and the higher the parameter evaluation value, the more amount of information retained. In one application scene, either of the parameter evaluation value S1 or S2 may be selected, while in another application scene, both the parameter evaluation value S1 and S2 may be employed, which is not limited in this disclosure.
  • The updating module 64 masks all mask tensors and obtains the parameter evaluation value. In the foregoing example, it means that all of the 90 4×4 mask matrices are masked and 90 parameter evaluation values are obtained. A mask tensor with a largest parameter evaluation value is selected as the updated mask tensor, in other words, the parameter mask tensor. There are many ways of selecting the largest parameter evaluation value, for example, the sorting unit 642 may sort all the parameter evaluation values according to the values from large to small to obtain the largest parameter evaluation value, or simply compare two values with a two-input comparator, then leave the larger for comparison with a next parameter evaluation value, and then the largest parameter evaluation value is left after all the 90 parameter evaluation values have been compared. If a plurality of mask tensors have the same largest parameter evaluation value, the updating module 64 may select one of them based on a specific rule or hardware characteristic, for example, a top-ranked one, bottom-ranked one, first-left one, last-left one, or one randomly chosen.
  • The mask tensor having the largest parameter evaluation value is a mask tensor that retains the most amount of information, and in this embodiment, this mask tensor is taken as the parameter mask tensor.
  • In this embodiment, the updating module 64 will update the parameter mask tensor in each iteration or each epoch. If in the step 704, the mask adjustment parameters are updated after each training sample is trained, it is advantageous that the parameter mask tensor is updated every iteration; and if in the step 704, the mask adjustment parameters are updated every iteration, it is advantageous that the parameter mask tensor is updated at the end of each epoch.
  • Through the flow shown in FIG. 7, in this embodiment, in the mask adjustment stage, the parameters are trained and the mask matrix is updated at the same time. Generally speaking, the neural network training will perform a specific count of epochs, and the specific count may be 1, 5, 10 or others, which may be adjusted by those skilled in the art according to the specific training situation and is not limited in the present disclosure.
  • Another embodiment of the present disclosure provides, similarly based on the aforementioned hardware environment, a solution of performing sparse training on a neural network model. It is different from the previous embodiment in that a mask-free stage is entered before the mask adjustment stage. In the mask-free stage, the processing device 203 only trains the parameters, in other words, does not mask the parameters, and after the mask-free stage is finished and the mask adjustment stage is entered, the parameters are trained and the mask matrix is updated at the same time. Training flow of this embodiment is shown in FIG. 11.
  • In step 1101, the control module 62 first sets that a mask-free stage is entered. In the mask-free stage, in this embodiment, the parameters are not masked, all of them participate in the training, and at the beginning of the training, the random generation module 61 randomly generates parameter values. For ease of identification, the parameters participating in the training in the mask-free stage are called mask-free parameters.
  • In step 1102, the computation module 63 computes a value of a loss function in forward propagation based on the mask-free parameters. In this step, the computation module 63 employs the way of computing the loss function in the prior art as follows: inputting a training sample in forward propagation, through the computation by the layers of the neural network, gradually extracting an input feature map as abstract features and computing the loss function by using forward propagation results and a real value.
  • In step 1103, the computation module 63 computes partial derivatives of the loss function with respect to the mask-free parameters in back propagation. The computation module 63 computes the partial derivative of the loss function with respect to each mask-free parameter through a chain rule by employing a gradient descent method.
  • In step 1104, the updating module 64 updates the mask-free parameters based on the partial derivatives, and takes the updated mask-free parameters as initial values of the mask adjustment parameters. According to effects of the mask-free parameters on error, the updating module 64 multiplies the parameters by a stride to update the mask-free parameters of the whole neural network. In this embodiment, the updating module 64 may also update the mask-free parameters based on the partial derivatives in each training sample or each iteration.
  • In this embodiment, the steps 1102, 1103 and 1104 may be repeated in a specific count of epochs, to update the mask-free parameters many times, and after a last updating, the updated mask-free parameters will be taken as initial values of the mask adjustment parameters in a next stage.
  • In step 1105, it is set that a mask adjustment stage is entered. The control module 62 sets that the mask adjustment stage is entered, in other words, it begins to use the mask tensor to mask part of the parameters. At the beginning of entering the mask adjustment stage, as described above, initial values of the mask adjustment parameters are mask-free parameters finally updated in the mask-free stage, and the mask tensor can be generated in two ways, one of which is to randomly generate the mask tensor by the random generation module 61, the other of which is to obtain the initial values of the mask adjustment parameters based on the mask-free parameters finally updated in the mask-free stage, in a same manner as the step 705, which is not repeated.
  • In step 1106, the mask adjustment parameters are masked in forward propagation based on the mask tensor to compute the value of the loss function. In step 1107, the partial derivatives of the loss function with respect to the mask adjustment parameters are computed in back propagation. In step 1108, the mask adjustment parameters are updated based on the partial derivatives. In step 1109, the mask tensor is updated based on updated mask adjustment parameters. These steps are the same as the steps 702, 703, 704 and 705, respectively, and are not repeated.
  • The counts of epochs in the mask-free stage and in the mask adjustment stage are not limited in this embodiment, and can be arranged by those skilled in the art according to a specific situation, and the counts of epochs in the mask-free stage and in the mask adjustment stage are not necessarily the same.
  • Another embodiment of the present disclosure provides, similarly based on the aforementioned hardware environment, a solution of performing sparse training on a neural network model. It is different from the above embodiment in that the training is divided into three stages: a mask-free stage, a mask adjustment stage and a mask fixing stage. In the mask-free stage, the processing device 203 only trains the parameters and does not mask the parameters; in the mask adjustment stage, the processing device 203 takes the updated mask-free parameters as initial values, trains the parameters and the mask tensor at the same time; and in the mask fixing stage, the processing device 203 takes the updated mask adjustment parameters and the updated mask tensor in the mask adjustment stage as initial values, and continues to train the parameters without changing or updating the mask tensor.
  • The flow executed in the mask-free stage and the mask adjustment stage in this embodiment is shown in FIG. 11, and therefore are not repeated. After the mask fixing stage is entered, the flow is shown in FIG. 12.
  • In step 1201, the control module 62 sets that a mask fixing stage is entered. In the mask fixing stage, the control module 62 takes the updated mask adjustment parameters in the mask adjustment stage as initial values of the parameters (hereinafter referred to as the mask fixing parameters) in this stage, and in this embodiment, since the mask tensor has been updated in the mask adjustment stage, the mask tensor will be no longer updated in this stage, and the mask fixing parameters are masked based on the mask tensor finally updated in the mask adjustment stage, and the mask fixing parameters are continually trained.
  • In this embodiment, the following steps are repeated in at least one epoch.
  • In step 1202, the computation module 63 masks the mask fixing parameters in forward propagation based on the updated mask tensor in the mask adjustment stage to compute the value of the loss function. This step is similar to the step 702 and is not repeated.
  • In step 1203, the computation module 63 computes the partial derivatives of the loss function with respect to the mask fixing parameters in backward propagation. This step is similar to the step 703 and is not repeated.
  • In step 1204, the updating module 64 updates the mask fixing parameters based on the partial derivatives. This step is similar to the step 704 and is not repeated.
  • In this embodiment, the training is divided into three stages. In the mask-free stage, no mask tensor masks the parameters, and only the parameters are trained to speed up the convergence of the parameters. In the mask adjustment stage, the initial values of the parameters are no longer randomly generated, but are trained mask-free parameters, which is helpful to quickly obtain an ideal mask tensor. After the mask tensor has been updated, the mask fixing stage is entered, parameters are continually trained by using the updated mask tensor, and finally, the trained parameters will better match the mask tensor.
  • In view of the above, those skilled in the art will understand that there may be several implementations of the present disclosure as shown in FIG. 13 when performing sparse training on the neural network model.
  • The implementation 1301 only has a mask adjustment stage, in which both an initial values W0 of the parameter and an initial value MO of the mask tensor are randomly generated by the random generation module 61, or the initial value MO of the mask tensor is determined based on the initial value W0 of the parameter. The parameters are trained and the mask matrix is updated at the same time, to obtain a trained parameter Wf and an updated mask tensor Mf.
  • The implementation 1302 only has a mask-free stage and a mask adjustment stage. In the mask-free stage, only parameters are trained, an initial value W0 of the parameter is randomly generated by the random generation module 61, and an updated parameter W1 is obtained after the training. In the mask adjustment stage, the parameters are trained and a mask matrix is updated at the same time, an initial value of the parameter in this stage is the updated parameter W1, and an initial value MO of the mask tensor is randomly generated by the random generation module 61, or the initial value MO of the mask tensor is obtained by using the updated parameter W1, and finally a trained parameter Wf and an updated mask tensor Mf are obtained.
  • The implementation 1303 only has a mask adjustment stage and a mask fixing stage. In the mask adjustment stage, both an initial value W0 of the parameter and an initial value MO of the mask tensor are randomly generated by the random generation module 61, or the initial value MO of the mask tensor is determined based on the initial values W0 of the parameter. The parameters are trained and the mask matrix is updated at the same time, to obtain an updated parameter W1 and an updated mask tensor Mf. In the mask fixing stage, the parameters are masked by the updated mask tensor Mf and are continually trained, an initial value of the parameter in the stage is the updated parameter W1, and finally a trained parameter Wf is obtained.
  • The implementation 1304 has a mask-free stage, a mask adjustment stage, and a mask fixing stage. In the mask-free stage, only parameters are trained, and an initial value W0 of the parameter is randomly generated by the random generation module 61, and an updated parameter W1 is obtained after the training. In the mask adjustment stage, the parameters are trained and a mask matrix is updated at the same time, an initial value of the parameter in this stage is the updated parameter W1, and an initial value MO of the mask tensor is randomly generated by the random generation module 61, or the initial value MO of the mask tensor is obtained by using the updated parameter W1, and finally an updated parameter W2 and an updated mask tensor Mf are obtained. In the mask fixing stage, the parameters are masked by the updated mask tensor Mf and are continually trained, an initial value of the parameter in the stage is the updated parameter W2, and finally a trained parameters Wf is obtained.
  • The implementation 1305 has, in addition to a mask-free stage, a mask adjustment stage, and a mask fixing stage, other training stages (indicated by dashed lines) between the mask-free stage and the mask adjustment stage, and between the mask adjustment stage and the mask fixing stage. In the mask-free stage, only parameters are trained, and an initial value W0 of the parameter is randomly generated by the random generation module 61, and an updated parameter W1 is obtained after the training. Then, the mask-free stage may be followed by any training stage disclosed or not disclosed in this disclosure, in which the parameters are trained or a mask matrix is be updated. Assuming that this stage is a mask fixing stage, an initial value of the parameter in this stage is the updated parameter W1, and an initial value MO of the mask tensor is randomly generated by the random generation module 61, or the initial value MO of the mask tensor is obtained by using the updated parameter W1, to obtain an updated parameter W2.
  • Then the mask adjustment stage is entered, in which the parameters are trained and the mask matrix is be updated at the same time, an initial value of the parameter in this stage is the updated parameter W2, and an initial value of the mask tensor is still the mask tensor MO, to obtain an updated parameter W3 and an updated mask tensor Ml. And then, this stage is followed by any stage disclosed or not disclosed in the present disclosure, in which the parameters are trained or the mask matrix is updated. Assuming that this stage is a parameter fixing stage, in other words, the parameters are fixed and not trained, and only the mask tensor is trained, an initial value of the parameter in this stage is the updated parameter W3, and an initial value of the mask tensor is the updated mask tensor Ml, to obtain an updated mask tensor Mf.
  • Finally, in the mask fixing stage, the parameters are masked by the updated tensor Mf and are continually trained, an initial value of the parameter in this stage is the updated parameter W3, and finally a trained parameter Wf is obtained.
  • The various implementations shown in FIG. 13 are only examples, and after referring to the present disclosure, those skilled in the art can, without making creative efforts, expand other implementations, which shall fall within the scope of the present disclosure.
  • The count of epochs in each stage of various implementations is not limited in the present disclosure, and can be arranged by those skilled in the art according to a specific situation, and the count of epochs in each stage is not necessarily the same.
  • The aforementioned embodiments do not necessarily require that all preset specific count of epochs are completed. The control unit 62 may further judge whether a percentage of all unchanged element values of the parameter mask tensor reaches a threshold in 2 consecutive epochs. If the threshold is reached, it is indicated that the training result is basically converged and more training brings limited improvement on the accuracy, and therefore, the mask adjustment stage is ended to complete the training. Such a threshold is typically set above 70%, in other words, if the percentage of all unchanged element values of the parameter mask tensor is above 70%, the training will be stopped. The threshold is not limited in the present disclosure, and may be 80%, 90%, 100%, or any other percentage.
  • Another embodiment of the present disclosure is a computer-readable storage medium that has stored thereon computer program code for performing sparse training on a neural network model, which when executed by a processor, performs the methods of the embodiments as described above. In some implementation scenes, the above integrated units may be implemented in a form of a software program module. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated units may be stored in a computer-readable memory. Based on this, when the solutions of the present disclosure are embodied in a form of a software product (for example, computer-readable storage medium), the software product may be stored in a memory, and it may include several instructions to cause a computer device (for example, a personal computer, a server, a network device, or the like) to perform some or all of the steps of the methods described in the embodiments of the present disclosure. The above memory may include, but is not limited to, various media that may have stored thereon program code, for example, a U disk, a flash disk, a read only memory (ROM), a random access memory (RAM), a mobile hard disk, a magnetic disk, an optical disk, or the like.
  • In the above embodiments, after the training is completed, when the computation device 201 performs inference, the trained parameters are shielded by using the updated parameter mask tensor, to control the processing area of the feature map input to the neural network model, so that on one hand, the expected accuracy can be reached, and on the other hand, the amount of computation can be reduced in the process of the inference to achieve sparsification.
  • According to different application scenes, an electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a PC device, an Internet-of-Things terminal, a mobile terminal, a mobile phone, a dashboard camera, a navigator, a sensor, a webcam, a camera, a video camera, a projector, a watch, an earphone, a mobile memory, a wearable device, a visual terminal, an automatic driving terminal, transportation, a household appliance, and/or medical device. The transportation comprises an airplane, a ship and/or a vehicle; the household appliance comprises a television set, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; and the medical device comprises a nuclear magnetic resonance instrument, a B-ultrasonic scanner and/or an electrocardiograph. The electronic device or apparatus of the present disclosure may also be applied to fields such as Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunication, finance, retail, construction site, medical care, and the like. Further, the electronic device or apparatus of the present disclosure may also be used in application scenes related to artificial intelligence, big data, and/or cloud computing, such as a cloud, an edge, and a terminal. In one or more embodiments, an electronic device or apparatus with high computational capability according to the present disclosure may be applied to a cloud device (for example, cloud server), and an electronic device or apparatus with low power consumption may be applied to a terminal device and/or an edge device (for example, a smartphone or a camera). In one or more embodiments, as hardware information of a cloud device and hardware information of a terminal device and/or an edge device are compatible with each other, appropriate hardware resources can be matched from hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate hardware resources of the terminal device and/or the edge device, so that unified management, scheduling and cooperative work of terminal-cloud integration or cloud-edge-terminal integration are achieved.
  • It should be noted that for the sake of brevity, this disclosure presents some methods and their embodiments as a series of actions and their combinations, but those skilled in the art will appreciate that the solutions of the present disclosure are not limited by the order of the described actions. Accordingly, those of ordinary skill in the art will appreciate, in light of the disclosure or teachings of the present disclosure, that certain steps therein may be performed in other orders or concurrently. Further, those skilled in the art will appreciate that the embodiments described in this disclosure may be regarded as alternative embodiments, in other words, the actions or modules involved therein are not necessarily required for the implementation of a certain solution or solutions of the present disclosure. In addition, according to different solutions, the description of some embodiments in the present disclosure may focus on different emphases. In view of this, those skilled in the art will appreciate that for portions that are not described in detail in one embodiment of the present disclosure, reference may also be made to the related description of other embodiments.
  • In a specific implementation, based on the disclosure and teachings of the present disclosure, those skilled in the art will appreciate that several embodiments disclosed in the present disclosure may also be implemented in other ways not disclosed herein. For example, as for units in the foregoing embodiments of the electronic device or apparatus, the units are split based on logic functions considered herein, and they may be split in other ways in a practical implementation. For another example, a plurality of units or components may be combined or integrated into another system, or some features or functions in a unit or component may be selectively disabled. In terms of a connection relation between different units or components, the connection discussed above in conjunction with the accompanying drawings may be direct or indirect coupling between the units or components. In some scenes, the foregoing direct or indirect coupling involves a communication connection that uses an interface, wherein the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.
  • In the present disclosure, an unit described as a separate part may or may not be physically separate, and a component shown as an unit may or may not be a physical unit. The aforementioned components or units may be in a same position or distributed over a plurality of network units. In addition, according to actual needs, some or all of the units can be selected to achieve the objectives of the solutions described in the embodiments of the present disclosure. In addition, in some scenes, a plurality of units in embodiments of the present disclosure may be integrated into one unit or each unit exists physically separately.
  • In other implementation scenes, the above integrated unit may also be implemented in a form of hardware, in other words, a specific hardware circuit, which may include a digital circuit, and/or an analog circuit, and the like. A physical implementation of a hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but are not limited to, devices such as transistors or memristors. In view of this, various devices described herein (for example, computation devices or other processing devices) may be implemented by a suitable hardware processor, such as a central processing unit, GPU, FPGA, DSP, ASIC, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including a magnetic storage medium, a magneto-optical storage medium, or the like), and it may be, for example, a resistive random access memory (RRAM), a dynamic random access memory (DRAM), a static random access memory (SRAM), an enhanced dynamic random access memory (EDRAM), a high bandwidth memory (HBM), a hybrid memory cube (HMC), a ROM, a RAM, and the like.
  • The above content may be better understood in light of the following Clauses:
  • Clause 1. A method of performing sparse training on a neural network model, comprising:
      • in a mask adjustment stage, repeating the following steps in a plurality of epochs:
      • masking, in forward propagation, mask adjustment parameters based on a mask tensor to compute a value of a loss function;
      • computing, in back propagation, partial derivatives of the loss function with respect to the mask adjustment parameters;
      • updating the mask adjustment parameters based on the partial derivatives; and
      • updating the mask tensor based on the updated mask adjustment parameters; and
      • in a mask fixing stage, taking the updated mask adjustment parameters in the mask adjustment stage as initial values of mask fixing parameters, repeating the following steps in a plurality of epochs:
      • masking, in forward propagation, the mask fixing parameters based on the updated mask tensor to compute a value of the loss function;
      • computing, in back propagation, partial derivatives of the loss function with respect to the mask fixing parameters; and
      • updating the mask fixing parameters based on the partial derivatives,
      • wherein the updated mask fixing parameters are shielded by using the updated mask tensor to control a processing area of a feature map input to the neural network model.
  • Clause 2. The method of Clause 1, further comprising:
      • in a mask-free stage, repeating the following steps in a plurality of epochs:
      • computing, in forward propagation, the value of the loss function based on mask-free parameters;
      • computing, in back propagation, partial derivatives of the loss function with respect to the mask-free parameters; and
      • updating the mask-free parameters based on the partial derivatives,
      • wherein the updated mask-free parameters are taken as initial values of the mask adjustment parameters.
  • Clause 3. The method of Clause 2, further comprising:
      • randomly generating initial values of the mask tensor and the mask-free parameters.
  • Clause 4. The method of Clause 1, further comprising:
      • determining the initial value of the mask tensor based on the initial values of the mask adjustment parameters.
  • Clause 5. The method of Clause 4, wherein when the mask tensor is a one-dimensional tensor, the determining the initial value of the mask tensor includes:
      • selecting, from every m data elements of a specified dimension of the initial values of the mask adjustment parameters, n data elements with larger absolute values as valid data elements, where m>n; and
      • generating the initial value of the mask tensor based on positions of the n valid data elements in the m data elements.
  • Clause 6. The method of Clause 5, wherein the specified dimension is an input channel dimension.
  • Clause 7. The method of Clause 4, wherein when the mask tensor is a two-dimensional tensor, the determining the initial value of the mask tensor includes:
      • presetting a specific count of two-dimensional mask tensors, wherein each dimension of the two-dimensional mask tensors includes m elements, of which n elements are 1, and m-n elements are 0, where m>n;
      • masking two specified dimensions of the initial values of the mask adjustment parameters of the neural network layer respectively based on each preset two-dimensional mask tensor to obtain masked parameter tensors;
      • performing product sum computation on training data of the neural network layer based on each masked parameter tensor to obtain parameter evaluation values; and
      • selecting a two-dimensional mask tensor that yields the largest of all parameter evaluation values as the initial value of the mask tensor.
  • Clause 8. The method of Clause 7, wherein the two specified dimensions are an input channel dimension and an output channel dimension.
  • Clause 9. The method of Clause 1, wherein in the mask adjustment stage, the mask adjustment parameters are updated based on the partial derivatives in each iteration.
  • Clause 10. The method of Clause 1, wherein in the mask adjustment stage, when the mask tensor is a one-dimensional tensor, the updating the mask tensor includes:
      • after a specific count of epochs have been performed, dividing the updated mask adjustment parameters into a plurality of intervals in units of a specific parameter count m;
      • sorting the mask adjustment parameters in each interval according to absolute values of the mask adjustment parameters from large to small;
      • setting, in the mask tensor, elements at positions that correspond to first n mask adjustment parameters with larger absolute values in each interval, to 1; and
      • setting, in the mask tensor, elements at positions that correspond to m-n mask adjustment parameters with smaller absolute values in each interval, to 0.
  • Clause 11. The method of Clause 10, wherein the mask adjustment stage further comprises:
      • judging whether a percentage of all unchanged element values of the mask tensor reaches a threshold in a plurality of consecutive epochs; and
      • if the threshold is reached, ending the mask adjustment stage.
  • Clause 12. The method of Clause 11, wherein the threshold is one of 80%, 90%, and 100%.
  • Clause 13. The method of Clauses 5 to 8 or 10, wherein m is 4 and n is 2.
  • Clause 14. The method of Clause 10, wherein the specific count is 1.
  • Clause 15. A computer-readable storage medium, having stored thereon computer program code for performing sparse training on a neural network model, which, when executed by a processing device, performs the method of any of Clauses 1 to 12.
  • Clause 16. An integrated circuit device for performing sparse training on a neural network model, comprising:
      • a processing device, comprising a control module, a computation module, and an updating module,
      • wherein when the control module sets that a mask adjustment stage is entered, the computation module repeats the following operations in a plurality of epochs: masking, in forward propagation, mask adjustment parameters based on a mask tensor to compute a value of a loss function; and computing, in back propagation, partial derivatives of the loss function with respect to the mask adjustment parameters; and the updating module is configured to update the mask adjustment parameters based on the partial derivatives and update the mask tensor based on the updated mask adjustment parameters,
      • wherein when the control module sets that a mask fixing stage is entered, the updating module takes the updated mask adjustment parameters as initial values of mask fixing parameters, and the computation module repeats the following operations in a plurality of epochs: masking, in forward propagation, the mask fixing parameters based on the updated mask tensor in the mask adjustment stage to compute the value of the loss function; and computing, in back propagation, partial derivatives of the loss function with respect to the mask fixing parameters; the updating module updates the mask fixing parameters based on the partial derivatives; and
      • a computation device configured to shield the updated mask fixing parameters by using the updated mask tensor to control a processing area of a feature map input to the neural network model.
  • Clause 17. The integrated circuit device of Clause 16, wherein when the control module sets that a mask-free stage is entered, the computation module repeats the following operations in a plurality of epochs: computing, in forward propagation, the value of the loss function based on mask-free parameters; and computing, in back propagation, partial derivatives of the loss function with respect to the mask-free parameters; and the updating module updates the mask-free parameters based on the partial derivatives, and takes the updated mask-free parameters as initial values of the mask adjustment parameters.
  • Clause 18. The integrated circuit device of Clause 17, wherein the processing device further includes a random generation module configured to randomly generate initial values of the mask tensor and the mask-free parameters.
  • Clause 19. The integrated circuit device of Clause 16, wherein the processing device further includes a mask tensor determination module configured to determine the initial value of the mask tensor based on the initial values of the mask adjustment parameters.
  • Clause 20. The integrated circuit device of Clause 19, wherein when the mask tensor is a one-dimensional tensor, the mask tensor determination module is configured to:
      • select, from every m data elements of a specified dimension of the initial values of the mask adjustment parameters, n data elements with larger absolute values as valid data elements, where m>n; and
      • generate the initial value of the mask tensor based on positions of the n valid data elements in the m data elements.
  • Clause 21. The integrated circuit device of Clause 20, wherein the specified dimension is an input channel dimension.
  • Clause 22. The integrated circuit device of Clause 19, wherein when the mask tensor is a two-dimensional tensor, the mask tensor determination module is configured to:
      • preset a specific count of two-dimensional mask tensors, wherein each dimension of the two-dimensional mask tensors includes m elements, of which n elements are 1 and m-n elements are 0, where m>n;
      • mask two specified dimensions of the initial values of the mask adjustment parameters of the neural network layer respectively based on each preset two-dimensional mask tensor to obtain masked parameter tensors;
      • perform product sum computation on training data of the neural network layer based on each masked parameter tensor to obtain parameter evaluation values; and
      • select a two-dimensional mask tensor that yields the largest of all the parameter evaluation values as the initial value of the mask tensor.
  • Clause 23. The integrated circuit device of Clause 22, wherein the two specified dimensions are an input channel dimension and an output channel dimension.
  • Clause 24. The integrated circuit device of Clause 16, wherein in the mask adjustment stage, the updating module updates the mask adjustment parameters based on the partial derivatives in each iteration.
  • Clause 25. The integrated circuit device of Clause 16, wherein when the mask tensor is a one-dimensional tensor, the updating module includes a division unit, a sorting unit, and an adjustment unit, and in the mask adjustment stage, after a specific count of epochs have been performed, the division unit divides the updated mask adjustment parameters into a plurality of intervals in units of a specific parameter count m; the sorting unit sorts the mask adjustment parameters in each interval according to absolute values of the mask adjustment parameters from large to small; the adjustment unit sets, in the mask tensor, elements at positions that correspond to first n mask adjustment parameters with larger absolute values in each interval, to 1, and sets, in the mask tensor, elements at positions that correspond to m-n mask adjustment parameters with smaller absolute values in each interval, to 0.
  • Clause 26. The integrated circuit device of Clause 25, wherein in the mask adjustment stage, the control module judges whether a percentage of all unchanged element values of the mask tensor reaches a threshold in 2 consecutive epochs; and if the threshold is reached, the mask adjustment stage is ended.
  • Clause 27. The integrated circuit device of Clause 26, wherein the threshold is one of 80%, 90%, and 100%.
  • Clause 28. The integrated circuit device of Clauses 20 to 23 or 25, wherein m is 4 and n is 2.
  • Clause 29. The integrated circuit device of Clause 25, wherein the specific count is 1.
  • Clause 30. A board card, comprising the integrated circuit device of any of Clauses 16 to 29.
  • Clause 31. A method of performing sparse training on a neural network model, comprising:
      • in a mask adjustment stage, repeating the following steps in a plurality of epochs:
      • masking, in forward propagation, mask adjustment parameters based on a mask tensor to compute a value of a loss function;
      • computing, in back propagation, partial derivatives of the loss function with respect to the mask adjustment parameters;
      • updating the mask adjustment parameters based on the partial derivatives; and
      • updating the mask tensor based on the updated mask adjustment parameters,
      • wherein the updated mask adjustment parameters are shielded by using the updated mask tensor to control a processing area of a feature map input to the neural network model.
  • Clause 32. The method of Clause 31, further comprising:
      • in a mask-free stage, repeating the following steps in a plurality of epochs:
      • computing, in forward propagation, the value of the loss function based on mask-free parameters;
      • computing, in back propagation, partial derivatives of the loss function with respect to the mask-free parameters; and
      • updating the mask-free parameters based on the partial derivatives;
      • wherein the updated mask-free parameters are taken as initial values of the mask adjustment parameters.
  • Clause 33. The method of Clause 32, further comprising:
      • randomly generating initial values of the mask tensor and the mask-free parameters.
  • Clause 34. The method of Clause 31, further comprising:
      • determining the initial value of the mask tensor based on the initial values of the mask adjustment parameters.
  • Clause 35. The method of Clause 34, wherein when the mask tensor is a one-dimensional tensor, the determining the initial value of the mask tensor includes:
      • selecting, from every m data elements of a specified dimension of the initial values of the mask adjustment parameters, n data elements with larger absolute values as valid data elements, wherein m>n; and
      • generating the initial value of the mask tensor based on positions of the n valid data elements in the m data elements.
  • Clause 36. The method of Clause 35, wherein the specified dimension is an input channel dimension.
  • Clause 37. The method of Clause 34, wherein when the mask tensor is a two-dimensional tensor, the determining the initial value of the mask tensor comprises:
      • presetting a specific count of two-dimensional mask tensors, wherein each dimension of the two-dimensional mask tensors includes m elements, of which n elements are 1 and m-n elements are 0, where m>n;
      • masking two specified dimensions of the initial values of the mask adjustment parameters of the neural network layer respectively based on each preset two-dimensional mask tensor to obtain masked parameter tensors;
      • performing product sum computation on training data of the neural network layer based on each masked parameter tensor to obtain parameter evaluation values; and
      • selecting a two-dimensional mask tensor that yields the largest of all the parameter evaluation values as the initial value of the mask tensor.
  • Clause 38. The method of Clause 37, wherein the two specified dimensions are an input channel dimension and an output channel dimension.
  • Clause 39. The method of Clause 31, wherein in the mask adjustment stage, the mask adjustment parameters are updated based on the partial derivatives in each iteration.
  • Clause 40. The method of Clause 31, wherein in the mask adjustment stage, when the mask tensor is a one-dimensional tensor, the updating the mask tensor includes:
      • after a specific count of epochs have been performed, dividing the updated mask adjustment parameters into a plurality of intervals in units of a specific parameter count m;
      • sorting the mask adjustment parameters in each interval according to absolute values of the mask adjustment parameters from large to small;
      • setting, in the mask tensor, elements at positions that correspond to first n mask adjustment parameters with larger absolute values in each interval, to 1; and
      • setting, in the mask tensor, elements at positions that correspond to m-n mask adjustment parameters with smaller absolute values in each interval, to 0.
  • Clause 41. The method of Clause 40, wherein the mask adjustment stage further includes:
      • judging whether a percentage of all unchanged element values of the mask tensor reaches a threshold in 2 consecutive epochs; and
      • if the threshold is reached, ending the mask adjustment stage.
  • Clause 42. The method of Clause 41, wherein the threshold is one of 80%, 90%, and 100%.
  • Clause 43. The method of Clauses 35 to 38 or 40, wherein m is 4 and n is 2.
  • Clause 44. The method of Clause 40, wherein the specific count is 1.
  • Clause 45. A computer-readable storage medium, having stored thereon computer program code for performing sparse training on a neural network model, which, when executed by a processing device, performs the method of any of Clauses 31 to 42.
  • Clause 46. An integrated circuit device for performing sparse training on a neural network model, comprising:
      • a processing device, comprising a control module, a computation module, and an updating module,
      • wherein when the control module sets that a mask adjustment stage is entered, the computation module repeats the following operations in a plurality of epochs: masking, in forward propagation, mask adjustment parameters based on a mask tensor to compute a value of a loss function; and computing, in back propagation, partial derivatives of the loss function with respect to the mask adjustment parameters; and the updating module is configured to update the mask adjustment parameters based on the partial derivatives and update the mask tensor based on the updated mask adjustment parameters; and
      • a computation device configured to shield the updated mask adjustment parameters by using the updated mask tensor to control a processing area of a feature map input to the neural network model.
  • Clause 47. The integrated circuit device of Clause 46, wherein when the control module sets that a mask-free stage is entered, the computation module repeats the following operations in a plurality of epochs: computing, in forward propagation, the value of the loss function based on mask-free parameters; and computing, in back propagation, partial derivatives of the loss function with respect to the mask-free parameters; and the updating module updates the mask-free parameters based on the partial derivatives, and takes the updated mask-free parameters as initial values of the mask adjustment parameters.
  • Clause 48. The integrated circuit device of Clause 47, wherein the processing device further includes a random generation module configured to randomly generate initial values of the mask tensor and the mask-free parameters.
  • Clause 49. The integrated circuit device of Clause 46, wherein the processing device further includes a mask tensor determination module configured to determine the initial value of the mask tensor based on the initial values of the mask adjustment parameters.
  • Clause 50. The integrated circuit device of Clause 49, wherein when the mask tensor is a one-dimensional tensor, the mask tensor determination module is configured to:
      • select, from every m data elements of a specified dimension of the initial values of the mask adjustment parameters, n data elements with larger absolute values as valid data elements, wherein m>n; and
      • generate the initial value of the mask tensor based on positions of the n valid data elements in the m data elements.
  • Clause 51. The integrated circuit device of Clause 50, wherein the specified dimension is an input channel dimension.
  • Clause 52. The integrated circuit device of Clause 49, wherein when the mask tensor is a two-dimensional tensor, the mask tensor determination module is configured to:
      • preset a specific count of two-dimensional mask tensors, wherein each dimension of the two-dimensional mask tensors includes m elements, of which n elements are 1 and m-n elements are 0, where m>n;
      • mask two specified dimensions of the initial values of the mask adjustment parameters of the neural network layer respectively based on each preset two-dimensional mask tensor to obtain masked parameter tensors;
      • perform product sum computation on training data of the neural network layer based on each masked parameter tensor to obtain parameter evaluation values; and
      • select a two-dimensional mask tensor that yields the largest of all the parameter evaluation values as the initial value of the mask tensor.
  • Clause 53. The integrated circuit device of Clause 52, wherein the two specified dimensions are an input channel dimension and an output channel dimension.
  • Clause 54. The integrated circuit device of Clause 46, wherein in the mask adjustment stage, the updating module updates the mask adjustment parameters based on the partial derivatives in each iteration.
  • Clause 55. The integrated circuit device of Clause 46, wherein when the mask tensor is a one-dimensional tensor, the updating module includes a division unit, a sorting unit, and an adjustment unit, and in the mask adjustment stage, after a specific count of epochs have been performed, the division unit divides the updated mask adjustment parameters into a plurality of intervals in units of a specific parameter count m; the sorting unit sorts the mask adjustment parameters in each interval according to absolute values of the mask adjustment parameters from large to small; the adjustment unit sets, in the mask tensor, elements at positions that correspond to first n mask adjustment parameters with larger absolute values in each interval, to 1, and sets, in the mask tensor, elements at positions that correspond to m-n mask adjustment parameters with smaller absolute values in each interval, to 0.
  • Clause 56. The integrated circuit device of Clause 55, wherein in the mask adjustment stage, the control module judges whether a percentage of all unchanged element values of the mask tensor reaches a threshold in 2 consecutive epochs; and if the threshold is reached, the mask adjustment stage is ended.
  • Clause 57. The integrated circuit device of Clause 56, wherein the threshold is one of 80%, 90%, and 100%.
  • Clause 58. The integrated circuit device of Clauses 50 to 53 or 55, wherein m is 4 and n is 2.
  • Clause 59. The integrated circuit device of Clause 55, wherein the specific count is 1.
  • Clause 60. A board card, comprising the integrated circuit device of any of Clauses 46 to 59.
  • The embodiments of the present disclosure have been described above in detail, specific examples have been applied herein to explain the principles and implementations of the present disclosure, and the description of the above embodiments is only used to help understand the methods and core ideas of the present disclosure; meanwhile, for those of ordinary skill in the art, according to the ideas of the present disclosure, variations will be made in specific implementations and application scopes, and in summary, the contents of this specification should not be construed as restrictions on the present disclosure.

Claims (23)

1.-30. (canceled)
31. A method of performing sparse training on a neural network model, comprising:
in a mask adjustment stage, repeating following steps in a plurality of epochs:
masking, in forward propagation, mask adjustment parameters based on a mask tensor to compute a value of a loss function;
computing, in back propagation, partial derivatives of the loss function with respect to the mask adjustment parameters;
updating the mask adjustment parameters based on the partial derivatives; and
updating the mask tensor based on the updated mask adjustment parameters,
wherein the updated mask adjustment parameters are shielded by using the updated mask tensor to control a processing area of a feature map input to the neural network model.
32. The method of claim 31, further comprising:
in a mask-free stage, repeating following steps in a plurality of epochs:
computing, in forward propagation, the value of the loss function based on mask-free parameters;
computing, in back propagation, partial derivatives of the loss function with respect to the mask-free parameters; and
updating the mask-free parameters based on the partial derivatives;
wherein the updated mask-free parameters are taken as initial values of the mask adjustment parameters.
33. The method of claim 32, further comprising:
randomly generating initial values of the mask tensor and the mask-free parameters.
34. The method of claim 31, further comprising:
determining the initial value of the mask tensor based on the initial values of the mask adjustment parameters.
35. The method of claim 34, wherein when the mask tensor is a one-dimensional tensor, the determining the initial value of the mask tensor includes:
selecting, from every m data elements of a specified dimension of the initial values of the mask adjustment parameters, n data elements with larger absolute values as valid data elements, wherein m>n; and
generating the initial value of the mask tensor based on positions of the n valid data elements in the m data elements.
36. The method of claim 35, wherein the specified dimension is an input channel dimension.
37. The method of claim 34, wherein when the mask tensor is a two-dimensional tensor, the determining the initial value of the mask tensor includes:
presetting a specific count of two-dimensional mask tensors, wherein each dimension of the two-dimensional mask tensors includes m elements, of which n elements are 1 and m-n elements are 0, where m>n;
masking two specified dimensions of the initial values of the mask adjustment parameters of the neural network layer respectively based on each preset two-dimensional mask tensor to obtain masked parameter tensors;
performing product sum computation on training data of the neural network layer based on each masked parameter tensor to obtain parameter evaluation values; and
selecting a two-dimensional mask tensor that yields the largest of all the parameter evaluation values as the initial value of the mask tensor.
38. The method of claim 37, wherein the two specified dimensions are an input channel dimension and an output channel dimension.
39. The method of claim 31, wherein in the mask adjustment stage, the mask adjustment parameters are updated based on the partial derivatives in each iteration.
40. The method of claim 31, wherein in the mask adjustment stage, when the mask tensor is a one-dimensional tensor, the updating the mask tensor includes:
after a specific count of epochs have been performed, dividing the updated mask adjustment parameters into a plurality of intervals in units of a specific parameter count m;
sorting the mask adjustment parameters in each interval according to absolute values of the mask adjustment parameters from large to small;
setting, in the mask tensor, elements at positions that correspond to first n mask adjustment parameters with larger absolute values in each interval, to 1; and
setting, in the mask tensor, elements at positions that correspond to m-n mask adjustment parameters with smaller absolute values in each interval, to 0.
41.-44. (canceled)
45. A non-transitory computer-readable storage medium, having stored thereon computer program code for performing sparse training on a neural network model, which, when executed by a processing device, performs the method of claim 31.
46. An integrated circuit device for performing sparse training on a neural network model, comprising:
a processing device, comprising a control module, a computation module, and an updating module,
wherein when the control module sets that a mask adjustment stage is entered, the computation module repeats following operations in a plurality of epochs: masking, in forward propagation, mask adjustment parameters based on a mask tensor to compute a value of a loss function; and computing, in back propagation, partial derivatives of the loss function with respect to the mask adjustment parameters; and the updating module is configured to update the mask adjustment parameters based on the partial derivatives and update the mask tensor based on the updated mask adjustment parameters; and
a computation device configured to shield the updated mask adjustment parameters by using the updated mask tensor to control a processing area of a feature map input to the neural network model.
47. The integrated circuit device of claim 46, wherein when the control module sets that a mask-free stage is entered, the computation module repeats following operations in a plurality of epochs: computing, in forward propagation, the value of the loss function based on mask-free parameters; and computing, in back propagation, partial derivatives of the loss function with respect to the mask-free parameters; and the updating module updates the mask-free parameters based on the partial derivatives, and takes the updated mask-free parameters as initial values of the mask adjustment parameters.
48. The integrated circuit device of claim 47, wherein the processing device further includes a random generation module configured to randomly generate initial values of the mask tensor and the mask-free parameters.
49. The integrated circuit device of claim 46, wherein the processing device further includes a mask tensor determination module configured to determine the initial value of the mask tensor based on the initial values of the mask adjustment parameters.
50. The integrated circuit device of claim 49, wherein when the mask tensor is a one-dimensional tensor, the mask tensor determination module is configured to:
select, from every m data elements of a specified dimension of the initial values of the mask adjustment parameters, n data elements with larger absolute values as valid data elements, wherein m>n; and
generate the initial value of the mask tensor based on positions of the n valid data elements in the m data elements.
51. The integrated circuit device of claim 50, wherein the specified dimension is an input channel dimension.
52. The integrated circuit device of claim 49, wherein when the mask tensor is a two-dimensional tensor, the mask tensor determination module is configured to:
preset a specific count of two-dimensional mask tensors, wherein each dimension of the two-dimensional mask tensors includes m elements, of which n elements are 1 and m-n elements are 0, where m>n;
mask two specified dimensions of the initial values of the mask adjustment parameters of the neural network layer respectively based on each preset two-dimensional mask tensor to obtain masked parameter tensors;
perform product sum computation on training data of the neural network layer based on each masked parameter tensor to obtain parameter evaluation values; and
select a two-dimensional mask tensor that yields the largest of all the parameter evaluation values as the initial value of the mask tensor.
53. The integrated circuit device of claim 52, wherein the two specified dimensions are an input channel dimension and an output channel dimension.
54. The integrated circuit device of claim 46, wherein in the mask adjustment stage, the updating module updates the mask adjustment parameters based on the partial derivatives in each iteration.
55.-60. (canceled)
US17/557,802 2020-11-04 2022-02-03 Neural network sparsification device and method, and related product Pending US20220230069A1 (en)

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
CN202011216903 2020-11-04
CN202011216903.5 2020-11-04
CN202011566141.1A CN114444681A (en) 2020-11-04 2020-12-25 Neural network sparsing device, method and corresponding product
CN202011566141.1 2020-12-25
PCT/CN2021/123881 WO2022095676A1 (en) 2020-11-04 2021-10-14 Neural network sparsification device and method, and corresponding product

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/123881 Continuation WO2022095676A1 (en) 2020-11-04 2021-10-14 Neural network sparsification device and method, and corresponding product

Publications (1)

Publication Number Publication Date
US20220230069A1 true US20220230069A1 (en) 2022-07-21

Family

ID=81362120

Family Applications (2)

Application Number Title Priority Date Filing Date
US18/003,821 Pending US20230259780A1 (en) 2020-11-04 2021-10-14 Neural network sparsification apparatus and method and related product
US17/557,802 Pending US20220230069A1 (en) 2020-11-04 2022-02-03 Neural network sparsification device and method, and related product

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US18/003,821 Pending US20230259780A1 (en) 2020-11-04 2021-10-14 Neural network sparsification apparatus and method and related product

Country Status (3)

Country Link
US (2) US20230259780A1 (en)
CN (2) CN114444680A (en)
WO (2) WO2022095675A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115170917B (en) * 2022-06-20 2023-11-07 美的集团(上海)有限公司 Image processing method, electronic device and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9779786B1 (en) * 2016-10-26 2017-10-03 Xilinx, Inc. Tensor operations and acceleration
CN107886164A (en) * 2017-12-20 2018-04-06 东软集团股份有限公司 A kind of convolutional neural networks training, method of testing and training, test device
EP3877907A4 (en) * 2018-11-06 2023-11-01 Emory University Systems and methods for training an autoencoder neural network using sparse data
CN111652366A (en) * 2020-05-09 2020-09-11 哈尔滨工业大学 Combined neural network model compression method based on channel pruning and quantitative training

Also Published As

Publication number Publication date
CN114444681A (en) 2022-05-06
WO2022095676A1 (en) 2022-05-12
CN114444680A (en) 2022-05-06
US20230259780A1 (en) 2023-08-17
WO2022095675A1 (en) 2022-05-12

Similar Documents

Publication Publication Date Title
CN112633490B (en) Data processing device, method and related product for executing neural network model
CN112463159B (en) Compiling method, compiling device, electronic equipment and storage medium
US20220230069A1 (en) Neural network sparsification device and method, and related product
CN112463160A (en) Compiling method, compiling device, electronic equipment and storage medium
CN113469336A (en) Compiling method and execution method for optimizing neural network model and related products
CN111199276B (en) Data processing method and related product
CN113469326A (en) Integrated circuit device and board card for executing pruning optimization in neural network model
CN116185942A (en) Data processing method, device, storage medium and electronic equipment
CN114692847B (en) Data processing circuit, data processing method and related products
CN114444678A (en) Apparatus, method, and storage medium for thinning neural network layer
CN116980277B (en) Data processing method, device, computer equipment and storage medium
CN113792867B (en) Arithmetic circuit, chip and board card
WO2023236929A1 (en) Method and device for reading target data in data based on instruction
US20230305840A1 (en) Computing apparatus, integrated circuit chip, board card, device and computing method
CN114444677A (en) Device, board card and method for sparse training and readable storage medium
CN114691083A (en) Matrix multiplication circuit, method and related product
CN113469328A (en) Device, board card, method and readable storage medium for executing revolution crossing
CN116484926A (en) Self-adaptive splitting optimization equipment and method
CN114764608A (en) Data processing device and method for executing neural network model and related products
CN117235424A (en) Computing device, computing method and related product
CN113469327A (en) Integrated circuit device for executing advance of revolution
CN114692820A (en) Data processing device and method for executing neural network model and related products
CN116483255A (en) Apparatus and method for accelerating data movement
CN116090519A (en) Compiling method of convolution operator and related product
CN113850376A (en) Computing device and method for executing neural network model and related products

Legal Events

Date Code Title Description
STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

AS Assignment

Owner name: ANHUI CAMBRICON INFORMATION TECHNOLOGY CO., LTD., CHINA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ZHU, SHIBING;GAO, YUFENG;REEL/FRAME:060085/0334

Effective date: 20220510