WO2023022255A1

WO2023022255A1 - Method and system for lightening deep learning model using pruning

Info

Publication number: WO2023022255A1
Application number: PCT/KR2021/011008
Authority: WO
Inventors: 김태호; 강신한; 신은섭
Original assignee: 주식회사 노타
Priority date: 2021-08-17
Filing date: 2021-08-19
Publication date: 2023-02-23
Also published as: KR102597183B1; KR20230026014A

Abstract

Disclosed are a method and a system for lightening a deep learning model using pruning. A method for lightening a deep learning model according to one embodiment may comprise the steps of: determining a significance level for channels for each layer of a deep learning model to be compressed; assigning an initial value for each channel according to the significance level of the channel; and initiating compression of the deep learning model at a target compression rate on the basis of the assigned initial value.

Description

Deep learning model lightweight method and system using pruning

The following description relates to a deep learning model lightweight method and system using pruning.

Although deep learning models (or artificial intelligence models) show excellent performance, it is difficult to apply them to practical applications due to the large size and computational complexity of the models. A representative method used to apply a deep learning model to an actual application is a deep learning model weight reduction technique. Lightweighting of a deep learning model refers to functions, modules, and/or features that make a given deep learning model into a smaller deep learning model. Here, 'small' may mean reducing the number of weights/bias constituting the deep learning model, reducing capacity, or speeding up inference. At this time, it is very important not to degrade the performance while reducing the weight.

There are various types of lightweighting techniques. Large classifications include network pruning, quantization, knowledge distillation, model search (Neural Architecture Search), and filter decomposition, and within each classification, there are many different types of lightweighting techniques. this exists Among these, pruning is a representative lightweight technique.

There are two major pruning techniques: global pruning and local pruning. Local pruning receives parameters for how much weights in each layer should be pruned, and then , making decisions about which weights to prune at each layer. In the case of global pruning, it is also determined how many weights should be pruned for each layer, including local pruning.

Representative methods among global pruning techniques include techniques such as Growing, AutoSlim, and Nuclear Norm. These techniques are methodologies for determining the optimal number of channels in each layer of a deep learning model. Growing searches for the number of channels by gradually increasing the number of channels, based on when the channel in each layer is 1, After learning the entire model, it determines the optimal number of channels by reducing channels for each layer. On the other hand, in the case of the nuclear norm, pruning is performed by calculating the nuclear norm of the singular value of the feature map for each layer and then deleting channels with values below a specific threshold. proceed

When reasoning is performed in an edge computing environment using the corresponding pruning technique, RAM (Random Access Memory), memory size, and FLOPS (floating-point operations per second) are limited due to limitations in the edge hardware environment. etc. will be limited. For example, in the case of global pruning described above, the number of channels for each layer is adjusted to a limited number according to FLOPS.

In addition, since the growing technique starts compressing the deep learning model with only one channel per layer, it takes a long time to learn to reach an optimal network. In addition, since the auto-slim technique learns the entire network for each layer, it also takes a long time to learn. In addition, in the case of the Newtler norm, there is a problem that the result is not sophisticated because pruning is performed based on the pre-learned weights.

Provided is a deep learning model lightweight method and system that can reduce the training time of the deep learning model by initializing pruning in consideration of the importance of channels for each layer of the deep learning model.

A model weight reduction method performed by a computer device including at least one processor, comprising: determining, by the at least one processor, an importance of a channel for each layer of a deep learning model to be compressed; allocating, by the at least one processor, an initial value for each channel according to the importance of the channel; and initiating, by the at least one processor, compression of the deep learning model at a target compression ratio based on the assigned initial value.

According to one aspect, the model weight reduction method is based on a value of the weight mask for an arbitrary channel of the deep learning model in a loss function configured using a weight mask based on the assigned initial value by the at least one processor. The method may further include determining whether pruning is performed for the arbitrary channel.

According to another aspect, the loss function is configured using cross-entropy loss for improving training accuracy of the deep learning model, the weight mask, and L1 regularization for the weight mask. can be characterized.

According to another aspect, the L1 normalization may be configured by multiplying an L1 norm for the weight mask and a hyperparameter for controlling an effect of the L1 norm on the loss function.

According to another aspect, the step of determining whether to perform pruning on the random channel may include determining that the random channel is continuously used when the value of the weight mask is 1, and when the value of the weight mask is 0. It may be characterized in that the arbitrary channel is determined to be pruned.

According to another aspect, the weight mask is composed of a sigmoid function having a product of a temperature parameter and a mask variable as an input parameter, and the temperature parameter has a relatively greater contribution to improving the performance of the deep learning model. In order to assign a relatively larger weight to the channel that corresponds to the selected channel, it may be characterized in that it is determined by multiplying the assigned initial value and the loss gradient according to the value of the mask variable.

According to another aspect, the determining of the importance of the channel may include determining the importance of the channel for each layer of the deep learning model to be compressed using a global pruning technique.

According to another aspect, the global pruning technique may include Layer-Adaptive Magnitude-based Pruning (LAMP).

A computer program stored in a computer readable recording medium is provided in combination with a computer device to execute the method on the computer device.

A computer readable recording medium having a program for executing the method in a computer device is recorded.

At least one processor implemented to execute instructions readable by a computer device, determining the importance of a channel for each layer of a deep learning model to be compressed by the at least one processor, and determining the importance of a channel according to the importance of the channel. It provides a computer device characterized in that each assigns an initial value and starts compression of the deep learning model at a target compression rate based on the assigned initial value.

The learning time of the deep learning model can be reduced by initializing pruning considering the importance of the channel for each layer of the deep learning model.

1 is a block diagram illustrating an example of a computer device according to one embodiment of the present invention.

2 is a block diagram showing an example of an internal configuration of a deep learning model lightweight system according to an embodiment of the present invention.

3 is a flowchart illustrating an example of a deep learning model lightweight method according to an embodiment of the present invention.

4 is a diagram illustrating an example of starting compression of a deep learning model from a target compression ratio according to an embodiment of the present invention.

Hereinafter, an embodiment will be described in detail with reference to the accompanying drawings.

A deep learning model lightweight system according to embodiments of the present invention may be implemented by at least one computer device. At this time, a computer program according to an embodiment of the present invention may be installed and driven in the computer device, and the computer device may perform the deep learning model weight reduction method according to the embodiments of the present invention under the control of the driven computer program. can The above-described computer program may be combined with a computer device and stored in a computer readable recording medium to execute a deep learning model lightweight method on a computer.

1 is a block diagram illustrating an example of a computer device according to one embodiment of the present invention. As shown in FIG. 1, a computer device 100 includes a memory 110, a processor 120, a communication interface 130, and an I/O interface 140. can include The memory 110 is a computer-readable recording medium and may include a random access memory (RAM), a read only memory (ROM), and a permanent mass storage device such as a disk drive. Here, a non-perishable mass storage device such as a ROM and a disk drive may be included in the computer device 100 as a separate permanent storage device distinct from the memory 110. Also, an operating system and at least one program code may be stored in the memory 110 . These software components may be loaded into the memory 110 from a recording medium readable by a separate computer from the memory 110 . The separate computer-readable recording medium may include a computer-readable recording medium such as a floppy drive, a disk, a tape, a DVD/CD-ROM drive, and a memory card. In another embodiment, software components may be loaded into the memory 110 through the communication interface 130 rather than a computer-readable recording medium. For example, software components may be loaded into the memory 110 of the computer device 100 based on a computer program installed by files received through a network 160 .

The processor 120 may be configured to process commands of a computer program by performing basic arithmetic, logic, and input/output operations. Instructions may be provided to processor 120 by memory 110 or communication interface 130 . For example, processor 120 may be configured to execute received instructions according to program codes stored in a recording device such as memory 110 .

The communication interface 130 may provide a function for the computer device 100 to communicate with other devices through the network 160 . For example, a request, command, data, file, etc. generated according to a program code stored in a recording device such as the memory 110 by the processor 120 of the computer device 100 is transmitted to the network ( 160) to other devices. Conversely, signals, commands, data, files, etc. from other devices may be received by the computer device 100 via the communication interface 130 of the computer device 100 via the network 160 . Signals, commands, data, etc. received through the communication interface 130 may be transmitted to the processor 120 or the memory 110, and files, etc. may be stored as storage media that the computer device 100 may further include (described above). permanent storage).

The input/output interface 140 may be a means for interface with the input/output device (I/O device, 150). For example, the input device may include a device such as a microphone, keyboard, or mouse, and the output device may include a device such as a display or speaker. As another example, the input/output interface 140 may be a means for interface with a device in which functions for input and output are integrated into one, such as a touch screen. The input/output device 150 and the computer device 100 may be configured as one device.

Also, in other embodiments, computer device 100 may include fewer or more elements than those of FIG. 1 . However, there is no need to clearly show most of the prior art components. For example, the computer device 100 may be implemented to include at least a portion of the above-described input/output device 150 or may further include other components such as a transceiver and a database.

2 is a block diagram showing an example of the internal configuration of a deep learning model lightweight system according to an embodiment of the present invention, and FIG. 3 is a flowchart showing an example of a deep learning model lightweight method according to an embodiment of the present invention. am. The deep learning model lightweight system 200 according to this embodiment may be implemented by at least one computer device 100. The deep learning model weight reduction system 200 of FIG. 3 may include a channel importance determination unit 210, a channel initial value allocation unit 220, a compression start unit 230, and a pruning decision unit 240. At this time, the channel importance determination unit 210, the channel initial value allocation unit 220, the compression start unit 230, and the pruning determination unit 240 are computer devices 100 implementing the deep learning model lightweight system 200. ) may be a functional expression of a function that the processor 120 operates under the control of a computer program. For example, the processor 120 of the computer device 100 may be implemented to execute control instructions according to codes of an operating system included in the memory 110 or codes of at least one computer program. Here, the processor 120 controls the computer device 100 so that the computer device 100 performs the steps 310 to 340 included in the method of FIG. 3 according to a control command provided by a code stored in the computer device 100. can control. At this time, as a functional expression of the processor 120 for performing each of the steps 310 to 340, the channel importance determination unit 210, the channel initial value allocation unit 220, the compression start unit 230, and pruning decision Section 240 may be used.

In step 310, the channel importance determination unit 210 may determine the importance of the channel for each layer of the deep learning model to be compressed. For example, the channel importance determiner 210 may determine the importance of a channel for each layer of the deep learning model to be compressed using a global pruning technique. A more specific example is Layer-Adaptive Magnitude-based Pruning (LAMP), which provides a pruning technique that can achieve a state-of-the-art compromise between sparsity and performance rather than simple size-based pruning using selected layer-specific sparsity for each layer. It can be used to determine the importance of the channel of

In step 320, the channel initial value allocator 220 may allocate an initial value for each channel according to the importance of the channel. Here, the initial value assigned to each channel may be an initial value of a temperature parameter to be described later.

In step 330, the compression initiator 230 may start compression of the deep learning model at a target compression ratio based on the assigned initial value. As described above, since the growing technique starts compressing the deep learning model with only one channel per layer, it takes a long time to learn to reach an optimal network. In addition, since the AutoSlim technique learns the entire network for each layer, it also has a problem in that it takes a long time to learn. On the other hand, in the deep learning model weight reduction method according to the present embodiment, after allocating an initial value for each channel according to the importance of the channel, it is possible to start compressing the deep learning model from the target compression rate, thereby optimizing the learning time of the deep learning model. can

In step 340, the pruning decision unit 240 determines whether or not to prune an arbitrary channel according to the value of the weight mask for an arbitrary channel of the deep learning model in the loss function constructed using the weight mask based on the assigned initial value. You can decide whether to run or not. For example, the loss function may be configured using cross-entropy loss for improving training accuracy of a deep learning model, weight mask, and L1 regularization for the weight mask.

In this case, the pruning decision unit 240 may determine to continue using a random channel when the value of the weight mask is 1, and may determine to prun the random channel when the value of the weight mask is 0.

In addition, L1 normalization may consist of a multiplication between an L1 norm for a weight mask and a hyperparameter for adjusting the effect of the L1 norm on a loss function.

In addition, the weight mask may be composed of a sigmoid function having a product of a temperature parameter and a mask variable as an input parameter, and the temperature parameter is relatively proportional to a channel that makes a relatively greater contribution to improving the performance of the deep learning model. In order to assign a larger weight, it may be determined as a product of an assigned initial value and a loss gradient according to a value of a mask variable.

As a more specific example, the network loss of the deep learning model may be calculated as in Equation 1 below.

Here, “ L ” may mean a loss of the deep learning model, and “ f ” may mean a deep learning model that receives data “ x ” as an input. In addition, " L _ CE " is the cross-entropy loss to improve the learning accuracy of the deep learning model " f ", " w " is the weight, and "σ (βs) " is the role of weight masking It can be a weight mask that performs In this case, “σ” may be a sigmoid function, “β” may be a temperature parameter, and “s” may be a mask variable. Also, "˚" may mean a Hadamard product (or element-wise product), and "||x||_1" may mean an L1 norm for x. Here, the temperature parameter “β” may be scheduled so that the deep learning model “ f ” is trained in an optimal direction. Scheduling of the temperature parameter "β" will be described in more detail later.

In this case, the pruning decision unit 240 determines that a random channel is continuously used when the value of the weight mask “σ(βs)” is 1, and when the value of the weight mask “σ(βs)” is 0. It can be determined by pruning any channel.

" λ _1" is a lower parameter for adjusting the influence of the L1 norm on loss, and can be scheduled as shown in Equation 2 below so that the deep learning model " f " is trained according to the target sparsity.

Here, “ λ _1 ^ base ” may mean hyperparameters , “u” may mean target sparsity, and “ u_c ” may mean current sparsity, respectively.

On the other hand, when the value of the temperature parameter “β” in Equation 1 increases, the slope of the sigmoid function “σ” becomes steep, and thus, the weight mask “σ(βs)” serves to converge to a discrete value. can do. In order to improve learning stability and convergence speed, a temperature parameter "β" can be scheduled so that a channel that contributes greatly to performance improvement can be assigned a large weight so that the corresponding channel can be continuously used in the learning process.

For example, as shown in Equation 3 below, the temperature parameter “β” multiplies “β_0” as an initial value by the conversion degree of the loss according to the value of the mask variable, so that the greater the change in the corresponding mask contributes to the loss, the greater the weight. You can adopt the giving method.

4 is a diagram illustrating an example of starting compression of a deep learning model from a target compression ratio according to an embodiment of the present invention. FIG. 4 shows an example in which learning of a deep learning model is performed for all channels of each layer with respect to an original model 410 . In addition, FIG. 4 shows an example of starting compression of a deep learning model with only one channel per layer for the Growing (420) technique. In addition, FIG. 4 shows an example of learning the entire network for each layer for the AutoSlim (430) technique. Unlike these prior arts, FIG. 4 is a proposed method (Proposed Method, 440) according to this embodiment. After allocating an initial value for each channel according to the importance of the channel, compression of the deep learning model is started from the target compression rate. shows an example of At this time, since compression starts from the target compression rate for each channel, learning time can be reduced by determining the optimal number of channels for each layer more quickly.

As described above, according to embodiments of the present invention, the training time of the deep learning model can be reduced by initializing pruning in consideration of the importance of the channel for each layer of the deep learning model.

The system or device described above may be implemented as a hardware component or a combination of hardware components and software components. For example, devices and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA) , a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. The processing device may run an operating system (OS) and one or more software applications running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of software. For convenience of understanding, there are cases in which one processing device is used, but those skilled in the art will understand that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it can include. For example, a processing device may include a plurality of processors or a processor and a controller. Other processing configurations are also possible, such as parallel processors.

Software may include a computer program, code, instructions, or a combination of one or more of the foregoing, which configures a processing device to operate as desired or processes independently or collectively. You can command the device. Software and/or data may be any tangible machine, component, physical device, virtual equipment, computer storage medium or device, intended to be interpreted by or provide instructions or data to a processing device. can be embodied in Software may be distributed on networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer readable media.

The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. The medium may continuously store programs executable by a computer or temporarily store them for execution or download. In addition, the medium may be various recording means or storage means in the form of a single or combined hardware, but is not limited to a medium directly connected to a certain computer system, and may be distributed on a network. Examples of the medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical recording media such as CD-ROM and DVD, magneto-optical media such as floptical disks, and ROM, RAM, flash memory, etc. configured to store program instructions. In addition, examples of other media include recording media or storage media managed by an app store that distributes applications, a site that supplies or distributes various other software, and a server. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler.

As described above, although the embodiments have been described with limited examples and drawings, those skilled in the art can make various modifications and variations from the above description. For example, the described techniques may be performed in an order different from the method described, and/or components of the described system, structure, device, circuit, etc. may be combined or combined in a different form than the method described, or other components may be used. Or even if it is replaced or substituted by equivalents, appropriate results can be achieved.

Therefore, other implementations, other embodiments, and equivalents of the claims are within the scope of the following claims.

Claims

A model weight reduction method performed by a computer device including at least one processor,

determining, by the at least one processor, an importance of a channel for each layer of a deep learning model to be compressed;

allocating, by the at least one processor, an initial value for each channel according to the importance of the channel; and

Initiating, by the at least one processor, compression of the deep learning model at a target compression rate based on the assigned initial value.

Model weight reduction method comprising a.
According to claim 1,

In a loss function constructed by using the weight mask based on the assigned initial value by the at least one processor, whether or not to prune an arbitrary channel of the deep learning model according to the value of the weight mask for the arbitrary channel step to determine

A model weight reduction method further comprising a.
According to claim 2,

Wherein the loss function is configured using cross-entropy loss for improving training accuracy of the deep learning model, the weight mask, and L1 regularization for the weight mask. method.
According to claim 3,

The L1 normalization is composed of a product of an L1 norm for the weight mask and a hyperparameter for controlling an effect of the L1 norm on the loss function.
According to claim 2,

The step of determining whether pruning for the arbitrary channel is performed,

and determining to continue using the random channel when the value of the weight mask is 1, and determining to prune the random channel when the value of the weight mask is 0.
According to claim 2,

The weight mask is composed of a sigmoid function having a product of a temperature parameter and a mask variable as an input parameter,

The temperature parameter is a product of the assigned initial value and the loss gradient according to the value of the mask variable, in order to assign a relatively larger weight to a channel that contributes relatively more to the performance improvement of the deep learning model. to be decided

A model weight reduction method characterized by.
According to claim 1,

The step of determining the importance of the channel,

A model weight reduction method characterized in that determining the importance of a channel for each layer of the deep learning model to be compressed using a global pruning technique.
According to claim 7,

The global pruning method comprises layer-adaptive magnitude-based pruning (LAMP).
A computer program stored in a computer readable recording medium to be combined with a computer device to execute the method of any one of claims 1 to 8 on the computer device.
A computer readable recording medium on which a program for executing the method of any one of claims 1 to 8 is recorded in a computer device.
at least one processor implemented to execute instructions readable by a computer device;

including,

by the at least one processor,

Determine the importance of the channel for each layer of the deep learning model to be compressed,

Allocating an initial value for each channel according to the importance of the channel;

Initiating compression of the deep learning model at a target compression ratio based on the assigned initial value

Characterized by a computer device.
According to claim 11,

by the at least one processor,

Determining whether to prune for an arbitrary channel according to a value of the weight mask for an arbitrary channel of the deep learning model in a loss function constructed using a weight mask based on the assigned initial value

Characterized by a computer device.
According to claim 12,

The loss function is configured using cross-entropy loss for improving training accuracy of the deep learning model, L1 regularization for the weight mask and the weight mask

Characterized by a computer device.
According to claim 12,

By the at least one processor to determine whether to prune the arbitrary channel,

Determining to continue using the random channel when the value of the weight mask is 1 and determining to prune the random channel when the value of the weight mask is 0

Characterized by a computer device.
According to claim 12,

The weight mask is composed of a sigmoid function having a product of a temperature parameter and a mask variable as an input parameter,

The temperature parameter is a product of the assigned initial value and the loss gradient according to the value of the mask variable, in order to assign a relatively larger weight to a channel that contributes relatively more to the performance improvement of the deep learning model. to be decided

Characterized by a computer device.